GPU architectures: mainly Nvidia and compares to TPUs
Networking: how data is moved around and the bottlenecks
Collective operations across GPU clusters and their theoretical speed limits
GPU architecture:
Streaming Multi-Processors: GPU’s main core. Each one has many, e.g. H100 has 132. Made up of sub-parts:
Tensor cores: specialise in matrix multiplication. Useful for multiply-and-accumulate operations. Useful for dense layers of neural networks (1-2 FLOP per clock cycle).
CUDA cores: general purpose cores for vector operations: multiply, add, compare etc. Slower but more multi-purpose (1 FLOP per clock cycle). They can also access non-contiguous memory.
Warp scheduler: bit of hardware which selects which warp to run this clock cycle.
Operate SIMT: Single Instruction Multiple Threads -> group data into warps and do the same operation on all, potentially masking out unwanted operations.
Warps: collections of 32 threads over which one operation is done per clock cycle. The same instruction is done across the whole warp.
If they require different operation, then masking is performed.
Why? Some operations are slower than others, e.g. fetching data is slower than arithmetic. The warp scheduler minimise SM downtime by constantly scheduling ready warps.
Interleaving warps can hide latency
Registers: extremely fast access memory for storing warps just before calculation. Each CUDA core has access to 256 registers
L1 cache / SMEM: small (256kB) rapid access memory
L2 cache: ~50MB of reasonable fast access memory
High Bandwidth Memory: bit like RAM for the GPU.
GPU architecture has advantage of many small independent SMs. This means a program can launch many independent processes.
Networking: discusses how data is shared within a cluster of GPUs.
Node:
Usually around 8 GPUs.
Extremely fast all-to-all NVLink connections
Max bandwith ~450GB/s on H100.
Scalable Units:
Collection of nodes, usually around 32
Connected by ethernet and InfiniBand switches
Stored in fat tree, i.e. tree format with nodes as leaves and higher bandwidth the higher up the tree you go.
There are multiple ‘trunks’ meaning there are multiple equal cost paths to get from one node to another, decreasing
choke points.