NV Streaming Multiprocessors: each has its own instruction scheduler
On-chip L2 fast cache (mega fast access)
High bandwidth DRAM
Operations, code and values are stored in DRAM and then loaded to L2 cache when needed to perform operations.
Tensor Cores: accelerate matrix multiplication (only do matmuls)
CUDA cores: perform operations which can’t be converted into matmuls
Extreme parallelisation:
Two level thread hierarchy:
Function threads are grouped into equally sized blocks
Groups of blocks are launched
GPUs hide dependent instruction latency by constantly switching threads.
This allows linear performance increase to batches far above the number of cores.
Why?
GPUs have many SMs each with their own scheduler and many threads which can communicate via shared memory.
SMs operate one block at a time but can achieve linear performance with many blocks per SM
You want to have a number of blocks an integer multiple to the number of SMs to reduce tail effect
Tail effect: under utilisation at the end of a batch in which not all SMs are being used (e.g. 6 jobs across 4 SMs leaves two in last run).
Limiting factors:
Memory bandwidth (e.g. ReLU)
Maths bandwidth (e.g. large linear layer)
Latency
Rule: simple ops tend to be memory limited, complex or large operations tend to be math limited