18
18
% We ignore GDS present on AMD as we don't use it.
19
19
Compared to the general-purpose processors the ratio between computation power and throughput of the memory subsystem is significantly higher on GPUs. To feed the computation units with data, the GPU architectures include a number of specific optimizations requiring programmer consideration and also several types of implicit and explicit caches optimized for different use cases.
21
There are 3 types of general-purpose memory provided by the GPU architecture. A large amount of global memory is accessible to all threads of the task grid. Much smaller, but significantly faster shared memory is local to a thread block. The thread-specific local variables are normally hold in the registers. If there is not enough register space, a part of the variables may be offloaded to the \emph{local memory}. The thread-specific, but dynamically addressed arrays are always stored in the local memory (i.e. if array addresses can't be statically resolved during the compilation stage). In fact, the local memory is a special area of the global memory, but the data would normally not leave the L1 or L2 caches unless an extreme amount is required. Even then, access to variables in the local memory inflicts a severe performance penalty compared to the variables kept in the registers and should be avoided if possible.
21
There are 3 types of general-purpose memory provided by the GPU architecture. A large amount of \emph{global memory} is accessible to all threads of the task grid. Much smaller, but significantly faster \emph{shared memory} is local to a thread block. The thread-specific local variables are normally hold in registers. If there is not enough register space, a part of variables may be offloaded to \emph{local memory}. The thread-specific, but dynamically addressed arrays are always stored in the local memory (i.e. if array addresses can't be statically resolved during the compilation stage). In fact, the local memory is a special area of the global memory, but the data would normally not leave the L1 or L2 caches unless an extreme amount is required. Even then, access to variables in the local memory inflicts a severe performance penalty compared to the variables kept in the registers and should be avoided if possible.
23
23
To reduce the load on the memory subsystem, GPUs try to coalesce global memory accesses into as few transactions as possible. This is only possible if the threads of a warp are accessing adjacent locations in the memory. While on the old hardware the stricter access patterns has to be followed, the recent architectures perform a required number of 32- to 128-byte wide transactions. The maximum bandwidth is achieved if as few as possible of such transactions are issued to satisfy the data request of a complete warp. The alignment requirements have to be considered as well. If it is not possible, the shared memory is often used as explicit cache to streamline accesses to the global memory~\cite{nvidia2014transpose}.