/articles/toma : revision 56

To get this branch, use:

bzr branch
http://darksoft.org/webbzr/articles/toma

« back to all changes in this revision

Viewing changes to section_3x_arch.tex

Committer: Suren A. Chilingaryan
Date: 2018-04-25 11:20:53 UTC
Revision ID: csa@suren.me-20180425112053-fxc3s4tdx1vmyqb7

Re-integrate proofs in long version, step1

files modified:
paper.tex

section_1x_intro.tex

section_2x_setup.tex

section_3x_arch.tex

table_2x_funcs.tex

transact.tex

Show diffs side-by-side

added added

removed removed

section_3x_arch.tex

The architecture of nowadays GPUs is rather heterogeneous and includes several types of computational elements organized in SIMD units of different size. The balance of performance is shifting between different types of operations as architectures are get updated. To feed the fast SIMD units with data, a complex hierarchy of memories and caches is introduced. The memory is very sensitive to the access patterns and the optimal patterns also differ between the hardware generations~\cite{nvidia2017cudapg}. In this section we briefly explain the GPU architecture and elaborate differences between considered GPUs focusing on the aspects important to implement back projection efficiently. To simplify reading for the broader audience, we use more common CUDA terminology across the paper.

\subsection{Hardware Architecture}

The typical GPU consists of several semi-independent Streaming Multiprocessors (SM) which are sharing a global GPU memory and L2 cache~\cite{nvidia2009gf110}. Several DMA engines are included to move the data to/from system memory. Each SM includes task scheduler, computing units, a large register file, a fast on-chip (\emph{shared}) memory, and several different caches. There are a few types of computing units. The number crunching capabilities are provided by a large number of arithmetic units (ALU) also called \emph{Core} units by NVIDIA. ALUs are aimed on a single-precision floating point and integer arithmetic. Some GPUs also include specialized \emph{half precision} and \emph{double precision} units to perform operations with these types faster. There are also architecture-specific units. All NVIDIA devices include Special Function Units (SFU) which are used to quickly compute approximates of transcendent operations. The latest Volta architecture includes \emph{Tensor} units aimed on fast multiplication of small matrices to accelerate deep learning workloads~\cite{nvidia2017v100}. AMD architectures adapt scalar units to track loop counters, etc~\cite{amd2012gcn}. The memory operations are executed by \emph{LD/ST} units. The memory is either accessed directly or the \emph{Texture} units are used to perform a fast linear interpolation between the neighboring data elements while loading the data.

The typical GPU consists of several semi-independent \emph{Streaming Multiprocessors (SM)} which share global GPU memory and L2 cache~\cite{nvidia2009gf110}. Several \emph{Direct Memory Access (DMA)} engines are included to move data to/from system memory. Each SM includes a task scheduler, computing units, a large register file, a fast on-chip (\emph{shared}) memory, and several different caches. There is a few types of computing units. The number crunching capabilities are provided by a large number of \emph{Arithmetic Units (ALU)} also called \emph{Core Units} by NVIDIA. ALUs are aimed on single-precision floating point and integer arithmetic. Some GPUs also include specialized \emph{half precision} and \emph{double precision} units to perform operations with these types faster. There are also architecture-specific units. All NVIDIA devices include \emph{Special Function Units (SFU)} which are used to quickly compute approximates of transcendent operations. The latest Volta architecture includes \emph{Tensor} units aimed on fast multiplication of small matrices to accelerate deep learning workloads~\cite{nvidia2017v100}. AMD architectures adapt scalar units to track loop counters, etc~\cite{amd2012gcn}. The memory operations are executed by \emph{Load/Store Units (LD/ST)}. The memory is either accessed directly or \emph{Texture} units are used to perform a fast linear interpolation between the neighboring data elements while loading the data.

The computing units are arranged in several SIMD sets which are able to execute the same instruction on multiple data elements simultaneously. Several such sets are included in SM and, often, can be utilized in parallel. The dispatch unit of an SM employs data- and instruction-level parallelism to distribute the work-load between all available sets of SIMD units. However, it is architecture depended which combination of instructions can be executed in parallel. The ratio between throughput of different instructions and also shared memory bandwidth varies significantly.

The computing units are arranged in several SIMD sets which are able to execute the same instruction on multiple data elements simultaneously. Several such sets are included in SM and, often, can be utilized in parallel. The \emph{dispatch unit} of an SM employs data- and instruction-level parallelism to distribute the work-load between all available sets of SIMD units. However, it is architecture depended which combination of instructions can be executed in parallel. The ratio between throughput of different instructions and also shared memory bandwidth varies significantly.

%Fermi-based GPUs have excellent and well-balanced performance of floating-point and integer instructions. Kepler GPUs perform worse if large amount of integer computations are involved, but have very fast texture engine, etc.

\subsection{Execution Model}\label{section:execution_model}

% We ignore GDS present on AMD as we don't use it.

Compared to the general-purpose processors the ratio between computation power and throughput of the memory subsystem is significantly higher on GPUs. To feed the computation units with data, the GPU architectures include a number of specific optimizations requiring programmer consideration and also several types of implicit and explicit caches optimized for different use cases.

There are 3 types of general-purpose memory provided by the GPU architecture. A large amount of global memory is accessible to all threads of the task grid. Much smaller, but significantly faster shared memory is local to a thread block. The thread-specific local variables are normally hold in the registers. If there is not enough register space, a part of the variables may be offloaded to the \emph{local memory}. The thread-specific, but dynamically addressed arrays are always stored in the local memory (i.e. if array addresses can't be statically resolved during the compilation stage). In fact, the local memory is a special area of the global memory, but the data would normally not leave the L1 or L2 caches unless an extreme amount is required. Even then, access to variables in the local memory inflicts a severe performance penalty compared to the variables kept in the registers and should be avoided if possible.

There are 3 types of general-purpose memory provided by the GPU architecture. A large amount of \emph{global memory} is accessible to all threads of the task grid. Much smaller, but significantly faster \emph{shared memory} is local to a thread block. The thread-specific local variables are normally hold in registers. If there is not enough register space, a part of variables may be offloaded to \emph{local memory}. The thread-specific, but dynamically addressed arrays are always stored in the local memory (i.e. if array addresses can't be statically resolved during the compilation stage). In fact, the local memory is a special area of the global memory, but the data would normally not leave the L1 or L2 caches unless an extreme amount is required. Even then, access to variables in the local memory inflicts a severe performance penalty compared to the variables kept in the registers and should be avoided if possible.

To reduce the load on the memory subsystem, GPUs try to coalesce global memory accesses into as few transactions as possible. This is only possible if the threads of a warp are accessing adjacent locations in the memory. While on the old hardware the stricter access patterns has to be followed, the recent architectures perform a required number of 32- to 128-byte wide transactions. The maximum bandwidth is achieved if as few as possible of such transactions are issued to satisfy the data request of a complete warp. The alignment requirements have to be considered as well. If it is not possible, the shared memory is often used as explicit cache to streamline accesses to the global memory~\cite{nvidia2014transpose}.

Older »