/articles/toma : contents of section_8x_summary.tex at revision 100

: (revision 100)

To get this branch, use:

bzr branch
http://darksoft.org/webbzr/articles/toma

\section{Conclusion}\label{section:summary}
 We have surveyed a range of GPU architectures presented by the major hardware vendors in the last 10 years. \tablename~\ref{tbl:gpuspec} lists architecture details and summarizes rather considerable shifts of the performance balance between different hardware pipelines. The throughput ratio between floating point and type-conversion instructions has fluctuated 8-fold. Type-conversions are executed at a half rate of the peak floating-point performance on AMD GCN GPUs, but only a single type-conversion instruction can be executed per 12 floating-point operations on NVIDIA Kepler GPUs (considering the peak rates). On the other hand, the type-conversion instructions can be executed in parallel with floating-point operations on NVIDIA Kepler, but not on AMD GCN. The ratio between the theoretical throughput of floating-point instructions and the shared memory bandwidth has changed 2.6 times across the reviewed architectures. A 2-fold change is reported between the throughput of floating point operations and the filtering rate of the texture engine. Furthermore, we found that even more considerable architectural changes have been introduced in some products. AMD has replaced the VLIW architecture by GCN effectively moving from instruction level parallelism to a SIMT-only model. On NVIDIA platforms, execution of bit mangling and type conversion operations was shifted between ALU and SFU units. In recent architectures half-float and Tensor Units have been introduced to accelerate machine learning algorithms. Our study demonstrates that these changes are highly relevant to the performance of developed algorithms and a significant speed-up is possible if low-level details of the GPU architecture are taken into account. In addition to GPU architectures, we also reviewed the Intel Xeon-Phi technology in \sectionname~\ref{section:cpurec}. We show that the OpenCL algorithms developed for GPUs are barely suited for this architecture due to the different scheduling model. The standard threaded code is easier to implement and better fitting for general-purpose CPUs and Intel accelerators based on Xeon Phi technology.

 We present two algorithms to perform fast back-projection on the variety of GPU architectures. The first utilizes the texture engine for interpolation. The second algorithm relies on ALU units and shared memory. Furthermore, we proposed two hybrid approaches to combine these methods and achieved an even higher performance by balancing the load across the GPU subsystems. In \sectionname~\ref{section:multislice} we show that a higher utilization of the texture engine can be achieved if the data is re-arranged in larger vector types. Such vectors are streamed by the texture engine at the same rate as simple floating-point numbers provided that the high locality of the texture fetches can be ensured across half-wraps and also within groups of 4-consecutive threads. On some architectures we can further double the performance by switching to a half-precision data representation at a price of some penalty to the image quality. The only requirement is the ability of the hardware to perform high-speed transformation between half- and single-precision formats of floating point numbers.  Even if half-precision floating point numbers are not directly supported by the texture engine, in \sectionname~\ref{section:half} we demonstrated that they still can be efficiently utilized by binding a texture with the forged data type. To reach the maximal theoretical rate of the texture engine, the performance bottleneck caused by the low throughput of constant memory and SFU units is resolved by re-assigning work between GPU threads as explained in \sectionname~\ref{section:newtex}. While this approach results in a lower occupancy on the AMD platform, the resulting performance is considerably improved especially on AMD VLIW-based GPUs. On the NVIDIA platform we are able to enforce 100\% occupancy instead, see \sectionname~\ref{section:newtex_occupancy}. Consequently, a relatively large amount of local memory is used, but it is completely backed by the L1 cache and the performance is improved significantly on most NVIDIA architectures as well. As can be seen from \figurename~\ref{fig:texeff}, a high utilization of the texture engine is achieved across all hardware platforms. The algorithm is highly portable, and only minor adjustments of the algorithm parameters are required to adapt it to a specific hardware. In contrast, the ALU-based algorithm requires significant modifications for some of the considered architectures. As we have shown in \sectionname~\ref{section:modelling}, different functional blocks may limit the algorithm performance depending on the underlying hardware. Consequently, we were able to significantly boost its performance by re-balancing the load of these functional blocks. For the Maxwell and Pascal micro-architectures, we run both algorithms in parallel efficiently redistributing the load between texture engine, shared memory, and ALUs. This approach is explained in \sectionname~\ref{section:hyrbid}. Because of the slow throughput of Keplers SFU units, in \sectionname~\ref{section:alu_fancy} we proposed an alternative method to perform rounding and type-conversion operations using ALUs instead of SFUs. Consequently, part of SFU load is shifted to ALUs and a higher performance is achieved. In \sectionname~\ref{section:alu_hx}, we introduce additional caches for the Fermi architecture to reduce the total number of issued instructions.  For the AMD VLIW architecture, we significantly increase the amount of work per GPU thread. Consequently, the kernel runs at a very low occupancy but utilizes the instruction level parallelism better. In \sectionname~\ref{section:alu_occupancy} we also discuss the optimal occupancy for other architectures. It depends on the amount of available hardware registers, kernel complexity, and also the ratio between memory and ALU/SFU instructions. We show that targeting both higher and lower occupancy may result in a considerable speedup.
 %However, 50\% occupancy is optimal for the discussed use-case across the majority of hardware platforms.
 
 Different algorithms can be used to better target a varying balance of subsystem performances in each GPU architecture. We have also shown that it is viable to utilize multiple algorithms in parallel if they are primarily aimed at the different hardware units. The optimal ratio between these algorithms can be ensured on the NVIDIA platform allowing the balanced usage of all GPU components. The recommended algorithms for each platform are summarized in \tablename~\ref{tbl:algs}. The nearest-neighbor interpolation performs significantly faster on the majority of the considered platforms if the ALU-based algorithm is used. Except on Kepler, the linear interpolation is also accelerated if the ALU variant is used either alone or in combination with the texture-based algorithm. If the exact agreement with the standard algorithm is not required, an additional speed-up can be achieved by using the half-float data representation or by replacing the linear interpolation with a combination of the oversampling and the nearest-neighbor approach as explained in \sectionname~\ref{section:oversample}. There is still a rapid progress in parallel hardware and new architectures are announced yearly. To port the algorithms to new devices, the algorithm configuration can be parametrized and a quick search in the parameters space be executed to find optimal settings. This approach will not deliver the optimal performance if new functional blocks are introduced in the architecture, e.g. Tensor and Ray Tracing units on the recent NVIDIA GPUs. However, it can address the shifts in the operation balance. 

\input{table_8x_algs.tex}

 \figurename~\ref{fig:speedhist} illustrates the history of NVIDIA platform from 2009 to 2016. While the performance of the standard algorithm has grown on the pair with the hardware improvements, the optimized algorithms got an additional boost from utilizing parallelism between GPU subsystems. The speed-up of the optimized back-projection algorithms significantly outperform the respective grow of the hardware performance. Particularly, using new ALU-based algorithm we boosted performance by 3 - 5 times on the Fermi architecture. In the same time, the peak throughput of the floating-point instruction has been only been improved by 50\%. The balance of operations has changed in the Kepler architecture significantly. The throughput of bit-mangling and type-conversion operators has been even reduced on GTX680 if compared to GTX580. We still were able to preserve the steady grow of the performance by optimizing usage of the texture engine and re-balancing the load between SFUs and ALUs. Due to ability to utilize the texture engine in parallel with ALUs, on Maxwell and Pascal architectures the algorithm performance again increased above the improvements of the hardware. 
 
\begin{figure}[htb]
\centering
\includegraphics[width=\imgsizedefault\textwidth]{img/speedup-steps.pdf}
\caption{\label{fig:speedhist} The figure evaluates the theoretical peak throughput of GPU subsystems and the measured performance of standard and optimized back-projection algorithms. The speed-up against NVIDIA GeForce GTX295 is shown in the left part of the figure. The relative speed-up between consecutive architectures is shown in the right part. }
\end{figure} 
 
 NVIDIA Titan X is the newest of the evaluated GPUs. Here, we were able to accelerate the code by 2.5 times using linear interpolation and without loss of image quality. The proposed algorithm is 3.5 times faster if the nearest-neighbor interpolation is used. Even if the reconstruction chain is only able to process a single-slice at a time, the proposed hybrid approach is 2 times faster then the standard algorithm. The achieved speed-up across all platforms is presented in \figurename~\ref{fig:speedup}. Some architectures can be accelerated as much as 7 times compared to the state-of-the-art method. The high-speed reconstruction is of a significant importance for imaging at synchrotron facilities and allows to improve spatial and temporal resolutions of the beam-line instrumentation. The back-projection algorithm is also utilized in iterative reconstruction techniques aiming for high-quality reconstruction. Therefore, the faster implementation lowers the computational demands for high-quality offline reconstruction as well.  Furthermore, the general concept of balancing the load between the computational units of the GPU is not limited the presented tomographic reconstruction but rather suggested for any computational intense task. 

\begin{figure}[htb]
\centering
\includegraphics[width=\imgsizedefault\textwidth]{img/algorithms.pdf}
\caption{\label{fig:speedup} The figure lists the performance improvements of the proposed algorithms using the linear and nearest-neighbor interpolation modes. The speed-up against the standard implementation is measured across all architectures. The black bars show the improvements of a single slice reconstruction performance achieved using the new texture-based kernel due to the optimized fetch locality and reduced load on the constant memory. The blue bars show the increased speed-up using the multi-slice reconstruction. The green bars indicate if the alternative ALU-based kernel outperforms the texture based approach and the achieved gains. The performance of the hybrid approach is shown using the orange color. The last two bars show additional speed-up with approximate methods which do not replicate results of the standard method exactly.}
\end{figure}