/articles/toma

To get this branch, use:
bzr branch http://darksoft.org/webbzr/articles/toma
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
\section{Setup, Methodology, and Conventions}\label{section:setup}
\subsection{Hardware Platform}
 To evaluate the performance of the proposed methods, we have selected 9 AMD and NVIDIA GPUs with different micro-architectures. Table \ref{table:gpus} summarizes the considered GPUs. These GPUs were assembled into the 3 GPU servers. The newer NVIDIA cards with Maxwell and Pascal architectures were installed in a Supermicro 7047GT based server specified in \tablename~\ref{table:sys1}. The older NVIDIA cards and all AMD cards were installed in two identical systems based on the Supermicro 7046GT platform. The full specification is given in \tablename~\ref{table:sys2}. Additionally, we have tested how the developed code is performing on an Intel Xeon Phi 5110P accelerator. The accelerator was installed in the first platform along with the newer NVIDIA cards.

 
\begin{table}[htb]
\caption{\label{table:gpus} List of selected  GPU architectures}
\centering
\begin{tabular} { ccccc }
\hline
Vendor & GPU                 & Arch.    & Code    & Release \\
NVIDIA & GeForce GTX 295     & GT200    & GT200   & 2009 \\
NVIDIA & GeForce GTX 580     & Fermi    & GF110   & 2010 \\
NVIDIA & GeForce GTX 680     & Kepler   & GK104   & 2012 \\
NVIDIA & GeForce GTX Titan   & Kepler   & GK110   & 2013 \\
NVIDIA & GeForce GTX 980     & Maxwell  & GM204   & 2014 \\
NVIDIA & GeForce GTX Titan X & Pascal   & GP102   & 2016 \\
AMD    & Radeon HD-5970      & VLIW5    & Cypress & 2009 \\
AMD    & Radeon HD-7970      & GCN1     & Tahiti  & 2012 \\
AMD    & Radeon R9-290       & GCN2     & Hawaii  & 2013 \\
\hline
\end{tabular}
\end{table}

\begin{table}[htb]
\caption{\label{table:sys1} Server for newer NVIDIA cards}
\begin{tabular} { l || p{5.5cm} }
\hline
Platform      & Supermicro 7047GT GPU Server \\
Motherboard   & Supermicro X9DRG-QF with Intel C602 chipset \\
Memory        & 256 GB DDR3-133 Memory \\
Processor     & Dual Intel Xeon E5-2640 (24 cores at 2.5 GHz)  \\
\hline
\end{tabular}
\end{table}

\begin{table}[htb]
\caption{\label{table:sys2} Servers for AMD and older NVIDIA cards}
\begin{tabular} { l || p{5.5cm} }
\hline
Platform        & Supermicro 7046GT GPU Server \\
Motherboard     & Supermicro X8DTG-QF with Intel 5520 chipset \\
Memory          & 96 GB DDR3-1066 Memory \\
Processor       & Dual Intel Xeon X5650 (12 cores at 2.67 GHz) \\
\hline
\end{tabular}
\end{table}

\subsection{Software Setup}
 All described systems were running OpenSuSE 13.1. The code for the NVIDIA cards was developed using the CUDA framework. As newer versions of the framework have dropped support for older GPUs, we have used CUDA 6.5 for the NVIDIA GeForce GTX295 card and CUDA 8.0 for other NVIDIA GPUs. The AMD version of the code is based on OpenCL and was compiled using AMD APPSDK 3.0. Additionally, we have tested the performance of Xeon CPUs and a Xeon Phi accelerator using Intel SDK for OpenCL. Since the latest version of Intel OpenCL SDK does not support Xeon Phi processors any more, again we needed to use two different SDK versions. The newer one was used to evaluate the performance of the Xeon processors while the older one served to execute the developed methods on the Xeon Phi accelerator. All installed software components are summarized in \tablename~\ref{table:soft}. 

\begin{table}[htb]
\caption{\label{table:soft} Software components}
\begin{tabular} {l || l}
\hline
Operating System          & OpenSuSE 13.1 \\
System Configuration      & kernel 3.11.10, glibc 2.18, gcc 4.8.1 \\
CUDA Platform             & CUDA SDK 8.0.61, driver 375.39 \\
CUDA Platform (GT200)     & CUDA SDK 6.5.14, driver 340.102 \\
AMD Platform              & APP SDK 3.0.130.136, driver 15.12 \\
Intel Platform            & OpenCL SDK 2017 v. 7.0.0.2511 \\
Intel Platform (Xeon Phi) & MPSS 3.5.1, OpenCL SDK 4.5.0.8 \\
\hline
\end{tabular}
\end{table}

\subsection{Benchmarking Strategy}
 In this article we are not aiming to precisely characterize the performance of the graphics cards, but rather validate the efficiency of the proposed optimizations. For this reason we take a relatively lax approach to the performance measurements. In most tests, we use a data set consisting of 2048 projections with dimensions of 2048 by 2048 pixels each. 512 slices with same dimensions are reconstructed and the median reconstruction time is used to estimate the performance. 
%  Unless specified otherwise, 512 similar slices are reconstructed and the median time is measured to compute the performance. In most tests, we use a typical data set recorded by 4~MPix camera utilized at ANKA synchrotron. It consists of 2000 projections with dimensions of 1776 by 1707 pixels each. 
%ToDo: reintroduce if we provide the sections...
%To prove that these parameters has a negligible effect on the performance, we show how the reconstruction performance depend on size in the section~\ref{section:perf_size} and the stability of the performance in the section~\ref{section:perf_gpuboost}.

 Starting with the Kepler architecture, NVIDIA introduces the GPUBoost technology to adapt the clock speed according to the current load and the processor temperature~\cite{ryan2016gpuboost}. To avoid significant performance discrepancies, we run a heat-up procedure until the performance stabilizes. Furthermore, we verify that the actual hardware clock measured before start of measurements (but after the heat-up procedure) does not significantly differ from the clock measured after the measurements. Otherwise, we re-run the test. Finally, we exclude all I/O operations in the benchmarks. The reconstructions are executed using dummy data and the results are discarded without transferring them back to the system memory. 
%ToDo: Re-introduce if we add section about GPUBoost on Pascal
%Otherwise, some delays may be introduced between consecutive execution of GPU functions. During this pauses the processor is cooling down and may achieve higher average clock as compared to the more efficient reconstruction pipeline executing consecutive reconstructions without delays in between.
 
\subsection{Quality Evaluation}
 Some of the suggested optimizations alter the resulting reconstruction. To assess the effect on quality, we compare the obtained results with the standard reconstruction in such cases. The standard Shepp Logan Head Phantom with a resolution of 1024x1024 pixels is used for the evaluation~\cite{shepp1974}. We also illustrate the differences between standard and reduced quality methods using a cross-section slice from a real volume with a fossilized wasp from a recent experiment conducted at ANKA synchrotron~\cite{vdk2018}. The projection images were recorded using a 12-bit pco.dimax camera~\cite{pco2014dimax}. More details about the setup of the imaging system are available in the referenced article. As the changes are typically small and are hardly visible in the 2D image, we show a profile along vertical line crossing most of the features in the slice, see \figurename~\ref{fig:phantom}.

\begin{figure}[htb]
\centering
%\includegraphics[width=\imgsizedefault\textwidth]{img/phantom.pdf}
\includegraphics[width=\imgsizedefault\textwidth]{img/phanfossil.pdf}
\caption{\label{fig:phantom} Synthetic Shepp-Logan phantom (left) and a reconstructed cross-section slice of a fossilized wasp (right) are used for quality evaluation. All profile plots in the article are shown along the red vertical lines. }
\end{figure} 

\subsection{Pseudo-code Conventions}
 To avoid long code listings we use pseudo-code to describe the algorithms. We use mixture of a mathematical and a \emph{C}-style notation to keep it minimalistic and easy to follow. \emph{C}  syntax is mostly adapted for operations, loops, and conditionals. We use $\idiv$ to denote integer division and $\%$ for modulo operation. No floating point division is performed in any of algorithms. The division is always executed on positive integer arguments and produces integer number which is rounded towards zero. The standard  naming scheme for variables is used across all presented algorithms. We group related variables together. The same letter is used to refer all variables of the group and the actual variable is specified using subscript. Furthermore, some algorithms use shared memory to cache the data stored in global or constant memory. In such cases, we keep the variable name, but add superscript indicating the memory domain. For instance, $\shmem{c_s}$ points to the sine of the projection angle stored in the shared memory. $c$ is a group of variables storing the projection constants. $c_s$ refers specifically the sine of the projection angle and the superscript $\shmem{\cdot}$ indicates that the copy in shared memory is accessed. All variables used across the algorithms are listed in \tablename~\ref{table:alg_prms}, \ref{table:alg_idxs}, \ref{table:alg_vars}. The superscripts used to indicate memory segment are specified in \tablename~\ref{table:alg_ss}.

\input{table_2x_vars}

 We use $\vdata{\cdot}$ symbol to denote all vector variables, i.e. $float2$, $float4$, etc. Furthermore, all proposed algorithms are capable to reconstruct 1, 2, or 4 slices in parallel. If more than 1 slice is reconstructed, the accumulator and a few other temporary variables use the floating-point vector format to store values for multiple slices. These variables are marked with $\vfloat{\cdot}$. All arithmetic operations in this case are performed in vector form and affect all slices. The vector multiplication is performed element wise as it would be in CUDA and OpenCL. We use the standard \emph{C} notation to refer array indexes and components of the vector variables. The arrays are indexed from 0. For instance $\vx{\vfloat{s}[0]}$ refers to the first component of the accumulator. The assignment between vector variable and scalars are shown using curly braces, like $\vlist{x, y} = \vfloat{s}[0]$. The floating point constants are shown without \emph{C} type specification. However, it is of utmost importance to qualify all floating-point constants as single precision in the \emph{C} code, i.e. using $0.5f$ in place of $0.5$. Otherwise the double-precision arithmetic will be executed severely penalizing performance on majority of consumer-grade GPUs.
 
 To perform thread synchronization and to access the texture engine, the algorithms rely on a few functions provided by CUDA SDK or defined in the OpenCL specifications. To preserve neutrality of notation, we use abbreviated keywords to reference this functions. This list of used abbreviations along with the corresponding CUDA and OpenCL functions are listed in \tablename~\ref{table:alg_cmd}. Actually, the syntax of OpenCL and CUDA kernels is very closely related. Only a few language keywords are named differently. It is a trivial task to generate both CUDA and OpenCL kernels based on the provided pseudo-code. 

\input{table_2x_funcs}

 We use integer division and modulo operations across the code listings. These operations are very slow on GPUs and actually should be performed as bit mangling operations instead. However, the optimizing compilers
can replace them automatically by the faster bit-mangling instructions. So, we are free to use notation which is easier to read. There are a few other cases where the optimization is left to the compiler.