/articles/toma

To get this branch, use:
bzr branch http://darksoft.org/webbzr/articles/toma
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
\section{Hybrid Approaches}\label{section:hybridrec}
 We have proposed two algorithms to perform back-projection. One relays on the texture engine and is bound to its performance. The second is using shared memory and ALUs with only a small load on the texture engine. In this section we propose two methods to balance the load across all hardware components.

\subsection{Combined Approach for Pascal Architecture}\label{section:hyrbid}
 On Maxwell and Pascal architectures shared memory and SFU performance are the main limiting factor for the ALU-based algorithm. Both of these resources are very lightly utilized by the texture-based kernel. Therefore, it is possible to run the texture-based kernel for one part of the blocks and ALU-based kernel for another. NVIDIA allows to detect which SM executes the block. Consequently, it is possible to ensure that the desired ratio between texture- and ALU-based kernels is achieved. 
 
 An array is statically defined in the global memory space. The first thread of a block is resolving the SM number using \emph{get\_smid()} instruction and increments the corresponding cell of the array using an atomic operation. The block number within a cell is obtained and depending on the requested ratio one of the two algorithms is executed. The code snippet is shown bellow.
  
\begin{lstlisting}
__device__ uint smblocks[128] = {0};
__global__ static void reconstruct_hybrid() {
    __shared__ uint block;
    if ((threadIdx.x == 0)&&(threadIdx.y == 0)) {
        uint smid = get_smid();
        block = atomicAdd(&smblocks[smid], 1);
    }
    __syncthreads();
    if (block&1) reconstruct_tex(...);
    else reconstruct_alu(...);
}
\end{lstlisting}

 In \sectionname~\ref{section:newtex} we proposed an advanced thread mapping scheme for the texture-based kernel. The intention was to keep pixel-to-block assignments minimal and preserve the performance also for small images. The ALU kernel, however, aims for larger image sizes and works with 32-by-32 area at minimum. Therefore, an alternative simpler mapping is utilized for the texture-based kernel if it is executed as part of the hybrid approach. The block-to-pixel assignments are kept in sync with the ALU-based kernel. At each step a standard region of 16-by-16 pixels is processed. The thread to pixel assignments follow the mapping described in \sectionname~\ref{section:remap}. Each thread is responsible for 4 to 16 pixels and processes them in a loop. The same texture is used to perform linear interpolation in blocks running the texture-based algorithm and to cache data if the blocks execute the ALU-based reconstruction. The performance and utilization of GPU subsystems using the different reconstruction modes is reviewed in \tablename~\ref{tbl:utilization}.
 
 
%ToDo: Shall we add issue slot utilization here?
\begin{table}[htb] %[htbp]
\begin{threeparttable}
\caption{\label{tbl:utilization} Utilization of functional units in hybrid reconstruction mode}
\centering
\noindent

\begin{tabularx}{\columnwidth}{ | X | c | c | c | c | c | }
\hline
Method             & Texture   & Shared     & ALU     & SFU      & Perf.\\
\hline
Tex-based          & 100\%     & 20\%       & 40\%    & 10\%     & 726 GU/s \\
ALU-based          & 10\%      & 90\%       & 60\%    & 50\%     & 693 GU/s \\
Hybrid             & 70\%      & 70\%       & 70\%    & 40\%     & 995 GU/s \\
\hline
Oversampling       & 20\%      & 90\%       & 50\%    & 40\%     & 1107 GU/s \\
\hline
\end{tabularx}

\begin{tablenotes}
\item Utilization of NVIDIA GeForce Titan X (Pascal) subsystems using different reconstruction algorithms. Two slices are reconstructed in parallel according to the configuration given in tables~\ref{tbl:newtex}~and~\ref{tbl:alurec}. The utilization is obtained using \emph{nvprof} based on the following metrics: \emph{tex\_fu\_utilization}, \emph{shared\_utilization}, \emph{single\_precision\_fu\_utilization}, \emph{special\_fu\_utilization}.
\end{tablenotes}
\end{threeparttable}
\end{table}

%Whatever in second half of warps we call tex fetches or shmem fetches, we can't push more shmem fetches until outstanding shmem fetches returned. This happens more-or-less the same time (just a bit faster in facts) independent of what is action of other warps. So, there is a limited number of situations when the combination is effective. While for single-processing processing, the performance is bound by shmem in our ALU version. Actually, the FP limit is quite close and a little (if any) speed-up is possible. 

 In 2-slice reconstruction mode, the performance of the texture-based and ALU-based kernels is very close. Therefore, half of the blocks run the ALU-based reconstruction and the other half uses the texture engine. The hybrid approach outperforms the optimized texture-based method by 30\% in this case. The ALU-based reconstruction is significantly faster if only a single slice is reconstructed. The SM on Pascal and Maxwell runs up to 8 blocks with 256 threads each. The ALU reconstruction is executed for 5 blocks and the texture based reconstruction is performed for other 3. It is possible to secure 20\% higher throughput over the plain ALU-based reconstruction. Too many registers are required if 4-slices are reconstructed in parallel. Consequently, a low occupancy penalizes the performance of the texture-based kernel significantly. 
 
 While it is possible to use the described approach using the nearest-neighbor interpolation, in practice there is a little speed-up. The ALU kernel outperforms the texture-based kernel significantly unless 4-slice reconstruction is performed using the half-float data. Consequently, there is a little effect if they are executed in parallel. The proposed method is only suitable for Maxwell and Pascal architectures. All other devices are  bound by the performance of the ALU units. While the Kepler architecture has a very high ALU performance, ALUs are also utilized to perform rounding operations to overcome the slow SFU performance. Since the texure-based kernel also uses ALUs intensively, no performance gains are measured. The used configuration and the achieved performance on Maxwell and Pascal platforms are presented in \tablename~\ref{tbl:hybrid}.

\input{table_7x_hybrid.tex}


\subsection{Oversampling}\label{section:oversample}
 There is an alternative approach to improve the utilization of the texture engine using the ALU-based reconstruction. The idea is to sample several values for each bin of a sinogram and use the nearest-neighbor instead of linear interpolation, see \figurename~\ref{fig:overs}. While more shared memory is required, the number of computations and memory transactions is reduced in this case. A significant speed-up is achieved compared to linear interpolation if 4 values are sampled for each bin at offsets $.00$, $.25$, $.50$, and $.75$. \figurename~\ref{fig:quality_overs} and \figurename~\ref{fig:quality_overs_fossil} compare the described approach against the reconstructions performed using the nearest neighbor and linear interpolation. In both cases the reconstruction in oversampling mode is more similar to the results obtained using the linear interpolation. 

\begin{figure}[htb]
\centering
\includegraphics[width=\imgsizedefault\textwidth]{img/overs.pdf}
\caption{\label{fig:overs} The figure illustrates the oversampling reconstruction approach. To reconstruct a 32x32 pixel square, a thread block caches 192 values per projection (left). The values are fetched from 48 bins at uniform intervals using the texture engine. Then, the reconstruction is performed and projections are processed in a loop one after another (right). To determine required position in the cache, the offset from the first bin of the cache is multiplied by 4 and the result is rounded to the nearest integer. The value at this position is loaded from the array and used to update pixel value.}
\end{figure}
 
\begin{figure}[htb]
\centering
%\includegraphics[width=\imgsizedefault\textwidth]{img/quality_overs.pdf}
\includegraphics[width=\imgsizedefault\textwidth]{img/quality_overs_zoomed.pdf}
\caption{\label{fig:quality_overs} Comparison of the reconstructions performed using nearest neighbor (\emph{NN}) and linear interpolation with the hybrid oversampling approach (\emph{overs}). The profile plot along the selected line is shown in the top part of the figure for the phantom and all reconstruction methods. The absolute difference from precise phantom image is shown along the same line in the bottom part. }
\end{figure} 

\begin{figure}[htb]
	\centering
	%\includegraphics[width=\imgsizedefault\textwidth]{img/quality_overs.pdf}
	\includegraphics[width=\imgsizedefault\textwidth]{img/quality_overs_fossil.pdf}
	\caption{\label{fig:quality_overs_fossil} Comparison of the reconstructions performed using nearest neighbor (\emph{NN}) and linear interpolation with the hybrid oversampling approach. The profile plot along the selected line is shown in the top part of the figure for the fossilized wasp dataset and all reconstruction methods. The absolute difference from the reconstruction performed using linear interpolation is shown along the same line in the bottom part. }
\end{figure} 

    Implementation-wise a few modifications are required for the optimal performance. The amount of used shared memory is quadrupled. To achieve reasonable occupancy the number of cached projections has to be reduced. Typically only 4 - 8 projections are processed in parallel. The amount of available shared memory on Kepler still does not allow to reach 50\% occupancy if multiple slices are reconstructed. The performance is significantly penalized if only 2 projections are cached per iteration of the projection loop. Therefore, the Kepler GPUs are running with occupancy under 50\%. On the Kepler-based Titan card the actual occupancy allowed by shared memory is hinted and 72 registers are used per thread. The GeForce GTX680 is restricted to 63 registers per thread and hinting occupancy bellow 50\% is not useful as extra registers can't be assigned. 

    To avoid idling at synchronization point, 32 to 64 threads are used to cache each projection row. The texture engine is expected to perform linear interpolation to deliver data at fractional offsets. Consequently, the half-float data representation can't be used together with the oversampling. If the caching of $h_x$ is enabled as explained in \sectionname~\ref{section:alu_hx}, the first 16 threads of each warp are assigned to cache the first component of $h_x$ vector and the second half-warp stores the second half of the value. It allows to cache all required data using a single 32-bit instruction and reduces the required shared memory bandwidth.

 The data locality is significantly worse if the oversampling approach is used. Up to 4 times more values are accessed by each warp. Consequently, there is a high possibility of shared memory bank conflicts. To reduce the amount of conflicts, the data vectors are split if multiple slices are reconstructed in parallel. The vector components used to represent each sinogram are extracted after texture fetch and are stored into the 2 - 4 separate caches. On the systems with 32-bit shared memory, a dedicated buffer is allocated for each sinogram component. On the platforms preferring 64-bit over 32-bit loads, the data is split only if 4-slice reconstruction is performed. Two buffers are used in this case, each storing the sinogram components for a pair of reconstructed slices. During the accumulation step, the values are extracted from all caches using the same index and re-combined into the appropriate vector again. To allow 64-bit writes also during a single-slice reconstruction, on Kepler platform the re-combination of shared memory writes is performed as explained in  \sectionname~\ref{section:alurec_shmem}. The used configuration and achieved performance are summarized in the table~\ref{tbl:overs}.

  
\input{table_7x_overs.tex}

%
%\begin{table}[htb] %[htbp]
%\begin{threeparttable}
%\caption{\label{tbl:shmemoconf} The optimal parameters to prevent shared memory bank conflicts in oversampling kernel}
%\centering
%\noindent
%%\resizebox{\columnwidth}{!}{\begin{tabular}{} ... \end{\tabular}}
%\begin{tabularx}{\columnwidth}{ | c | l | l | l | X | }
%\hline
%Area & Slices & Platform & Threads & Optimizations \\
%\hline
%\multirow{6}{*}{32x32}
%& \multirow{2}{*}{1}
%  & 32-bit      & 64 & - \\
%& & 64-bit      & 32 & write64 \\
%\cline{2-5}
%& \multirow{2}{*}{2}
%  & 32-bit      & 64 & double-buffer \\
%& & 64-bit      & 64 & - \\
%\cline{2-5}
%& \multirow{2}{*}{4}
%  & 32-bit      & 64 & quad-buffer \\
%& & 64-bit      & 64 & double-buffer \\
%\hline
%\multirow{2}{*}{64x64}
%& \multirow{2}{*}{1}
%  & 32-bit      & 64 & - \\
%& & 64-bit      & 64 & write64 \\
%\cline{2-5}
%& \multirow{2}{*}{2}
%  & 32-bit      & 64 & double-buffer \\
%& & 64-bit      & 64 & - \\
%\cline{2-5}
%& \multirow{2}{*}{4}
%  & 32-bit      & 64 & quad-buffer \\
%& & 64-bit      & 64 & double-buffer \\
%\hline
%
%\end{tabularx}
%\begin{tablenotes}
%\item For each considered configuration, the number of threads per projection row and the required optimizations are specified. The \emph{double-buffer} optimization splits the shared memory cache in 2 parts to prevent bank conflicts on the NVIDIA Fermi and all considered AMD architectures. The \emph{write64} optimization combines two writes to shared memory to use full bandwidth of Kepler GPUs.
%\end{tablenotes}
%\end{threeparttable}
%\end{table}