/articles/toma

To get this branch, use:
bzr branch http://darksoft.org/webbzr/articles/toma
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
\subsubsection{Optimizing efficiency of the texture cache}\label{section:newtex}
  The \figurename~\ref{eff_std} also evaluates how efficiently the described implementation utilizes the available texture throughput. While the performance is close to theoretical maximum on the most considered architectures if a single-slice processing mode is used, the efficiency drops significantly if multiple slices are reconstructed in parallel. The AMD cards and the cards based on the NVIDIA Kepler architecture show sub-optimal performance also for single-slice reconstruction mode. There are two main reasons for that: 
\begin{itemize}
\item The default reconstruction algorithm does not take into the consideration 2D locality of the texture cache explained in the section~\ref{section:texture}. Consequently the texture cache is not used efficiently and does not provide data fast enough slowing down the filtering rate.
\item On top of that, the constant memory is relatively slow on the Kepler architecture, see \tablename~\ref{table:gpuratios} and also becomes the performance bottleneck.
\end{itemize}  

 Here we present an updated algorithm aimed to optimize locality of the texture fetches. The original algorithm maps each GPU thread to a single pixel of output slice and processes the projections sequentially in a loop. Instead we propose a different mapping scheme allowing to process multiple projections in parallel. As the projection angle is normally changing very slowly, it is reasonable to assume that the neighboring bins in the consecutive projection rows contribute to a pixel value. Therefore, an improved 2D locality is achieved if each warp is assigned to process several consecutive projections for a group of pixels. Furthermore, to reduce a spread of accessed bins, it is preferable if the warp processes a square area of output volume. This is in contrast to the original algorithm assigning warps to the lines of pixels across x-dimension. Therefore, a group of 64 consecutive threads is assigned to 4x4 pixel square. Each pixel of the output slice is reconstructed using 4 threads. Each is responsible to compute the contribution to the pixel value from a quarter of all available projections. To avoid costly atomic operations, the contributions of the projection subsets are summed completely independently. As the threads contributing to a pixel value are belonging to the same thread block, the shared memory is used to store the computed partial sums. Then,  the threads are re-assigned to perform reduction and compute the resulting pixel value.


 As a 4 threads is now required for each output pixel, it is necessary to either quadruple the size of computational grid or to increase the number of pixels processed by each thread. To process a projection, the GPU threads need to locate the bins contributing to the values of reconstructed pixels. This is done based on the pixels coordinates and several projection constants which are stored in the GPU memory. If the thread reconstructs several pixels, the constants can be loaded only once and, then, re-used to locate all required bins for the considered pixels. Therefore, to reduce load on GPU memory, it is better to increase the work-load of the GPU threads instead of enlarging the grid dimensions. This means each GPU thread computes contribution for 4 output pixels from quarter of available projections. The following mapping is adopted. The thread blocks assignments are kept exactly the same as in the standard version. Each thread block is responsible for the output area of 16-by-16 pixels. However, this area is further subdivided in 4-by-4 pixel squares. As we have 256 threads in the block and we need 64 threads per square, 4 such squares are processed in parallel. And a complete set of 16 squares requires 4 iterations. There are several ways to arrange the mapping between GPU threads and the pixel squares, see \figurename~\ref{fig:mappings}. The first mapping is sparse and results in a reduced cache hit rate as compared to the other options. And the third option requires less registers as only a single pixel coordinate is incremented for each thread. Since the usage of additional registers may result in the reduced occupancy or the spillage of registers into the local memory, the 3rd approach is preferred.
 
\begin{figure}[htb]
\centering
\includegraphics[width=0.45\textwidth]{img/mappings.pdf}
\caption{\label{fig:mappings} The figure illustrates several ways to assign a block of GPU threads to an area of 16-by-16 pixels. Since 4 projections are processed at once, only 64 threads are available for entire area and it take 4 iterations to process it completely. For each possible scheme in blue are shown all pixels which are processed during the first iteration in parallel. The first mapping (left) is sparse and results in increased cache misses. The second mapping (center) requires more registers and may cause reduced occupancy. So, the third mapping (right) is preferred. }
\end{figure}
 
 The next question is how to actually map 64 threads to a compound 3D space consisting of a 4-by-4 pixel area within 4 selected projections. As stated in the section~\ref{section:texture}, the cache utilization is optimal if all simultaneous accesses by a group of 16 consecutive threads lay within 4-by-4 texture square. As there are only 4 projections, the spread of projection bins accessed by each group of 16 threads has to be minimized. The bin is computed as $x\cdot\cos(p\alpha) - y\cdot\sin(p\alpha)$ and, consequently, the spread is minimal if $(x,y)$ belong to the smallest possible square area. Therefore, the reconstructed area is further split in 2-by-2 pixel squares. Each group of 16 threads is used to reconstruct one such square for all 4 projections. The \figurename~\ref{fig:newtex4} shows the described mapping. First 4 threads are assigned to the first projection and are responsible for 2x2 pixel square. The next 4 threads are processing the same pixel square of the second projection, and so on. The next group of 16 threads is restarting from the first projection in the same way, but processes another square of 2x2 pixels. The pseudo-code to convert thread indexes to the re-mapped pixel offsets is given in Algorithm~\ref{alg:newtex4remap}.
 
\begin{figure}[htb]
\centering
\includegraphics[width=0.45\textwidth]{img/newtex4.pdf}
\caption{\label{fig:newtex4} The figure illustrates how the 64 threads are mapped 4-by-4 pixel area along 4 projections. The pixels in the 4-by-4 area are numbered sequentially from 1 to 16 as show in the bottom part of figure. The mapping is shown in the top part. For each projection and output pixel the corresponding thread number is shown. }
\end{figure}


 
\begin{algorithm}[ht]
\DontPrintSemicolon


\caption{\label{alg:newtex4remap} Optimizing thread mapping for the better cache locality}
%\begin{algorithmic}
\KwIn {$theradIdx$ is the original mapping as reported by CUDA/OpenCL}
\KwOut {$proj$ and $idx$ is a new mapping associating each GPU thread with a projection and a pixel which is processed during the first iteration. The pixels on the subsequent iterations are computed by adding 4 to $y$-coordinate on each step. }
\Begin {
%\tcc{


\tcc{Each thread is responsible for one of 4 pixels laying within a small 2x2 pixel square which is in its own right is one of 4 squares composing the larger 4x4 pixel block. Here we determine the sequential number of pixel in small square, the sequential number of the small square in the larger pixel block, and the sequential number of these block.}
$block_n \eq \vy{m_t} \idiv 4$ \;
$square_n \eq \vy{m_t} \mod 4$ \;
$proj_n \eq \vx{m_t} \idiv 4$ \;
$pixel_n \eq \vx{m_t} \mod 4$ \;

\tcc{Converting the sequential number to $x$,$y$ coordinates.}
$\vdata{square} \eq \vlist{square_n \mod 2, square_n \idiv 2}$ \;
$\vdata{pixel} \eq \vlist{pixel_n \mod 2, pixel_n \idiv 2}$ \;

\tcc{Finally compute the actual pixel coordinates within the block.}
$\vdata{m}_t' \eq  2 \mul \vdata{square} + \vdata{pixel}$ \;
$\vx{m_t'} \aeq 4 \mul block_n $ \;
$m_p \eq proj_n$ \;
}
\end{algorithm}

 Because of the revised thread mapping, each warp now processes multiple projections in parallel and, consequently, also must access multiple elements of constant array in parallel. On NVIDIA platform this results in the serialization of loads from the constant memory and penalizes performance significantly. To avoid this penalty, the projection constants are stored in the global GPU memory and, further, cached in the shared memory. The caching is important because the global memory is slow and the standard L1/L2 caches available across GPU architectures are small and prone to cache poisoning.
%Such complex approach is unnecessary on AMD platform as no serialization is performed by AMD graphics cards. Instead the constant memory is just read trough the L1 cache and only a minor performance penalty is inflicted. Nevertheless, for the sake of uniformity we can use global memory to store projection constants also on AMD platform.
 
 The pseudo-code for the described approach is presented in Algorithm~\ref{alg:newtex4}. The algorithm includes 3 main stages:
\begin{itemize}
\item Caching of the projection constants in the shared memory 
\item Computing partial sums across the quarter of projections
\item Summing partial sums together
\end{itemize}
 
  The shared memory is configured to store constants for up to 256 projections. To perform the caching, an outer loop is introduced to iterate over the blocks of 256 projections. First all threads of the block are used to read the constants from the global memory and fill the cache. We use a 64-bit float2 variable to store the values from both trigonometric functions. This is important on Kepler architecture as the full shared memory bandwidth is only utilized if 64-bit requests are issued. There are no penalty on other platforms and even a minor speed-up possible as less memory operations are in flow. After synchronization, the inner projection loop is started to compute partial sums. It iterates over 4 projections at a time. Finally, the inner-most loop is used to process all 4 pixels the thread is responsible for. After processing the input data, another loop is executed. It writes the partial sums into the shared memory and perform reduction. To avoid non-coalesced global memory writes, first all results are stored in a shared memory buffer $\shmem{\vfloat{r}}$. Then, the results are written in the fastest possible way. The synchronization is needed when switching different mapping modes. However, each reduction is performed by only a single warp. For this reason, it is only necessary to prevent the compiler from re-ordering read and writes and, consequentially, the $fence$ operation is sufficient here. 

%ToDo: Measure something for GT200?
 The proposed algorithm allowed to reach 80-90\% of the theoretical performance on all platforms except GT200 and Kepler, see \figurename~\ref{fig:texeff}. The proposed optimization is not suitable for GT200 architecture due to  scarcity of registers and performs worse than simple straightforward implementation. The performance of Kepler architecture is bound by performance of constant and shared memories.
  
%Todo: Texture Hit Rate for some architecture?
While we are reading textures we should not enable L1/LDG caching as it would conflict with ...  
Enable 8-byte mode
  
  
\begin{algorithm}[htb]
\DontPrintSemicolon
\caption{\label{alg:newtex4} Cache-aware implementation of the back-projection kernel}
%\begin{algorithmic}
\KwIn {Texture and the projection constants $\gmem{c}_*$. Mappings $\vdata{m}_t'$ and $m_p$ are computed as explained in Algorithm~\ref{alg:newtex4remap} and, then, used to compute $\vdata{m}_g'$ and $\vdata{m}$ as specified in \tablename~\ref{table:alg_vars}.}
\KwShMem {$\shmem{\vdata{c}_{cs}}[s_p]$, $\shmem{c_{a}}[s_p]$, $\shmem{\vfloat{s}}[64][4]$, $\shmem{\vfloat{r}}[16][16]$}
\KwOut {Reconstructed slice $\vfloat{r}$}
\Begin {
%$\vdata{m}_g' \eq \vdata{m}_b \vmul \vdata{n}_t + \vdata{m}_t'$ \;
%$\vdata{m} \eq \vdata{m}_g' - \vlist{v_a - \fconst{0.5}, v_a - \fconst{0.5}}$ \;
%$id_l \eq \vy{m_t} \cdot \vx{n_t} + \vx{m_t}$ \;

$\vfloat{s}[4] \eq \{0\}$ \;
\ForToBy{p_b}{0}{n_p}{s_p}{
  \tcc{Caching the projection constants in shared memory}
  $\shmem{\vdata{c}_{cs}}[m_l] \eq \vlist{\gmem{c_c}[p_b + m_l], \gmem{c_s}[p_b + m_l]}$ \;
  $\shmem{c_a}[m_l] \eq \gmem{c_a}[p_b + m_l]$ \;

  \KwSync \;

  \tcc{Computing the partial sums}
  \ForToBy{p}{m_p}{min(s_p, n_p - p_b)}{4}{
    $c_s \eq \vy{\shmem{c_{cs}}[p]}$ \;
    $h \eq \shmem{c_a}[p] + \vx{m} \mul \vx{\shmem{c_{cs}}[p]} - \vy{m} \mul \vy{\shmem{c_{cs}}[p]} + \fconst{0.5}$ \;
    
	\ForTo{b}{0}{4}{
       $\vfloat{s}[b] \aeq $ \KwTex{$h$, $p_b + p + \fconst{0.5}$} \;
%       $h \seq 4 \mul \vy{\shmem{c_{cs}}[p]}$ \;
       $h \seq 4 \mul c_s$ \;
    }
  }

  \KwSync \;
}

%\tcc{Another mapping for reduction}
$\vdata{m}_t'' =  \vlist{\vx{m_t} \mod 4, 4 \mul \vy{m_t} + \vx{m_t} \idiv 4}$ \;
\ForTo{b}{0}{4}{
  \tcc{Moving the partial sums to shared memory}
  $\shmem{\vfloat{s}}[\vx{n_t} \mul \vy{m_t'} + \vx{m_t'}][m_p] \eq \vfloat{s}[b]$ \;
  \KwSync \;
  
  \tcc{Performing reduction}
  \CFor{i \eq 2}{i \geq 1}{i \deq 2}{
    \If{$\vx{mt''} < i$}{
        $\shmem{\vfloat{s}}[\vy{m_t''}][\vx{m_t''}] \aeq \shmem{\vfloat{s}}[\vy{m_t''}][\vx{m_t''}+i]$ \;
    }
    \KwFence \;
  }
  
  \tcc{To coalesce global memory writes, the results are grouped in shared memory}
  \If{$\vx{m_t''} \ifeq 0$} {
     $\shmem{\vfloat{r}}[4 \mul b + \vy{m_t''} \idiv 16][\vy{m_t''} \mod 16] \eq \shmem{\vfloat{s}}[\vy{m_t''}][0]$ \;
  }
  \KwSync \;
}
$\vfloat{r}[\vy{m_g},\vx{m_g}] \eq \shmem{\vfloat{r}}[\vy{m_t}][\vx{m_t}]$  \;
}
\end{algorithm}
  
\subsubsection{Optimizing occupancy for multi-slice mode}\label{section:newtex_occupancy}
 Like the standard algorithm, the cache-aware version can be easily adapted to process 2- and 4-slices in parallel. Only the accumulators and the intermediate buffers in shared memory have to be declared using appropriate vector type. However, the requirements of hardware resource grows significantly if we process multiple slices. In 4-slice mode, the presented version needs 16x 32-bit just to accumulate the partial sums. Furthermore, the  11~KB of shared memory is required per block,  3~KB for constants and 8~KB for 2 temporary arrays used for reduction and streamlining of memory writes. On Kepler architecture with 48KB of shared memory per SM, it is efficiently restricting the number of resident blocks to 4. Using 256-thread blocks, it allows to run 32 warps and only 50\% occupancy can be achieved on the Kepler architecture.
 
To handle the shortage of shared memory, we may drop some of the caches. Easy modifications include:
\begin{itemize}
\item Use the same shared memory buffer to perform both operations: cache constants and to perform reduction
\item Write the results directly after reduction without using another shared memory buffer to improve write coalescence. 
\item Either reduce number of cached constants per iteration or avoid caching constants in shared memory completely. Reducing 
\end{itemize}

 The first point has zero down-side effects and was not explained in the single-slice section only for the sake of simplicity. As the writes are performed only once, the performance penalty due to uncoalesced writes is rather small for any practical number of processed projections. Reducing the constant cache down has larger impact as some warps have to idle at the synchronization point reducing the effective occupancy. As the caching takes a rather small portion of execution time, the temporal reduction of effective occupancy is well compensated by the increased theoretical occupancy. Due to a small size of L1 cache, however, the global memory bandwidth caps the performance on all considered architectures if direct loads from global memory are performed. One remedy is to revert to storing the projection constants in constant memory in this case. Even if each warp needs to replay the load operation 4 times to get the differing constants for 4 reconstructed projections, the constant cache ensures sufficient bandwidth. The improved occupancy also helps to hide the extra latencies caused by memory fetches and results in better performance on some of the considered architectures.
 
 We don't have a direct control over the register allocation. However, additional loop required to cache the constants increases the register consumption considerably. If 4-slice variant is compiled for the Pascal architecture, the proposed algorithm requires 46 registers per thread. With disabled caching of projection constants, only 40 registers are needed. On other hand, the performance penalty due to direct loads from global or constant memories may outweigh the gains of better occupancy. Furthermore, NVIDIA allows to specify the desired number of groups to be executable per SM. The CUDA optimizer, then, changes the code generation algorithm to address the requirement. Generally, as explained in the section~\ref{section:texture} this results in an increased computational load and a more intensive use of L1 caches as part of local variables is offloaded to the local memory. Then, on Fermi and Kepler architectures, we may need to re-balance amount of available shared memory and L1 cache to achieve high occupancy on one hand and ensure there is enough L1 cache to back all the required local memory on the other. Actually, we have found out that setting such bounds have a major positive effect on the performance. But a special care should be taken while porting the code on the GPUs with different compute capability or to a newer CUDA version. The generated code may differ significantly and different bounds will be required for optimal performance. We discuss this aspect in section~\ref{section:bounds} in more details.

 The considered Maxwell- and Pascal-based GPUs are equipped with 96~KB of shared memory per SM. So, full occupancy can be achieved with all explained caches enabled. However, 46 registers are required if the caching of the projection constants is enabled. This restricts number of resident threads to 1424 and, hence, only 5 blocks of 256 threads can be scheduled. Disabling the constant cache reduces the register footprint to 40 and allows 6  blocks to run in parallel. If constant memory is used to store the projection constants, this results in xx\% performance increase from xx\% to xx\% of theoretical maximum.
%ToDo xx

 The Fermi and Kepler GPUs have 64~KB of on-chip memory in each SM which is split between L1-cache and shared memory according to user-specified configuration. Maximum of 48~KB can be allocated for shared memory. Consequently, only 4 blocks can be executed if all caches are enabled in 4-slice mode. This is reasonably high amount for the Fermi architecture as the occupancy is further bound by the register footprint. Kepler architecture, however, has the double amount of registers and the shared memory becomes the limiting factor. Combining the constant and reduction caches, it is possible to reduce the required shared memory down to 8~KB bytes and to run up to 6 blocks per SM. However, 53 registers are required on the Kepler-based Titan GPU if the constant caching is enabled. It limits resident threads to 1236 or 4 blocks and results in an occupancy of 50\%. Unfortunately, the slow performance of Kepler's constant memory makes it impossible to drop the caching of constants to reduce the register footprint and improve occupancy. The performance is improved if CUDA compiler is instructed to use less register and allow execution of 6 blocks at minimum, i.e. running at 75\% of occupancy. The resulting code places 4 variables in the local memory, but about 10\% speed-up over the unrestricted version. Bounding the number of registers is also helpful for Fermi architecture. Setting block limit to 3 increases the occupancy from 33\% to 50\% and bringing performance up by 18\%. The \tablename~\ref{tbl:newtex_bounds} summarizes resource consumption, theoretical occupancy, and achieved performance of methods with and without resource optimization. The results show that improving occupancy sometimes brings considerable speed-up also if some variables has to be offloaded to local memory backed by L1 cache. Restricting number of registers further results in a significant drop of performance due to drastic increase in local memory bandwidth.
 
 % ToDO: 2-slice mode. For Fermi, we need 3 (auto?). For Kepler, we explain bellow. Why Kepler is slow
While the problem is significantly less pronounced, it is still makes sense to set bounds for 1- and 2-slice modes as well. The 100\% occupancy is optimal on Kepler and later architecture if a single slice is processed. The 8 blocks should be requested using \emph{\_\_launch\_bounds\_\_}.


\begin{table}[t]
\begin{threeparttable}
\caption{\label{tbl:newtex_bounds} Occupancy and performance in the 4-slice reconstruction mode}
\centering
\begin{tabular} { | l | l | c | c | c | c | c | }
%\multicolumn{6}{l}{\textbf{NVIDIA GeForce GTX 580 (Fermi, sm21)}} \\
\hline
GPU & Approach 	& SMem & Regs & LMem & Occ. & Perf. \\
\hline
\multirow{3}{*}{580} 
& Std    & 11~KB & 44 & 0 & 33\% & 146.2~GU/s \\
& CMem   &  8~KB & 38 & 0 & 50\% & 161.8~GU/s \\
& Bounds &  8~KB & 42 & 1 & 50\% & 172.8~GU/s \\ 
%\hline
%\multicolumn{6}{l}{\textbf{NVIDIA GeForce GTX 680 (Kepler, sm30)}} \\
%\hline
%Approach 	& SMem & Regs & LMem & Occ. & Perf. \\
\hline
\multirow{3}{*}{680} 
& Std    & 11~KB & 44 & 0 & 50\% & 197.4~GU/s \\
& CMem   &  8~KB & 39 & 0 & 75\% & 167.0~GU/s \\
& Bounds &  8~KB & 40 & 4 & 75\% & 220.1~GU/s \\ 

%\hline
%\multicolumn{6}{l}{\textbf{NVIDIA GeForce GTX Titan (Kepler, sm35)}} \\
%\hline
%Approach 	& SMem & Regs & LMem & Occ. & Perf. \\
\hline
\multirow{3}{*}{Titan} 
& Std    & 11~KB & 54 & 0 & 50\% & 303.7~GU/s \\
& CMem   &  8~KB & 40 & 0 & 75\% & 258.5~GU/s \\
& Bounds &  8~KB & 40 & 4 & 75\% & 338.6~GU/s \\ 
\hline
\end{tabular}
\begin{tablenotes}
\item \tabledesc{The table compares resource requirements and performance of the standard approach described in section~\ref{section:newtex} with the two methods aimed to improve the occupancy. The first method (\emph{CMem}) skips caching step and loads projection constants directly from GPU constant memory. The second method (\emph{Bounds}) instructs CUDA compiler to use less registers. Both methods perform un-coalesced writes to global GPU memory. The table lists the usage of shared memory (per block), number of used 32-bit registers (per-thread), number of 32-bit variables allocated in the local memory (per-thread), and achieved performance in giga-updates per second.}
\end{tablenotes}
\end{threeparttable}
\end{table}



\subsubsection{Handling Kepler Architecture}\label{section:newtex_kepler}
 As it was already shown in the previous section, the Kepler architecture needs a special treatment. Both shared and constant memories are comparatively slow with respect to the performance of texture engine, see~\ref{tbl:gpu_ratios}. Furthermore, 64-bit access is required to fully utilize the available bandwidth of shared memory. According to our measurements in~\ref{section:cmem}, there is only 2~KB of L1 constant cache and for optimal performance it requires 64-bit access as well. The larger 8~KB cache layer is shared by a cluster of several SM, but it is only capable to provide a half bandwidth of the shared memory.
 
 The 64-bit access is enabled in CUDA using \emph{cudaDeviceSetSharedMemConfig} command with \emph{cudaSharedMemBankSizeEightByte} argument. Then, the sine and cosine of a projection angles stored in float2 vectors are loaded using a single 64-bit operation. However, only the half of the available bandwidth can be used while loading the axis corrections. If all projections have the same center and such corrections are not needed, the $\shmem{c_a}$ cache can be dropped and about 10\% speed-up is secured on NVIDIA GTX 680 and Titan GPUs. Otherwise, it is also possible to re-arrange the cache to allow 64-bit accesses and get a higher performance as well. The CUDA compiler is able to unroll loops and combine loads from consecutive array elements into a single 64- or 128-bit operation. This is not compatible with the proposed algorithm as each thread only processes a part of all projections and does not access successive positions in the constant array. To allow element re-combination, the constants accessed by the same thread are grouped together as illustrated on \figurename~\ref{fig:ld128}. As 64 elements belonging to each group are spanning exactly over 32 64-bit shared memory banks, the padding is added to each group to avoid bank conflicts. The constant cache, then, is allocated as $c_{a}^S[4][64+4]$ and accessed as $\shmem{c_a}[m_p][p>>2]$ instead of $\shmem{c_a}[p]$. The performance is improved by 5\% over the base implementation. The described modification slightly increases register pressure and for this reason are not utilized on non-Kepler architectures.  
 
\begin{figure}[htb]
\centering
\includegraphics[width=0.45\textwidth]{img/ld128.pdf}
\caption{\label{fig:ld128} The figure illustrates a layout of the projection constant cache optimized for Kepler architecture. The simple layout (top) is split in 4 parts grouping together the constants accessed by the same threads (middle). The padding is added after each group of constant to prevent bank conflicts. It is expected that the CUDA optimizer may join several load operations together and load the constants for up to 4-projections simultaneously (bottom). No bank conflicts happen as all threads of the warp are accessing different shared memory banks.}
\end{figure} 
 
 In a single-slice processing mode, a better performance can be achieved using a slightly different approach. The performance is penalized as a result of an additional processing step which is required to cache constants in shared memory. While non-cached access is possible, it is significantly slower due to low performance of constant memory. Each GPU thread needs to load 3 constants for each processed projection. This constants are re-used if the thread computes multiple pixels. Less loads from the constant memory is required in this case, but a larger amount of registers is needed to store the partial sums. The baseline algorithm processes 4-pixels in parallel, but the performance is additionally penalized due to 4-way constant memory re-plays caused by warps accessing constants for different projections. It pushes load on constant memory beyond its limits and makes the caching necessary.
 
 Since the Kepler GPUs are equipped with large register bank, the other option is to process more pixels per thread. A modification suggested for Kepler architecture is based on a new thread-to-pixel mapping illustrated on \figurename~\ref{fig:mapping16}. 16 projections are processed in parallel. But all threads are mapped to a single area of 4-by-4 pixels as opposed to 4 such areas in the original modification, see \figurename~\ref{fig:mappings}. The arrangements of threads within 4-by-4 areas are not changed, see \figurename~\ref{fig:newtex4}. As each group of 4 projections is still assigned to the same warp, the 4-way replay is performed by GPU engine while reading constant memory. The projection groups, however, are processed by different warps within the thread black and do not cause additional re-plays. So, the load on the constant memory is reduced 4-fold. Each thread, however, now computes 16 pixels and need to store 16 partial sums. 
 
\begin{figure}[htb]
\centering
\includegraphics[width=0.45\textwidth]{img/mapping16.pdf}
\caption{\label{fig:mapping16} The figure illustrates an alternative way to assign a block of GPU threads to an area of 16-by-16 pixels (left). An 4-by-4 pixel square is processed over 16 projections in parallel. It takes 16 iterations to process the complete area. As 4 groups of projections are processed by different warps, only a 4-way re-play is needed to load projection constants for each warp (right). }
\end{figure} 
    
 It causes a significant increase in the register consumption. If left to the CUDA optimizer, the 47 registers will be required on GTX Titan and only 5 groups will be executed. The performance is better if minimum of 6 groups is configured. However, this causes a significant spill of variables into the local memory. Since a very little of shared memory is used in a single-slice mode, a larger portion of Kepler memory can be assigned to L1 cache to reduce the performance penalty caused by local memory utilization. This is done using \emph{cudaFuncSetCacheConfig} command specifying preference for L1 cache using \emph{cudaFuncCachePreferL1} argument. 48~KB will be assigned to L1 cache and only 16~KB to shared memory in this case. However, it is enough to support 6 thread groups permitted according to the register consumption. 

  The impact from 16 projection groups should be summed together in the reduction part. Kepler introduced a new shuffle instruction allowing to exchange data within the warp. It can be efficiently used to speed-up reduction. The Kepler modification of cache-aware back-projection is shown in Algorithm~\ref{alg:newtex16}. To avoid conditionals in algorithm, the sinogram may be padded to a multiple of 16 projections.

\begin{algorithm}[htb]
\DontPrintSemicolon
\caption{\label{alg:newtex16} Cache-aware implementation of the back-projection kernel optimized for Kepler architecture}
%\begin{algorithmic}
\KwIn {Texture and the projection constants $\cmem{c}_*$. Mappings $\vdata{m}_t'$ and $m_p$ are computed as explained in Algorithm~\ref{alg:newtex4remap} with the only difference that $4 \mul block_n$ is added to the projection index $m_p$ instead of coordinate $\vx{m_t'}$. The computed  values are, then, used to obtain $\vdata{m}_g'$ and $\vdata{m}$ as specified in \tablename~\ref{table:alg_vars}.}
\KwShMem {$\shmem{\vfloat{s}}[16][16]$, $\shmem{\vfloat{r}}[16][16]$}
\KwOut {Reconstructed slice $\vfloat{r}$}
\Begin {

$\vfloat{s}[4][4] \eq \{0\}$ \;
\tcc{Computing the partial sums}
\ForToBy{p}{m_p}{n_p}{16}{
    $\vdata{c}_{cs} \eq \cmem{\vdata{c}_{cs}}[p]$ \;
    $h' \eq \cmem{c_a}[p] + \vx{m} \mul \vx{\cmem{c_{cs}}[p]} - \vy{m} \mul \vy{\cmem{c_{cs}}[p]} + \fconst{0.5}$ \;
    
	\ForTo{b_x}{0}{4}{
	  \ForTo{b_y}{0}{4}{
	   $h \eq h' + 4 \mul b_x \mul \vx{c_{cs}} - 4 \mul b_y \mul \vy{c_{cs}}$ \;
       $\vfloat{s}[b_y][b_x] \aeq $ \KwTex{$h$, $p + \fconst{0.5}$} \;
      }
    }
}

\ForTo{b_x}{0}{4}{
  \ForTo{b_y}{0}{4}{
    \tcc{Moving partial sums to shared memory}
     $\shmem{\vfloat{s}}[4 \mul \vy{m_t'} + \vx{m_t'}][m_p] \eq \vfloat{s}[b_y][b_x]$ \;
     \KwSync \;
  
  \tcc{Performing reduction}
  $\vfloat{r} \eq \shmem{\vfloat{s}}[\vy{m_t}][\vx{m_t}]$ \;
  \CFor{i \eq 8}{i \geq 1}{i \deq 2}{
     $\vfloat{r} \aeq$ \KwShflXor{$\vfloat{r}$, $i$, $16$} \;   
%     $\shmem{\vfloat{s}}[\vy{m_t}][\vx{m_t}] \aeq \shmem{\vfloat{s}}[\vy{m_t''}][\vx{m_t''}+i]$ \;
  }
  
  \tcc{To coalesce global memory writes, the results are grouped in shared memory}
  \If{$\vx{m_t} \ifeq 0$} {
     $\shmem{\vfloat{r}}[4 \mul b_y + \vy{m_t} \idiv 4][4 \mul b_x + \vy{m_t} \mod 4] \eq \vfloat{r}$ \;
  }
  \KwSync \;
  }
}
$\gmem{\vfloat{r}}[\vy{m_g},\vx{m_g}] \eq \shmem{\vfloat{r}}[\vy{m_t}][\vx{m_t}]$  \;
}
\end{algorithm}

 There is one more trick to reduce register requirements on NVIDIA GTX Titan. The counter of the projection loop $p$ is declared as floating point variable like:
\begin{lstlisting}
 for (float p = $m_p$ + 0.5f; p < $n_p$; p += 16.f) { ... }
\end{lstlisting}
The $n_p$ in kernel arguments is also declared as floating point number. Then, only 5 variables instead of 8 are spilled into the local memory and performance is considerably improved. This is only applicable for NVIDIA GTX Titan GPU. Another Kepler-based card, NVIDIA GTX 680, is actually performing worse with this modification. The cards compute capabilities differ in revision and the code generated for GTX 680 actually uses more registers if floating-point index is used. The effect is purely due to NVIDIA optimizer. The \emph{ptx} code for both cards is exactly the same, but \emph{SASS} assemblies extracted from the generated \emph{cubins} differ significantly. All described modifications are compared in \tablename~\ref{tbl:newtex_kepler}.


\begin{table}[t]
\begin{threeparttable}
\caption{\label{tbl:newtex_kepler} List of possible modifications to improve performance on Kepler architecture}
\centering
\begin{tabular} { | l | l | c | c | c | c | c | }
%\multicolumn{7}{l}{\textbf{NVIDIA GeForce GTX 580 (Fermi, sm21)}} \\
\hline
\textbf{GPU} & \textbf{Algorithm} 	& \textbf{Regs} & \textbf{LM} & \textbf{L1} & \textbf{Occ.} & \textbf{Perf.} \\
\hline
\multirow{5}{*}{680} 
& Standard    & 32 &  0 & 16~KB & 100\% & 109~GU/s \\
& Same center & 32 &  0 & 16~KB & 100\% & 120~GU/s \\
& Coalescing  & 32 &  0 & 16~KB & 100\% & 114~GU/s \\ 
& 16 proj.    & 40 &  8 & 48~KB &  75\% & 116~GU/s \\ 
& Float index & 40 & 12 & 48~KB &  75\% & 113~GU/s \\
\hline
\multirow{5}{*}{Titan} 
& Standard    & 32 &  0 & 16~KB & 100\% & 165~GU/s \\
& Same center & 32 &  4 & 16~KB & 100\% & 186~GU/s \\
& Coalescing  & 32 &  0 & 16~KB & 100\% & 177~GU/s \\ 
& 16 proj.    & 40 & 32 & 48~KB &  75\% & 186~GU/s \\ 
& Float index & 40 & 20 & 48~KB &  75\% & 205~GU/s \\
\hline
\end{tabular}
\begin{tablenotes}
\item \tabledesc{The table compares several methods to reduce memory load and improve reconstruction performance on NVIDIA Kepler architecture. The resource requirements and performance of the standard approach described in section~\ref{section:newtex} are compared against: \colid{a} method assuming there is no corrections needed for rotational axis position,  \colid{b} method re-grouping projections to allow 64-bit coalesced load, \colid{c-d} methods processing 16 projections in parallel using integer and floating-point iterator in the projection loop. The table reports the number of used 32-bit registers \colname{Regs}, the size of required local memory in bytes \colname{LM}, the configured size of L1 cache per SM \colname{L1}, theoretical occupancy \colname{Occ}, and the achieved performance in giga-updates per second \colname{Perf}.}
\end{tablenotes}
\end{threeparttable}
\end{table}

 Due to high register requirements, the described optimization does not bring performance gains for the 2~slice reconstruction mode. Better performance is achieved if the 4-projection mode is used and the layout of the axis correction cache is optimized for 64-bit access as explained above. The sine and cosine constants are stored in float2 vectors and, therefore, are not remapped. It is optimal to run the kernel at full occupancy. The number of generated registers is restricted with \emph{\_\_launch\_bounds\_\_}. 3 - 4 variables are, then, spilled into the local memory. To ensure that they are fitting in the L1 cache at full occupancy, it is necessary to allocate 32~KB of memory for L1 cache leaving remaining 32~KB for the shared memory. If the coalescing cache is disabled and the same buffer is used to cache constants and to exchange partial sums, less than 4~KB of shared memory is required per block and the full occupancy is achieved. The constant memory is not a bottleneck for the 4-slice reconstruction mode as the performance is severely limited by halt-to-float type casting in this case. The standard approach to cache constants can be used without performance penalty.

 
%Reference something 
 
\subsubsection{Summary}
We have introduced a new cache-aware algorithm which is able to reconstruct up 4 slices in parallel. Several modifications is proposed to better suite specific GPU architectures. The optimal configuration and the corresponding performance are summarized in \tablename~\ref{tbl:newtex}. The achieved efficiency is further analyzed on \figurename~\ref{fig:texeff}. For a single slice reconstruction mode, we were able to achieve over 90\% efficiency across all considered platforms. On some GPUs a significant speed of up to 90\% is achieved also in single-slice processing mode as compared to the original algorithm. Using the multi-slice reconstruction and half-float data representation it is possible almost to quadruple performance on Fermi and the latest NVIDIA architectures. 80-90\% efficiency is achieved if texture-aware reconstruction kernel is used. The efficiency of Kepler GPUs is limited due to comparatively slow on-chip memory and low performance of SFU units limiting the speed of type mangling. Nevertheless, due to the proposed optimizations a 2 - 3 times speed-up over standard single-slice algorithm is achieved.

\begin{table}[htb] %[htbp]
\begin{threeparttable}
\caption{\label{tbl:newtex} Performance and configuration of cache-aware texture-based kernel}
\centering
%\resizebox{\columnwidth}{!}{\begin{tabular}{} ... \end{\tabular}}
\begin{tabular}{ | l  c | r | r r r r c | }
\hline
& & & \multicolumn{5}{c|}{Configuration}  \\
\mhd{|c}{GPU} & \mhd{c|}{SaO} & \mhd{c|}{Perf.} & \mhd{c}{PaO} & \mhd{c}{B} & \mhd{c}{L1/SM} & \mhd{c}{C} & \mhd{c|}{Notes} \\
%GPU & SaO & Perf & PaO & B & L1/SM & Cache & Notes \\

\hline
\multirow{3}{*}{GTX580} 
& 1 &  49 GU/s &  4 & 3 & 16/48 & 256 & - \\
& 2 &  97 GU/s &  4 & 3 & 16/48 & 256 & - \\
& 4 & 173 GU/s &  4 & 3 & 16/48 & 256 & - \\
\hline

\multirow{3}{*}{GTX680} 
& 1 & 115 GU/s & 16 & 6 & 48/16 &   0 & - \\
& 2 & 209 GU/s &  4 & 8 & 32/32 & 256\tnote{1} & *\tnote{2} \\ 
& 4 & 220 GU/s &  4 & 6 & 16/48 & 256 & - \\ 
\hline

\multirow{3}{*}{Titan} 
& 1 & 205 GU/s & 16 & 6 & 48/16 &   0 & *\tnote{3}  \\
& 2 & 292 GU/s &  4 & 8 & 32/32 & 256\tnote{1} & *\tnote{2} \\
& 4 & 339 GU/s &  4 & 6 & 16/48 & 256 & - \\
\hline

\multirow{3}{*}{GTX980} 
& 1 & 156 GU/s &  4 & 8 & -     & 256 & - \\
& 2 & 303 GU/s &  4 & 8 & -     & 256 & - \\
& 4 & 560 GU/s &  4 & 6 & -     &   0 & - \\
\hline

\multirow{3}{*}{Titan X} 
& 1 & 389 GU/s &  4 & 8 & -     & 256 & - \\
& 2 & 739 GU/s &  4 & 8 & -     & 256 & - \\
& 4 &1422 GU/s &  4 & 6 & -     &   0 & - \\
\hline

HD5970 & 1 &  56 GU/s & 4 & - & - & 256 & - \\
\hline
HD7970 & 1 & 115 GU/s & 4 & - & - & 256 & - \\
\hline
R9-290 & 1 & 148 GU/s & 4 & - & - & 256 & - \\
\hline
\end{tabular}
\begin{tablenotes}
\item The table summarizes the performance and optimal configuration for the texture-based kernel across considered platforms. Information is provided for all supported slice-modes (SaO) specifying how many slices are reconstructed at once. The configuration lists: 
\coldesc{a}{PaO}{if 4- or 16-projection modification is used}
\coldesc{b}{B}{the preferred number of blocks per SM}
\coldesc{c}{L1/SM}{how the on-chip SM memory is split between L1 cache and shared memory}
\coldesc{d}{C}{the size of cache used to store projection constants}
\coldesc{e}{Notes}{references the required special settings described bellow}.
The nearest-neighbor interpolation and the half float data representation are used to measure performance if 4 slices are reconstructed in parallel. Otherwise, the performance is measured using bi-linear interpolation mode and the sinogram is stored in the single-precision floating point format.
\item[1] The projection constants are grouped to allow 64-bit loads
\item[2] No shared memory buffer is used to coalesce global memory writes 
\item[3] Projection loop uses floating point iterator
\end{tablenotes}
\end{threeparttable}
\end{table}



\begin{figure}[htb]
\centering
\includegraphics[width=0.45\textwidth]{img/texeff.pdf}
\caption{\label{fig:texeff} The figure evaluates efficiency of optimizations proposed for texture-based back-projection kernel. The measured throughput is compared to the maximum filter rate of a texture engine and the performance is reported as a percent of achieved utilization. The results are reported also for processing multiple slices in parallel if the texture engine supports the corresponding vector data type. The nearest-neighbor interpolation and the half float data representation are used to measure performance if 4 slices are reconstructed. Otherwise, the performance is measured using bi-linear interpolation mode and the sinogram is stored in the single-precision floating point format. The blue bars show performance of the standard \algorithmname~\ref{alg:stdalg} modified to process multiple slices in parallel as explained in \sectionname~\ref{section:multislice} and to store data in the half-float format as explained in \sectionname~\ref{section:half}. The green bars show improvements achieved using cache-aware \algorithmname~\ref{alg:newtex4} without any architecture-specific optimizations discussed later on. The green bars show further improvements due to tuning of occupancy as explained in \sectionname~\ref{section:newtex_occupancy} and using Kepler-specific modifications as discussed in \sectionname~\ref{section:newtex_kepler} and presented in \algorithmname~\ref{alg:newtex16}. The utilization is reported according to the supported filter rate, not the bandwidth. While the lower utilization is achieved for multi-slices reconstruction modes, the actual performance is higher as 2-/4-slices are processed using a single texture access.}
\end{figure}