/articles/toma : revision 34

10

11

As a 4 threads is now required for each output pixel, it is necessary to either quadruple the size of computational grid or to increase the number of pixels processed by each thread. To process a projection, the GPU threads need to locate the bins contributing to the values of reconstructed pixels. This is done based on the pixels coordinates and several projection constants which are stored in the GPU memory. If the thread reconstructs several pixels, the constants can be loaded only once and, then, re-used to locate all required bins for the considered pixels. Therefore, to reduce load on GPU memory, it is better to increase the work-load of the GPU threads instead of enlarging the grid dimensions. This means each GPU thread computes contribution for 4 output pixels from quarter of available projections. The following mapping is adopted. The thread blocks assignments are kept exactly the same as in the standard version. Each thread block is responsible for the output area of 16-by-16 pixels. However, this area is further subdivided in 4-by-4 pixel squares. As we have 256 threads in the block and we need 64 threads per square, 4 such squares are processed in parallel. And a complete set of 16 squares requires 4 iterations. There are several ways to arrange the mapping between GPU threads and the pixel squares, see \figurename~\ref{fig:mappings}. The first mapping is sparse and results in a reduced cache hit rate as compared to the other options. And the third option requires less registers as only a single pixel coordinate is incremented for each thread. Since the usage of additional registers may result in the reduced occupancy or the spillage of registers into the local memory, the 3rd approach is preferred.

12

13

\begin{figure}[h]

13

\begin{figure}[htb]

14

\centering

15

\includegraphics[width=0.45\textwidth]{img/mappings.pdf}

16

\caption{\label{fig:mappings} The figure illustrates several ways to assign a block of GPU threads to an area of 16-by-16 pixels. Since 4 projections are processed at once, only 64 threads are available for entire area and it take 4 iterations to process it completely. For each possible scheme in blue are shown all pixels which are processed during the first iteration in parallel. The first mapping (left) is sparse and results in increased cache misses. The second mapping (center) requires more registers and may cause reduced occupancy. So, the third mapping (right) is preferred. }

18

19

The next question is how to actually map 64 threads to a compound 3D space consisting of a 4-by-4 pixel area within 4 selected projections. As stated in the section~\ref{section:texture}, the cache utilization is optimal if all simultaneous accesses by a group of 16 consecutive threads lay within 4-by-4 texture square. As there are only 4 projections, the spread of projection bins accessed by each group of 16 threads has to be minimized. The bin is computed as $x\cdot\cos(p\alpha) - y\cdot\sin(p\alpha)$ and, consequently, the spread is minimal if $(x,y)$ belong to the smallest possible square area. Therefore, the reconstructed area is further split in 2-by-2 pixel squares. Each group of 16 threads is used to reconstruct one such square for all 4 projections. The \figurename~\ref{fig:newtex4} shows the described mapping. First 4 threads are assigned to the first projection and are responsible for 2x2 pixel square. The next 4 threads are processing the same pixel square of the second projection, and so on. The next group of 16 threads is restarting from the first projection in the same way, but processes another square of 2x2 pixels. The pseudo-code to convert thread indexes to the re-mapped pixel offsets is given in Algorithm~\ref{alg:newtex4remap}.

20

21

\begin{figure}[h]

21

\begin{figure}[htb]

22

\centering

23

\includegraphics[width=0.45\textwidth]{img/newtex4.pdf}

24

\caption{\label{fig:newtex4} The figure illustrates how the 64 threads are mapped 4-by-4 pixel area along 4 projections. The pixels in the 4-by-4 area are numbered sequentially from 1 to 16 as show in the bottom part of figure. The mapping is shown in the top part. For each projection and output pixel the corresponding thread number is shown. }

75

Enable 8-byte mode

76

77

78

\begin{algorithm}[h]

78

\begin{algorithm}[htb]

79

\DontPrintSemicolon

80

\caption{\label{alg:newtex4} Cache-aware implementation of the back-projection kernel}

81

%\begin{algorithmic}

205

206

The 64-bit access is enabled in CUDA using \emph{cudaDeviceSetSharedMemConfig} command with \emph{cudaSharedMemBankSizeEightByte} argument. Then, the sine and cosine of a projection angles stored in float2 vectors are loaded using a single 64-bit operation. However, only the half of the available bandwidth can be used while loading the axis corrections. If all projections have the same center and such corrections are not needed, the $\shmem{c_a}$ cache can be dropped and about 10\% speed-up is secured on NVIDIA GTX 680 and Titan GPUs. Otherwise, it is also possible to re-arrange the cache to allow 64-bit accesses and get a higher performance as well. The CUDA compiler is able to unroll loops and combine loads from consecutive array elements into a single 64- or 128-bit operation. This is not compatible with the proposed algorithm as each thread only processes a part of all projections and does not access successive positions in the constant array. To allow element re-combination, the constants accessed by the same thread are grouped together as illustrated on \figurename~\ref{fig:ld128}. As 64 elements belonging to each group are spanning exactly over 32 64-bit shared memory banks, the padding is added to each group to avoid bank conflicts. The constant cache, then, is allocated as $c_{a}^S[4][64+4]$ and accessed as $\shmem{c_a}[m_p][p>>2]$ instead of $\shmem{c_a}[p]$. The performance is improved by 5\% over the base implementation. The described modification slightly increases register pressure and for this reason are not utilized on non-Kepler architectures.

207

208

\begin{figure}[h]

208

\begin{figure}[htb]

209

\centering

210

\includegraphics[width=0.45\textwidth]{img/ld128.pdf}

211

\caption{\label{fig:ld128} The figure illustrates a layout of the projection constant cache optimized for Kepler architecture. The simple layout (top) is split in 4 parts grouping together the constants accessed by the same threads (middle). The padding is added after each group of constant to prevent bank conflicts. It is expected that the CUDA optimizer may join several load operations together and load the constants for up to 4-projections simultaneously (bottom). No bank conflicts happen as all threads of the warp are accessing different shared memory banks.}

215

216

Since the Kepler GPUs are equipped with large register bank, the other option is to process more pixels per thread. A modification suggested for Kepler architecture is based on a new thread-to-pixel mapping illustrated on \figurename~\ref{fig:mapping16}. 16 projections are processed in parallel. But all threads are mapped to a single area of 4-by-4 pixels as opposed to 4 such areas in the original modification, see \figurename~\ref{fig:mappings}. The arrangements of threads within 4-by-4 areas are not changed, see \figurename~\ref{fig:newtex4}. As each group of 4 projections is still assigned to the same warp, the 4-way replay is performed by GPU engine while reading constant memory. The projection groups, however, are processed by different warps within the thread black and do not cause additional re-plays. So, the load on the constant memory is reduced 4-fold. Each thread, however, now computes 16 pixels and need to store 16 partial sums.

217

218

\begin{figure}[h]

218

\begin{figure}[htb]

219

\centering

220

\includegraphics[width=0.45\textwidth]{img/mapping16.pdf}

221

\caption{\label{fig:mapping16} The figure illustrates an alternative way to assign a block of GPU threads to an area of 16-by-16 pixels (left). An 4-by-4 pixel square is processed over 16 projections in parallel. It takes 16 iterations to process the complete area. As 4 groups of projections are processed by different warps, only a 4-way re-play is needed to load projection constants for each warp (right). }

225

226

The impact from 16 projection groups should be summed together in the reduction part. Kepler introduced a new shuffle instruction allowing to exchange data within the warp. It can be efficiently used to speed-up reduction. The Kepler modification of cache-aware back-projection is shown in Algorithm~\ref{alg:newtex16}. To avoid conditionals in algorithm, the sinogram may be padded to a multiple of 16 projections.

227

228

\begin{algorithm}[h]

228

\begin{algorithm}[htb]

229

\DontPrintSemicolon

230

\caption{\label{alg:newtex16} Cache-aware implementation of the back-projection kernel optimized for Kepler architecture}

231

%\begin{algorithmic}

317

\subsubsection{Summary}

318

We have introduced a new cache-aware algorithm which is able to reconstruct up 4 slices in parallel. Several modifications is proposed to better suite specific GPU architectures. The optimal configuration and the corresponding performance are summarized in \tablename~\ref{tbl:newtex}. The achieved efficiency is further analyzed on \figurename~\ref{fig:texeff}. For a single slice reconstruction mode, we were able to achieve over 90\% efficiency across all considered platforms. On some GPUs a significant speed of up to 90\% is achieved also in single-slice processing mode as compared to the original algorithm. Using the multi-slice reconstruction and half-float data representation it is possible almost to quadruple performance on Fermi and the latest NVIDIA architectures. 80-90\% efficiency is achieved if texture-aware reconstruction kernel is used. The efficiency of Kepler GPUs is limited due to comparatively slow on-chip memory and low performance of SFU units limiting the speed of type mangling. Nevertheless, due to the proposed optimizations a 2 - 3 times speed-up over standard single-slice algorithm is achieved.

319

320

\begin{table}[h] %[htbp]

320

\begin{table}[htb] %[htbp]

321

\begin{threeparttable}

322

\caption{\label{tbl:newtex} Performance and configuration of cache-aware texture-based kernel}

323

\centering

383

384

385

386

\begin{figure}[h]

386

\begin{figure}[htb]

387

\centering

388

\includegraphics[width=0.45\textwidth]{img/texeff.pdf}

389

\caption{\label{fig:texeff} The figure evaluates efficiency of optimizations proposed for texture-based back-projection kernel. The measured throughput is compared to the maximum filter rate of a texture engine and the performance is reported as a percent of achieved utilization. The results are reported also for processing multiple slices in parallel if the texture engine supports the corresponding vector data type. The nearest-neighbor interpolation and the half float data representation are used to measure performance if 4 slices are reconstructed. Otherwise, the performance is measured using bi-linear interpolation mode and the sinogram is stored in the single-precision floating point format. The blue bars show performance of the standard \algorithmname~\ref{alg:stdalg} modified to process multiple slices in parallel as explained in \sectionname~\ref{section:multislice} and to store data in the half-float format as explained in \sectionname~\ref{section:half}. The green bars show improvements achieved using cache-aware \algorithmname~\ref{alg:newtex4} without any architecture-specific optimizations discussed later on. The green bars show further improvements due to tuning of occupancy as explained in \sectionname~\ref{section:newtex_occupancy} and using Kepler-specific modifications as discussed in \sectionname~\ref{section:newtex_kepler} and presented in \algorithmname~\ref{alg:newtex16}. The utilization is reported according to the supported filter rate, not the bandwidth. While the lower utilization is achieved for multi-slices reconstruction modes, the actual performance is higher as 2-/4-slices are processed using a single texture access.}