/articles/toma : revision 34

1

2

\subsection{Back-projection based on Texture engine}\label{section:texrec}

3

The standard implementation described in previous section performing fairly good. The compilers included in the CUDA Framework and AMD APPSDK are doing a good job of optimizing the execution flow. The loops are automatically unrolled and the operations are re-arranged to allow streaming texture loads as explained in the section~\ref{section:texture}. Still, the default implementation does not utilize all capabilities of texture engine and significant improvement can be achieved on all architectures.

3

The standard implementation described in previous section performing fairly good. The compilers included in the CUDA Framework and AMD APPSDK are doing a good job of optimizing the execution flow. The loops are automatically unrolled and the operations are re-arranged to allow streaming texture loads as explained in the section~\ref{section:memory_performance}. Still, the default implementation does not utilize all capabilities of texture engine and significant improvement can be achieved on all architectures.

4

5

\subsubsection{Standard version}\label{section:standard}

6

First, we will detail how the standard implementation works. Each GPU thread is responsible for a single pixel of output slice and iterates over all projections to sum the contribution from each one. At each iteration, a projection is performed to find a coordinate where the ray passing through the reconstructed image pixel hits the detector. The value at the corresponding position in the sinogram row is fetched using the texture engine and summed up with the contributions from other projections. The texture engine is configured to perform either nearest-neighbor or linear interpolation as desired. The projection is computed according to \equationname~\ref{eq:bin}. To align the coordinate system with rotational axis, the position of the rotational axis is first subtracted from the pixel coordinates and, then, added to the computed detector coordinate to find the required position in the sinogram. To compensate for possible distortions of imaging system during the experiment, the rotational center is not constant, but may include per-projection corrections. Sine and cosine of each projection angle as well as the corrected position of the rotation axis are read from a buffer in the constant memory which is generated during the initialization phase. The computation grid is split in square blocks of 16-by-16 threads. It results in optimal occupancy across all considered platforms. The corresponding pseudo-code is presented in Algorithm~\ref{alg:stdalg}. The variables used here and across other code snippets in the article are described in \tablename~\ref{table:alg_vars}.

17

18

In parallel tomography, exactly the same operations are performed for all the reconstructed slices. Therefore, it is possible to reconstruct multiple slices in parallel if the back projection operator is applied to a compound sinogram which encodes bins from the multiple simple sinograms as vector data. Particularly, it is possible to construct such sinogram using float2 vector type and interleaving values from one sinogram as $x$ components and from another as $y$, see \figurename~\ref{fig:interleave}. With \emph{float2}-typed texture mapped on this interleaved sinogram, it is possible to fully utilize the bandwidth of the texture engine and reconstruct two slices in parallel. The interleaving is done as an additional data preparation step after filtering, but before actually executing the back projection kernel. The back projection kernel, then, must be simply adjusted to use the float2 type and write the $x$ component of the result into the first output slice and the $y$ component into the second. There is a considerable speed-up on all NVIDIA architectures as can be seen on the \figurename~\ref{fig:texeff}.

19

20

\begin{figure}[h]

20

\begin{figure}[htb]

21

\centering

22

\includegraphics[width=0.45\textwidth]{img/interleave.pdf}

23

\caption{\label{fig:interleave} Illustration how two sinograms are interleaved to allow utilization of full 8-byte filtering bandwidth on post Fermi NVIDIA GPUs. }

33

34

The penalty to the quality of reconstruction inflicted by reduced precision is evaluated on \figurename~\ref{fig:quality_nn}. For synthetic phantom it is negligible. However, the behaviour for the real-world measurements may be different, especially if projections are obtained using a camera with high dynamic range. As the optimization proposed in this subsection changes the reconstruction results, it is important to verify that the achieved quality is still satisfactory.

35

36

\begin{figure}[h]

36

\begin{figure}[htb]

37

\centering

38

\includegraphics[width=0.45\textwidth]{img/quality_nn.pdf}

39

\caption{\label{fig:quality_nn} Comparison of the reconstructions performed using single-precision and half-precision input. The profile plot along the selected line is shown in the top part of the figure for the phantom and both reconstructions. There is no visible difference between the methods. The absolute difference between reconstructions along the same line is shown in the bottom part. }

54

55

56

\subsubsection{Optimizing locality of texture fetches}\label{section:remap}

57

The algorithm maps each GPU thread to a single pixel of output slice and processes the projections sequentially in a loop. The default mapping is linear. The thread with coordinates $(x,y)$ in computational grid is used to reconstruct the pixel with coordinates $(x,y)$ in the slice. All threads of an arbitrary warp, then, are used to reconstruct consecutive pixels across the $x$-axis. Consequently, the accesses to the projection array by the threads of a warp are distributed quite widely. Up to 32 different locations may be accessed. As it was explained in the \sectionname~\ref{section:texture}, the locality of fetches within a block, a warp, and also within a group of 4 consecutive threads is important to keep textures engine running at full speed. Texture engine fetches data from the cache at 32-byte granularity and the optimal performance is achieved only if a number of fetches required per a group of 4 threads is bellow a certain threshold.

57

The algorithm maps each GPU thread to a single pixel of output slice and processes the projections sequentially in a loop. The default mapping is linear. The thread with coordinates $(x,y)$ in computational grid is used to reconstruct the pixel with coordinates $(x,y)$ in the slice. All threads of an arbitrary warp, then, are used to reconstruct consecutive pixels across the $x$-axis. Consequently, the accesses to the projection array by the threads of a warp are distributed quite widely. Up to 32 different locations may be accessed. As it was explained in the \sectionname~\ref{section:texture_engine}, the locality of fetches within a block, a warp, and also within a group of 4 consecutive threads is important to keep textures engine running at full speed. Texture engine fetches data from the cache at 32-byte granularity and the optimal performance is achieved only if a number of fetches required per a group of 4 threads is bellow a certain threshold.

58

59

The number of bins ($b$) of projection ($p$) required to reconstruct a rectangular block of pixels ($S$) with dimensions $n$ by $m$ is defined as

60

\begin{equation*}

121

122

There are 256 threads in the block and 64 threads are required per 4-by-4 pixel square. Therefore, 4 such squares are processed in parallel and a complete set of 16 squares requires 4 iterations. \figurename~\ref{fig:mappings} shows several possible sequences to serialize processing. The first mapping is sparse and results in a reduced cache hit rate as compared to the other options. The third option requires less registers as only a single pixel coordinate is incremented for each thread. And the second mapping has a better access locality within the 64-thread warps of the AMD platform. In practice, however, it does not affect performance. On other hand, in multi-slice reconstruction mode the register usage is very high and the extra registers cause reduced occupancy or the spillage of registers into the local memory. Therefore, the 3rd approach is preferred.

123

124

\begin{figure}[h]

124

\begin{figure}[htb]

125

\centering

126

\includegraphics[width=0.45\textwidth]{img/mappings.pdf}

127

\caption{\label{fig:mappings} The figure illustrates several ways to assign a block of GPU threads to an area of 16-by-16 pixels. Since 4 projections are processed at once, only 64 threads are available for entire area and it take 4 iterations to process it completely. For each possible scheme in gray are shown all pixels which are processed during the first iteration in parallel. The first mapping (left) is sparse and results in increased cache misses. The second mapping (center) requires more registers and may cause reduced occupancy. So, the third mapping (right) is preferred. }

129

130

Since the request to multiple locations in the constant memory causes warp serialization on NVIDIA platform, the threads of a warp are always used to process the same projection. The following mapping scheme is adopted. The lowest 4 bits of the thread number in a block define the mapping within each of 4-by-4 pixel squares. The group of 16 threads follows Z-curve as explained in the \sectionname~\ref{section:remap}. Next 2 bits define a square and the top 2 bits define the processed projection. The \figurename~\ref{fig:newtex4_pm} illustrates the proposed mapping and \algorithmname~\ref{alg:newtex4remap} provides the corresponding pseudo-code.

131

132

\begin{figure}[h]

132

\begin{figure}[htb]

133

\centering

134

\includegraphics[width=0.45\textwidth]{img/newtex_pm.pdf}

135

\caption{\label{fig:newtex4_pm} The figure shows the mapping of 256 thread block used to reconstruct square of 16-by-16 pixels along 4 projections. 4 iterations are required to process all pixels. A group of 64 consecutive threads is responsible to process a rectangular area of 16 by 4 pixels (middle). 4 projections are processed in parallel using 4 such groups (right). Each 4-by-4 pixel square is reconstructed by 16 threads arranged along 2\textsuperscript{nd} order Z-curve (left). For each output pixel or block of pixels, the assigned range of threads is shown on the figure. }

185

\subsubsection{Enabling 64-bit access for NVIDIA Kepler GPUs}\label{section:newtex_ld64}

186

Both shared and constant memories are comparatively slow on Kepler with respect to the performance of the texture engine, see~\ref{tbl:gpuratios}. Furthermore, 64-bit access is required to fully utilize the available bandwidth of the shared memory. This is provided in multi-slice mode. However, 64-bit operation should be also enabled in CUDA using \emph{cudaDeviceSetSharedMemConfig} command with \emph{cudaSharedMemBankSizeEightByte} argument.

187

188

According to our measurements in~\ref{section:cmem}, for constant memory NVIDIA maintains two caches. The larger 8~KB cache layer is shared by a cluster of several SM and a small 2~KB L1 cache is residing directly within SM. On Kepler, for optimal performance the L1 cache requires 64-bit access as well and the larger 8~KB is only capable to deliver half of the bandwidth. Some performance improvement is achieved by grouping all projection constants together and storing as a single \emph{float4} vector. Even if only 3 components of the vector are actually required, the performance is considerably better on Kepler GPUs.

188

According to our measurements in~\ref{section:memory_performance}, for constant memory NVIDIA maintains two caches. The larger 8~KB cache layer is shared by a cluster of several SM and a small 2~KB L1 cache is residing directly within SM. On Kepler, for optimal performance the L1 cache requires 64-bit access as well and the larger 8~KB is only capable to deliver half of the bandwidth. Some performance improvement is achieved by grouping all projection constants together and storing as a single \emph{float4} vector. Even if only 3 components of the vector are actually required, the performance is considerably better on Kepler GPUs.

189

190

\subsubsection{Summary}

191

We have introduced a new cache-aware algorithm which is able to reconstruct up 4 slices in parallel. Several modifications is proposed to better suite specific GPU architectures. The optimal configuration and the corresponding performance are summarized in \tablename~\ref{tbl:newtex}. The achieved efficiency is further analyzed on \figurename~\ref{fig:texeff}. For a single slice reconstruction mode, we are able to achieve over 90\% efficiency across all considered platforms. On some GPUs a significant speed of up to 90\% is achieved also in single-slice processing mode as compared to the original algorithm. Using the multi-slice reconstruction and half-float data representation it is possible to quadruple performance on Fermi and the latest NVIDIA architectures. 80-90\% efficiency is achieved if optimized reconstruction kernel is used. The efficiency of Kepler GPUs is limited due to comparatively slow on-chip memory and low performance of SFU units limiting the speed of type mangling. Nevertheless, due to the proposed optimizations a 2 - 3 times speed-up over the standard single-slice algorithm is achieved.

193

\input{table_51_texrec.tex}

194

195

196

\begin{figure}[h]

196

\begin{figure}[htb]

197

\centering

198

\includegraphics[width=0.45\textwidth]{img/texeff.pdf}

199

\caption{\label{fig:texeff} The figure evaluates efficiency of optimizations proposed for texture-based back-projection kernel. The measured throughput is compared to the maximum filter rate of a texture engine and the performance is reported as a percent of achieved utilization. The results are reported also for processing multiple slices in parallel if the texture engine supports the corresponding vector data type. The nearest-neighbour interpolation and the half float data representation are used to measure performance if 4 slices are reconstructed. Otherwise, the performance is measured using bi-linear interpolation mode and the sinogram is stored in the single-precision floating point format. The blue bars show performance of the standard \algorithmname~\ref{alg:stdalg} modified to process multiple slices in parallel as explained in \sectionname~\ref{section:multislice} and to store data in the half-float format as explained in \sectionname~\ref{section:half}. The green bars show improvements due to a better fetch locality achieved with remapping threads as explained in \sectionname~\ref{section:remap}. The red bars show the maximum performance using \algorithmname~\ref{alg:newtex4} with optimal combination of tweaks explained through \sectionname~\ref{section:texrec}. \tablename~\ref{tbl:newtex} summarizes the architecture-specific parameters used at each GPU. The utilization is reported according to the supported filter rate, not the bandwidth. While the lower utilization is achieved for multi-slices reconstruction modes, the actual performance is higher as 2-/4-slices are processed using a single texture access.}