/articles/toma : contents of section_5x_texrec.tex at revision 100

: (revision 100)

To get this branch, use:

bzr branch
http://darksoft.org/webbzr/articles/toma

\section{Back-projection based on Texture engine}\label{section:texrec}
 The standard implementation described in previous section performs fairly good. The compilers included in the CUDA Framework and AMD APPSDK are  optimize the execution flow automatically. The loops are unrolled and the operations are re-arranged to allow streaming texture loads as explained in the section~\ref{section:code_flow}. Still, the default implementation does not utilize all capabilities of texture engine and significant improvement can be achieved on all architectures.
 
\subsection{Standard version}\label{section:standard}
 First, we will detail how the standard implementation works. Each GPU thread is responsible for a single pixel of output slice and loops over all projections to sum the contribution from each one. At each iteration, a projection is performed to find a coordinate where the ray passing through the reconstructed image pixel hits the detector. The value at the corresponding position in the sinogram row is fetched using the texture engine and summed up with the contributions from other projections. The texture engine is configured to perform either nearest-neighbor or linear interpolation as desired. The projection is computed according to \equationname~\ref{eq:bin}. To align the coordinate system with rotational axis, the position of the rotational axis is first subtracted from the pixel coordinates and, then, added to the computed detector coordinate to find the required position in the sinogram. To compensate for possible distortions of imaging system during the experiment, the rotational center is not constant, but may include per-projection corrections. Sine and cosine of each projection angle as well as the corrected position of the rotation axis are read from a buffer in the constant memory which is generated during the initialization phase. The computation grid is split in square blocks of 16-by-16 threads. It results in optimal occupancy across all considered platforms. The corresponding pseudo-code is presented in Algorithm~\ref{alg:stdalg}. 

\input{alg_51_std}


 The CUDA platform supports two slightly different approaches to manage textures: the texture reference API and the texture  object API~\cite{nvidia2017cudapg}. The texture reference API is universal and is supported by all devices. The texture object API is only supported since Kepler architecture. While the reference API can be used on all devices, as we found out the object API outperforms it on the devices with compute compatibility 3.5 and later. Therefore, we use the reference API for GT200, Fermi, and the first generation of Kepler devices and the object API for all newer architectures.
%ToDo: shall I bring a sample number?
 
 
\subsection{Multi-slice reconstruction}\label{section:multislice}
 %The reconstruction performance of the standard approach is bound by the filtering rate of the texture engine used to interpolate the projection data. The filtering rate defines a number of texels that the texture engine is able to output per second.  However, the maximum supported number of bits per texel varies between architectures. NVIDIA GT200 is bound by 32-bits and is only capable to work with 4-byte single-precision floating point numbers at the full speed.  With Fermi, NVIDIA has decided to keep the filtering rate mostly unchanged, but instead to double the maximum bit-width of supported data types. Since Fermi all NVIDIA architectures are capable to filter 8-byte data at the full pace. This is not limited to double precision numbers. The hardware is also able to manage 8-byte vector types, like \emph{float2}. The standard reconstruction algorithm can benefit from this feature only if changed to double-precision for better accuracy. However, this have little use in practice. Otherwise, only half of the available throughput is achieved. 

 The texture engines integrated in all recent generations of GPUs are capable filter 8-byte data at the full pace, see \sectionname~\ref{section:texture_engine}. The standard reconstruction algorithm can benefit from this feature only if changed to double-precision for better accuracy. But this have a little use in practice. In parallel tomography, however, exactly the same operations are performed for all the reconstructed slices. Therefore, it is possible to reconstruct multiple slices in parallel if the back projection operator is applied to a compound sinogram which encodes bins from the several individual sinograms as vector data.  Particularly, it is possible to construct such sinogram using float2 vector type and interleave values from one sinogram as $x$ components and from another as $y$, see \figurename~\ref{fig:interleave}. With \emph{float2}-typed texture mapped on this interleaved sinogram, it is possible to fully utilize the bandwidth of the texture engine and reconstruct two slices in parallel. The interleaving is done as an additional data preparation step between filtering and back projection steps.  The back projection kernel is, then, adjusted to use the float2 type and writes the $x$ component of the result into the first output slice and the $y$ component into the second. There is a considerable speed-up on all relevant architectures as can be seen on \figurename~\ref{fig:texeff}.
 
\begin{figure}[htb]
\centering
\includegraphics[width=\imgsizedefault\textwidth]{img/interleave.pdf}
\caption{\label{fig:interleave} Interleaving two sinograms to allow utilization of full 8-byte filtering bandwidth on post Fermi NVIDIA GPUs. }
\end{figure}



\begin{figure}[htb]
\centering
\includegraphics[width=\imgsizedefault\textwidth]{img/texeff.pdf}
\caption{\label{fig:texeff} The figure evaluates efficiency of optimizations proposed for texture-based back-projection kernel. The measured throughput is compared to the maximum filter rate of a texture engine and the performance is reported as a percent of achieved utilization. The results are reported also for processing multiple slices in parallel. The nearest-neighbour interpolation is used to measure performance if 4 slices are reconstructed in parallel. On NVIDIA platform the data is also stored in the half float data representation in this case.   For single- and dual-slice reconstruction, the performance is measured for bi-linear interpolation mode and the sinogram is stored in the single-precision floating point format on all platforms. The blue bars show performance of the standard \algorithmname~\ref{alg:stdalg} just modified to process multiple slices in parallel. The green bars show improvements due to a better fetch locality. The red bars show the maximum performance achieved by using \algorithmname~\ref{alg:newtex4} with optimal combination of tweaks explained through \sectionname~\ref{section:texrec}. \tablename~\ref{tbl:newtex} summarizes the architecture-specific parameters used at each GPU. The utilization is reported according to the supported filter rate, not the bandwidth. While the lower utilization is achieved for multi-slices reconstruction modes, the actual performance is higher as 2-/4-slices are processed using a single texture access.}
\end{figure}


\subsection{Using half-precision data representation}\label{section:half}
 Since the NVIDIA texture engine is currently limited to 8-byte vectors, the proposed approach can't be scaled to 4 slices if the single-precision input is used. However, CUDA supports half-precision data type which encodes each floating-point number using 16 bits only. While reduced precision might affect the quality of reconstruction, the majority of cameras has only a dynamic range of 16 bits or bellow. High-speed cameras actually used for time-resolved synchrotron tomography have even a lower resolution of 10-12 bits only. Since the higher frequencies in a sinogram are amplified during the filtering step, it is impossible to represent the filtered sinograms by integer numbers without loss of precision. However, using a half-precision floating-point representation to store the input data should have a limited impact on the resulting image quality if all further arithmetic operations are performed in single-precision. Unfortunately, the half-precision textures are not supported in the latest available version of CUDA yet (CUDA 8.0). While one can store the half-precision numbers in the GPU memory, it is impossible to map the corresponding texture. Still, it is possible to speed-up the reconstruction if the nearest-neighbor interpolation mode is selected. After filtering, the sinograms are down-sampled to the half-precision format and interleaved. The texture-mapping is created using the \emph{float2} data type. Upon request the texture engine returns the nearest value without performing any operations on it. Therefore, the appropriate data is returned even if an incompatible format is configured. It is important that the data size is correct. To avoid further penalty to the precision, the half-precision numbers are immediately casted to single-precision using \emph{\_\_half22float2} instruction and all further operations are performed in single-precision as usual. 
 
  The \figurename~\ref{fig:texeff} indicates a significant speed-up on all NVIDIA architectures except Kepler. As can be seen from \tablename~\ref{tbl:gpuspec}, the type casting is very slow on Kepler and caps the performance gains. The proposed method is also not viable on AMD platform.
Neither of the considered AMD GPUs support half-precision extension of OpenCL specification. Without this extension, no hardware instruction is available to convert between half-precision and single-precision. While such conversion can be performed using several bit mangling operations, it would cap the possible performance gain as well. 

 The penalty to the quality of the reconstruction induced by reduced precision is evaluated in \figurename~\ref{fig:quality_nn} and \figurename~\ref{fig:fossil_nn}. It is negligible for both synthetic Shepp Logan phantom and the selected dataset with the measured data.  However, the behavior for different real-world measurements may vary, especially if projections are obtained using a camera with high dynamic range. As the optimization proposed in this subsection changes the reconstruction results, it is important to verify that the achieved quality is still satisfactory for the considered application.

 
\begin{figure}[htb]
\centering
\includegraphics[width=\imgsizedefault\textwidth]{img/quality_nn_zoomed.pdf}
\caption{\label{fig:quality_nn}  Comparison of two reconstructions of the Shepp Logan Head Phantom using a single-precision (red) or half-precision (green) input. A profile (top) and absolute difference between reconstructions (bottom) are shown along the line crossing maximum of features on the phantom. Due to very small differences between reconstructions, the red and green lines completely overlap in the top plot. }
\end{figure}

\begin{figure}[htb]
	\centering
	\includegraphics[width=\imgsizedefault\textwidth]{img/quality_nn_fossil.pdf}
	\caption{\label{fig:fossil_nn}  Comparison of two reconstructions of the fossilized wasp dataset using a single-precision (red) or half-precision (green) input. A profile (top) and absolute difference between reconstructions (bottom) are shown along the selected line crossing multiple features. Due to very small differences between reconstructions, the red and green lines completely overlap in the top plot. }
\end{figure}



\subsection{Efficiency of the standard algorithm}\label{section:effeval}
  The \figurename~\ref{fig:texeff} evaluates efficiency of texture engine utilization. While performance in a single-slice processing mode is close to theoretical maximum on a majority of the considered architectures, the efficiency drops significantly if multiple slices are reconstructed in parallel. The AMD cards and the cards based on the NVIDIA Kepler architecture show sub-optimal performance also in a single-slice reconstruction mode. 
  
  As was discussed in \sectionname~\ref{section:ilp}, GPU architectures include multiple functional blocks operating independently. The performance of the GPU application is typically restricted by the slowest and/or most loaded of these blocks. Secondly, complex algorithms require a large amount of hardware resources like registers and shared memory. Large footprint on resources may constrain parallelism and, consequently, limit an GPU ability to hide memory latencies and schedule load across all available functional units. The discussed algorithm relies on:
\begin{itemize}
\item Texture engine to fetch and interpolate data
\item ALUs to find the ray incidence point
\item Constant memory to load projection constants
\item The SFU units are used for type conversions and integer multiplication on the recent NVIDIA devices. The major load is from conversion between half- and single-precision formats in 4-slice reconstruction mode. The SFUs are also used for addressing constant memory arrays and to convert a loop index to the texture-coordinate along the projection axis.
\end{itemize}

 The standard algorithm has a small register footprint and all GPUs provide enough computing power to find incidence points. The performance of the texture engine, however, is sub-optimal across all architectures if multi-slice reconstruction is performed. The reason is the bad locality of the texture fetches. The AMD GPUs are also restricted by the performance of the texture cache if only a single-slice is reconstructed. On top of that, the Kepler and AMD VLIW systems have comparatively slow constant memory which also bounds the performance bellow the theoretical throughput. Finally, the low SFU performance on the Kepler GPUs restricts the reconstruction if half-float format is used to store the sinograms. More information about GPU capabilities and the relative performance of GPU components is given in \tablename~\ref{tbl:gpuspec}.

\subsection{Optimizing locality of texture fetches}\label{section:remap}
 The standard algorithm maps each GPU thread to a single pixel of output slice. The default mapping is linear: the thread with coordinates $(x,y)$ in a computational grid is used to reconstruct the pixel with coordinates $(x,y)$ in a slice. Since every thread in a wrap reconstructs consecutive pixel along x axis, a large range of sinogram bins is always accessed.  Up to 16 different locations is fetched by a warp if 16-by-16 thread blocks are utilized. As it was discussed in the \sectionname~\ref{section:texture_engine}, the locality of fetches within a block, a warp, and also within a group of 4 consecutive threads is important to keep the texture engine running at full speed. 

 To improve the locality of the texture fetches, a new thread-to-pixel mapping is proposed. The thread blocks assignments are kept exactly the same as in the standard version. I.e. each block of 256 threads is responsible for an output area of 16-by-16 pixels. However, this area is further subdivided into 4-by-4 pixel squares. Within each square, the threads  are mapped along Z-order curve as illustrated in \figurename~\ref{fig:newtex4_pm}, left. Then, a group of 4 threads fetches positions in a sinogram row which are maximum 3 bins apart. And only up to 5 elements are required to perform corresponding linear interpolations. The data required for 16 threads is limited to 8 bins only. \tablename~\ref{tbl:remap} shows the effect of remapping for the 1- and 2-slice reconstruction on NVIDIA Titan X GPU. According to \figurename~\ref{fig:texeff} a significant speed-up is also achieved on other architectures unless the performance is also capped by other factors. 
 
%This is wrong. On multiple levels:
% 1) 3 bins appart means that actually the accesses lay within 4 pixels (not). And, consequently, up to 5 locations are needed.
% 2) The table reports 8.5 fetches per 16 threads (as probably the stuff is not re-used). 
% 3) Also number of fetches does not affect performance directly as shows our microbenchmarking experimetns.
%The locality of texture fetches is improved significantly with the new mapping. Texture engine fetches data from the cache at 32-byte granularity and the optimal performance is achieved only if a number of fetches required per a group of 4 threads is minimized.
%Due to the alignment requirements up to 2 data accesses will be performed by the texture engine. Maximum 3 fetches are required to serve the data to 16 threads responsible for the complete square. This is in contrast to 3 fetches per 4-thread group and 6 fetches per 16 threads required using the original mapping. \tablename~\ref{tbl:remap} shows the effect of remapping for 1- and 2-slice reconstruction modes on the Pascal-based Titan X GPU. The improved locality results in over 10\% reduction of accesses to the texture cache. This has a significant impact for performance in 2-slice mode. While reduction of cache traffic is also reported for single- slice mode, it does not affect performance as the number of issued fetches lays within capabilities of the texture cache also for standard mapping.  

\begin{table}[t]
\begin{threeparttable}
\caption{\label{tbl:remap} Queries to texture cache with standard and optimized mapping on NVIDIA GeForce Titan X (Pascal)}
\centering
\begin{tabular} { | c | l | c | c | c | c | }
\hline
Slices & Approach & Queries\tnote{a} & Tex. hits\tnote{b} & L2 hits\tnote{c} & Perf.\tnote{d} \\
\hline
\multirow{2}{*}{1} 
& Standard & 0.43 & 96.0\% & 89.0\% & 381~GU/s \\
& Remapped & 0.39 & 95.5\% & 89.4\% & 376~GU/s \\
\hline	
\multirow{2}{*}{2} 
& Standard & 0.61 & 91.5\% & 88.6\% & 534~GU/s \\
& Remapped & 0.53 & 93.8\% & 88.3\% & 724~GU/s \\
\hline
\end{tabular}
\begin{tablenotes}
\item \tabledesc{The table compares efficiency of the texture fetches using standard linear mapping scheme and the new scheme with improved locality. The measurements obtained using NVIDIA profiler for the 1- and 2-slice reconstruction modes. The table lists: \colid{a} number of 32-byte queries issued to texture cache per fetch, \colid{b} hit rate of the texture cache, \colid{c} L2 cache hit rate, \colid{d} achieved reconstruction performance in giga-updates per second.}
\end{tablenotes}
\end{threeparttable}
\end{table}

 The pseudo-code to compute the new thread indexes is given in \algorithmname~\ref{alg:newtex4remap}. The only required modification in \algorithmname~\ref{alg:stdalg} is to use the updated indexes $m_t'$ in place of ones reported by CUDA/OpenCL.
 
\input{alg_51_remap}

\subsection{Optimizing memory bandwidth}\label{section:newtex}
 Even though the new thread mapping gives a significant speed-up on a majority of considered architectures, the performance on Kepler and AMD VLIW GPUs is still bound by the slow constant memory. To process a projection, GPU threads load several geometric constants to locate point of incidence as defined in \equationname~\ref{eq:bin}. These constants can be re-used multiple times if each GPU thread would reconstruct several pixels. Since pixels are reconstructed independently, it will also increase the number of independent instructions in the execution flow and improve a scheduler ability to hide memory latencies and to issue multiple instructions per clock. There are two approaches how to adapt thread-to-pixel mapping. Either the number of threads in a computational grid is reduced proportionally or a new mapping scheme is constructed in a way that the same amount of threads is running but each thread contributes to multiple resulting pixels. The later can be achieved by processing several projections in parallel. Then, each thread is responsible for a group of pixels but loops over a subset of all projections only. Another thread would contribute to the same group of pixels but from a different subset of projections.

 Both methods perform similarly if properly optimized for the target GPU. Using the second approach, however, the dimensions of computational grid stay unchanged. Consequently, it has advantage for region of interest (ROI) and small-scale reconstructions. For this reason, we focus on this method and elaborate how it is implemented and tuned to run efficiently across platforms. To preserve good locality of texture fetches, the mapping described in previous section is adapted with small changes. The thread blocks assignments are kept the same. Each block is responsible for an output area of 16-by-16 pixels and this area is further subdivided into 4-by-4 pixel squares. In contrast to original mapping, however, 64 threads are assigned per square. Each thread is responsible to compute a contribution to the pixel value from a quarter of all available projections. Hence, each thread processes 4 pixels and each pixel is reconstructed using 4 threads. To avoid costly atomic operations, the contributions of the projection subsets are summed independently. Then, the threads are re-assigned to perform reduction in the shared memory and compute the final value of a pixel. To preserve a good spatial locality of the texture fetches, 4 neighboring projections are processed in parallel and the threads step over 4 projections at each loop iteration. 
 %As the projection angle is normally changing very slowly, it is reasonable to assume that the neighboring bins contribute to a pixel value if taken from the consecutive projection rows. Therefore, the good locality of texture fetches is preserved if the threads are mapped to process 4 neighboring projections in parallel and each thread steps over 4 projections at each iteration step. 

  There are 256 threads in a block and 64 threads are assigned to reconstruct each 4-by-4 pixel square. Therefore, 4 such squares are processed in parallel and a complete set of 16 squares requires 4 steps of a loop. \figurename~\ref{fig:mappings} shows several possible sequences to serialize processing. The first mapping is sparse and results in a reduced cache hit rate as compared to the other options. Since only a single pixel coordinate has to be incremented in a pixel loop, the third option requires less registers compared to the second. While the second mapping has a better access locality within the 64-thread warps of the AMD platform, it does not affect performance in practice.  On other hand, the register usage is very high in multi-slice reconstruction modes and the extra registers cause reduced occupancy or the spillage of registers into the local memory. Therefore, the third approach is preferred.

\begin{figure}[htb]
\centering
\includegraphics[width=\imgsizedefault\textwidth]{img/mappings.pdf}
\caption{\label{fig:mappings} The figure illustrates several ways to assign a block of GPU threads to an area of 16-by-16 pixels. Since 4 projections are processed at once, only 64 threads are available for entire area and it take 4 steps to process it completely. For each possible scheme in gray are shown all pixels which are processed during the first step in parallel. The first mapping (left) is sparse and results in increased cache misses. The second mapping (center) requires more registers and may cause reduced occupancy. So, the third mapping (right) is preferred. }
\end{figure}
  
  A request to multiple locations in the constant memory by a warp is serialized on NVIDIA platform. To avoid such serialization, all threads of a warp are always assigned to the same projection. The following mapping scheme is adopted. The lowest 4 bits of the thread number in a block define the mapping within a 4-by-4 pixel square. A group of 16 threads follows Z-curve as explained in the \sectionname~\ref{section:remap}. Next 2 bits define a square and the top 2 bits define the processed projection. \figurename~\ref{fig:newtex4_pm} illustrates the proposed mapping and \algorithmname~\ref{alg:newtex4remap} provides the corresponding pseudo-code.

\begin{figure}[htb]
\centering
\includegraphics[width=\imgsizedefault\textwidth]{img/newtex_pm.pdf}
\caption{\label{fig:newtex4_pm} Mapping of a block with 256 threads to reconstruct a square of 16-by-16 pixels along 4 projections. 4 steps are required to process all pixels. A group of 64 consecutive threads is responsible to process a rectangular area of 16 by 4 pixels (middle). 4 projections are processed in parallel using 4 such groups (right). Each 4-by-4 pixel square is reconstructed by 16 threads arranged along  Z-order curve (left). For each output pixel or block of pixels, the assigned range of threads is shown in the figure. }
\end{figure}

 The pseudo-code for the complete approach is presented in \algorithmname~\ref{alg:newtex4}. There are two distinct processing steps. First the partial sums are computed in an 4-element array. It is declared as a local variable and both NVIDIA and AMD compilers are able to back it with registers because of the fixed size. The outer loop starts from the first projection assigned to a thread and steps over the projections which are processed in parallel. The large loop-unrolling factor requested with \emph{pragma} preprocessor directive has a positive impact on performance, especially on Kepler architecture.  At each iteration constants are loaded and inner loop is executed to process 4 pixels the thread is responsible for. After completion of all projections, the reduction loop is executed. The partial sums are written into shared memory and reduction is performed. To avoid non-coalesced global memory writes, first all results are stored in a shared memory buffer $\shmem{\vfloat{r}}$ and, then, written in the coalesced manner. The synchronization is needed when switching different mapping modes. Since each reduction is performed by a single warp only, it is sufficient to prevent compiler from reordering read and write operations in-between of reduction steps using \emph{fence} operation. Alternatively, the \emph{shuffle} operation may be utilized to perform reduction on Kepler and newer NVIDIA architectures. Then, neither \emph{fence} nor \emph{if}-condition are required. The reduction loop using the \emph{shuffle} instruction is shown in \algorithmname~\ref{alg:newtex4_shuffle}.
 
\input{alg_51_newtex4}

\input{alg_51_shuffle}

On GTX295 using CUDA6, there are a few glitches significantly affecting performance. The \emph{fence} instruction prevents unrolling of the reduction loop. Consequently, the array with partial sums is referenced indirectly using the loop index. This forces the compiler to allocate array in the local memory instead of using registers and causes enormous penalty to the performance. Therefore, a standard \emph{\_\_syncthreads} is used instead. The loop is also not unrolled if the inner reduction loop is implemented directly as written in \algorithmname~\ref{alg:newtex4}. The following formulation causes no issues:

\begin{lstlisting}
for(j = 0; j < 2; j++) { 
    i = 2 >> j;
    ...
}
\end{lstlisting}
 
% For AMD GCN2 we used 16-proj mode due to slow-downs because of constant memory. However, it is because we have not implemented gmem reading. If we would do, I guess it should be fix also newtex4 mode. But I will skip details in the paper.
 
 The GPU constant memory is optimized with the assumption that always the same constants are accessed by all threads of a computational grid. Since the new algorithm goes over several projections in parallel, this assumption is not valid any more. While the proposed mapping avoids major penalty due to warp serialization, slow constant memory is still a bottleneck on older AMD devices. To avoid performance penalty, faster and larger shared memory is used instead in this case. The projection constants are initially stored in global GPU memory and, then, are cached in shared memory. The \algorithmname~\ref{alg:newtex4_cc} contains alternative implementation of the accumulation step for \algorithmname~\ref{alg:newtex4}. Shared memory is additionally configured to store constants for up to 256 projections. In fact, the same shared memory buffer may be used in the both steps of algorithm, first for caching constants and later for a data exchange while performing reduction.  An outer loop processing blocks of 256 projections is introduced. At each iteration of the loop, the threads of a block are, first, used to read the constants from global memory and fill the cache. To allow 64-bit loads, we use a \emph{float2} variable to store values of both trigonometric functions. After synchronization, the inner projection loop is started to compute partial sums. The inner loop is implemented as in \algorithmname~\ref{alg:newtex4} with only difference that constants are loaded from shared memory. This method, however, cannot be used across all platforms. While majority of NVIDIA GPUs showed similar performance for both implementations, Kepler-based GPUs perform better if constant memory is utilized.  % I guess due to increased regiter usage....

\input{alg_51_cc}


\subsection{Optimizing occupancy}\label{section:newtex_occupancy}
 Similarly to the standard algorithm, the optimized version can be easily adapted to process 2- and 4-slices in parallel. Only accumulators and intermediate buffers have to be declared with the appropriate vector type. However, the usage of hardware resources grows significantly if multiple slices are processed in parallel. In a 4-slice mode, 16 registers (32-bit each) are required only to accumulate the partial sums. The large register footprint reduces occupancy and may result in a sub-optimal performance unless treated properly. 
 
 The register allocation is completely out of developer control on AMD platform. NVIDIA allows to target the desired number of blocks executed by each SM in parallel. It is done using \emph{\_\_launch\_bounds\_\_} keyword. The CUDA optimizer, then, changes the code generation algorithm to meet the target. It prevents data pre-fetching and also may result in an increased computational load and/or in a more intensive usage of L1 caches as a part of local variables is offloaded to local memory.  On Fermi and Kepler architectures, 64~KB of on-chip memory is split between L1-cache and shared memory according to the user-specified configuration. By default 48~KB is assigned to shared memory and only 16~KB is left for L1 cache. If the shared memory consumption is low enough, it is possible to re-balance this ratio and achieve a high occupancy on one hand and ensure that there is enough L1 cache to back all the required local memory on the other. 
  
  The \tablename~\ref{tbl:newtex_bounds} summarizes resource consumption, theoretical occupancy, and achieved performance on the NVIDIA GTX Titan with and without resource restriction. The results show that improved occupancy may bring a considerable speed-up also if significant number of variables has to be offloaded to local memory, provided it is backed by L1 cache. Without restriction the generated code requires 38 registers if 2-slice reconstruction mode is enabled. This limits the number of resident threads to 1724 or 6 blocks and results in 75\% occupancy. The performance is improved by 15\% if CUDA compiler is instructed to allow execution of 8 blocks, i.e. running at full occupancy. To fulfill this requirement the compiler puts 6 variables in the local memory. However, 16~KB of L1 cache is not enough to assure backing of the required local memory for 8 resident blocks. On other hand, only 4~KB of shared memory is required per block for temporary buffers or 32~KB for all 8 blocks. Therefore, the ratio between L1 cache and shared memory is shifted to allow 32~KB of L1 cache. This is done using \emph{cudaFuncSetCacheConfig} command with \emph{cudaFuncCachePrefer} argument to specify preference for L1 cache. The recommended restrictions for other architectures are summarized in \tablename~\ref{tbl:newtex}

%ToDo: Shall I add ShMem and L1/ShMem ratio for completness?
\begin{table}[t]
\begin{threeparttable}
\caption{\label{tbl:newtex_bounds} Occupancy and performance of NVIDIA GeForce GTX Titan (Kepler) using 2-slice reconstruction mode}
\centering
\begin{tabular} { | l | r | r | r | r | }
%\multicolumn{6}{l}{\textbf{NVIDIA GeForce GTX 580 (Fermi, sm21)}} \\
\hline
Restricted & Registers & Local Mem. & Occupancy & Performance \\
\hline
No  & 38 & -  & 75\% & 320~GU/s \\
Yes & 32 & 24 bytes & 100\% & 368~GU/s \\
\hline
\end{tabular}
\end{threeparttable}
\end{table}

 Both shared and constant memories are comparatively slow on Kepler with respect to the performance of the texture engine, see \tablename~\ref{tbl:gpuspec}. Furthermore, 64-bit access is required to fully utilize the available bandwidth of shared memory. This is given in the multi-slice reconstruction mode. However, 64-bit operation should be also enabled in CUDA using \emph{cudaDeviceSetSharedMemConfig} command with \emph{cudaSharedMemBankSizeEightByte} argument. The constant memory also performs better if 64- or 128-bit wide access is performed. A speed-up is achieved if all projection constants are grouped together and stored as a single \emph{float4} vector. Even if only 3 components of the vector are actually required (i.e. one quarter of bandwidth is actually wasted), the performance is considerably better.
 
%\subsubsection{Enabling 64-bit access for NVIDIA Kepler GPUs}\label{section:newtex_ld64}
% According to our measurements in~\ref{section:memory_performance}, for constant memory NVIDIA maintains two caches. The larger 8~KB cache layer is shared by a cluster of several SM and a small 2~KB L1 cache is residing directly within SM. For optimal performance the L1 cache requires 64-bit access as well and the larger 8~KB is only capable to deliver half of the bandwidth. Some performance improvement is achieved by grouping all projection constants together and storing as a single \emph{float4} vector. Even if only 3 components of the vector are actually required, the performance is considerably better on Kepler GPUs.
 
\subsection{Summary}
 We have introduced a new cache-aware algorithm which is able to reconstruct up to 4 slices in parallel. Several modifications are proposed to improve performance on specific GPU architectures. The optimal configuration and the corresponding performance are summarized in \tablename~\ref{tbl:newtex}. The achieved efficiency is further analyzed on \figurename~\ref{fig:texeff}. For a single slice reconstruction mode, the performance is above 90\% of the theoretical maximum across all considered platforms. Depending on the architecture, it corresponds to a speed-up of up to 90\% as compared to the original algorithm. Using the multi-slice reconstruction and half-float data representation, it is possible to quadruple performance on Fermi and the latest AMD and NVIDIA architectures. Efficiency of about 80-90\% is achieved if the optimized reconstruction kernel is utilized. The efficiency of Kepler GPUs is restricted due to comparatively slow on-chip memory and low performance of SFU units which are used to perform type mangling operations. Nevertheless, 2 to 3 times speed-up over the standard single-slice algorithm is achieved due to the proposed optimizations.

\input{table_5x_texrec.tex}