/articles/toma : contents of section_61_alurec.tex at revision 100

: (revision 100)

To get this branch, use:

bzr branch
http://darksoft.org/webbzr/articles/toma

\section{Alternative algorithm based on ALUs}\label{section:alurec}
 While it is possible to reach a very high reconstruction speed by processing multiple slices in parallel, this option is not available on all GPUs. Furthermore, the ability to re-combine slices for parallel reconstruction may be limited due to architecture of data processing pipeline or by the latency requirements.     According to specifications, the majority of GPUs are able to perform over 32 floating-point operations during a single texture fetch, see \tablename~\ref{tbl:gpuspec}. Only 9 floating-point operations are required to perform a single update of back projection algorithm~\cite{cabral1994}. Therefore, an alternative implementation using the algebraic units to perform interpolation may outperform the texture-based kernel by 3-times if executed on a single slice. The challenge is to feed the data into the floating-point units at the required rate. The L1 cache integrated in SM is small with low associativity and, consequently, is susceptible of cache poisoning. As result, the loads from global memory limit performance severely. In this section we present a back projection algorithm based on ALU to perform interpolation and using shared-memory as an explicit cache. First, we will explain the concept and present a base version of the algorithm. Then, we build a simplified performance model and analyze that limits the performance on each of the hardware platforms. Finally, multiple adjustments are discussed to slightly shift balance between memory operations and different types of computations and to address the capabilities of a specific architecture better.
 
 
\subsection{The Concept}\label{section:alurec_concept} 
 The proposed approach is illustrated on \figurename~\ref{fig:alutex}. To avoid the penalties associated with global memory loads, shared memory is used to cache all bins required for reconstruction by a block of threads. To reserve a large enough buffer for the cache, it is necessary to find an upper bound of bins ($b$) required to reconstruct a rectangular block of pixels ($S$) with dimensions $n$ by $m$. It is defined as

\begin{equation*}
 b_p = \max_{(x,y) \in S}r_p(x,y) - \min_{(x,y) \in S}r_p(x,y)
\end{equation*}

where $r_p(x,y)$ is the incident offset in a projection row which is computed as defined in \equationname~\ref{eq:bin}. If ($x_0$, $y_0$) is the coordinates of maximum of $r_p$ and ($x_1$, $y_1$) - of minimum, the equation can be reformulated as
\begin{equation*}
    b_p = (x_0 - x_1)\cdot\cos(p\alpha) - (y_0 - y_1)\cdot\sin(p\alpha)
\end{equation*}
or
\begin{equation*}
    b_p = \sqrt{(x_0 - x_1)^2 + (y_0 - y_1)^2} \cdot\cos(\alpha + \beta)
\end{equation*}
where $\beta$ is some angle. Then, $b_p$ can be estimated as:
\begin{equation*}
b_p \leq \sqrt{(n)^2 + (m)^2} \cdot\cos(\alpha + \beta)
\end{equation*}

 
%Shorter:
%The \equationname~\ref{eq:bin} can be reform ulated as $r_p(x,y)=\sqrt{x^2 + y^2}\cdot\cos(\alpha + \beta)$, where$\beta$ is some angle.
%Therefore $b_p$ can be estimated as

% As was shown in \sectionname~\ref{section:remap}, maximum $N\sqrt{2}$ bins are required per projection to reconstruct a full pixel square with the side $N$. To perform caching, it is necessary to find the minimal required bin ($h_m$) for each projection. Then, the reconstruction may be performed in two stages as follows.

 It is independent of processed projection and is minimal if the area $S$ is square. In this case the value of $b$ does not exceed $n \cdot \sqrt{2}$. For practical purposes we assume that $\frac{3}{2}n$ bins are required per projection to reconstruct a full pixel square with side $n$. To perform caching, it is necessary to find the minimal required bin ($h_m$) for each projection. Then, the reconstruction is performed in two stages. First, the required bins are cached for a set of projections. Afterwards, the reconstruction is performed using the data in the cache. To perform caching, the threads of a block are split in several groups. Each group is responsible to cache bins for a single projection. A subset of a sinogram row consisting of $\frac{3}{2}n$ bins is extracted starting at an offset equal to the $h_m$. Based on a thread index in a group, the offset in a sinogram is computed and the corresponding bin is cached in a shared memory array. If necessary, a few bins with a stride equal to a number of threads in a group are cached by the same thread. The threads of a block are, then, re-assigned to match the output pixels and process the contributions from the cached projections in a loop. As usual, the threads determine a position where the ray passing through the reconstructed pixel hits the detector row. The corresponding bin in a sinogram is computed by each thread and an offset from the $h_m$ value is found. Typically the offset is not integer and falls in between of two cached values. Depending on the configured interpolation mode either the offset is rounded to the nearest integer and a single value is loaded from shared memory or both neighboring values are loaded and the linear interpolation is performed to compute the impact of a projection.

 Both steps depend on the $h_m$ to perform caching and to locate the required value in the cache. This operation is costly and would add significantly to computation balance if executed by each thread and for each projection. To reduce amount of required operations, the $h_m$ values are cached in the shared memory during the first stage of algorithm and, then, re-used in the second. Furthermore, the minimal bin is always accessed while reconstructing one of the corners of the pixel square. The actual corner is only depending on the projection angle and is the same across all squares of the reconstructed slice. Therefore, a single value is required for each projection to compute minimal bin. This value ($c_m$) can be defined as the difference between the position accessed to compute a top-left pixel of a square and the minimal position accessed across this square. Then, it is computed as:
\begin{equation*}
 c_m = n \cdot \max (0, \cos(\alpha _p), -\sin(\alpha _p), \cos(\alpha _p) - \sin(\alpha _p))
\end{equation*}
 Using $c_m$, the minimal required bin ($h_m$) is computed as: $h_m \eq floor(h_b + c_m)$, where $h_b$ is the bin accessed by the first thread of a block.  It is computed based on a index of a thread block in the computational grid as described in \tablename~\ref{table:alg_vars}. The $c_m$ is computed during the initialization stage and is stored along with other projection constants in the GPU constant memory. 
 
 
\begin{figure}[htb]
\centering
\includegraphics[width=\imgsizedefault\textwidth]{img/alutex.pdf}
\caption{\label{fig:alutex} The figure illustrates reconstruction process relaying on the shared memory cache and the algebraic units to perform back-projection. To reconstruct 32x32 square, a thread block caches 48 bins from each projection row. The projections are processed in groups moderated by the size of available shared memory. At first, the required subset of bins in each projection is determined (left). The selected subsets along with their offsets in the projection rows are cached in the shared memory (center). Then, the reconstruction is performed and projections are processed in a loop one after another (right). Each thread is responsible for several pixels of output slice. For each pixel the required position in the sinogram is computed. The cache offset for the considered projection row is subtracted from this position and the offset in the cache is determined (bottom). As the offset is typically not integer, two array elements are loaded from the cache and interpolation is performed.}
\end{figure}
 
 Multiple auxiliary operations are required to perform reconstruction. The sinogram values are fetched from the cache, interpolated, and summed up. On top of that, the $h_m$ is computed for each projection, the selected parts of sinogram are cached, and the corresponding positions in the projection cache are determined for each pixel. These operations add an additional load on GPU and significantly reduce the performance. While it is impossible to eliminate the auxiliary operations entirely, there are two major ways to scale down their proportion. Either several slices are reconstructed in parallel or a larger pixel area is assigned to a thread block for reconstruction. First option allows to reduce proportion of computations needed to determine which data is cached for each projection and to find the required offset in the shared memory array. Since the reconstruction is not bound by a performance of the texture engine any more, there is no restriction on a number of slices processed in parallel. It is possible to reconstruct 4 or more slices together provided there is enough hardware resources to handle the data. Proportionally less data have to be cached if a larger area is assigned to a thread block. Consequently, the load on global and shared memories is reduced. This is achieved either by increasing a number of threads in a block or by assigning multiple output pixels to each thread. Since the constants can be stored in the registers and re-used to process multiple pixels, the load on constant memory is reduced in the last case as well.
So, the first option cuts down the amount of computations significantly. The second method cuts down computations to lesser extent, but also reduces an utilization of shared memory slightly. Both ways, however, increase the use of hardware resources significantly. More shared memory and more registers are required. The optimal compromise between these options has to be found for each targeted platform. Furthermore, there are multiple ways to implement the described operations. Each variant will put more load on one GPU subsystem or another. The additional shared-memory caches can be utilized to shift the balance between computations and memory operations. In the next subsection we present a base implementation and will target the specific architectures across the rest of the section.
 
\subsection{Base Implementation}\label{section:alurec_implementation}
 Processing only a single pixel per thread is sub-optimal across all targeted platforms. The optimal load is between 4 and 16 pixels depending on the available hardware resources. Since square areas are most efficient to cache, we target areas of either 32-by-32 or 64-by-64 pixels per a thread block. While intermediate sizes can be used as well, for power of two sizes it is easy to design thread mappings suitable for both caching and accumulation stages of the reconstruction process. For sake of simplicity, in \algorithmname~\ref{alg:alurec} we present a simple version processing 4 pixels per a thread. A block of 256 threads is used to reconstruct a square of 32-by-32 pixels. The maximum number of bins accessed per projection, then, is equal to $32 \cdot \sqrt{2}$ or 46. If the linear interpolation is used, up to 47 elements in a sinogram array are actually accessed for each projection. Therefore, 16 threads cache all required values in 3 steps. To avoid conditionals all 48 values are always cached. This ratio keeps if 64-by-64 area is reconstructed. The 96 values has to be cached. The number of projections processed in parallel is limited by the available shared memory and the size of a single projection row in the cache. A group of 16 projections may be cached at once if only a single slice is reconstructed. For 4-slice mode or if a 64-by-64 area is reconstructed, only 8 projections are typically processed in parallel. Reducing this number further may have negative impact on the performance as many threads would need to wait at the synchronization point reducing the effective occupancy.
 
\input{alg_52_alurec}

 As was already explained, the $h_m$ is computed during the caching stage and also stored in shared memory. Instead of repeated computation, the value is loaded from shared memory during the reconstruction stage. To find the required offset in the cache, a difference between the position in a sinogram row ($h$) and $h_m$ is computed. The equation for $h$ includes the projection-corrected position of the rotational axis ($c_a$) which is constant for all pixels. It can be integrated into the $h_m$ already during the caching step of the algorithm. So the value of $c_a - h_m$ is stored in shared memory instead of $h_m$. Then, only the pixel-dependent part of the projection equation is computed inside of the main loop. 

 Since no interpolation is required while the data are read from global memory, it is possible to access the sinograms directly rather than using texture fetches. The loads are always coalesced and thread blocks read each value only once. However, NVIDIA relays on the same LD units to perform the shared and global memory operations. Hence, either a shared memory or a global memory instruction will be executed by SM at each clock. On other hand, texture loads are performed using the specialized units on all architectures. Therefore, it is possible to load data from global and shared memory simultaneously if global memory is accessed using the texture engine. It makes the texture engine a preferred option to get the data in the shared memory cache. To avoid unnecessary interpolations, the texture engine is configured to use nearest neighbor interpolation. 

\subsection{Optimizing the thread mapping to avoid shared memory bank conflicts}\label{section:alurec_shmem}
 Like for the texture-based reconstruction kernel, the thread-to-pixel mapping is important to achieve a good performance. The main goal is to reduce shared memory transactions and avoid shared memory bank conflicts during the both stages of reconstruction. On all architectures, the warps need to avoid accessing multiple rows of the same shared memory bank in a single instruction. While the warp consists of 64 threads on the AMD platform, maximum 32 shared memory banks are supported on the reviewed GPUs. To prevent bank conflicts, it is only necessary to avoid accessing the same bank across a group of 32 threads~\cite{amd2013openclpg}. Therefore, there is no need to tackle the larger warp size on AMD while discussing the shared memory access. Furthermore, there are several architecture specific restrictions. The Fermi and AMD devices are not capable to handle 128-bit data efficiently~\cite{nvidia2017cudapg,amd2013openclpg}. Using 64-bit wide operations is extremely important on the Kepler architecture to utilize the full performance of shared memory. Only half of the bandwidth is available if 32-bit access is performed. While not as significant as on the Kepler architecture, 64-bit loads are about 20\% faster on AMD Cypress and Tahiti~\cite{amd2013openclpg}. 
 
%[skiped]As explained in the \sectionname~\ref{section:memory_performance} the performance of 64- and 128-bit loads may be improved on the newest NVIDIA devices if groups of 4 consecutive threads are accessing only two banks.

 No changes are required to benefit from the 64-bit shared memory in the multi-slice reconstruction modes. A 64-bit access can be easily facilitated in the caching step of the algorithm also if a single-slice reconstruction is performed. Each thread is made responsible to cache 2 bins at once. First, 2 texture fetches are performed to extract values of the neighboring bins. Then, both values are assembled in a 64-bit \emph{float2} vector and are written into the shared memory using a single operation, see \algorithmname~\ref{alg:cache64}. This approach, however, reduces the locality of texture fetches. Since $h_m$ may have an odd value, switching to \emph{float2} textures is not an option. However the load on the texture engine is quite low and in contrast to shared memory has little impact on overall reconstruction speed. This optimization is relevant on NVIDIA Kepler and both older AMD GPUs.
 
\input{alg_52_cache64}
 
 Only the half of the available shared memory banks are utilized on the NVIDIA Fermi and all AMD GPUs  if 128-bit data is accessed. To circumvent the problem, it is possible to split the \emph{float4} vectors in two parts, store them in the two buffers in shared memory separately, and re-combine back before performing interpolation, see \algorithmname~\ref{alg:split}. 

\input{alg_52_split}

  In the first stage of algorithm, the number of threads assigned to cache each projection is adjusted to optimize access to the shared memory. If a large 64-by-64 area is reconstructed, a full warp of 32 threads can be assigned for each projection row avoiding any possible bank conflicts. Unfortunately, it is not completely optimal on the Kepler architecture as, then, it is impossible to re-combine two bins into a single 64-bit wide write as explained above. It is also not possible to assign 32 threads per row for a smaller 32-by-32 area because only 48 bins has to be cached per projection in this case. And it is a bad idea to keep the half of threads idling. Therefore, several projection rows are processed by each warp in the described cases. This potentially may cause bank conflicts. If only a single slice is reconstructed, however, the banks are shifted from one projection row to another as illustrated on \figurename~\ref{fig:banks}. The caching is performed without bank conflicts if either 16 threads are assigned per projection row on the platforms with 32-bit shared memory or 8/16 threads are used on the Kepler devices. Only 8 threads are used to allow bin re-combination if a small area is reconstructed. 16 threads per projection are optimal on all platforms if multiple slices are reconstructed. The 64-bit banks storing \emph{float2}-sinogram are shifted across projection rows exactly the same way as 32-bit banks do if a simple \emph{float}-sinogram is reconstructed. And on platforms with 32-bit shared memory it is enough to prevent bank conflicts within a group of 16 threads while dealing with 64-/128-bit data. The optimal settings for each reconstruction mode are summarized in \tablename~\ref{tbl:shmemconf}.
  
\begin{figure}[htb]
\centering
\includegraphics[width=\imgsizedefault\textwidth]{img/banks.pdf}
\caption{\label{fig:banks}  The figure illustrates how the warps are assigned to cache a subset of a sinogram on the systems with 32-bit and 64-bit shared memory. For each projection 48 bins which are required to reconstruct area of 32-by-32 pixels are cached. The shared memory banks used to back each group of 16 bins are specified considering that 32-bit data format is used.}
\end{figure}  

 According to the documentation it does not matter how the threads of a half warp are accessing shared memory. In practice, however, we found that on recent NVIDIA devices the performance of 64- and 128-bit loads is slightly improved if only 1-2 different memory locations are accessed by groups of 4 consecutive threads. The locality of shared memory loads is improved if each half-warp is mapped to a square consisting of 4-by-4 pixels and the threads are arranged along Z-order curve similarly to the texture fetches. All 256 threads of a block are mapped to 16 such squares. For the reasons explained in \sectionname~\ref{section:newtex}, the squares are arranged linearly along $x$-axis. Two rows of 4x4 squares are processed in parallel if a small 32-by-32 area is reconstructed. A single row is covered for the bigger area or if only 128 threads are assigned per a block. The remaining rows are processed over 4-16 steps. The threads accumulate the sums for each pixel in a register-bound array and dump it to global memory once the processing of all projections is completed. The complete mapping scheme is illustrated on \figurename~\ref{fig:alumap}. The performance of NVIDIA Titan X is increased by 3\% if the described mapping is utilized.
  

\begin{figure}[htb]
\centering
\includegraphics[width=\imgsizedefault\textwidth]{img/alumap.pdf}
\caption{\label{fig:alumap} The assignment of block threads to pixels as proposed for ALU-based reconstruction }
\end{figure}  


\input{table_6x_shmem.tex}

\subsection{Advanced Caching Mode}\label{section:ycache}
   For linear interpolation two neighboring bins are always loaded, but it is impossible to perform 64-bit load due to the alignment requirements. Consequently, only a half of the available bandwidth is used on the Kepler architecture in the single-slice processing mode. To allow 64-bit access, both values required to perform linear interpolation are stored as \emph{float2} vector in the corresponding bin of the cache. The size of cache is doubled, but also the achieved bandwidth is increased by factor of two on the Kepler platform and is considerably improved on the AMD devices which are optimized for 64-bit loads. The required amount of shared memory is still adequately low and does not limit occupancy if the single-slice reconstruction is performed. Furthermore, one floating-point operation is eliminated in the interpolation step of algorithm if the second component of cached vector actually stores the difference between the values of consecutive bins in a sinogram as shown on \figurename~\ref{fig:ycache}. 
  
\begin{figure}[htb]
\centering
\includegraphics[width=\imgsizedefault\textwidth]{img/ycache.pdf}
\caption{\label{fig:ycache} Advanced Caching Mode. Two values are cached for each bin of a sinogram. The second value stores the difference between neighboring bins to allow faster interpolation. The shuffle instruction is used to get values from the bins cached by a different GPU thread.}
\end{figure} 
  
 The caching procedure is modified as shown in \algorithmname~\ref{alg:newcache}. To reduce required inter-thread communication, each thread caches several consecutive bins. The communication is, then, only required to compute the second part of the last bin which is assigned to a thread. The shuffle instruction is used on Kepler and the newer NVIDIA architectures. A read from shared memory is performed on the NVIDIA Fermi and all AMD GPUs after the \emph{fence}-style synchronization. In this case the shared memory cache is also padded by one extra column to allow an unconditional read by the last thread in a group assigned to a projection row.

\input{alg_52_newcache}

 In case of a 32-by-32 pixel area, 16 threads per projection row are used on all platforms independent of the width of a shared memory bank. The banks are shifted between projection rows on the 64-bit platforms as explained in \sectionname~\ref{section:alurec_shmem}. And for 32-bit architectures it is enough to avoid bank conflicts within a half-warp only. Furthermore, there is also no bank conflicts between the threads of a half-warp as the stride is not a multiple of 4, see illustration in \figurename~\ref{fig:ybanks}. A full warp is used per row if a thread block is assigned to process larger 64-by-64 pixel area. The same number of steps is, then, required to process the complete projection row and, consequently, shared memory is accessed with the same stride without bank conflicts.


\begin{figure}[htb]
\centering
\includegraphics[width=\imgsizedefault\textwidth]{img/ybanks.pdf}
\caption{\label{fig:ybanks} The figure illustrates how the shared memory banks are accessed if advanced caching mode is used, see \sectionname~\ref{section:ycache}. The presented layout is employed to reconstruct area of 32x32 pixels. The grayed boxes indicate the banks accessed by a warp during the first caching pass. Two values are cached for each bin and, consequently, each bin spans over two memory banks on the platforms with 32-bit shared memory. On these platform it is also only necessary to avoid conflicts within a half-warp. So, the accesses for second half-warp are not shown. }
\end{figure}  

\subsection{Modeling}\label{section:modelling}
 The proposed method is relatively complex and utilizes multiple GPU subsystems. There are many ways to tune the proposed algorithm to address the capabilities of a targeted architecture better. It is important to understand the limiting factors in each case. Here, we try to build a simplified performance model. First, we identify several distinct operations required to perform back projection:
\begin{enumerate}
\item The projection constants are loaded from memory. And the minimal required bin is computed to decide which data have to be cached.
\item The sinogram subsets are fetched from the texture and cached in shared memory.
\item For each reconstructed pixel, the corresponding position in a sinogram is determined.
\item The offset in the shared memory array is computed.
\item Depending on the requested interpolation type, one or two values are fetched and the contribution of a projection is added to the accumulator.
\end{enumerate}
 
These operations rely on several hardware components:
\begin{itemize}
\item \textbf{Constant memory} is used to retrieve projection constants.
\item \textbf{Texture Engine} is used to retrieve the sinogram values.
\item \textbf{Shared memory} is used while caching the data and retrieving the cached values to perform interpolation.
\item \textbf{ALUs} are used for general-purpose computations, particularly to perform projection and interpolation.
\item \textbf{SFUs} are used on Kepler, Maxwell, and Pascal architectures to perform rounding operations, to convert data between floating point and integer representation, and to perform bit-shifts. These instructions are used to compute offsets in the cache and to perform interpolations. While the bit-shifts are not used directly in pseudo-code, they are implicitly utilized to resolve addresses in the constant and shared memory arrays. 
\end{itemize}

 Each of these components may limit the performance if its resource is exceeded. Furthermore, there is also a limit on a number of instructions which SM is able to schedule per a clock cycle. Particularly, the warp scheduler on Fermi is limited to a single instruction per clock. If a memory instruction is launched, the half of ALUs are kept idle. To estimate the performance we assess the number of required operations according to the presented pseudo-code. We assume that the performance is either capped by the slowest of the components or by a total number of instructions. It is a very rough estimate. The developed kernels are resource intensive and are executed at a significantly reduced occupancy. It is difficult to predict how the compiler will generate the code to manage the available resources. Furthermore, some variables are moved to the slower local memory. The local memory is backed by L1 cache which shares the hardware with shared memory on the Fermi and Kepler based GPUs. Consequently, the operations with such variables are not only increasing latency, but also may penalize the shared memory performance. Nevertheless, the obtained estimates allow us to choose the required optimization strategy for each architecture.
 
 Instructions required to perform a single update on a pixel value are summarized bellow for the reconstruction using the linear interpolation. Rounding/type-conversions (TC) and bit-shift (BS) operations are counted separately because they are scheduled differently on Maxwell/Pascal and Kepler GPUs. For each operation the normalization coefficient, i.e. the number of updates performed per the specified number of instructions, is indicated. Fused-Multiply-Add (FMA) is counted as a single instruction.

\begin{enumerate}
\item Computing and caching of $h_m$ (per $n_t \mul n_q \mul n_v$ updates)
\begin{itemize}
\item \textbf{Constant Memory} (128-bit): $s_t$ 
\item \textbf{Shared Memory} (32-bit): $s_t$ (because a full warp is executed anyway)
\item \textbf{FP}: $4 s_t$  (to compute $h_b$ and $\shmem{h_m}$)
\item \textbf{TC}: $s_t$ (rounding)
\item \textbf{BS}: $2 s_t$ (resolving addresses in constant and shared-memory arrays)
\end{itemize}

\item Caching (per $n_t \mul n_q \mul n_v$ updates)
\begin{itemize}
\item \textbf{Texture Fetches}: $\frac{3}{2} \sqrt{n_t \mul n_q}$
\item \textbf{Shared Memory} (type-dependent, but always 64-bit in Advanced Caching Mode): $\frac{3}{2} \sqrt{n_t \mul n_q}$
\item \textbf{FP}: $4 s_t$ ($3$ if Advanced Caching Mode is not used)
\item \textbf{TC}: $s_t$  (integer to float conversion of projection number to perform texture fetch)
\item \textbf{BS}: $2 s_t$ (resolving addresses in the shared-memory array)
\end{itemize}

\item Setting inner-projection loop and evaluating required position in sinogram  (per $n_q \mul n_v$ updates):
\begin{itemize}
\item \textbf{Constant Memory} (64-bit): $1$ (only cosine and sine of the angle are required here)
\item \textbf{Shared Memory} (32-bit): $0.25 - 1$ (offsets for 4 projections can be loaded at once using a single 128-bit load if the loop is unrolled) %Also doesn't use bandwidth 1/4 on Pascal
\item \textbf{TC}: $2 - 3$ (computing $h$, updating loop index unless unrolled)
\item \textbf{BS}: $1 - 3$ (resolving addresses in the constant array and also in both shared memory caches unless the loop is fully unrolled)
\end{itemize}

\item Computing an offset in the cache and the coefficient for linear interpolation (per $n_v$):
\begin{itemize}
\item \textbf{FP}: $2$ (update to the next offset; computation of interpolation coefficient unless nearest neighbor mode is selected)
\item \textbf{TC}: $2$ (rounding and float-to-integer type conversion; only a single operation is required if nearest neighbor interpolation is performed)
\item \textbf{BS}: $1$ (resolving the address in the shared memory array) 
\end{itemize}

\item Linear Interpolation (for each update):
\begin{itemize}
\item \textbf{Shared Memory} (type dependent): $1$ ($2$ if Advanced Caching Mode is not used)
\item \textbf{FP}: $2$ (interpolation and update; $3$ operations if Advanced Caching Mode is not used and only $1$ if nearest neighbor interpolation is performed )
\end{itemize}
\end{enumerate}

 Further, a single-slice reconstruction ($n_v \eq 1$) using advanced caching mode is evaluated. Blocks of 256 threads ($n_t \eq 256$) are assigned to process a 32-by-32 pixel square ($n_q \eq 4$). For sake of simplicity we assume that 16 threads are used to cache a single projection row and that the inner projection loop is fully unrolled. We skip the texture fetches as load is very low and certainly is not a limiting factor here. Then, the following number of operations is estimated per a single update:
\begin{itemize}
\item \textbf{Constant Memory}: 2.3 bytes (0.3 instructions)
\item \textbf{Shared Memory}: 8.7 bytes (1.1 instructions)
\item \textbf{FP}: 4.6 (counting FMA as a single operation)
\item \textbf{TC}: 2.0
\item \textbf{BS}: 1.3
\item \textbf{Instructions}: 8.2
\end{itemize}

 To verify these estimates, the prototype implementation was executed under CUDA profiler. The number of estimated and measured operations is compared in \tablename~\ref{tbl:modelprof}. There is a difference, but the error is within 10\%.
 
\begin{table}[htb] %[htbp]
\begin{threeparttable}
\caption{\label{tbl:modelprof} Estimated and measured number of different operations required to perform back-projection using linear interpolation}
\centering
\noindent
\begin{tabularx}{\columnwidth}{ | X | l | l  l | l | l | }
\hline
           & SFU      & Int      & FP              &  Shared  & Constant  \\
Estimated  & 2.03125  & 1.3125   & 4.625           & 1.125    & 0.265625 \\ 
Measured   & 2.032125 & \multicolumn{2}{c|}{5.87}  & 1.265625 & 0.297875\\
\hline
\end{tabularx}
\begin{tablenotes}
\item The table gives the number of operations required to perform back-projection of a single slice using linear-interpolation, processing 4 pixels per GPU thread, and with the advanced caching mode enabled. The measured values are obtained on NVIDIA GeForce Titan X (Pascal) using \emph{nvprof}. The SFU usage is represented by value of \emph{inst\_bit\_convert} metric. It is impossible to separate the number of integer multiplications from other instructions executed on ALU. Therefore, a common number is given based on the sum of \emph{inst\_fp\_32} and \emph{inst\_integer} metrics. The shared memory operations are given as a sum of counts for \emph{shared\_store} and \emph{shared\_load} events. To estimate the number of constant memory operations, from the number of executed load/store instructions obtained using \emph{ldst\_executed} metric we have subtracted all other memory operations which are reported as \emph{shared\_store}, \emph{shared\_load}, and \emph{global\_store} events and all texture transactions which are counted in \emph{tex\_cache\_transactions} metric.
\end{tablenotes}
\end{threeparttable}
\end{table}

 \tablename~\ref{tbl:model} evaluates the maximum performance according to each execution unit. The throughput is taken from \tablename~\ref{tbl:gpuspec} and the load is computed according to the list above using the following assumptions which are explained in the \sectionname~\ref{section:ilp}. NVIDIA Maxwell and Pascal are not restricted to the SFU to perform bit-shifts, but are able to use also ALU units. On this devices we do not include the integer multiplications in the SFU balance. On NVIDIA Kepler we do. SFUs are either not available or not used on AMD GCN and NVIDIA Fermi. So, all types of operations are counted together in the ALU balance.
%The instructions to constant and shared memory can be issued in parallel on AMD and NVIDIA Kepler architectures. Therefore, the loads from the constant memory are not included in the balance of memory operations. On other NVIDIA architectures either load from the constant memory or load from the shared memory can be dispatched. Therefore, we sum both. 

 As can be seen, the performance bottleneck is architecture dependent. The AMD VLIW and NVIDIA Kepler GPUs are bound by ability to perform rounding operations and to convert variables between integer and floatint-point representation. While not limiting performance in the modeled configuration, this still sets a quite low threshold on Maxwell and Pascal GPUs. However, the main limiting factor on these architectures is the shared memory bandwidth. The Fermi GPU is only capable to dispatch a single instruction per warp and, consequently, bound by the instruction throughput. Finally, AMD GCN based devices are restricted by the performance of algebraic units.

\begin{table}[htb] %[htbp]
\begin{threeparttable}
\caption{\label{tbl:model} Performance estimates according to model}
\centering
\noindent
\begin{tabularx}{\columnwidth}{ | c | r r r r | X | }
\hline
GPU     & Mem          & ALU          & SFU          & OPS         & Limit \\
\hline
GTX580  & 145          & \textbf{99}  & -            & \textbf{97} & Instructions \\
GTX680  & 188          & 334          & \textbf{77}  & 252         & SFU \\
Titan   & 325          & 577          & \textbf{133} & 436         & SFU \\
GTX980  & \textbf{234} & 432          & 315          & 628         & Memory \\
Titan X & \textbf{576} & 1062         & 776          & 1545        & Memory \\
HD5970  & 169          & 145          & \textbf{114} & 114         & SFU \\
HD7970  & 346          & \textbf{238} & -            & 1160        & ALU \\
R9-290  & 443          & \textbf{304} & -            & 1485        & ALU \\
\hline
\end{tabularx}
\begin{tablenotes}
\item The table gives the estimates for maximum performance of back-projection kernel and reports the performance bottleneck for each considered GPU. The numbers are given in giga-updates per second. The performance limit for each execution unit is evaluated separately and the minimum throughput bounding the kernel performance is highlighted in bold. The estimation is made for a kernel configured to run linear-interpolation and process 4 pixels per GPU thread and running in the single slice reconstruction mode with advanced caching enabled.
\end{tablenotes}
\end{threeparttable}
\end{table}

\subsection{Rounding Using Floating-Point Arithmetic}\label{section:alu_fancy}
The Kepler performance is severely limited because due to rounding and type conversion operations. The reason is the slow performance of SFU units on the Kepler platform. Total 3 SFU operations are required to compute offset in shared memory and to perform linear interpolation. 
\begin{lstlisting}
float $h_f = floor(h);$
int $h_i = (int)h_f;$
float $d = \shmem{d}[h_i];$
\end{lstlisting}

 Each of the listed instructions uses SFU. The first instruction performs rounding and the second converts floating-point number to integer. The last operation involves a bit-shift to resolve the address of an array element. The array index is multiplied by the size of a data type, but the bit shift is actually performed in place of multiplication because the data size is always power of two. Instead, it is possible to perform multiplication using the floating point numbers and operate with pointers directly, like:
\begin{lstlisting}
float $h_f = floor(h);$
int $h_i = (int)(4.f * h_f);$
float $d = *(float*)((void*)\shmem{d} + h_i);$
\end{lstlisting}

 Then, one of the 3 SFU instructions is replaced with 2 floating-point operations. However, it is possible to eliminate the SFU instructions entirely. Since the offsets are always small positive numbers, rounding and type-conversion operations can be implemented using the floating-point arithmetic only. IEEE754 specification defines the format of a single-precision floating point number~\cite{ieee754}. It is illustrated on \figurename~\ref{fig:fancy}. For positive numbers, the representation is defined as:
\begin{equation}
f  = 2^{e-127} \cdot (1 + \sum f_i \cdot 2^{i-23})
\end{equation}
Consequently all fractional components are eliminated if $2^{23}$ is added to a number.
\begin{equation}
f + 2^{23}  = 2^{23} \cdot (1 + \sum f_i \cdot 2^{i-23})
\end{equation}
The rounded number is obtained if $2^{23}$ is subtracted back afterwards. To compute \emph{floor()} it is necessary to subtract 0.5 before these operations. I.e. the following implementation is suggested:
\begin{lstlisting}
float $e_{23} = exp2(23.f);$
float $h' = h - 0.499999f;$
float $h_{tmp} = h' + e_{23};$
float $h_f = h_{tmp} - e_{23};$
\end{lstlisting}

 The proposed method replaces a single SFU-based rounding instruction with 3 floating-point operations. The $e_{23}$ constant is computed only once in the beginning of a kernel and does not add much to the computation balance. It is further possible to make a float-to-integer conversion using a simple integer subtraction which is performed by ALU. The small integer numbers are fully encoded by the fraction portion of an IEEE 754 number. There are still some significant bits representing exponent, but they can be easily eliminated as illustrated on \figurename~\ref{fig:fancy}. 
\begin{lstlisting}
int $h_i$ =  __float_as_int($h_{tmp}$) - 0x4B000001;
\end{lstlisting}

 The \emph{\_\_float\_as\_int} is a simple cast (re-interpretation) of a floating point number as an integer. Using the pointer arithmetic notation, it is equivalent to $*(int*)\&h_{tmp}$. Still there is an index computation left. It is often reasonable to keep some load on SFUs as well. If indexing is left unchanged, the $d[\_\_float\_as\_int(h_{tmp} - 0x4B000001)]$ is replaced with a single \emph{iSCADD} operation combining multiplication and integer subtraction. It is executed on SFU in a single clock cycle. Consequently, 2 SFU instructions are replaced with 3 floating point operations and a single SFU instruction is left. The other option is to eliminate SFU instructions entirely. It is possible with
\begin{lstlisting}
$h_{tmp}$ = 4 * $h_{tmp}$ - (4 - 1) * $e_{23}$;
void *addr = (void*)$\shmem{d}$ + __float_as_int($h_{tmp}$);
float $d$ = *(float*)(addr);
\end{lstlisting}
 In this case 3 SFU instructions are replaced with 5 floating-point operations. The method to use depends on the expected operation balance. It can be estimated using the performance model which was proposed in \sectionname~\ref{section:modelling}. Either way the result is exact and there is no penalty to quality. 
 
 To perform nearest-neighbor interpolation, 2 SFU instructions are required on Kepler. One instruction can be easily replaced with floating-point operation by performing multiplication before type conversion as explained above. Otherwise, the SFU instructions are completely replaced with 3 floating-point operations.

\begin{figure}[htb]
\centering
\includegraphics[width=\imgsizedefault\textwidth]{img/fancy.pdf}
\caption{\label{fig:fancy} IEEE 754 representation of single-precision floating-point number (top) and an example how to get the standard integer representation in fraction part by adding $2^{23}$ (bottom)}
\end{figure} 

 The performance along with a number of instructions issued per update is shown in \tablename~\ref{tbl:fancy} for NVIDIA GTX Titan. The speed-up of 20\% is achieved if rounding is implemented using floating point instructions, but index computation is left on SFU. The complete elimination of SFU instructions puts unnecessary load on ALUs and keeps SFU units idle. This method has a little impact on the VLIW architecture. While the performance is limited by the throughput of integer instructions, the difference between performance of floating-point and special units is not as high as on Kepler. Consequently, the performance is limited approximately at the same level.

\begin{table}[htb] %[htbp]
\begin{threeparttable}
\caption{\label{tbl:fancy} Different interpolation modes on Titan (Kepler) GPU}
\centering
\noindent
\begin{tabularx}{\columnwidth}{ | X | r | r r r | }
\hline
                  &             & \multicolumn{3}{c|}{Instructions} \\
Method            & Performance & FP     &  Integer  & Bit-convert \\
\hline
Stanard           & 165 GU/s   & 4.5    & 1.4       & 2.03 \\
FP round          & 197 GU/s   & 7.5    & 1.4       & 0.03 \\
FP round \& index & 182 GU/s   & 8.5    & 1.4       & 0.03 \\
\hline
\end{tabularx}
\begin{tablenotes}
\item The table compares performance of 3 different rounding modes described in \sectionname~\ref{section:alu_fancy}. The performance is measured on the Kepler-based Titan GPU and the number of issued instructions is obtained using NVIDIA profiler. The number of floating point and integer operations is reported by \emph{inst\_fp\_32} and \emph{inst\_integer} metrics correspondingly. The integer counter includes both additions/subtractions executed on ALU and \emph{iSCADD} operations executed on SFU. Consequently, the number of integer operations is constant because the \emph{iSCADD} instruction is just replaced with integer addition. The bit-convert instructions are reported as \emph{inst\_bit\_convert} and actually represent the rounding and type-mangling operations.
\end{tablenotes}
\end{threeparttable}
\end{table}

\subsection{Half-float cache}\label{section:alu_half}
 Like in the texture-based reconstruction algorithm, the half-float representation may be used to speed-up reconstruction. While the texture engine is not a performance-limiting factor here, the bottleneck in shared memory is lifted on the Maxwell and Pascal architectures if half-float values are cached in shared memory. The values are converted to a single-precision format just before computing interpolation. Since the texture units are always used in the nearest neighbor mode, it possible to use the half-float data representation also to speed-up reconstruction performing linear interpolations. About 10\% performance increase is measured on Pascal and Maxwell if 4 slices are reconstructed in parallel and the linear interpolation is performed. High load on SFU units to convert between half and floating-point representation prevents larger speed-ups. It is also the reason why no performance improvements are reported on other platforms. On professional series of Tesla cards with Pascal architecture it could be possible to achieve higher performance by keeping the computations in half-precision all way through the end. The results, however, are expected to suffer from additional performance degradation.  In any case this is not feasible on Titan X because of significantly lower throughput of high-float arithmetic. 
 
\subsection{Additional Caches}\label{section:alu_hx}
 The Fermi performance is limited by the throughput of ALU units and also by a number of instructions it is able to dispatch per clock cycle. Using the advanced caching mode, the interpolation footprint is reduced by a single instruction. Advanced caching is used across multiple platforms in the single-slice reconstruction mode. On Fermi, however, it also improves performance if multi slices are reconstructed in parallel. The vectors fetched from the texture engine are split into the components and are cached using 2 or 4 independent caches. Furthermore, there is an additional option to slightly reduce the number of operations. Two \emph{FMA} instructions are required to find the required offset in the cache.
\begin{equation*}
        h \eq \shmem{h}_m[p] + \vx{f'_g} \mul \cmem{c_c}[p] - \vy{f'_g} \mul \cmem{c_s}[p]
\end{equation*}
This computation is performed once per each pixel the thread is responsible for. If 16 pixels are assigned to each thread, the impact is negligible in the overall computational balance. Fermi is, however,  limited by the amount of available registers and unlike newer architectures is restricted to process only 4 pixels per thread. Therefore, it is relevant to reduce the number of instructions required to compute $h$. When $\shmem{h}_m$ value is cached, only a single thread in every 16 is actually used to perform the caching. Instead the cached value may include the $x$ component as well and utilize all threads with a minimal extra load. I.e. the following value is cached in shared memory instead of $\shmem{h}_m$:
\begin{equation*}
   \shmem{h}_x[m_p][\vx{m_t}] = \cmem{c}_a[p] + \vx{f_g} \mul \cmem{c_c}[p] - h_m
\end{equation*}
Then, 32 values are cached per projection row, but only one \emph{FMA} is used to compute the offset: 
\begin{equation*}
        h \eq \shmem{h}_x[p][\vx{m'_t}] - \vy{f'_g} \mul \cmem{c_s}[p]
\end{equation*}
 The amount of required shared memory is significantly increased, but there is no additional memory traffic. A warp is either loading the same value which is broadcasted from a single shared memory bank or up to 32 values are loaded and all banks are utilized. Furthermore, the cosine of a projection angle is not loaded any more from the constant memory. Extra instructions, however, are dispatched unless special care is taken. As was mentioned in \sectionname~\ref{section:modelling}, the 64- or 128-bit loads are performed to load $\shmem{h_m}$ if the projection loop is unrolled. This is possible because the values for consecutive projections are stored next to each other. Technically it is possible to organize the new cache to keep such arrangement, but there is a better option which is independent of loop-unrolling. The threads of the block are assigned to process a 16-by-16 pixel square at each step instead of the mapping proposed in \figurename~\ref{fig:alumap}. In \sectionname~\ref{section:newtex} the linear mapping scheme is reasoned by ability to maintain only a single index because, then, each thread needs to increment an $y$-coordinate only. This is given using the new caching scheme as the $x$ component is already included in the value loaded from the cache. Each thread processes 4 pixels with coordinates $(x,y)$, $(x + 16, y)$, $(x, y + 16)$, and $(x + 16,y + 16)$. It loads $\shmem{h}_x[p_i][x]$ and $\shmem{h}_x[p_i][x+16]$ using a single 64-bit instruction and only need to increment the y-coordinate. The utilization of the shared memory bandwidth is increased as each thread needs to load 64 bits per projection instead of 32. But the total memory bandwidth is still exactly the same as in the base implementation due to reduced requests to constant memory. About 5\% speed-up is achieved on the NVIDIA Fermi and AMD Tahiti architectures.

 Few other values can be cached to slightly shift the balance of operations. In some cases, it is beneficial to cache also trigonometric constants in shared memory. This is slightly improves the performance across NVIDIA architectures. The $h_m$ value is normally computed multiple times by all threads responsible to cache a specific projection row. The extra load is not very high, but can be avoided for a price of several additional registers required to introduce a third stage in the reconstruction process. At first, threads of a block are assigned to compute $h_m$ values for 256 projections and cache it in shared memory. Then, the values are just loaded at each loop iteration. It was found useful on the AMD VLIW architecture. Vice-versa the caching of $h_m$ can be disabled altogether on the systems with fast ALUs, but slow shared memory. The suggested cache settings are summarized in \tablename~\ref{tbl:cacheconf}.
 
\input{table_6x_caches.tex}
 
\subsection{Managing Occupancy}\label{section:alu_occupancy}
  The number of pixels processed per block and per thread is one of the main parameters affecting the performance.  The optimal configuration depends on the available GPU resources, but also on the size of the reconstructed image. The number of executed blocks could be insufficient to load GPU evenly if large pixel blocks are used in conjunction with a small image. However, the texture-based approach is expected to perform better for small images in any case. To summarize the performance and optimal configuration we assume that the sufficiently large image is reconstructed and focus on the hardware capabilities only. Depending on the available GPU resources 4, 8, or 16 pixels are assigned per thread. In the last case each thread block is responsible to reconstruct an area of 64x64 pixels. Otherwise, only 32x32 pixels are processed. The block of 128 threads is used to allow processing of 8 pixels per thread.

 If a number of concurrently processed slices is given, there are still multiple factors affecting the performance. The optimal implementation of the proposed algorithm should ensure that:
\begin{itemize}
 \item The full occupancy is achieved to hide latencies efficiently.
 \item A large reconstruction area is assigned to each thread block to reduce amount of caching operations per reconstructed pixel.
 \item As many pixels as possible are assigned to each GPU thread. It allows to reduce a proportion of the auxiliary operations required to compute offsets in the cache and also ensures that a large amount of independent instructions is in execution flow as required by the architectures relaying on the instruction level parallelism (ILP).
 \item The number of threads assigned to cache each projection row is in accordance with the requirements specified in \tablename~\ref{tbl:shmemconf}. Then, no shared memory bank conflicts occur and the shared memory writes are executed optimally.
 \item The number of projections cached at each step is enough to utilize all threads in the block. Otherwise, the threads idling at the synchronization point reduce the efficiently achieved occupation.
 \item The projection loop is completely unrolled to provide additional ILP parallelism and ensure that multiple 32-bit memory operations can be combined into a single 64-/128-bit instruction.
 \item The generated code is able to issue multiple load operations in a streaming fashion as explained in \sectionname~\ref{section:code_flow}. It allows to reduce penalty inflicted by the memory access latencies if other mechanism fail to hide them entirely.
 \item All appropriate optimizations discussed through this section are implemented.
\end{itemize}

 Due to hardware limitations it is impossible to achieve all these goals simultaneously. The number of required registers is steeply increased with a number of pixels assigned per thread and restricts the achieved occupancy. Using the multi-slice reconstruction mode, either a high occupancy or a high number of pixels per thread is possible to achieve. The amount of available shared memory restricts how many projections could be cached at the desired occupancy level. If this restriction is low, the high effective occupancy is still possible to achieve in the caching stage if more threads are used to cache each projection or smaller 128-thread blocks are in use. The first option is only available if 16 pixels are reconstructed per thread. Consequently, a high number of registers is required in both cases. Furthermore, the number of threads assigned per projection is in turn restricted if shared memory is optimized for 64-bit writes.  Most of the proposed optimizations increase the usage of registers or/and shared memory. The use of additional caches could have a negative general impact if the increased shared memory footprint results in a lower number of cached projections or reduces the achieved occupancy. The streaming loads cause a significant increase of consumed registers and definitively reduce the occupancy. Hints to compiler reducing the unrolling of inner projection loop are used to prevent this. Furthermore, the desired occupancy can be targeted on NVIDIA platform. Forcing the higher occupancy may result in additional computational load may and cause the compiler to back part of the local variables with slower local memory instead of registers. Vice-versa under low occupancy, the compiler may be able to increase ILP parallelism and perform stream-loading more efficiently. 
 
 The importance of the described aspects differs between architectures and an optimal compromise has to be found for each targeted platform. We found out that targeting 50\% occupancy is optimal across majority of architectures. On the platforms with a larger register bank, the occupancy above 50\% is achieved by default. If 50\% is targeted, more registers are available for data streaming which has often a positive impact on the performance. Vice-versa, on the systems with a small register bank the default occupancy is typically low and restricting the amount of used registers to ensure 50\% occupancy results in a faster code. We also have found that it is important to keep at least 50\% of threads busy in the caching stage. Above this threshold the under-utilization has an impact, but relatively insignificant. Therefore, to cope with shortage of shared memory, the number of cached projections is decreased in steps of 4.
 
 The Maxwell and Pascal GPUs have a large amount of both shared memory and registers, but are bound by the shared memory bandwidth. An area of 64-by-64 pixels are processed by each thread block on these platforms in order to reduce amount of shared memory writes. In the linear interpolation mode the amount of shared memory operations is well balanced with ALU throughput. The streaming of memory reads is not required if the shared memory loads are interleaved with ALU- and SFU-bound interpolation instructions. Thus, the 100\% occupancy is targeted and a speed-up of 15\% is measured. This is not the case in the nearest neighbor interpolation mode. The shared memory bandwidth is the bottleneck in this case and the performance is improved if multiple shared memory loads are streamed together. Consequently, significantly more registers are required. By default the CUDA compiler does not utilize the streaming capabilities fully, but runs at 62\% occupancy. Requesting occupancy to 50\% allows to stream more loads together and improves performance by 7\%. The impact of occupancy on the performance for both linear and nearest-neighbor interpolation modes are reviewed \tablename~\ref{tbl:occupancy}.

\begin{table}[htb] %[htbp]
\begin{threeparttable}
\caption{\label{tbl:occupancy} The effect of occupancy-targeting on the performance for NVIDIA Titan X GPU}
\centering
\noindent

\begin{tabularx}{\columnwidth}{ | X | r | r | r | r | }
\multicolumn{5}{c}{\textbf{Linear Interpolation Mode}} \\
\hline
Target &  Registers & Local Memory & Occupancy & Performance \\
\hline
-		& 40 	& - 			& 75\%	& 565 GU/s \\
50\%	& 64 	& - 			& 50\%	& 570 GU/s \\
100\%	& 32 	& 8~bytes	    & 100\%	& \textbf{620 GU/s} \\
\hline
\multicolumn{5}{c}{}\\
\multicolumn{5}{c}{\textbf{Nearest Neighbour Interpolation Mode}} \\
\hline
Target &  Registers & Local Memory & Occupancy & Performance \\
\hline
-		& 48 	& - 			& 62\%	& 1082 GU/s \\
50\%	& 64 	& - 			& 50\%	& \textbf{1158 GU/s} \\
100\%	& 32 	& 40~bytes	    & 100\%	& 954 GU/s \\
\hline

\hline
\end{tabularx}

\begin{tablenotes}
\item A single slice reconstruction is executed with the settings configured according to \tablename~\ref{tbl:alurec} with the only exception of occupancy which is set as specified in the \emph{Target} column. The best performance is highlighted in bold. 
\end{tablenotes}
\end{threeparttable}
\end{table}

% Or shall we tell 48? 55 regs if not __LAUNCH_BLOCK__ statement at all. 48 if we just specify number of threads per block. The 40 bytes of lmem, however, are used in this case.
 The Kepler GPUs has the same amount of registers as Maxwell and Pascal. However, a more aggressive unrolling is required and is performed by the CUDA compiler to ensure the wide memory accesses and to enable the longer flow of independent instructions. The ILP parallelism is required to allow 4 warp scheduler to utilize all 6 ALU blocks integrated in the Kepler SM. Consequently, an increased number of registers is used to execute the same code. For instance, the reconstruction based on the linear interpolation as discussed in the previous paragraph would use 55 registers if compiled for the Kepler architecture (compute capability 3.5) instead of only 40 registers which are required if Pascal architecture is targeted. The performance at 100\% occupancy is sub-optimal if linear interpolation is performed. On other hand, the 64 registers available at 50\% occupancy are not enough to enable efficient streaming of the shared memory loads. Therefore, a small pixel area of 32-by-32 pixels is reconstructed per block and the block is reduced to 128 threads only. The last point is important to keep a high level of ILP and also to achieve a full thread utilization in the caching stage as the number of cached projections is limited due to low amount of available shared memory. Using nearest neighbor interpolation, there is enough registers to organize stream-loading at 50\% occupancy also for the larger pixel area. The Fermi architecture includes only a half of the Kepler registers and is bound to 32-by-32 pixel area in all interpolation modes. While the amount of the shared memory is the same as on Kepler, fewer blocks are required to achieve full occupancy here. Consequently, it is possible to cache more projections at 50\% occupancy. On GT200 the amount of registers is even lower and it is not suitable to implement the proposed scheme with sufficiently high performance.

 While there is no option to instruct compiler on the desired occupancy on the AMD GCN devices, the used caches are aimed to ensure that at least 50\% occupancy can be achieved. The VLIW architecture needs to issue 4-5 independent instructions at each clock. Therefore, it is important to ensure a very large ILP parallelism even in price of significantly reduced occupancy. The larger area is assigned to a thread block for a single-slice reconstruction and a smaller thread block is used to process 8 pixels per thread in all other cases. The algorithm is running at about 35\% of the maximum occupancy.

  \tablename~\ref{tbl:alurec} summarizes the proposed configuration and gives the measured performance. If only a single slice is available for reconstruction, the new algorithm outperforms the texture-based version across all considered architectures. The maximum speed is better on Fermi and on all AMD architectures if the linear interpolation is performed. Using the nearest-neighbor interpolation the performance is improved on Kepler GPUs and also across all target architectures in the case if the quality is not compromised by a half-float data representation.

\input{table_6x_alurec.tex}

 The GPU-specific tuning has a major positive effect on the performance. However, new architectures are announced yearly. A throughout study would be required to adjust parameters accordingly. Furthermore, the generated code varies significantly for the devices of different compute capabilities even within the same architecture family. While we had not studied it in detail, there are also differences depending on the CUDA version. To avoid manual work, the actual configuration can be parametrized and a quick search in the parameters space executed to find the optimal settings. The parameters may include numeric options like the targeted occupancy, number of cached projections, unrolling factor, etc. But also switching on and off specific optimizations and caches is feasible. Automated approach would not deliver the optimal performance if the new functional blocks are introduced like Tensor Units on a recently announced NVIDIA Volta architecture. However, it can address the shifts in the operation balance.