59
59
In the first stage of algorithm, the number of threads assigned to cache each projection is adjusted to optimize access to the shared memory. If a large 64-by-64 area is reconstructed, a full warp of 32 threads can be assigned for each projection row avoiding any possible bank conflicts. Unfortunately, it is not completely optimal on the Kepler architecture as, then, it is impossible to re-combine two bins into a single 64-bit shared memory write as explained above. It is also not possible to assign 32 threads per row for a smaller 32-by-32 area because only 48 bins has to be cached per projection in this case. And it is a bad idea to keep the half of the threads idling part of the time. Therefore, several projection rows are processed by each warp in the described cases. This potentially may result in the bank conflicts. If only a single slice is reconstructed, however, the banks are shifted from one projection row to another as illustrated on \figurename~\ref{fig:banks}. On the platforms with 32-bit shared memory, the caching is performed without causing bank conflicts if groups of 16 threads are used per projection row. On the Kepler platform, however, only 8 threads per projection row are used instead and each thread is assigned to write two values into the cache. 16 threads are used to process 32-bins per iteration on the Kepler if 64-by-64 area is assigned to thread block. For the multi-slice reconstruction modes, 16 threads per projection are optimal on all platforms. On Kepler, the \emph{float2}-sinogram used together with the 64-bit banks is performing exactly the same as the \emph{float}-sinogram is performing with the 32-bit banks. I.e. 16 threads per projection are required to avoid bank conflicts. On other platforms it is enough to prevent bank conflicts within the groups of 16 threads while dealing with 64-/128-bit data. So, there are no problems if 16 threads are used per projection row. Two shared memory buffers, however, are used as explained above if 4-slice reconstruction is executed on any AMD device or Fermi-based GPU from NVIDIA. The optimal settings for each reconstruction mode are summarized in \tablename~\ref{tbl:shmemconf}.
63
63
\includegraphics[width=0.45\textwidth]{img/banks.pdf}
64
64
\caption{\label{fig:banks} The figure illustrates how the warps are assigned to cache a subset of a sinogram on the systems with 32-bit and 64-bit shared memory. For each projection 48 bins required to reconstruct area of 32-by-32 pixels are cached. The shared memory banks used to back each group of 16 bins are specified considering that 32-bit data format is used.}