11
11
As a 4 threads is now required for each output pixel, it is necessary to either quadruple the size of computational grid or to increase the number of pixels processed by each thread. To process a projection, the GPU threads need to locate the bins contributing to the values of reconstructed pixels. This is done based on the pixels coordinates and several projection constants which are stored in the GPU memory. If the thread reconstructs several pixels, the constants can be loaded only once and, then, re-used to locate all required bins for the considered pixels. Therefore, to reduce load on GPU memory, it is better to increase the work-load of the GPU threads instead of enlarging the grid dimensions. This means each GPU thread computes contribution for 4 output pixels from quarter of available projections. The following mapping is adopted. The thread blocks assignments are kept exactly the same as in the standard version. Each thread block is responsible for the output area of 16-by-16 pixels. However, this area is further subdivided in 4-by-4 pixel squares. As we have 256 threads in the block and we need 64 threads per square, 4 such squares are processed in parallel. And a complete set of 16 squares requires 4 iterations. There are several ways to arrange the mapping between GPU threads and the pixel squares, see \figurename~\ref{fig:mappings}. The first mapping is sparse and results in a reduced cache hit rate as compared to the other options. And the third option requires less registers as only a single pixel coordinate is incremented for each thread. Since the usage of additional registers may result in the reduced occupancy or the spillage of registers into the local memory, the 3rd approach is preferred.
15
15
\includegraphics[width=0.45\textwidth]{img/mappings.pdf}
16
16
\caption{\label{fig:mappings} The figure illustrates several ways to assign a block of GPU threads to an area of 16-by-16 pixels. Since 4 projections are processed at once, only 64 threads are available for entire area and it take 4 iterations to process it completely. For each possible scheme in blue are shown all pixels which are processed during the first iteration in parallel. The first mapping (left) is sparse and results in increased cache misses. The second mapping (center) requires more registers and may cause reduced occupancy. So, the third mapping (right) is preferred. }