/articles/toma : contents of section_4x

: (revision 100)

To get this branch, use:

bzr branch
http://darksoft.org/webbzr/articles/toma

\section{Tomographic Reconstruction} \label{section:tomo}
 At synchrotron imaging beamlines, information about X-ray attenuation or/and phase changes in the sample is used to reconstruct its internal structure. The objects are placed on a rotation stage in front of a pixel detector and rotated in equiangular steps. As the object rotates, the pixel detector registers a series of two-dimensional intensity images of the incident X-rays. Typically the X-rays are not detected directly, but converted to visible light using a scintillator placed between the sample and pixel detector. Then, the conventional CCD cameras are used to record intensities which actually correspond to projections of the sample volume. Due to the rather large source-to-sample distance, imaging at synchrotron light sources is usually well described by a parallel-beam geometry. The beam direction is perpendicular to the rotation axis and to the lines of the pixel detector. Therefore, the 3D reconstruction problem can be split into a series of 2D reconstructions performed with cross-sectional slices. An origin of coordinate system coincides with center of sample rotation stage and rotation axis is anti-parallel to gravity. To reconstruct a slice, the projection values are "smeared" back over the 2D cross section along the direction of incidence and are accumulated over all projection angles. To compensate blurring effects, high-pass filtering of the projection data is performed prior to back projection~\cite{natterer2001mathematical}.

%\begin{figure}[htb]
%\centering
%\includegraphics[width=\imgsizedefault\textwidth]{img/recon2.eps}
%\caption{\label{recon} Image reconstruction using a simple back projection. The sample
%is rotated and a pixel detector acquires a series of parallel projections under
%different rotation angles (left). For image reconstruction, all projections are
%"smeared" back onto the cross section along the direction of incidence
%yielding an integrated image (right).}
%\end{figure}

 The typical reconstruction data flow using parallel accelerators is represented on \figurename~\ref{dataflow}. The projections are loaded into the system memory either from a storage facility or directly from a camera and, then, transferred into the GPU memory before executing pre-processing or reconstruction steps. From cameras equipped with PCIe-interface it is also possible to transfer the projections directly into the GPU memory using GPUDirect or DirectGMA technologies. The later is supported by UFO framework~\cite{vogelgesang2016dgma}. The loaded projections are pre-processed with a chain of filters to compensate the defects of optical system. Then, the projections are rearranged in order to group together the chunks of data required to reconstruct each slice. These chunks are called \emph{sinograms} and are distributed between parallel accelerators available in the system in a round-robin fashion. Filtering and back-projection on each slice are performed on each GPU independently, the results are transferred back, and are either stored or passed further for online processing and visualization.  To efficiently utilize the system resources, usually all described steps are pipelined.  The output volume is divided into multiple subvolumes, each encompassing multiple slices. The data required to reconstruct each subvolume is loaded and send further trough the pipeline. While next portion of the data is loaded, the already loaded data is pre-processed, assembled in sinograms, and reconstructed. The preprocessing is significantly less computational-intensive compared to the reconstruction and is often, but not always, performed on CPUs. OpenCL, OpenMP, or POSIX threads are used to utilize all CPU cores.  The pre-processed sinograms are, then, distributed between GPUs for reconstruction. For each GPU a new data pipeline is started. While one sinogram is transferred into the GPU memory, the sinograms already residing in GPU memory are first filtered, then back projected to the resulting slice, and finally transferred back to the system memory. Event-based asynchronous API and double-buffering are utilized to execute data transfer in parallel with reconstruction. Basically, such approach allows to use all system resources including Disk/Network I/O, PCIe bus, CPUs, and GPUs in parallel.

\begin{figure}[htb]
\centering\includegraphics[width=\imgsizedefault\textwidth]{img/dataflow.pdf}
\caption{\label{dataflow} The data flow in image reconstruction framework. The data is split in blocks and processed using pipelined architecture to efficiently use all system resources. }
\end{figure}

 A single row from each of the projections is required to reconstruct a slice of output 3D volume. These rows are grouped together in a sinogram. For the sake of simplicity, we refer to these rows as \emph{projections} while discussing reconstruction from sinograms. Each slice is reconstructed independently. First, each sinogram row is convolved  with a high-pass filter to reduce blurring - an effect inherent to back-projection. The convolution is normally performed as multiplication in Fourier domain. The implementation is based on available FFT libraries. NVIDIA \emph{cuFFT} is used on CUDA platform and either AMD \emph{clFFT} or Apple \emph{oclFFT} is utilized for OpenCL. For optimal FFT performance, multiple sinogram rows are converted to and from FFT domain together using batched transformation mode. After filtering, the buffer with filtered sinograms is either bind to texture on CUDA platform or copied into the texture if OpenCL is used. The pixel-driven approach is used to compute back-projection. For each pixel $(x,y)$ of the resulting slice, the impact of all projections is summed.  This is done by computing the positions $r_p$ where the corresponding back projection rays are originated and interpolating the values of projection bins around this position.
 
   If $\alpha$ is an angle between consecutive projections, the positions are computed as follows:
\begin{equation}\label{eq:bin}
r_p(x,y) = x\cdot\cos(p\alpha) - y\cdot\sin(p\alpha)
\end{equation} 
 
  As computation of trigonometric function is relatively slow on all GPU architectures, the values of $cos(p\alpha)$ and  $sin(p\alpha)$ are normally pre-computed on CPU for all projections, transferred to GPU constant memory, and, then, re-used for each slice. Assuming this optimization, the back projection performance is basically determined by how fast the interpolation could be made. Two interpolation modes are generally used. The nearest neighbor interpolation is faster and better at preserving the edges while the linear interpolation reconstructs the texture better. While more sophisticated interpolation algorithms can be used as well, they are significantly slower and are rarely if ever used. All reviewed reconstruction frameworks rely on GPU texture engine to perform interpolation. This technique was first proposed in the beginning of the nineties for the \emph{SGI RealityEngine}~\cite{cabral1994}.


% Explain C2C optimization ?