/articles/toma : revision 34

To get this branch, use:

bzr branch
http://darksoft.org/webbzr/articles/toma

« back to all changes in this revision

Viewing changes to section_4x_tomo.tex

Committer: Suren A. Chilingaryan
Date: 2017-12-23 08:49:35 UTC
Revision ID: csa@suren.me-20171223084935-yg4j912ehufjz6d0

Fix cross-references and some latex complaints

Show diffs side-by-side

added added

removed removed

section_4x_tomo.tex

Due to rather large source-to-sample distances, imaging at synchrotron light sources is usually well described by a parallel-beam geometry. The beam direction is perpendicular to the rotation axis and to the lines of the pixel detector. Therefore, the 3D reconstruction problem can be split up into a series of 2D reconstructions performed with cross-sectional slices. A coordinate system is defined so that the center of rotation is located at the center of the sample coordinate system and the sample is rotating around the vertical axis. According to FBP algorithm, to determine a function of the sample object at a given position with coordinates $(x,y)$ in the slice $z$, $\sum_{p=1}^P I_p(x\cdot\cos(p\alpha) - y\cdot\sin(p\alpha), z)$ is computed over all projections, where $P$ is the number of projections, $\alpha$ is the angle between projections, and $I_p$ is the image of $p$-th projection. For a given slice, this corresponds to smearing back the projection values over the 2D cross section, and integrating over all projection angles, see \figurename~\ref{recon}. Since the images are digitized, a linear interpolation is performed to get the values of $I_p$ at the computation point. To compensate the blurring effect, high-pass filtering of the projection data prior to back projection is performed.

%ToDo: Add stuff about filtering.

\begin{figure}[h]

\begin{figure}[htb]

\centering

\includegraphics[width=0.45\textwidth]{img/recon2.eps}

\caption{\label{recon} Image reconstruction using a simple back projection. The sample

The typical reconstruction data flow using parallel accelerators is represented on \figurename~\ref{dataflow}. The projections are loaded into the memory either from the storage or directly from the camera. In most cases, the data is loaded in the system memory first and transferred to the GPU memory before pre-processing or reconstruction stages. However, if used together with a custom camera with PCIe-interface, the UFO framework also allow direct transfer of the projection data into the GPU memory using DirectGMA technology~\cite{vogelgesang2016dgma}. The loaded projections are preprocessed with a chain of filters to compensate the defects of optical system. Then, the projections are transposed in order to group together the chunks of data required to reconstruct each slice. These chunks are called sinograms and distributed between the parallel accelerators available in the system in a round-robin fashion. Filtering and back-projection on each slice are performed on each GPU independently, the results are transferred back and either stored or passed further for online processing and visualization. To efficiently utilize the system resources, usually all described steps are pipelined. The output volume is divided into multiple subvolumes, each encompassing multiple slices. The data required to reconstruct each subvolume is loaded and send further trough the pipeline. While next portion of the data is loaded, the already loaded data is preprocessed, transposed into synograms, and reconstructed. The preprocessing is significantly less compute-intensive compared to reconstruction and is often, but not always, performed on CPUs. OpenCL, OpenMP, or POSIX threads are used to utilize all CPU cores. Basically, this allows to use all system resources including Disk/Network IO, CPUs, and GPUs in parallel. Unless the preprocessing is executed on GPUs, the sinogram generation is performed on CPU as well. Even using the slower system memory, it takes negligible amount of time. The generated sinograms are, then, distributed between GPUs for reconstruction. For each GPU a new data pipeline is started. While one sinogram is transferred into the GPU memory, the sinograms already residing in GPU memory are first filtered, then back projected to the resulting slice, and finally transferred back to the system memory. Hence, the data transfers over PCIe bus are also performed in parallel with the reconstruction.

\begin{figure}[h]

\begin{figure}[htb]

\centering\includegraphics[width=0.45\textwidth]{img/dataflow.pdf}

\caption{\label{dataflow} The data flow in image reconstruction framework. The data is split in blocks and processed using pipelined architecture to efficiently use all system resources. }

\end{figure}

As computation of trigonometric function is rather slow on all GPU architectures, the values of $cos(p\alpha)$ and $sin(p\alpha)$ are normally pre-computed on CPU for all projections, transferred to GPU constant memory, and, then, re-used for each slice. Assuming this optimization, the back projection performance is basically determined by how fast the interpolation could be made. Two interpolation modes are generally used. The nearest neighbor interpolation is faster and better at preserving the edges while the linear interpolation reconstructs the texture better. While more sophisticated interpolation algorithms can be used as well, they are significantly slower and are rarely if ever used. All the reviewed implementations rely on GPU texture engine to perform interpolation.

%ToDo: http://developer.amd.com/community/blog/2015/08/24/accelerating-performance-acl-clfft-library/

The complete algorithm is represented on \figurename~\ref{fbp}. Event-based asynchronous API and double-buffering are used during the data transfers in both directions. The filtering is normally performed as convolution in Fourier space and is relying on the available FFT libraries. NVIDIA \emph{cuFFT} is used on CUDA platform and either \emph{clFFT} or \emph{oclFFT} is utilized for OpenCL~\cite{xxx}. For optimal FFT performance, multiple sinogram rows are converted to and from FFT domain together using batched transformation mode. After filtering, the buffer with the filtered sinograms is either bind to texture on CUDA platform or copied to the texture if OpenCL is used.

The complete algorithm is represented on \figurename~\ref{fbp}. Event-based asynchronous API and double-buffering are used during the data transfers in both directions. The filtering is normally performed as convolution in Fourier space and is relying on the available FFT libraries. NVIDIA \emph{cuFFT} is used on CUDA platform and either AMD \emph{clFFT} or Apple \emph{oclFFT} is utilized for OpenCL. For optimal FFT performance, multiple sinogram rows are converted to and from FFT domain together using batched transformation mode. After filtering, the buffer with the filtered sinograms is either bind to texture on CUDA platform or copied to the texture if OpenCL is used.

The standard implementation of back projection is released as follows. Each pixel of the reconstructed slice is associated with a dedicated GPU thread and all pixels of the slice are reconstructed by the CUDA kernel in parallel. Within the kernel a loop is executed over all projections. At each iteration, a projection bin corresponding to the reconstructed pixel is calculated and the texture fetching is performed. The texture engine is configured to carry out a linear interpolation while fetching the data. This technique is actually quite old and was first proposed in the beginning of the nineties for the \emph{SGI RealityEngine}~\cite{cabral1994}.

\begin{figure}[h]

\begin{figure}[htb]

\centering

\includegraphics[width=0.45\textwidth]{img/fbp.pdf}

\caption{\label{fbp} The standard implementation of filtering back projection algorithm. The algorithm consists of two main steps: filtering and back projection. First, the sinograms are convoluted with the configured filter and the result are stored as 2D texture. Then, the resulting slice is computed by summing the impact from all projections using GPU texture engine. Double-buffering is used to perform data transfer in parallel with computations. }

Older »