15
15
The typical reconstruction data flow using parallel accelerators is represented on \figurename~\ref{dataflow}. The projections are loaded into the memory either from the storage or directly from the camera. In most cases, the data is loaded in the system memory first and transferred to the GPU memory before pre-processing or reconstruction stages. However, if used together with a custom camera with PCIe-interface, the UFO framework also allow direct transfer of the projection data into the GPU memory using DirectGMA technology~\cite{vogelgesang2016dgma}. The loaded projections are preprocessed with a chain of filters to compensate the defects of optical system. Then, the projections are transposed in order to group together the chunks of data required to reconstruct each slice. These chunks are called sinograms and distributed between the parallel accelerators available in the system in a round-robin fashion. Filtering and back-projection on each slice are performed on each GPU independently, the results are transferred back and either stored or passed further for online processing and visualization. To efficiently utilize the system resources, usually all described steps are pipelined. The output volume is divided into multiple subvolumes, each encompassing multiple slices. The data required to reconstruct each subvolume is loaded and send further trough the pipeline. While next portion of the data is loaded, the already loaded data is preprocessed, transposed into synograms, and reconstructed. The preprocessing is significantly less compute-intensive compared to reconstruction and is often, but not always, performed on CPUs. OpenCL, OpenMP, or POSIX threads are used to utilize all CPU cores. Basically, this allows to use all system resources including Disk/Network IO, CPUs, and GPUs in parallel. Unless the preprocessing is executed on GPUs, the sinogram generation is performed on CPU as well. Even using the slower system memory, it takes negligible amount of time. The generated sinograms are, then, distributed between GPUs for reconstruction. For each GPU a new data pipeline is started. While one sinogram is transferred into the GPU memory, the sinograms already residing in GPU memory are first filtered, then back projected to the resulting slice, and finally transferred back to the system memory. Hence, the data transfers over PCIe bus are also performed in parallel with the reconstruction.
18
18
\centering\includegraphics[width=0.45\textwidth]{img/dataflow.pdf}
19
19
\caption{\label{dataflow} The data flow in image reconstruction framework. The data is split in blocks and processed using pipelined architecture to efficiently use all system resources. }