/articles/toma : contents of section_1x

: (revision 100)

To get this branch, use:

bzr branch
http://darksoft.org/webbzr/articles/toma

\section{Introduction}\label{section:intro}
 X-ray tomography is a powerful tool to investigate materials and small animals at the micro- and nano-scale~\cite{withers2007nanotomography}.
Information about X-ray attenuation or/and phase changes in the sample is used to reconstruct its internal structure. Recent advances in X-ray optics and detector technology have paved the way for a variety of new X-ray imaging experiments aiming to study dynamic processes in materials and to analyze small organisms in vivo. At the Swiss Light Source (SLS) scientists were able to take high quality 3D snapshots of 150 Hz oscillations of a blowfly flight motor~\cite{mokso2015fly}. A temporal resolution of 20 milliseconds was achieved during a stencil test performed at SLS~\cite{maire2016} and also in the analysis of morphological dynamics of fast-moving weevils at the ANKA synchrotron at KIT~\cite{dosSantosRolo2013invivo}.
%The  foaming of aluminum alloy precursors was studied with temporal resolution of 200 milliseconds~\cite{kamm2016foam}. At ANKA synchrotron, %the morphological dynamics in fast-moving weevils were obtained with temporal resolution of 50 milliseconds~\cite{dosSantosRolo2013invivo}.
 
 To achieve these results, the instrumentation used at imaging beamlines has recently undergone a major update. The installed streaming cameras are able to deliver up to hundreds of thousands of frames per second with a continuous data rate up to 8 GB/s~\cite{marone2017}. Newly developed control systems at ANKA~\cite{vogelgesang2016}, SLS~\cite{marone2017}, and other synchrotron facilities use the acquired imaging information to track the processes under study and adjust the instrumentation accordingly. These control systems rely highly on the performance of the integrated image processing frameworks. Faster acquisition and a  high level of automation is essential to study dynamic phenomena and at the same time enables experiments with significantly increased sample throughput. For example, in 2015 Diamond Light Source (DLS) reported that typically about 3000 scans are recorded during 5 days of operation at a single imaging beamline~\cite{atwood2015}. Consequently, the amount of data generated at imaging beamlines quickly grows and results in a steep rise of the required computing power. In order to achieve higher temporal resolution and to prolong the duration of experiments, advanced methods are developed that incorporate a priori knowledge in the reconstruction procedure. These methods are able to produce high-quality images from undersampled and underexposed measurements, as demonstrated by~\cite{mirone2014dict,eyndhoven2015}. Unfortunately these methods are computationally significantly more demanding than traditional reconstruction algorithms and further increase the load on the computing infrastructure~\cite{ashkarin2015}.

%Fortunately, the highly parallel computing architectures has brought a significant boost also to computing domain. 
 To tackle the performance challenge several reconstruction frameworks have been developed and optimized to utilize the parallel capabilities of nowadays computing architectures. At SLS \emph{GridRec}, a fast reconstruction approach optimized for conventional CPU technology, has been adopted~\cite{marone2012regridding}. The reconstruction is scheduled across a dedicated cluster and reconstructs a 3D image within a couple of minutes~\cite{marone2017}. Other frameworks use GPUs to accelerate the computation and are able to achieve minute-scale reconstructions at a single node equipped with multiple GPU adapters. PyHST is developed at ESRF and uses the CUDA framework to offload image reconstruction to NVIDIA GPUs~\cite{chilingaryan2011gpu}. The second version of PyHST provides also a number of iterative reconstruction techniques~\cite{mirone2014pyhst2}. The UFO parallel computing framework is used at ANKA synchrotron to realize in-vivo tomography and laminography experiments~\cite{vogelgesang2012ufo,vogelgesang2016dgma}. It constructs a data processing workflow by combining basic building blocks in a graph structure. OpenCL is used to execute the reconstruction at parallel accelerators with a primary focus on NVIDIA and AMD GPUs. ASTRA is a fast and flexible development platform for tomographic algorithms with MATLAB and python interfaces~\cite{aarle2016astra,palenstijn2017astra}. It is implemented in C++ and uses CUDA to offload computations to GPU. Several other frameworks are based on the ASTRA libraries to provide GPU-accelerated reconstruction, for instance the Savu framework at DLS~\cite{atwood2015} or TomoPy at the Advanced Photon Source (APS)~\cite{gursoy2014tomopy}. Recent versions of TomoPy also support \emph{UFO} and \emph{GridRec} as backends.  All of the GPU-accelerated frameworks are capable to distribute the computation to a GPU cluster as well.

 While most of the nowadays imaging frameworks rely heavily on parallel hardware to speed-up the reconstruction, specific features of the GPU architecture are rarely considered. On other hand, the hardware architectures differ significantly~\cite{zhang2014performance}. Organization of memory and cache hierarchies, performance balance between different types of operations, and even the type of parallelism varies. A significant speed-up is possible if details of the specific architecture are taken into account as illustrated in~\cite{csa2011sc}. Fast execution is especially important if the reconstruction is embedded in a control workflow. Minimal latency is essential to track faster processes and to improve the achieved spatial and temporal resolutions. Due to unavoidable communication overhead, it is not always possible to reduce the latencies by scaling the reconstruction cluster.

 For online monitoring and control, normally fast analytical methods are used to reconstruct 3D images. There are two main approaches: Filtered Back Projection (FBP) and methods based on the Fourier Slice Theorem~\cite{natterer2001mathematical}. The later methods are asymptotically faster, but due to the involved interpolation in the Fourier domain are more sensitive to the quality of the available projections. For typical geometries Fourier-based methods are several times faster using the same computing hardware~\cite{rshkarin2015} and should be preferred if the computing infrastructure is limited to general-purpose processors only~\cite{marone2017}. A recent study suggests to implement back projection as convolution in log-polar coordinates in order to gain high reconstruction speed with interpolation in the image domain~\cite{anderson2016fbpconv}. However, this new method has not yet been adopted in production environments. Still, Filtered Back Projection is the method of choice, largely due to it simplicity and robustness. Therefore, the efficiency of the FBP implementation is still crucial for the operated monitoring and control systems. Furthermore, methods used for low dose tomography normally consist out of sequences of forward and back projections. And, thus, a faster implementation of the back projection lowers also the computational demands for high-quality offline reconstruction and might reduce the required hardware investments.

 While there are several articles aiming at optimization of Back Projection for general-purpose processors and Intel Xeon-Phi accelerators~\cite{treibig2013simd}, up to our knowledge there are no publications considering the variety of GPU architectures. A number of papers addresses specific GPU architectures~\cite{zinsser2013,papenhausen2013}. Multiple papers perform a general analysis of a range of GPU architectures, reveal undisclosed details trough micro-benchmarking, and propose guidelines for performance optimization~\cite{volkov2016thesis,mei2017microbench,zhang2017performance}. This information is invaluable to understand factors limiting performance on a specific architecture and to find an alternative approach to achieve a better performance. Several papers propose methods to auto-tune computation kernels~\cite{lim2017autotuning}. However, the tuning is limited to finding optimal configuration of pre-defined parameters like desired occupancy, dimensions of execution blocks, etc. For instance, there are no automated solutions to tune the balance between the texture engine and the computational cores.
 
 In \cite{csa2018sbac}, we presented two highly-optimized back-projection algorithms for NVIDIA Pascal GPUs and a hybrid approach to balance the load between different GPU subsystems using both in parallel. While the algorithms can be used on different hardware, multiple modifications are required to address the differences in the architectures efficiently. Furthermore, the proposed hybrid approach is only suitable for the NVIDIA GPUs of a few latest generations. A different scheme to balance load is required for AMD, Intel, and older NVIDIA GPUs. In this paper, we review a variety of parallel architectures presented in the last 10 years and establish a methodology to expand the original work to different parallel hardware. We discuss hardware differences in detail, build performance model, and demonstrate how these differences can be addressed to optimize the performance of the FBP algorithm. Particularly, we suggest modifications to adapt the developed algorithms for the architectures with on-chip memory optimized for 64-bit access. To address further differences in memory subsystems, we propose several alternative caching methods. We introduce an approach to reduce the overall number of executed instructions for systems with a bottleneck in the instruction throughput. We discuss optimal blocking strategies in great detail and suggest how the code-generation can be tweaked on the NVIDIA platform. We also propose two new methods to balance the load between different GPU subsystems. One targets NVIDIA Kepler architecture and another can be applied universally but with a minor penalty to the quality. The proposed performance model allows us to estimate the speed also for future architectures and select the appropriate modification and parametrization of the algorithms. Up to our knowledge, we present the first comprehensive overview of the GPU architectures across multiple vendors and GPU generations. Furthermore, using the back-projection algorithm as an example, we also illustrate how specific hardware features can be addressed and estimate possible gains. So the contribution of this paper goes beyond the proposed back-projection algorithm and also suggests optimization strategies suitable for other applications.
 
 In this paper we focus on the optimizations of the back-projection algorithm and only briefly mention the organization of data flow as it is already explained in literature~\cite{chilingaryan2011gpu,vogelgesang2016dgma}. We also do not cover scaling issues since the proposed optimizations can be easily integrated in existing frameworks like ASTRA, PyHST, or UFO which provide multi-GPU and GPU-cluster support already. The article is organized as follows. The hardware setup, software configuration, and pseudo-code conventions are listed in \sectionname~\ref{section:setup}. A short introduction to parallel architectures that is required to understand the proposed optimizations is given in \sectionname~\ref{section:arch}. In this section we also highlight the differences between the considered parallel architectures. The Filtered Back Projection algorithm and its state-of-the-art implementation are presented in \sectionname~\ref{section:tomo}. A number of optimizations to this implementation are proposed in \sectionname~\ref{section:texrec}. An alternative implementation relaying on a different set of hardware resources is developed in \sectionname~\ref{section:alurec}. A hybrid approach combining both approaches to fully utilize all hardware resources is presented in \sectionname~\ref{section:hybridrec}.  The achieved performance improvements are finally discussed in \sectionname~\ref{section:summary}.
% The full algorithm performance is evaluated in section 5.