/articles/toma : revision 56

1

\section{Introduction}\label{section:intro}

2

X-ray tomography is a powerful tool to investigate materials and living organisms at micro- and nano-scale~\cite{withers2007nanotomography}. Information about X-ray attenuation or/and phase change in a sample is used to reconstruct its internal structure. At imaging beamlines, the samples are routinely placed onto a rotation stage in front of a pixel detector and the detector is registering a series of parallel 2D projections of the sample volume. Various imaging techniques are, then, used to reconstruct a full 3D volume from series of the recorded 2D projections. Recent advances in X-ray optics and detector technology has paved a way for variety of new X-ray imaging experiments aimed to study fast processes in materials or real-time dynamics in small living organisms. At Swiss Light Source (SLS) the scientists were able to make high quality 3D snapshots through 150 Hz oscillations of the blowfly flight motor~\cite{mokso2015fly}. The temporal resolution of 20 milliseconds was achieved during the stencil test performed at SLS~\cite{maire2016} and the analysis of morphological dynamics in fast-moving weevils at ANKA synchrotron~\cite{dosSantosRolo2013invivo}.

2

X-ray tomography is a powerful tool to investigate materials and small animals at micro- and nano-scale~\cite{withers2007nanotomography}. Information about X-ray attenuation or/and phase changes in the sample are used to reconstruct its internal structure. At synchrotron imaging beamlines, samples are typically placed at a rotating stage in front of a pixel detector. It registers a series of 2D projections while the sample is turning. Various imaging techniques are then used to reconstruct the full 3D volume from the series of projections. Recent advances in X-ray optics and detector technology have paved the way for a variety of new X-ray imaging experiments aiming to study dynamic processes in materials and to analyze small organisms in vivo. At the Swiss Light Source (SLS) scientists were able to take high quality 3D snapshots of 150 Hz oscillations of a blowfly flight motor~\cite{mokso2015fly}. A temporal resolution of 20 milliseconds was achieved during a stencil test performed at SLS~\cite{maire2016} and also in the analysis of morphological dynamics of fast-moving weevils at the ANKA synchrotron at KIT~\cite{dosSantosRolo2013invivo}.

3

%The foaming of aluminum alloy precursors was studied with temporal resolution of 200 milliseconds~\cite{kamm2016foam}. At ANKA synchrotron, %the morphological dynamics in fast-moving weevils were obtained with temporal resolution of 50 milliseconds~\cite{dosSantosRolo2013invivo}.

4

To achieve these results, the instrumentation used at imaging beamlines has recently undergone a major update. Installed streaming cameras are able to deliver up to hundred thousands frames per second with the data bandwidth up to 8 GB/s continuously~\cite{marone2017}. Newly developed control systems at ANKA~\cite{vogelgesang2016}, SLS~\cite{marone2017}, and other synchrotron facilities relying on acquired imaging information to track processes under the study and adjust instrumentation accordingly. Hence, the operation of control systems is highly depending on the speed of integrated image processing frameworks. Faster acquisition and higher level of automation not only allowed to study dynamic phenomena, but also significantly increased the sample throughput. For example, in 2015 the Diamond Light Source (DLS) reported that typically about 3000 scans are made during a 5 days operation on a single imaging beamline~\cite{atwood2015}. Consequently, amount of data generated at imaging beamlines quickly grows and the computational demands are on the steep rise. Furthermore, to achieve higher temporal resolution and prolong experiment duration, there is an active research on the methods of low-dose tomography which are aimed to incorporate a priori knowledge in the reconstruction procedure and produce high quality images from undersampled and underexposed measurements, for example see~\cite{mirone2014dict, eyndhoven2015}. These methods are significantly more computationally intensive when traditional reconstruction techniques and would further increase load on the computing infrastructure~\cite{ashkarin2015}.

4

5

To achieve these results, the instrumentation used at imaging beamlines has recently undergone a major update. The nowadays installed streaming cameras are able to deliver up to hundreds of thousands of frames per second with a continuous data rate up to 8 GB/s~\cite{marone2017}. Newly developed control systems at ANKA~\cite{vogelgesang2016}, SLS~\cite{marone2017}, and other synchrotron facilities use the acquired imaging information to track the processes under study and adjust the instrumentation accordingly. These control systems rely highly on the performance of the integrated image processing framework.

6

7

Faster acquisition and a higher level of automation is essential to study dynamic phenomena, but also enables experiments with significantly increased the sample throughput. For example, in 2015 Diamond Light Source (DLS) has reported that typically about 3000 scans are recorded during 5 days of operation at a single imaging beamline~\cite{atwood2015}. Consequently, the amount of data generated at imaging beamlines quickly grows and results in a steep rise in the computational demand of these techniques. In order to achieve higher temporal resolution and prolong experiment duration, advanced methods are developed that incorporate a priori knowledge in the reconstruction procedure. These methods are able to produce high quality images from undersampled and underexposed measurements, as demonstrated by~\cite{mirone2014dict, eyndhoven2015}. Unfortunately these methods are computationally significantly more demanding then traditional reconstruction algorithms and further increase the load to the computing infrastructure~\cite{ashkarin2015}.

5

8

6

9

%Fortunately, the highly parallel computing architectures has brought a significant boost also to computing domain.

7

To tackle the performance problem a number of reconstruction frameworks were developed and optimized to use parallel capabilities of nowadays computing architectures. At SLS \emph{GridRec}, a fast reconstruction approach optimized for conventional CPU technology, is adopted~\cite{marone2012regridding}. The reconstruction is scheduled across the dedicated cluster and produces a full 3D image within a couple of minutes~\cite{marone2017}. Other frameworks tend to use AMD and NVIDIA GPUs to accelerate computations and able to achieve a minute-scale reconstruction using a single node equipped with multiple GPU adapters. PyHST is developed at ESRF and uses CUDA framework to offload image reconstruction to NVIDIA GPUs~\cite{chilingaryan2011gpu}. The second version of the framework provides a number of iterative reconstruction techniques as well~\cite{mirone2014pyhst2}. The UFO parallel computing framework is used at ANKA synchrotron to manage in-vivo tomography and laminography experiments~\cite{vogelgesang2012ufo,vogelgesang2016dgma}. It builds an arbitrary data processing workflows by combining basic building blocks in a graph structure. OpenCL libraries are used to offload reconstruction to parallel accelerators with primary focus on NVIDIA and AMD GPUs. ASTRA is fast and flexible development platform for tomographic algorithms with bindings to MATLAB and using CUDA for computation offload~\cite{aarle2016astra,palenstijn2017astra}. Several other frameworks are based on ASTRA libraries to provide GPU-accelerated reconstruction, for instance Savu framework at DLS~\cite{atwood2015} and TomoPy at APS (Advanced Photon Source)~\cite{gursoy2014tomopy}. Recent versions of TomoPy support \emph{UFO} and \emph{GridRec} backends as well. All of the GPU-accelerated frameworks are capable to distribute the computation across the cluster of computation nodes as well.

8

9

While most of the nowadays imaging frameworks are heavily rely on the parallel hardware to speed-up reconstruction, the implementations are rarely optimized for the specific architectures but utilizing a most common denominator instead. On other hand, the hardware architectures differ significantly~\cite{zhang2014performance}. Organization of memory and cache hierarchies, performance balance between different types of operations, and even type of parallelism varies. A significant speed-up is possible if the details of the specific architecture are taken into the consideration as illustrated in~\cite{csa2011sc}. This is especially important if reconstruction is running as part of control workflow and the reduction of the reconstruction latency may allow to track faster processes and achieve higher temporal or spatial resolution. Due to imminent extra communication latencies, the same result is not always possible to reach by scaling the reconstruction cluster.

10

11

For online monitoring and control, normally the fast analytical methods are used to reconstruct 3D images. There are two main approaches FBP (Filtered Back Projection) and the methods based on the Fourier Slice Theorem~\cite{natterer2001mathematical}. The later methods are asymptotically faster, but due to involved interpolation in the Fourier domain are more sensitive to the quality of the source projection data. For typical volume geometries the Fourier-based methods produce results several times faster using the same hardware~\cite{rshkarin2015} and are preferred if the computing cluster does not provide parallel accelerators and is limited to the general-purpose processors only~\cite{marone2017}. The recent research suggests possibility to implement back projection as convolution in log-polar coordinates providing the high reconstruction speed while performing interpolation in the image domain only~\cite{anderson2016fbpconv}. However, the new method has yet to receive an adoption in the production environment. Meanwhile, the Filtered Back Projection is still method of choice, largely due to it simplicity and robustness. Therefore, efficiency of FBP implementation is still crucial for the currently used real-time monitoring and control systems. Furthermore, since methods of low dose tomography are normally consists of series of forward and back projections, the faster implementation of back projection will also allow to reduce investments in the hardware used for high-quality offline reconstruction.

10

To tackle the performance challenge several reconstruction frameworks have been developed and optimized to use parallel capabilities of nowadays computing architectures. At SLS \emph{GridRec}, a fast reconstruction approach optimized for conventional CPU technology, has been adopted~\cite{marone2012regridding}. The reconstruction is scheduled across a dedicated cluster and reconstructs a 3D image within a couple of minutes~\cite{marone2017}. Other frameworks use GPUs to accelerate the computation and are able to achieve minute-scale reconstructions at a single node equipped with multiple GPU adapters. PyHST is developed at ESRF and uses the CUDA framework to offload image reconstruction to NVIDIA GPUs~\cite{chilingaryan2011gpu}. The second version of PyHST provides also a number of iterative reconstruction techniques~\cite{mirone2014pyhst2}. The UFO parallel computing framework is used at ANKA synchrotron to realize in-vivo tomography and laminography experiments~\cite{vogelgesang2012ufo,vogelgesang2016dgma}. It constructs a data processing workflow by combining basic building blocks in a graph structure. OpenCL is used to execute the reconstruction at parallel accelerators with a primary focus on NVIDIA and AMD GPUs. ASTRA is a fast and flexible development platform for tomographic algorithms with MATLAB and python interfaces~\cite{aarle2016astra,palenstijn2017astra}. It is implemented in C++ and uses CUDA to offload computations to GPU. Several other frameworks are based on ASTRA libraries to provide GPU-accelerated reconstruction, for instance the Savu framework at DLS~\cite{atwood2015} or TomoPy at the Advanced Photon Source (APS)~\cite{gursoy2014tomopy}. Recent versions of TomoPy also support \emph{UFO} and \emph{GridRec} as backends. All of the GPU-accelerated frameworks are capable to distribute the computation across the cluster of computation nodes as well.

11

12

While most of the nowadays imaging frameworks are heavily rely on the parallel hardware to speed-up reconstruction, the implementations are rarely optimized for the specific architectures but utilizing the most common denominator instead. On other hand, the hardware architectures differ significantly~\cite{zhang2014performance}. Organization of memory and cache hierarchies, performance balance between different types of operations, and even type of parallelism varies. A significant speed-up is possible if the details of the specific architecture are taken into the consideration as illustrated in~\cite{csa2011sc}. This is especially important if the reconstruction is running as part of control workflow. A reduced latency may allow to track faster processes and to improve the achieved spatial and temporal resolution. Due to imminent communication latencies, the same result can not always be reached by scaling the reconstruction cluster.

13

14

For online monitoring and control, normally fast analytical methods are used to reconstruct 3D images. There are two main approaches, Filtered Back Projection (FBP) and methods based on the Fourier Slice Theorem~\cite{natterer2001mathematical}. The later methods are asymptotically faster, but due to the involved interpolation in the Fourier domain are more sensitive to the quality of the available projections. For typical geometries Fourier-based methods are several times faster using the same computing hardware~\cite{rshkarin2015} and should be preferred if the computing infrastructure is limited to general-purpose processors only~\cite{marone2017}. Recent study suggests to implement back projection as convolution in log-polar coordinates in order to gain high reconstruction speed with interpolation in the image domain~\cite{anderson2016fbpconv}. However, this new method has not yet been adopted in production environments. Still, Filtered Back Projection is the method of choice, largely due to it simplicity and robustness. Therefore, the efficiency of the FBP implementation is still crucial for the currently used real-time monitoring and control systems. Furthermore, since methods used for low dose tomography normally consist of sequences of forward and back projections, a faster implementation of the back projection lower the computational demands for high-quality offline reconstruction and might reduce the required hardware investments.

12

15

13

16

While there are several articles aimed on optimization of Back Projection to general-purpose processors and Intel Xeon-Phi accelerators~\cite{treibig2013simd}, up to our knowledge there are no publications considering multiple GPU architectures. A number of paper address specific GPUs~\cite{zinsser2013,papenhausen2013}. Multiple papers perform a general analysis of a range of GPU architectures, reveal undisclosed details trough micro-benchmarking, and propose guidelines for performance optimization~\cite{volkov2016thesis,mei2017microbench,zhang2017performance}. This information is invaluable to understand factors limiting performance on a specific architecture and to find an alternative approach to achieve better performance. Several papers propose methods to auto-tune computation kernels~\cite{lim2017autotuning}. However, the tuning is limited to finding optimal number of registers, sizes of thread blocks, and other pre-defined parameters. For instance, there are no automated solutions to tune balance between texture engine and computational cores.

14

17

15

In this article we will review differences between recent parallel architectures and explain how the FBP algorithm can be mapped on the existing parallel hardware efficiently. We will focus on the architecture-specific optimizations and will only briefly mention the organization of data flow as it is already explained in the literature~\cite{chilingaryan2011gpu,vogelgesang2016dgma}. We also will not discuss scaling issues since proposed optimizations can be integrated with the existing frameworks like ASTRA, PyHST, and UFO which are already providing multi-GPU and GPU-cluster support. The article is organized as follows. The hardware setup, software configuration, and pseudo-code conventions are listed in \sectionname~\ref{section:setup}. The short introduction to parallel architectures required to understand the proposed optimizations is given in \sectionname~\ref{section:arch}. In this section we also highlight the differences between considered parallel architectures. Filtered Back Projection algorithm and the state-of-the-art implementation are presented in \sectionname~\ref{section:tomo}. A number of optimizations to the state-of-the-art implementation of the back-projection algorithm are proposed in \sectionname~\ref{section:texrec}. An alternative implementation relaying on a different set of hardware resources is developed in \sectionname~\ref{section:alurec}. A hybrid approach combining both approaches to fully utilize all hardware resources is presented in \sectionname~\ref{section:hybridrec}. The achieved performance improvements are discussed in \sectionname~\ref{section:summary}.

18

In this article we will review differences between recent parallel architectures and explain how the FBP algorithm can be mapped on the existing parallel hardware efficiently. We will focus on the architecture-specific optimizations and will only briefly mention the organization of data flow as it is already explained in the literature~\cite{chilingaryan2011gpu,vogelgesang2016dgma}. We also do not cover scaling issues since the proposed optimizations can be easily integrated in existing frameworks like ASTRA, PyHST, or UFO which are already providing multi-GPU and GPU-cluster support. The article is organized as follows. The hardware setup, software configuration, and pseudo-code conventions are listed in \sectionname~\ref{section:setup}. The short introduction to parallel architectures required to understand the proposed optimizations is given in \sectionname~\ref{section:arch}. In this section we also highlight the differences between considered parallel architectures. Filtered Back Projection algorithm and the state-of-the-art implementation are presented in \sectionname~\ref{section:tomo}. A number of optimizations to the state-of-the-art implementation of the back-projection algorithm are proposed in \sectionname~\ref{section:texrec}. An alternative implementation relaying on a different set of hardware resources is developed in \sectionname~\ref{section:alurec}. A hybrid approach combining both approaches to fully utilize all hardware resources is presented in \sectionname~\ref{section:hybridrec}. The achieved performance improvements are discussed in \sectionname~\ref{section:summary}.

16

19

% The full algorithm performance is evaluated in section 5.

17

20

18

21

b'\\ No newline at end of file'