/articles/toma

To get this branch, use:
bzr branch http://darksoft.org/webbzr/articles/toma
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
\subsection{CPU and Xeon Phi}\label{section:cpurec}
 While we are not aiming on the CPU-based architectures, the OpenCL code developed for AMD platform is easy to modify to run also on general-purpose processors and we have evaluated CPU performance for the sake of completeness. The texture-engine is not provided by the general-purpose processors. While the recent versions of OpenCL frameworks emulate the missing functionality, better performance is achieved by targeting the algebraic units of CPU directly. We adapted both standard and the ALU-based algorithms to load data directly from system memory instead of fetching it using texture engine. The standard algorithm is additionally modified to perform linear interpolation explicitly. The main difference between two methods is that the ALU algorithm caches data in shared memory while the adapted standard method loads data directly from system memory relaying on CPU caches. In fact, however, there is no a special hardware component backing shared memory. The appropriate blocking is enough to utilize CPU caches and the intermediate caching step is not necessarily required. On other hand, the amount of required computations is reduced if the second term for linear interpolation and a few other intermediate values are pre-computed and cached in shared memory as proposed in sections~\ref{section:ycache} and \ref{section:alu_hx}. In either case, the performance is improved if multiple slices are reconstructed in parallel and a larger pixel area is assigned to a thread block. Actually, on newer systems supporting 256-bit AVX instructions it makes sense to scale up processing to at least 8 slices in parallel. Allocating a larger amount of pixels per block is relevant to use the cache efficiently. The optimal number is determined by the size of L2 cache available per CPU core.
 
  To evaluate performance we used a server equipped with two Intel Xeon X5650 processors (6 cores, 2.66 - 3.06 GHz, 12 MB L2 cache, 128-bit SSE4.2 instructions) and the Intel Xeon Phi 5110P accelerator (60 cores, 1.05 GHz, 30 MB L2 cache, 512-bit IMCI instructions). There are two major OpenCL frameworks supporting general-purpose processors. AMD and Intel deliver their own SDKs, but the processors by both vendors are supported in either case. The AMD framework is not capable to run ALU algorithm efficiently without further adaption. A faster reconstruction is possible if the simpler standard algorithm is used instead. Still, it is significantly slower compared to the performance delivered by the Intel SDK running the same OpenCL code on the same hardware. The speed is even faster if Intel is running the ALU variant with advanced caching mode and $h_x$ caching enabled. The best performance is measured in a 4-slice reconstruction mode and with 64x64 regions assigned per a thread block. To evaluate performance we compared the reconstruction speed against the CPU-version of PyHST~\cite{chilingaryan2011gpu}. It implements multi-thread and cache-aware reconstruction, but does not perform implicit vectorization. Each thread processes a subset of all slices. The compound sinograms for simultaneous reconstruction of several slices are not supported. The performance is summarized in \tablename~\ref{tbl:cpu_perf}. PyHST outperforms the OpenCL prototype if it is executed in the single-slice mode, but it is slower if multiple slices are reconstructed at once. The performance of 33~GU/s is measured if a newer server with dual Xeon E5-2680 v.3 (12 cores, 2.50 - 3.30 GHz, 30 MB L2 Cache, 256-bit AVX2 instructions) is used. Even then the achieved reconstruction speed is inferior to the performance delivered by the slowest of considered GPUs. 
As Xeon Phi line is discontinued, the latest versions of Intel OpenCL SDK does not support of Xeon Phi processors any more. For this reason we had to resort to much older version from 2014. This version perform significantly worse on general-purpose CPUs. The delivered performance is on pair with SDK from AMD. Consequently, the measured performance is barely above the speed of a pair of old Xeon processors. 

\begin{table}[htb] %[htbp]
\begin{threeparttable}
\caption{\label{tbl:cpu_perf} Performance using general-purpose processors}
\centering
\noindent
%\resizebox{\columnwidth}{!}{\begin{tabular}{} ... \end{\tabular}}
\begin{tabularx}{\columnwidth}{ | X  c | r r | c | }
\hline
& & \multicolumn{2}{c|}{2x Xeon X5650} &  Xeon Phi 5110P \\
\mhd{|c}{Method} & \mhd{c|}{$n_v$} & \mhd{c}{AMD} & \mhd{c|}{Intel} & \mhd{c|}{Intel} \\

\hline
PyHST & 12 & \multicolumn{2}{c|}{9.3 GU/s} & - \\

\hline
\multirow{2}{*}{Standard} 
& 1 & 1.2 GU/s    & 3.6 GU/s    & 16.2 GU/s\\
& 4 & 4.2 GU/s    & 10.2 GU/s   & 12.1 GU/s\\

\hline
\multirow{2}{*}{Synchronized} 
& 1 & 0.9 GU/s    & 3.9 GU/s     & \\
& 4 & 3.2 GU/s    & 10.6 GU/s    & \\

\hline
\multirow{2}{*}{ALU algorithm} 
& 1 & 0.9 GU/s    & 6.1 GU/s     & 2.7 GU/s \\
& 4 & 3.7 GU/s    & 14.1 GU/s    & 0.2 GU/s \\
\hline
\end{tabularx}
\end{threeparttable}
\end{table}

% We actually have only 15 threads on Intel and 30 on AMD per 12 cores. So, there is no switching. The same thread iteratively processes workgroup.
% On other hand, the registers anyway need to be stored and loaded if we switch work-items.
There is a significant architectural difference between CPU and GPU platforms which is not considered in our implementation. When a thread block is scheduled to SM, the SM permanently assigns registers to all threads of the block and can switch executed threads without significant penalty. It is not the case for general-purpose processors. The used registers has to be saved and restored as block execution progresses and a processing of a new thread is started by the CPU core~\cite{intel2011openclpg}. To avoid an associated performance penalty, the threads on CPU platform are usually execute a large amount instructions before switching. Particularly, for the proposed back-projection algorithm this means that a thread will process multiple projections before giving a way to other threads of a block. Consequently, the data cached from the first projection of a block is already evicted from the L1 cache when the next thread is started. While it can be prevented by synchronizing block threads at each projection iteration, the performance will be penalized just other way due to expensive context switches. This penalty is actually playing a significant role in the performance difference between AMD and Intel SDKs. Using Intel SDK, the performance of the standard algorithm is slightly improved if the synchronization is performed before moving to a next projection. On AMD, this penalizes performance even more as it is shown in \tablename~\ref{tbl:cpu_perf}. The higher performance probably can be achieved if a way can be found to reduce the number of context switches without penalizing L1 cache hit rate significantly. However, it is much simpler to target general-purpose architectures using a simple C code. No context switches are required if CPU cores are made responsible for different subsets of slices. And both L1 and L2 caches can be targeted with the appropriate blocking directly.

%Using the half-float format to store the data is certainly useful as well.