1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
|
\begin{table}[htb] %[htbp]
\begin{threeparttable}
\caption{\label{tbl:overs} Performance and configuration of ALU-based back-projection kernel performing oversampling-based interpolation}
\centering
\noindent
%\resizebox{\columnwidth}{!}{\begin{tabular}{} ... \end{\tabular}}
\begin{tabularx}{\columnwidth}{ | X c | r | l l l l l l | }
\hline
%& & & \multicolumn{5}{c|}{Configuration} \\
%\mhd{|c}{GPU} & \mhd{c|}{Slices} & \mhd{c|}{Perf.} & \mhd{c}{Area} & \mhd{c}{Blocks} & \mhd{c}{L1/SM} & \mhd{c}{CC} & \mhd{c|}{PaO} \\
& & \mhd{c|}{Perf} & \multicolumn{6}{c|}{Configuration} \\
\mhd{|c}{GPU} & \mhd{c|}{$n_v$} & GU/s & \mhd{c}{$n_q$} & \mhd{c}{C} & \mhd{c}{$s_t/s_d$} & \mhd{c}{U} & \mhd{c}{R} & \mhd{c|}{O} \\
%& & Perf. & Px. & Caches & Pr. & U & Rnd. & Occ. \\
\hline
\multirow{3}{*}{GTX580}
& 1 & 80 & 4 & 1 & 32 / 8 & - & SFU & 75\% \\
& 2 & 116 & 4 & 2 & 32 / 8 & - & SFU & 50\% \\
& 4 & 142 & 4 & 4 & 64 / 4 & 2 & SFU & 50\% \\
\hline
% In NN, we need 1/2 cache. The
\multirow{3}{*}{GTX680}
& 1 & 123 & 16 & 1 & 32 / 4\tnote{1} & 4 & ALU\tnote{2} & 50\% \\
& 2 & 160 & 8 & 1 & 32 / 4 & 2 & ALU & 50\% \\
& 4 & 165 & 4 & 2 & 64 / 4 & 2 & SFU & 50\% \\
\hline
\multirow{3}{*}{Titan}
& 1 & 195 & 16 & 1 & 32 / 4\tnote{1} & 4 & ALU\tnote{2} & 50\% \\
& 2 & 237 & 8 & 1 & 32 / 4 & 2 & ALU & 43\% \\
& 4 & 279 & 4 & 2 & 64 / 4 & 2 & SFU & 37\% \\
\hline
\multirow{3}{*}{GTX980}
& 1 & 218 & 16 & 1 & 32 / 8 & - & SFU & 50\% \\
& 2 & 269 & 16 & 2 & 64 / 4 & - & SFU & 50\% \\
& 4 & 292 & 4 & 4 & 64 / 4 & 2 & SFU & 50\% \\
\hline
\multirow{3}{*}{Titan X}
& 1 & 606 & 16 & 1 & 32 / 8 & - & SFU & 50\% \\
& 2 & 693 & 16 & 2 & 64 / 4 & - & SFU & 50\% \\
& 4 & 743 & 4 & 4 & 64 / 4 & 2 & SFU & 50\% \\
\hline
\multirow{3}{*}{HD5970}
& 1 & 63 & 16 & 1 & 32 / 8\tnote{1} & - & - & - \\
& 2 & 71 & 8 & 1 & 32 / 4 & - & - & - \\
& 4 & 73 & 8 & 2 & 32 / 4 & 2 & - & - \\
\hline
\multirow{3}{*}{HD7970}
& 1 & 178 & 16 & 1 & 32 / 8\tnote{1} & - & - & - \\
& 2 & 222 & 4 & 1 & 32 / 8 & - & - & - \\
& 4 & 233 & 4 & 2 & 64 / 4 & 2 & - & - \\
\hline
\multirow{3}{*}{R9-290}
& 1 & 219 & 16 & 1 & 32 / 8 & - & - & - \\
& 2 & 298 & 4 & 2 & 32 / 8 & - & - & - \\
& 4 & 384 & 4 & 4 & 64 / 4 & 2 & - & - \\
\hline
\end{tabularx}
\begin{tablenotes}
\item The table summarizes the performance and optimal configuration for the ALU-based back-projection kernel if oversampling and nearest neighbor interpolation are used to update values of reconstructed pixels. The configuration specifies: \tblcol{$n_q$} - a number of pixels per thread, \tblcol{C} - a number of separate arrays used to cache singoram (either a dedicated array is used to store each component of sinogram vector or two components are stored together to allow 64-bit writes), \tblcol{$s_t/s_d$} - a number of threads used to cache projection row and a number cached projections, \tblcol{U} - unrolling hint for inner projection loop, \tblcol{R} - the units to perform rounding and type conversions (index is always computed using SFU), \tblcol{O} - the desired occupancy. The caches are configured as specified in \tablename~\ref{tbl:cacheconf}.
\item1 Each GPU thread caches 2 values at once to enable 64-bit writes.
\item2 The use of SFU is also avoided while resolving array addresses, see \sectionname~\ref{section:alu_fancy}.
\end{tablenotes}
\end{threeparttable}
\end{table}
|