1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
|
\begin{table}[htb] %[htbp]
\begin{threeparttable}
\caption{\label{tbl:alurec} Performance and configuration of ALU-based back-projection kernel}
\centering
\noindent
%\resizebox{\columnwidth}{!}{\begin{tabular}{} ... \end{\tabular}}
\begin{tabularx}{\columnwidth}{ | X c | r r | l l l l l | }
\hline
%& & & \multicolumn{5}{c|}{Configuration} \\
%\mhd{|c}{GPU} & \mhd{c|}{Slices} & \mhd{c|}{Perf.} & \mhd{c}{Area} & \mhd{c}{Blocks} & \mhd{c}{L1/SM} & \mhd{c}{CC} & \mhd{c|}{PaO} \\
& & \multicolumn{2}{c|}{Perf. (GU/s)} & \multicolumn{5}{c|}{Configuration} \\
\mhd{|c}{GPU} & \mhd{c|}{$n_v$} & \mhd{c}{Lin} & \mhd{c|}{NN} & \mhd{c}{$n_q$} & \mhd{c}{$s_d$} & \mhd{c}{U} & \mhd{c}{R} & \mhd{c|}{O} \\
%& & Lin. & NN & Px. & Pr. & U & Rnd. & Occ. \\
\hline
\multirow{3}{*}{GTX580}
& 1 & 80 & 120 & 4 & 16 & - & SFU & 75\% \\
& 2 & 113 & 188 & 4 & 16 & - & SFU & 50\% \\
& 4 & 142 & 247 & 4 & 8~\tnote{2} & - & SFU & 50\% \\
\hline
% In NN, we need 1/2 cache. Therefore, 4x4 area.
\multirow{3}{*}{GTX680}
& 1 & 123 & 195 & 8\tnote{3} & 8\tnote{4} & 4 & ALU & 50\% \\
& 2 & 160 & 290 & 8 & 8 & 2 & ALU & 50\% \\
& 4 & 165 & 306 & 4 & 8 & 2 & SFU & 50\% \\
\hline
\multirow{3}{*}{Titan}
& 1 & 195 & 268 & 8\tnote{3} & 8\tnote{4} & 4 & ALU & 50\% \\
& 2 & 237 & 429 & 8 & 8 & 2 & ALU & 50\% \\
& 4 & 278 & 471 & 4 & 8 & 2 & SFU & 50\% \\
\hline
\multirow{3}{*}{GTX980}
& 1 & 218 & 452 & 16 & 8 & - & SFU &100\%\tnote{5} \\
& 2 & 269 & 510 & 16 & 8 & - & SFU & 50\% \\
& 4\tnote{1} & 292 & 567 & 4 & 16 & - & ALU & 50\% \\
\hline
\multirow{3}{*}{Titan X}
& 1 & 606 & 1161 & 16 & 8 & - & SFU &100\%\tnote{5} \\
& 2 & 692 & 1328 & 16 & 8 & - & SFU & 50\% \\
& 4\tnote{1} & 743 & 1405 & 4 & 16 & - & ALU & 50\% \\
\hline
\multirow{3}{*}{HD5970}
& 1 & 63 & 116 & 16 & 8\tnote{3} & - & - & - \\
& 2 & 71 & 146 & 8 & 16 & - & - & - \\
& 4 & 73 & 160 & 8 & 8 & - & - & - \\
\hline
\multirow{3}{*}{HD7970}
& 1 & 178 & 290 & 16 & 8\tnote{3} & - & - & - \\
& 2 & 221 & 430 & 4\tnote{3} & 16\tnote{6} & - & - & - \\
& 4 & 233 & 450 & 4 & 8 & - & - & - \\
\hline
\multirow{3}{*}{R9-290}
& 1 & 219 & 341 & 16 & 8 & - & - & - \\
& 2 & 298 & 582 & 4\tnote{3} & 16\tnote{6} & - & - & - \\
& 4 & 383 & 635 & 4 & 16 & - & - & - \\
\hline
\end{tabularx}
\begin{tablenotes}
\item The table summarizes the performance and optimal configuration for the ALU-based back-projection kernel. The performance is reported for the linear and nearest neighbor interpolation modes. The configuration specifies: \tblcol{$n_q$} - a number of pixels per thread, \tblcol{$s_d$} - a number of cached projections, \tblcol{U} - unrolling hint for inner projection loop, \tblcol{R} - the units to perform rounding and type conversions (index is always computed using SFU), \tblcol{O} - the desired occupancy. The caches are configured as specified in \tablename~\ref{tbl:cacheconf}. The number of threads to cache a projection row is determined according to guidelines in \tablename~\ref{tbl:shmemconf}.
\item1 The configuration and performance are specified for half-float data representation. The half-float values are also cached in the shared memory.
\item2 Because of the reduced shared memory requirements, 16 projections are cached in the nearest neighbor interpolation mode.
\item3 A larger 64x64 area is reconstructed if nearest neighbour interpolation is performed. The 16 pixels are assigned to each GPU thread.
\item4 Each GPU thread caches 2 values at once to enable 64-bit writes if nearest neighbor interpolation is used. Consequently, only 16 threads are used per projection row and 16 projections are cached to utilize all threads.
\item5 The 50\% occupancy is targeted in nearest-neighbor interpolation mode.
\item6 Since 64x64 blocks are assigned to the thread block in the nearest-neighbor interpolation mode, the 32 threads are used per projection row and only 8 projections are cached.
\end{tablenotes}
\end{threeparttable}
\end{table}
|