68
68
\item Mention that 128-bit loads are causing bank conflicts on Fermi and AMD, but not on a newer architectures. (again \ref{section:alurec_shmem}, benchmark?)
69
69
\item Describe the test and show some numbers illustrating that the number of shared memory transactions is reduced similarly to texture fetches if groups of 4 consecutive threads are requesting less data. The practical effect is discussed in the end of \sectionname~\ref{section:alurec_shmem}.
70
70
\item Optional: Talk about constant memory and L1 and L1.5 caches. Not only latency is lower, but Kepler has double bandwidth from L1. Present a chart showing CMem performance vs. size (standard cache-study plot). We can also reference it in some paper if I remember correctly. If skipped, remove a few sentences from \sectionname~\ref{section:newtex_ld64}. They are not that important for the text flow.
76
75
Each SM is equipped with different types of execution units and is able to schedule multiple operations in parallel. ALU is most common element of the architecture. All ALUs of SM are organized in several SIMD sets which can be used independently. There are fewer units of other types, but they are also able to run independent workload in parallel with ALU units. SM includes one or more \emph{warp schedulers} which execute instructions of resident warps on the available SIMD units. Each scheduler is able to issue either a single instruction per-clock or at each clock to \emph{dual-issue} two independent instructions from the same warp. Normally, there is enough schedulers to utilize the available ALUs completely even if only a single instruction is executed per cycle. SM processor of Kepler generation, however, includes 6 sets of ALUs, but only 4 warp schedulers~\cite{nvidia2012gk110}. Consequently, the dual-issue is a strict requirement for optimal performance. The VLIW architecture used on the older AMD GPUs has a significantly different scheduling model and requires 4 or 5 independent instructions in the flow for optimal performance~\cite{zhang2011ati}.
77
76
%Even if schedulers are able to utilize all ALU elements without relying on dual-issue, only the ability to intermix different types of operations in the flow and make them independent of each other would allow to load all available types of SIMD units.