bzr branch
http://darksoft.org/webbzr/ani/mrses
1
by Suren A. Chilingaryan
Initial import |
1 |
Configuration
|
2 |
=============
|
|
3 |
1. For CELL, set MAX_PPU to 0 and undef MAX_SPU, PPU are
|
|
4 |
to slow to be used |
|
5 |
2. For x86, Intel Math Kernel library is fine, for PPU
|
|
6 |
Goto is best (but still too slow). Reference designs |
|
7 |
a bit slower if compiled with recent gcc. |
|
8 |
||
9 |
||
10 |
Expectations
|
|
11 |
============
|
|
12 |
1. SPU's are limited by local store (local memory). It is
|
|
13 |
only 256 KB. Application uses width * (nA + nB) (+ |
|
14 |
alignment corrections) for data buffer and some amount |
|
15 |
of temporary buffers, mainly dependent on width size. |
|
16 |
2. properties > width ;)
|
|
17 |
3. The pointers between PPU and SPU are transfered as 32
|
|
18 |
bit integers. For this reason it is better to compile |
|
19 |
a PPU application as 32 bit binary, for safety. |
|
20 |
4. Calls to mrses_iterate with NULL and non-NULL ires should
|
|
21 |
not be mixed. |
|
22 |
||
23 |
ToDo
|
|
24 |
====
|
|
25 |
1. SPU's have 128 registers. I have used this registers for
|
|
26 |
matrix multiplication, but it would be nice to optimize in |
|
27 |
the same way cholesky decomposition, etc. |
|
28 |
||
29 |
2. The vectorizations used for SPU can be migrated to PPU
|
|
30 |
and Intel architecture. |
|
31 |
||
32 |
3. SPU is dual issue: memory access and operations can be
|
|
33 |
performed in parallel if properly aligned (no code reord- |
|
34 |
ering is supported by SPU). |
|
35 |
||
36 |
4. DMA is asynchronous, interleaving computations and mem-
|
|
37 |
ory transfers will permit to neglect transfer time. |
|
38 |
||
39 |
5. Not clear why PPU are 10 times slower than Intel on the
|
|
40 |
same clock speed. By design or something is completely |
|
41 |
wrong. Cache 256KB should be no problem. |
|
42 |
||
43 |
6. If last question is resolved, it would be nice to move,
|
|
44 |
the histogram computation to SPE. |
|
45 |
||
46 |
7. Using hyperthreading server, the computation per thread
|
|
47 |
approx. 2 times slower (in sum OK, yet). Even if you decrease |
|
48 |
amount of used PPU's, it would be anyway slower. Somehow |
|
49 |
processes are not bound to certain core but migrating here |
|
50 |
and there and this probably causes slowdowns... Needs more |
|
51 |
investigations overall. |
|
52 |
||
53 |
8. Replace matrix multiplication with vector-to-matrix
|
|
54 |
multiplication in PPE. |
|
55 |
||
56 |
9. Somehow interleave operations in iterate mode when ires
|
|
57 |
is supplied. |
|
58 |
||
59 |