/ani/mrses : contents of cell/README at revision 2

: (revision 2)

To get this branch, use:

bzr branch
http://darksoft.org/webbzr/ani/mrses

Configuration
=============
 1. For CELL, set MAX_PPU to 0 and undef MAX_SPU, PPU are
 to slow to be used
 2. For x86, Intel Math Kernel library is fine, for PPU
 Goto is best (but still too slow). Reference designs
 a bit slower if compiled with recent gcc.


Expectations
============
 1. SPU's are limited by local store (local memory). It is
 only 256 KB. Application uses width * (nA + nB) (+ 
 alignment corrections) for data buffer and some amount
 of temporary buffers, mainly dependent on width size.
 2. properties > width ;)
 3. The pointers between PPU and SPU are transfered as 32
 bit integers. For this reason it is better to compile
 a PPU application as 32 bit binary, for safety.
 4. Calls to mrses_iterate with NULL and non-NULL ires should
 not be mixed.

ToDo
====
 1. SPU's have 128 registers. I have used this registers for 
 matrix multiplication, but it would be nice to optimize in
 the same way cholesky decomposition, etc.
 
 2. The vectorizations used for SPU can be migrated to PPU
 and Intel architecture.
 
 3. SPU is dual issue: memory access and operations can be
 performed in parallel if properly aligned (no code reord-
 ering is supported by SPU).
 
 4. DMA is asynchronous, interleaving computations and mem-
 ory transfers will permit to neglect transfer time.
 
 5. Not clear why PPU are 10 times slower than Intel on the
 same clock speed. By design or something is completely 
 wrong. Cache 256KB should be no problem.
 
 6. If last question is resolved, it would be nice to move, 
 the histogram computation to SPE.
 
 7. Using hyperthreading server, the computation per thread
 approx. 2 times slower (in sum OK, yet). Even if you decrease
 amount of used PPU's, it would be anyway slower. Somehow
 processes are not bound to certain core but migrating here
 and there and this probably causes slowdowns... Needs more
 investigations overall.
  
 8. Replace matrix multiplication with vector-to-matrix 
 multiplication in PPE.
 
 9. Somehow interleave operations in iterate mode when ires
 is supplied.
 
 

1 by Suren A. Chilingaryan Initial import	1	Configuration
	2	=============
	3	1. For CELL, set MAX_PPU to 0 and undef MAX_SPU, PPU are
	4	to slow to be used
	5	2. For x86, Intel Math Kernel library is fine, for PPU
	6	Goto is best (but still too slow). Reference designs
	7	a bit slower if compiled with recent gcc.
	8
	9
	10	Expectations
	11	============
	12	1. SPU's are limited by local store (local memory). It is
	13	only 256 KB. Application uses width * (nA + nB) (+
	14	alignment corrections) for data buffer and some amount
	15	of temporary buffers, mainly dependent on width size.
	16	2. properties > width ;)
	17	3. The pointers between PPU and SPU are transfered as 32
	18	bit integers. For this reason it is better to compile
	19	a PPU application as 32 bit binary, for safety.
	20	4. Calls to mrses_iterate with NULL and non-NULL ires should
	21	not be mixed.
	22
	23	ToDo
	24	====
	25	1. SPU's have 128 registers. I have used this registers for
	26	matrix multiplication, but it would be nice to optimize in
	27	the same way cholesky decomposition, etc.
	28
	29	2. The vectorizations used for SPU can be migrated to PPU
	30	and Intel architecture.
	31
	32	3. SPU is dual issue: memory access and operations can be
	33	performed in parallel if properly aligned (no code reord-
	34	ering is supported by SPU).
	35
	36	4. DMA is asynchronous, interleaving computations and mem-
	37	ory transfers will permit to neglect transfer time.
	38
	39	5. Not clear why PPU are 10 times slower than Intel on the
	40	same clock speed. By design or something is completely
	41	wrong. Cache 256KB should be no problem.
	42
	43	6. If last question is resolved, it would be nice to move,
	44	the histogram computation to SPE.
	45
	46	7. Using hyperthreading server, the computation per thread
	47	approx. 2 times slower (in sum OK, yet). Even if you decrease
	48	amount of used PPU's, it would be anyway slower. Somehow
	49	processes are not bound to certain core but migrating here
	50	and there and this probably causes slowdowns... Needs more
	51	investigations overall.
	52
	53	8. Replace matrix multiplication with vector-to-matrix
	54	multiplication in PPE.
	55
	56	9. Somehow interleave operations in iterate mode when ires
	57	is supplied.
	58
	59