/ani/mrses : revision 1

To get this branch, use:

bzr branch
http://suren.me/webbzr/ani/mrses

« back to all changes in this revision

Viewing changes to cell/README

Committer: Suren A. Chilingaryan
Date: 2010-04-28 04:30:08 UTC
Revision ID: csa@dside.dyndns.org-20100428043008-vd9z0nso9axezvlp

Initial import

files added:

README

VERSION

bmc.m

cell

cell/Makefile

cell/Makefile.in

cell/README

cell/atlas_potrf.c

cell/atlas_potrf.h

cell/buildutils

cell/buildutils/README_build_env.txt

cell/buildutils/cellsdk_select_compiler

cell/buildutils/make.env

cell/buildutils/make.footer

cell/buildutils/make.header

cell/ext

cell/ext/blas.h

cell/ext/cblas.h

cell/ext/lapack.h

cell/ext/lapack_errno.h

cell/hw_sched.c

cell/hw_sched.h

cell/hw_thread.c

cell/hw_thread.h

cell/mrses.h

cell/mrses_hw.c

cell/mrses_impl.c

cell/mrses_impl.h

cell/mrses_ppu.c

cell/mrses_ppu.h

cell/mrses_spe.c

cell/mrses_spe.h

cell/mrses_spe.txt

cell/mrses_spu.c

cell/mrses_spu.h

cell/msg.c

cell/msg.h

cell/ppu

cell/ppu/Makefile

cell/spu

cell/spu/Makefile

cell/test

cell/test.c

cell/test/Makefile

cell/tools.c

cell/tools.h

cell/vec_potrf.c

cell/vec_potrf.h

cell/vec_potrf.pl

cell/vec_potrf_mtxmul.h

libs

libs/ppu

libs/ppu/blas_LINUX.a

libs/ppu/lapack_LINUX.a

libs/ppu/libgslblas.a

libs/ppu/libgslcblas.a

libs/spu

libs/spu.txt

libs/spu/blas_SPE.a

libs/spu/lapack_SPE.a

libs/spu/libgslblas.a

libs/spu/libgslcblas.a

mrses.m

mrses_hw_debug.m

mrses_hw_distance.m

mrses_mtx.m

mrses_orig.m

mrses_software.m

release.sh

scripts

scripts/mrses_install.sh

test.m

test.sh

Show diffs side-by-side

added added

removed removed

cell/README

Configuration

=============

1. For CELL, set MAX_PPU to 0 and undef MAX_SPU, PPU are

to slow to be used

2. For x86, Intel Math Kernel library is fine, for PPU

Goto is best (but still too slow). Reference designs

a bit slower if compiled with recent gcc.

Expectations

============

1. SPU's are limited by local store (local memory). It is

only 256 KB. Application uses width * (nA + nB) (+

alignment corrections) for data buffer and some amount

of temporary buffers, mainly dependent on width size.

2. properties > width ;)

3. The pointers between PPU and SPU are transfered as 32

bit integers. For this reason it is better to compile

a PPU application as 32 bit binary, for safety.

4. Calls to mrses_iterate with NULL and non-NULL ires should

not be mixed.

ToDo

====

1. SPU's have 128 registers. I have used this registers for

matrix multiplication, but it would be nice to optimize in

the same way cholesky decomposition, etc.

2. The vectorizations used for SPU can be migrated to PPU

and Intel architecture.

3. SPU is dual issue: memory access and operations can be

performed in parallel if properly aligned (no code reord-

ering is supported by SPU).

4. DMA is asynchronous, interleaving computations and mem-

ory transfers will permit to neglect transfer time.

5. Not clear why PPU are 10 times slower than Intel on the

same clock speed. By design or something is completely

wrong. Cache 256KB should be no problem.

6. If last question is resolved, it would be nice to move,

the histogram computation to SPE.

7. Using hyperthreading server, the computation per thread

approx. 2 times slower (in sum OK, yet). Even if you decrease

amount of used PPU's, it would be anyway slower. Somehow

processes are not bound to certain core but migrating here

and there and this probably causes slowdowns... Needs more

investigations overall.

8. Replace matrix multiplication with vector-to-matrix

multiplication in PPE.

9. Somehow interleave operations in iterate mode when ires

is supplied.

b'\\ No newline at end of file'