/tomo/pyhst : contents of ToDo at revision 276

: (revision 276)

To get this branch, use:

bzr branch
http://suren.me/webbzr/tomo/pyhst

Bugs
----
 1) Reading all projections at once on systems with big amount of memory is
 extremely slow. Besides, it crashing on this line, around 1457:
 BIG_SINOS_natural_ordering[i_pro-Parameters.NUM_FIRST_IMAGE , 0 :throw_end+1 -throw_start ] = newitem
 Actually, it can be easily commented out
 The slow down is in extract_edf at ima.getData
     BIG_SINOS_natural_ordering[i_pro-Parameters.NUM_FIRST_IMAGE , 0 :throw_end+1 -throw_start ] = newitem
 2) In sinogram mode we are reading slices one-by-one, this not compatible 
   with current multi-reconstructor processing.
 3) Filering on the Intel Platform
 4) It looks like there are some problems with Apple FFT library. It some times returns invalid timigs (the
    OpenCL timings inprecise anyway). Related-or-unrelated problem, the poor performance of GTX580 on UFO
    server. The performance with old and newer kernels differ very liitle, and it is slower than GTX480 on
    my box (timer problem exists on my box as well). 
 5) Check dimensions (number of layers)
 6) Exit (check_alloc, check_code) within the thread is resulting in segmentation fault. We should not exit
 while threads are running, but instead return the error to calling applications until the threading is stopped
 and exit in single-threaed context.
 7) dim_fft is computed twice big than necessary (4096 to accomodate 2000, for instance). Do we really need it
 for better precision? Or is it a bug? The current version from Alessandro increases it even two times more (to
 8192) in order to perform "extra symmetric padding with fai360".
    
Features
--------
1. Implement pipeline reading, preprocessing, reconstucting, storing; all
   in dedicated threads.
   - Avoid serialization while writting output file
   - Implement slice preloading into the GPU memory
2. Re-implement preprocessing code in python using SIMD and OpenMP. Clean
   unused stuff out of python.
3. Try to use PTX assembler to optimize register usage in CUDA kernels.
4. Investigate data-compression and online compression to reduce usage of
   PCIe bandwidth. 
5. Try to use faster FFT implementations
6. Use NUMA libraries to use host memory more efficient
7. Visualize reconstructed slices
8. Estimate amount of memory for slice reading. Find compromise between
   waits during readout, and processing bing bunches wich will allow us
   to use HYBRID mode.
9. Implemnent linear interpolation and oversampling in OpenCL/CPU mode
10.On multi-GPU systems, the CUDA initialization takes up to 10s, to 
   save this time we should implement kind of a CUDA daemon.
11.Merge new features from ESRF branch

Felix
-----
1. Felix find a center
2. Adjustable center of rotation
3. Divide slices equally on 180 grads
4. Support HDF format

OpenCL
------
1. The counters are inprecise I got big difference between time spent
in HST and calculated using counters (even if RECON_BENCH is turned on).
2. The counters enforcing completely synchronous mode. Thats bad and ugly.
We should use glib timers and async mode.
3. Implement multi-queue processing to interleave computations and transfers
4. Provide optional padding for aligned access
5. Support for optimized 2real/1complex FFT transform
6. Find a way to map cl_image to buffer without copying
7. Support Linear interpolation in CPU mode. Find out why we are going out
of the range and need specific clamping mode
8. Implement support for fai mode
9. Port Mirrone patch supporting irregular angles.
10. Do faster CPU implementation

Potential Problems
------------------
1. In case of odd number of projections, we zeroing gpu_data during initialization.
However, than on each slice, the direct and inverse FFTs are computed which due to
precision errors could (or more probably could not) result in very wrong numbers 
affecting results of paired fourier transform.