/normxcorr/trunk

To get this branch, use:
bzr branch http://suren.me/webbzr/normxcorr/trunk
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Acceptable precision: 10^-3, 10^-5

Problems
========
 - CUDA 2.2 is crashing due to find_max2 kernel, CUDA 2.3 and 3.0 works well
 - gcc 4.3.4 (SuSe 11.2) has optimization bug and with -O3 and threads
 support disabled (DICT_SUPPORT_THREADS) computes wrongly. With -O2 everything
 is fine.
 Actually, strange effects are starting if hw_schedule_task(..,dictLoadImageThread) 
 is commented (it never executed anyway, but #ifdefing or commenting it out somehow 
 affects  optimizer)


Limitations
===========


ToDo
====
 1. Implement Volkov fast fourier code for multiplies of 2 for 2D case
 2. In CUDA 3.0 blocking of multiple 2D FFT give no performance benefit, see
    if would change in future versions.
 3. When we copying from the host to cuda (fragment mode), the memory
    transfer is interleaved with computations. Unfortunatelly, in image mode
    the memory transfer is handled as computations and there is no interleave
    is possible. Therefore, in most cases the fragment mode is faster compared
    to image mode.
 4. We probably can use the same buffer for cuda_base_buffer and cuda_data_buffer,
    the problem the extra space should be zeroed, and in the base buffer more
    data is filled. Another option is to unblock computations in load base (3D
    copy?) and then we would no need it CP_BLOCK times, but just ones.