/normxcorr/trunk

To get this branch, use:
bzr branch http://suren.me/webbzr/normxcorr/trunk
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
Acceptable precision: 10^-3, 10^-5

Problems
========
 - CUDA 2.2 is crashing due to find_max2 kernel, CUDA 2.3 and 3.0 works well
 - gcc 4.3.4 (SuSe 11.2) has optimization bug and with -O3 and threads
 support disabled (DICT_SUPPORT_THREADS) computes wrongly. With -O2 everything
 is fine.
 Actually, strange effects are starting if hw_schedule_task(..,dictLoadImageThread) 
 is commented (it never executed anyway, but #ifdefing or commenting it out somehow 
 affects  optimizer)
 - SLI mode should be disabled for multi-GPU support, otherwise application 
 will work extremely slow. Under Windows it can be done under PhysX properties.
 

MATLAB fixes for original version
=================================
    To compare speed with original version it is necessary to change in 
    ${MATLAB}/toolbox/images/images/cpcorr.m
    
    a) Increase CORRSIZE value from 5 to 15. (Line 76: CORRSIZE = 5;)
    b) Change
	input_fractional_offset = xyinput(icp,:) - round(xyinput(icp,:));
	base_fractional_offset = xybase_in(icp,:) - round(xybase_in(icp,:));
    to
	input_fractional_offset = xyinput(icp,:) - round(xyinput(icp,:)*1000)/1000;
	base_fractional_offset = xybase_in(icp,:) - round(xybase_in(icp,:)*1000)/1000;
    
    the selected path can be seen by removing semicolon from the end of 
    following lines in normxcorr2.m
	conv_time = time_conv2(T_size,A_size);
	fft_time = 3*time_fft2(outsize);

ToDo
====
 1. Implement Volkov fast fourier code for multiplies of 2 for 2D case
 2. In CUDA 3.0 blocking of multiple 2D FFT give no performance benefit, see
    if would change in future versions.
 3. When we copying from the host to cuda (fragment mode), the memory
    transfer is interleaved with computations. Unfortunatelly, in image mode
    the memory transfer is handled as computations and there is no interleave
    is possible. Therefore, in most cases the fragment mode is faster compared
    to image mode.
 4. We probably can use the same buffer for cuda_base_buffer and cuda_data_buffer,
    the problem the extra space should be zeroed, and in the base buffer more
    data is filled. Another option is to unblock computations in load base (3D
    copy?) and then we would no need it CP_BLOCK times, but just ones.
 5. Eliminate optimization modes bellow 3 (?) and provide options to switch 
    threading on/off. Implement image preloading and multipass mode in Matlab.
 6. Normxcorr2 routine of Matlab implements 2 methods of cross-correlation
    computation: using ifft(fft * fft) and conv2(). Before, Matlab 2007 the
    first one was faster for CORRSIZE=15 and, for that reasons, it used in
    here. However, since Matlab 2007 some improvements there made to conv2
    and now it is significantly faster compared with fft approach. For 2009b
    version it is 4 times faster (1.4 ms agains 6.5 ms). Besides, that the
    direct conv2 computation is just additions and multiplications which 
    should perform better on GPU. For that reasons, it make sense to implement
    normxcorr2 using second approach.
	c(x,y) = sum_all(sum_all(a(i,j)b(x-i,y-j)))

    Okay, thats actually wrong. I forgot to increase CORRSIZE to 15 in image 
    toolkit. Really FFT approach slightly faster (74ms agains 96ms), but 
    multiplications can be better suited for GPU code and it makes sense 
    to try. As an source the following OpenCL example from AMD can be used:
    http://developer.amd.com/gpu/ATIStreamSDK/ImageConvolutionOpenCL/pages/ImageConvolutionUsingOpenCL.aspx