1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
|
Acceptable precision: 10^-3, 10^-5
Problems
========
- CUDA 2.2 is crashing due to find_max2 kernel, CUDA 2.3 and 3.0 works well
- gcc 4.3.4 (SuSe 11.2) has optimization bug and with -O3 and threads
support disabled (DICT_SUPPORT_THREADS) computes wrongly. With -O2 everything
is fine.
Actually, strange effects are starting if hw_schedule_task(..,dictLoadImageThread)
is commented (it never executed anyway, but #ifdefing or commenting it out somehow
affects optimizer)
- SLI mode should be disabled for multi-GPU support, otherwise application
will work extremely slow. Under Windows it can be done under PhysX properties.
MATLAB fixes for original version
=================================
To compare speed with original version it is necessary to change in
${MATLAB}/toolbox/images/images/cpcorr.m
a) Increase CORRSIZE value from 5 to 15. (Line 76: CORRSIZE = 5;)
b) Change
input_fractional_offset = xyinput(icp,:) - round(xyinput(icp,:));
base_fractional_offset = xybase_in(icp,:) - round(xybase_in(icp,:));
to
input_fractional_offset = xyinput(icp,:) - round(xyinput(icp,:)*1000)/1000;
base_fractional_offset = xybase_in(icp,:) - round(xybase_in(icp,:)*1000)/1000;
the selected path can be seen by removing semicolon from the end of
following lines in normxcorr2.m
conv_time = time_conv2(T_size,A_size);
fft_time = 3*time_fft2(outsize);
ToDo
====
1. Implement Volkov fast fourier code for multiplies of 2 for 2D case
2. In CUDA 3.0 blocking of multiple 2D FFT give no performance benefit, see
if would change in future versions.
3. When we copying from the host to cuda (fragment mode), the memory
transfer is interleaved with computations. Unfortunatelly, in image mode
the memory transfer is handled as computations and there is no interleave
is possible. Therefore, in most cases the fragment mode is faster compared
to image mode.
4. We probably can use the same buffer for cuda_base_buffer and cuda_data_buffer,
the problem the extra space should be zeroed, and in the base buffer more
data is filled. Another option is to unblock computations in load base (3D
copy?) and then we would no need it CP_BLOCK times, but just ones.
5. Eliminate optimization modes bellow 3 (?) and provide options to switch
threading on/off. Implement image preloading and multipass mode in Matlab.
6. Normxcorr2 routine of Matlab implements 2 methods of cross-correlation
computation: using ifft(fft * fft) and conv2(). Before, Matlab 2007 the
first one was faster for CORRSIZE=15 and, for that reasons, it used in
here. However, since Matlab 2007 some improvements there made to conv2
and now it is significantly faster compared with fft approach. For 2009b
version it is 4 times faster (1.4 ms agains 6.5 ms). Besides, that the
direct conv2 computation is just additions and multiplications which
should perform better on GPU. For that reasons, it make sense to implement
normxcorr2 using second approach.
c(x,y) = sum_all(sum_all(a(i,j)b(x-i,y-j)))
Okay, thats actually wrong. I forgot to increase CORRSIZE to 15 in image
toolkit. Really FFT approach slightly faster (74ms agains 96ms), but
multiplications can be better suited for GPU code and it makes sense
to try. As an source the following OpenCL example from AMD can be used:
http://developer.amd.com/gpu/ATIStreamSDK/ImageConvolutionOpenCL/pages/ImageConvolutionUsingOpenCL.aspx
|