9
9
Actually, strange effects are starting if hw_schedule_task(..,dictLoadImageThread)
10
10
is commented (it never executed anyway, but #ifdefing or commenting it out somehow
12
- SLI mode should be disabled for multi-GPU support, otherwise application
13
will work extremely slow. Under Windows it can be done under PhysX properties.
16
MATLAB fixes for original version
17
=================================
18
To compare speed with original version it is necessary to change in
19
${MATLAB}/toolbox/images/images/cpcorr.m
21
a) Increase CORRSIZE value from 5 to 15. (Line 76: CORRSIZE = 5;)
23
input_fractional_offset = xyinput(icp,:) - round(xyinput(icp,:));
24
base_fractional_offset = xybase_in(icp,:) - round(xybase_in(icp,:));
26
input_fractional_offset = xyinput(icp,:) - round(xyinput(icp,:)*1000)/1000;
27
base_fractional_offset = xybase_in(icp,:) - round(xybase_in(icp,:)*1000)/1000;
29
the selected path can be seen by removing semicolon from the end of
30
following lines in normxcorr2.m
31
conv_time = time_conv2(T_size,A_size);
32
fft_time = 3*time_fft2(outsize);
29
45
the problem the extra space should be zeroed, and in the base buffer more
30
46
data is filled. Another option is to unblock computations in load base (3D
31
47
copy?) and then we would no need it CP_BLOCK times, but just ones.
48
5. Eliminate optimization modes bellow 3 (?) and provide options to switch
49
threading on/off. Implement image preloading and multipass mode in Matlab.
50
6. Normxcorr2 routine of Matlab implements 2 methods of cross-correlation
51
computation: using ifft(fft * fft) and conv2(). Before, Matlab 2007 the
52
first one was faster for CORRSIZE=15 and, for that reasons, it used in
53
here. However, since Matlab 2007 some improvements there made to conv2
54
and now it is significantly faster compared with fft approach. For 2009b
55
version it is 4 times faster (1.4 ms agains 6.5 ms). Besides, that the
56
direct conv2 computation is just additions and multiplications which
57
should perform better on GPU. For that reasons, it make sense to implement
58
normxcorr2 using second approach.
59
c(x,y) = sum_all(sum_all(a(i,j)b(x-i,y-j)))
61
Okay, thats actually wrong. I forgot to increase CORRSIZE to 15 in image
62
toolkit. Really FFT approach slightly faster (74ms agains 96ms), but
63
multiplications can be better suited for GPU code and it makes sense
64
to try. As an source the following OpenCL example from AMD can be used:
65
http://developer.amd.com/gpu/ATIStreamSDK/ImageConvolutionOpenCL/pages/ImageConvolutionUsingOpenCL.aspx