/normxcorr/trunk : revision 37

To get this branch, use:

bzr branch
http://suren.me/webbzr/normxcorr/trunk

« back to all changes in this revision

Viewing changes to dict_hw/README

Committer: Suren A. Chilingaryan
Date: 2010-08-08 13:41:04 UTC
Revision ID: csa@dside.dyndns.org-20100808134104-w7l3e5f4d62ygi5c

Fixes

files modified:
dict_hw/README

dict_hw/cmake/FindCUDA.cmake

dict_hw/src/dict_hw.cu

dict_hw/src/hw_sched.c

dict_hw/src/hw_sched.h

Show diffs side-by-side

added added

removed removed

dict_hw/README

Actually, strange effects are starting if hw_schedule_task(..,dictLoadImageThread)

is commented (it never executed anyway, but #ifdefing or commenting it out somehow

affects optimizer)

Limitations

===========

- SLI mode should be disabled for multi-GPU support, otherwise application

will work extremely slow. Under Windows it can be done under PhysX properties.

MATLAB fixes for original version

=================================

To compare speed with original version it is necessary to change in

${MATLAB}/toolbox/images/images/cpcorr.m

a) Increase CORRSIZE value from 5 to 15. (Line 76: CORRSIZE = 5;)

b) Change

input_fractional_offset = xyinput(icp,:) - round(xyinput(icp,:));

base_fractional_offset = xybase_in(icp,:) - round(xybase_in(icp,:));

input_fractional_offset = xyinput(icp,:) - round(xyinput(icp,:)*1000)/1000;

base_fractional_offset = xybase_in(icp,:) - round(xybase_in(icp,:)*1000)/1000;

the selected path can be seen by removing semicolon from the end of

following lines in normxcorr2.m

conv_time = time_conv2(T_size,A_size);

fft_time = 3*time_fft2(outsize);

ToDo

====

the problem the extra space should be zeroed, and in the base buffer more

data is filled. Another option is to unblock computations in load base (3D

copy?) and then we would no need it CP_BLOCK times, but just ones.

5. Eliminate optimization modes bellow 3 (?) and provide options to switch

threading on/off. Implement image preloading and multipass mode in Matlab.

6. Normxcorr2 routine of Matlab implements 2 methods of cross-correlation

computation: using ifft(fft * fft) and conv2(). Before, Matlab 2007 the

first one was faster for CORRSIZE=15 and, for that reasons, it used in

here. However, since Matlab 2007 some improvements there made to conv2

and now it is significantly faster compared with fft approach. For 2009b

version it is 4 times faster (1.4 ms agains 6.5 ms). Besides, that the

direct conv2 computation is just additions and multiplications which

should perform better on GPU. For that reasons, it make sense to implement

normxcorr2 using second approach.

c(x,y) = sum_all(sum_all(a(i,j)b(x-i,y-j)))

Okay, thats actually wrong. I forgot to increase CORRSIZE to 15 in image

toolkit. Really FFT approach slightly faster (74ms agains 96ms), but

multiplications can be better suited for GPU code and it makes sense

to try. As an source the following OpenCL example from AMD can be used:

http://developer.amd.com/gpu/ATIStreamSDK/ImageConvolutionOpenCL/pages/ImageConvolutionUsingOpenCL.aspx

Older »