1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
|
Acceptable precision: 10^-3, 10^-5
Problems
========
- CUDA 2.2 is crashing due to find_max2 kernel, CUDA 2.3 and 3.0 works well
- gcc 4.3.4 (SuSe 11.2) has optimization bug and with -O3 and threads
support disabled (DICT_SUPPORT_THREADS) computes wrongly. With -O2 everything
is fine.
Actually, strange effects are starting if hw_schedule_task(..,dictLoadImageThread)
is commented (it never executed anyway, but #ifdefing or commenting it out somehow
affects optimizer)
Limitations
===========
ToDo
====
1. Implement Volkov fast fourier code for multiplies of 2 for 2D case
2. In CUDA 3.0 blocking of multiple 2D FFT give no performance benefit, see
if would change in future versions.
3. When we copying from the host to cuda (fragment mode), the memory
transfer is interleaved with computations. Unfortunatelly, in image mode
the memory transfer is handled as computations and there is no interleave
is possible. Therefore, in most cases the fragment mode is faster compared
to image mode.
4. We probably can use the same buffer for cuda_base_buffer and cuda_data_buffer,
the problem the extra space should be zeroed, and in the base buffer more
data is filled. Another option is to unblock computations in load base (3D
copy?) and then we would no need it CP_BLOCK times, but just ones.
|