Post by Jerome KiefferOn Thu, 21 May 2015 07:59:35 -0400
Post by Andreas KloecknerPost by Luke PfisterIs there a suggested way to do the equivalent of np.sum along a particular
axis for a high-dimensional GPUarray?
I saw that this was discussed in 2009, before GPUarrays carried stride
information.
Hand-writing a kernel is probably still your best option. Just map the
non-reduction axes to the grid/thread block axes, and write a for loop
to do the summation.
Won't you win by having 1 workgroup (sorry it is the OpenCL name, can't remember the CUDA one)
doing a partial parallel reduction ?
i.e. 1 workgroup = 32 threads
32x( read + add) to shared memory as much as needed for the dimension of the gpuarray
Parallel reducton within the shared memory (even without barrier as we are in a warp)
I'd say that depends on the shape of the array, or, specifically, on
whether your other axes are big enough to fill the GPU. If they are,
then parallel reduction is not a winner, since it's not
work-efficient. (Work = n, span = log(n)) On the other hand, if you're
hurting to fill the machine, then it might be worth considering.
Andreas