[PyCUDA] Sum along axis of GPUarray?

Discussion:

Luke Pfister

2015-05-20 21:46:50 UTC

Is there a suggested way to do the equivalent of np.sum along a particular
axis for a high-dimensional GPUarray?

I saw that this was discussed in 2009, before GPUarrays carried stride
information.

Thanks,
Luke

Andreas Kloeckner

2015-05-21 11:59:35 UTC

Permalink

Post by Luke Pfister
Is there a suggested way to do the equivalent of np.sum along a particular
axis for a high-dimensional GPUarray?
I saw that this was discussed in 2009, before GPUarrays carried stride
information.

Hand-writing a kernel is probably still your best option. Just map the
non-reduction axes to the grid/thread block axes, and write a for loop
to do the summation.

HTH,
Andreas

Jerome Kieffer

2015-05-21 12:54:30 UTC

Permalink

On Thu, 21 May 2015 07:59:35 -0400

Post by Andreas Kloeckner

Hand-writing a kernel is probably still your best option. Just map the
non-reduction axes to the grid/thread block axes, and write a for loop
to do the summation.

Won't you win by having 1 workgroup (sorry it is the OpenCL name, can't remember the CUDA one)
doing a partial parallel reduction ?

i.e. 1 workgroup = 32 threads
First stage:
32x( read + add) to shared memory as much as needed for the dimension of the gpuarray

Second stage:
Parallel reducton within the shared memory (even without barrier as we are in a warp)

Cheers,

Andreas Kloeckner

2015-05-21 13:12:10 UTC

Permalink

Post by Jerome Kieffer
On Thu, 21 May 2015 07:59:35 -0400

Post by Andreas Kloeckner

Hand-writing a kernel is probably still your best option. Just map the
non-reduction axes to the grid/thread block axes, and write a for loop
to do the summation.

Won't you win by having 1 workgroup (sorry it is the OpenCL name, can't remember the CUDA one)
doing a partial parallel reduction ?
i.e. 1 workgroup = 32 threads
32x( read + add) to shared memory as much as needed for the dimension of the gpuarray
Parallel reducton within the shared memory (even without barrier as we are in a warp)

I'd say that depends on the shape of the array, or, specifically, on
whether your other axes are big enough to fill the GPU. If they are,
then parallel reduction is not a winner, since it's not
work-efficient. (Work = n, span = log(n)) On the other hand, if you're
hurting to fill the machine, then it might be worth considering.

Andreas