Discussion:
[PyCUDA] Sum along axis of GPUarray?
Luke Pfister
2015-05-20 21:46:50 UTC
Permalink
Is there a suggested way to do the equivalent of np.sum along a particular
axis for a high-dimensional GPUarray?

I saw that this was discussed in 2009, before GPUarrays carried stride
information.

Thanks,
Luke
Andreas Kloeckner
2015-05-21 11:59:35 UTC
Permalink
Post by Luke Pfister
Is there a suggested way to do the equivalent of np.sum along a particular
axis for a high-dimensional GPUarray?
I saw that this was discussed in 2009, before GPUarrays carried stride
information.
Hand-writing a kernel is probably still your best option. Just map the
non-reduction axes to the grid/thread block axes, and write a for loop
to do the summation.

HTH,
Andreas
Jerome Kieffer
2015-05-21 12:54:30 UTC
Permalink
On Thu, 21 May 2015 07:59:35 -0400
Post by Andreas Kloeckner
Post by Luke Pfister
Is there a suggested way to do the equivalent of np.sum along a particular
axis for a high-dimensional GPUarray?
I saw that this was discussed in 2009, before GPUarrays carried stride
information.
Hand-writing a kernel is probably still your best option. Just map the
non-reduction axes to the grid/thread block axes, and write a for loop
to do the summation.
Won't you win by having 1 workgroup (sorry it is the OpenCL name, can't remember the CUDA one)
doing a partial parallel reduction ?

i.e. 1 workgroup = 32 threads
First stage:
32x( read + add) to shared memory as much as needed for the dimension of the gpuarray

Second stage:
Parallel reducton within the shared memory (even without barrier as we are in a warp)

Cheers,
--
JérÎme Kieffer
tel +33 476 882 445
Andreas Kloeckner
2015-05-21 13:12:10 UTC
Permalink
Post by Jerome Kieffer
On Thu, 21 May 2015 07:59:35 -0400
Post by Andreas Kloeckner
Post by Luke Pfister
Is there a suggested way to do the equivalent of np.sum along a particular
axis for a high-dimensional GPUarray?
I saw that this was discussed in 2009, before GPUarrays carried stride
information.
Hand-writing a kernel is probably still your best option. Just map the
non-reduction axes to the grid/thread block axes, and write a for loop
to do the summation.
Won't you win by having 1 workgroup (sorry it is the OpenCL name, can't remember the CUDA one)
doing a partial parallel reduction ?
i.e. 1 workgroup = 32 threads
32x( read + add) to shared memory as much as needed for the dimension of the gpuarray
Parallel reducton within the shared memory (even without barrier as we are in a warp)
I'd say that depends on the shape of the array, or, specifically, on
whether your other axes are big enough to fill the GPU. If they are,
then parallel reduction is not a winner, since it's not
work-efficient. (Work = n, span = log(n)) On the other hand, if you're
hurting to fill the machine, then it might be worth considering.

Andreas

Loading...