[PyCUDA] Elementwise operations on noncontiguous arrays

Discussion:

Keegan Owsley

2016-11-30 17:27:46 UTC

Hello,

I've just slapped together a patch to pycuda that makes most elementwise
operations work with noncontiguous arrays. There are a bunch of hacks in
there, and the code needs some reorg before it's ready to be considered for
upstream (I made these changes while learning the pycuda codebase, so
there's a bunch of crud that can be cleaned out), but I figure I might as
well put it out there in its current state and see what you guys think.
It's also not extremely well-tested (I have no idea if it interferes with
skcuda, for example), but all of the main functions appear to work.

You can check out the code at https://bitbucket.org/owsleyk_omega/pycuda.

Briefly, this works by adding new parameters into elementwise kernels that
describe the stride and shape of your arrays, then using a function that
computes the location in memory from the stride, shape, and index.
Elementwise kernel ops are modified so that they use the proper indexing.
See an example of a kernel that's generated below:

#include <pycuda-complex.hpp>

typedef struct
{
unsigned n[2];
long stride[2];
} dim;

__device__ unsigned i2m(unsigned i, dim d)
{
unsigned m = 0;
unsigned j = i;
for(int k = 0; k < 2; k++)
{
m += d.stride[k] * (j%d.n[k]);
j = j / d.n[k];
}
return m;
}

__global__ void axpbyz(float a, float *x, float b, float *y, float
*z, unsigned long long n, dim *__dim0, dim *__dim1, dim *__dim2)
{

unsigned tid = threadIdx.x;
unsigned total_threads = gridDim.x*blockDim.x;
unsigned cta_start = blockDim.x*blockIdx.x;
unsigned i;

;

for (i = cta_start + tid; i < n; i += total_threads)
{
z[i2m(i,*__dim2)] = a*x[i2m(i,*__dim0)] + b*y[i2m(i,*__dim1)];
}

;
}

I've also attached a patch file that should take you from latest git to the
version in my repo. All of the changes are in elementwise.py and
gpuarray.py.

Andreas Kloeckner

2016-12-01 01:24:50 UTC

Permalink

Keegan,

Post by Keegan Owsley
I've just slapped together a patch to pycuda that makes most elementwise
operations work with noncontiguous arrays. There are a bunch of hacks in
there, and the code needs some reorg before it's ready to be considered for
upstream (I made these changes while learning the pycuda codebase, so
there's a bunch of crud that can be cleaned out), but I figure I might as
well put it out there in its current state and see what you guys think.
It's also not extremely well-tested (I have no idea if it interferes with
skcuda, for example), but all of the main functions appear to work.
You can check out the code at https://bitbucket.org/owsleyk_omega/pycuda.
Briefly, this works by adding new parameters into elementwise kernels that
describe the stride and shape of your arrays, then using a function that
computes the location in memory from the stride, shape, and index.
Elementwise kernel ops are modified so that they use the proper indexing.

Thanks for putting this together and sharing it! I have one main
question about this, regarding performance:

Modulo (especially variable-denominator modulo) has a habit of being
fantastically slow on GPUs. Could you time contiguous
vs. noncontiguous for various levels of "gappiness" and number of
axes? I'm asking this because I'd be OK with a 50% slowdown, but not
necessarily a factor of 5 slowdown on actual GPU hardware.

Thanks!
Andreas

Keegan Owsley

2016-12-01 19:54:06 UTC

Permalink

It seems striding in kernels can indeed incur a large penalty, though the
result is dependent on the array size. I've just updated the code to only
use the noncontiguous kernel if necessary.

Using a GTX 970, I get the following results using the pow kernel (I'm
using ipython's %timeit here, because for some reason python appears to

Post by Andreas Kloeckner

import pycuda.autoinit; import pycuda.gpuarray as gpuarray
b1 = gpuarray.arange(100000, dtype='float64').reshape(1000, 100)
b2 = b1[::2, :-1].copy()
# force CUDA to compile the kernels before we time
b1[::2,:-1]**2
b2**2
%timeit b1[::2,:-1]**2

1000 loops, best of 3: 654 Âµs per loop

Post by Andreas Kloeckner

%timeit b2**2

10000 loops, best of 3: 147 Âµs per loop

Post by Andreas Kloeckner

b1 = gpuarray.arange(1000000, dtype='float64').reshape(1000, 1000)
b2 = b1[::2, :-1].copy()
%timeit b1[::2,:-1]**2

100 loops, best of 3: 2.05 ms per loop

Post by Andreas Kloeckner

%timeit b2**2

1000 loops, best of 3: 1.66 ms per loop

I'll try clean up the code and get it into my repo tpday.

Keegan

On Thu, Dec 1, 2016 at 10:03 AM, FrÃ©dÃ©ric Bastien <

I can share our experience in Theano related to that.
I added code to "fuse" dimensions that are contiguous. So if you have a 3d
tensor, and only 1 dimensions is non contiguous, you won't pay the indexing
of a 3d tensor, but only of a 2d tensor.
I saw 2x speed up in such cases. It was old gpu and old cuda version,
maybe this changed.
In the new Theano back-end, we have rewriting that in a more readable
https://github.com/Theano/libgpuarray/blob/master/src/gpuarray_util.c#L153
This also take into account the broadcasting.
This could be done before call the kernel.
On Wed, Nov 30, 2016 at 8:24 PM, Andreas Kloeckner <

Post by Andreas Kloeckner
Keegan,

I've just slapped together a patch to pycuda that makes most elementwise
operations work with noncontiguous arrays. There are a bunch of hacks in
there, and the code needs some reorg before it's ready to be considered

for

upstream (I made these changes while learning the pycuda codebase, so
there's a bunch of crud that can be cleaned out), but I figure I might

well put it out there in its current state and see what you guys think.
It's also not extremely well-tested (I have no idea if it interferes

with

skcuda, for example), but all of the main functions appear to work.
You can check out the code at https://bitbucket.org/owsleyk_

omega/pycuda.

Briefly, this works by adding new parameters into elementwise kernels

that

describe the stride and shape of your arrays, then using a function that
computes the location in memory from the stride, shape, and index.
Elementwise kernel ops are modified so that they use the proper

indexing.
Thanks for putting this together and sharing it! I have one main
Modulo (especially variable-denominator modulo) has a habit of being
fantastically slow on GPUs. Could you time contiguous
vs. noncontiguous for various levels of "gappiness" and number of
axes? I'm asking this because I'd be OK with a 50% slowdown, but not
necessarily a factor of 5 slowdown on actual GPU hardware.
Thanks!
Andreas
_______________________________________________
PyCUDA mailing list
https://lists.tiker.net/listinfo/pycuda

Keegan Owsley

2016-12-01 23:43:16 UTC

Permalink

Alright, I've updated the code in my repository. It now passes all of the
unit tests in test_gpuarray (at least, the ones that were passing before).
The old behavior is the default, so hopefully that should fix compatibility
with existing code (still largely untested). To create a noncontiguous
kernel, pass use_strides=True into get_elwise_kernel, and pass
gpuarray.device_shape_and_strides
into func.prepared_async_call. I've updated all of the methods in gpuarray.py
to check for any noncontiguous inputs/outputs, and call the appropriate
function.

Something that I don't think I made clear before: the kernels generated by
get_elwise_module_noncontig are modified using regular expressions, so that
you don't need to change your code downstream to get strided array support.
I'm not convinced yet that this is the best approach, but it works okay so
far.

So, per example...

Code like this:

@context_dependent_memoize
def get_binary_op_kernel(dtype_x, dtype_y, dtype_z, operator,
use_strides=False):
return get_elwise_kernel(
"%(tp_x)s *x, %(tp_y)s *y, %(tp_z)s *z" % {
"tp_x": dtype_to_ctype(dtype_x),
"tp_y": dtype_to_ctype(dtype_y),
"tp_z": dtype_to_ctype(dtype_z),
},
"z[i] = x[i] %s y[i]" % operator,
"multiply", use_strides=use_strides)

Generates kernels like this:

#include <pycuda-complex.hpp>

typedef struct
{
unsigned n[2];
long stride[2];
} dim;

__device__ unsigned i2m(unsigned i, dim d)
{
unsigned m = 0;
unsigned j = i;
for(int k = 0; k < 2; k++)
{
m += d.stride[k] * (j%d.n[k]);
j = j / d.n[k];
}
return m;
}

__global__ void multiply(double *x, double *y, double *z,
unsigned long long n, dim *__dim0, dim *__dim1, dim *__dim2)
{

unsigned tid = threadIdx.x;
unsigned total_threads = gridDim.x*blockDim.x;
unsigned cta_start = blockDim.x*blockIdx.x;
unsigned i;

;

for (i = cta_start + tid; i < n; i += total_threads)
{
z[i2m(i,*__dim2)] = x[i2m(i,*__dim0)] / y[i2m(i,*__dim1)];
}

;
}

and you use them like this:

def is_noncontig(*vecs):
r = False
for v in vecs: r = r or not v.flags.forc
return r

def _div(self, other, out, stream=None):
"""Divides an array by another array."""

assert self.shape == other.shape

noncontig = is_noncontig(self, other, out)

func = elementwise.get_binary_op_kernel(self.dtype, other.dtype,
out.dtype, "/", use_strides=noncontig)

if noncontig:
func.prepared_async_call(self._grid, self._block, stream,
self.gpudata, other.gpudata,
out.gpudata, self.mem_size,
self.device_shape_and_strides,
other.device_shape_and_strides,
out.device_shape_and_strides)
else:
func.prepared_async_call(self._grid, self._block, stream,
self.gpudata, other.gpudata,
out.gpudata, self.mem_size)

In principle, similar updates can be made to modules that depend on
pycuda.elementwise down the road, to also support strided arrays without
big rewrites.

I'm open to suggestions for improving performance, but I'd like to focus on
compatibility for now. FrÃ©dÃ©ric's suggestion is a good one, but I suspect
there's another source of slowdowns somewhere, because the slowdowns I'm
seeing don't scale with array size.

Keegan

Post by Keegan Owsley
It seems striding in kernels can indeed incur a large penalty, though the
result is dependent on the array size. I've just updated the code to only
use the noncontiguous kernel if necessary.
Using a GTX 970, I get the following results using the pow kernel (I'm
using ipython's %timeit here, because for some reason python appears to

Post by Andreas Kloeckner

1000 loops, best of 3: 654 Âµs per loop

Post by Andreas Kloeckner

%timeit b2**2

10000 loops, best of 3: 147 Âµs per loop

Post by Andreas Kloeckner

b1 = gpuarray.arange(1000000, dtype='float64').reshape(1000, 1000)
b2 = b1[::2, :-1].copy()
%timeit b1[::2,:-1]**2

100 loops, best of 3: 2.05 ms per loop

Post by Andreas Kloeckner

%timeit b2**2

1000 loops, best of 3: 1.66 ms per loop
I'll try clean up the code and get it into my repo tpday.
Keegan
On Thu, Dec 1, 2016 at 10:03 AM, FrÃ©dÃ©ric Bastien <

I can share our experience in Theano related to that.
I added code to "fuse" dimensions that are contiguous. So if you have a
3d tensor, and only 1 dimensions is non contiguous, you won't pay the
indexing of a 3d tensor, but only of a 2d tensor.
I saw 2x speed up in such cases. It was old gpu and old cuda version,
maybe this changed.
In the new Theano back-end, we have rewriting that in a more readable
https://github.com/Theano/libgpuarray/blob/master/src/gpuarr
ay_util.c#L153
This also take into account the broadcasting.
This could be done before call the kernel.
On Wed, Nov 30, 2016 at 8:24 PM, Andreas Kloeckner <

Post by Andreas Kloeckner
Keegan,

I've just slapped together a patch to pycuda that makes most

elementwise

operations work with noncontiguous arrays. There are a bunch of hacks

there, and the code needs some reorg before it's ready to be

considered for

upstream (I made these changes while learning the pycuda codebase, so
there's a bunch of crud that can be cleaned out), but I figure I might

well put it out there in its current state and see what you guys think.
It's also not extremely well-tested (I have no idea if it interferes

with

skcuda, for example), but all of the main functions appear to work.
You can check out the code at https://bitbucket.org/owsleyk_

omega/pycuda.

Briefly, this works by adding new parameters into elementwise kernels

that

describe the stride and shape of your arrays, then using a function

that

computes the location in memory from the stride, shape, and index.
Elementwise kernel ops are modified so that they use the proper

Andreas Kloeckner

2016-12-04 23:30:22 UTC

Permalink

Post by Keegan Owsley
Something that I don't think I made clear before: the kernels generated by
get_elwise_module_noncontig are modified using regular expressions, so that
you don't need to change your code downstream to get strided array support.
I'm not convinced yet that this is the best approach, but it works okay so
far.

To make it easier to see your changes and comment on them (such as
exactly how robust the Regex stuff is), could you put this up as a pull
request here?

https://gitlab.tiker.net/inducer/pycuda

This will automatically run tests, so it's easier for me to handle.

Thanks!
Andreas

Keegan Owsley

2016-12-05 20:10:21 UTC

Permalink

I've created the PR here:
https://gitlab.tiker.net/inducer/pycuda/merge_requests/1

Regarding the regex, I have two versions: one that's simpler and uses the
standard `re` module but has issues handling nested array access, and a
more robust one that uses the `regex` module that correctly handles nested
accesses. The following kernel fails with the former, but succeeds with the
latter:

z[i] = y[x[i]]

I didn't see any kernels in pycuda that use this kind of nested access, so
the first regex is sufficient for all of the examples in there, but the
second is necessary to support arbitrary kernels. It's possible that there
are other situations in which the regex(es) fail, but I haven't run into
them yet.

Keegan

Post by Keegan Owsley

Post by Keegan Owsley
Something that I don't think I made clear before: the kernels generated

Post by Keegan Owsley
get_elwise_module_noncontig are modified using regular expressions, so

that

Post by Keegan Owsley
you don't need to change your code downstream to get strided array

support.

Post by Keegan Owsley
I'm not convinced yet that this is the best approach, but it works okay

Post by Keegan Owsley
far.

To make it easier to see your changes and comment on them (such as
exactly how robust the Regex stuff is), could you put this up as a pull
request here?
https://gitlab.tiker.net/inducer/pycuda
This will automatically run tests, so it's easier for me to handle.
Thanks!
Andreas