Keegan Owsley
2016-11-30 17:27:46 UTC
Hello,
I've just slapped together a patch to pycuda that makes most elementwise
operations work with noncontiguous arrays. There are a bunch of hacks in
there, and the code needs some reorg before it's ready to be considered for
upstream (I made these changes while learning the pycuda codebase, so
there's a bunch of crud that can be cleaned out), but I figure I might as
well put it out there in its current state and see what you guys think.
It's also not extremely well-tested (I have no idea if it interferes with
skcuda, for example), but all of the main functions appear to work.
You can check out the code at https://bitbucket.org/owsleyk_omega/pycuda.
Briefly, this works by adding new parameters into elementwise kernels that
describe the stride and shape of your arrays, then using a function that
computes the location in memory from the stride, shape, and index.
Elementwise kernel ops are modified so that they use the proper indexing.
See an example of a kernel that's generated below:
#include <pycuda-complex.hpp>
typedef struct
{
unsigned n[2];
long stride[2];
} dim;
__device__ unsigned i2m(unsigned i, dim d)
{
unsigned m = 0;
unsigned j = i;
for(int k = 0; k < 2; k++)
{
m += d.stride[k] * (j%d.n[k]);
j = j / d.n[k];
}
return m;
}
__global__ void axpbyz(float a, float *x, float b, float *y, float
*z, unsigned long long n, dim *__dim0, dim *__dim1, dim *__dim2)
{
unsigned tid = threadIdx.x;
unsigned total_threads = gridDim.x*blockDim.x;
unsigned cta_start = blockDim.x*blockIdx.x;
unsigned i;
;
for (i = cta_start + tid; i < n; i += total_threads)
{
z[i2m(i,*__dim2)] = a*x[i2m(i,*__dim0)] + b*y[i2m(i,*__dim1)];
}
;
}
I've also attached a patch file that should take you from latest git to the
version in my repo. All of the changes are in elementwise.py and
gpuarray.py.
I've just slapped together a patch to pycuda that makes most elementwise
operations work with noncontiguous arrays. There are a bunch of hacks in
there, and the code needs some reorg before it's ready to be considered for
upstream (I made these changes while learning the pycuda codebase, so
there's a bunch of crud that can be cleaned out), but I figure I might as
well put it out there in its current state and see what you guys think.
It's also not extremely well-tested (I have no idea if it interferes with
skcuda, for example), but all of the main functions appear to work.
You can check out the code at https://bitbucket.org/owsleyk_omega/pycuda.
Briefly, this works by adding new parameters into elementwise kernels that
describe the stride and shape of your arrays, then using a function that
computes the location in memory from the stride, shape, and index.
Elementwise kernel ops are modified so that they use the proper indexing.
See an example of a kernel that's generated below:
#include <pycuda-complex.hpp>
typedef struct
{
unsigned n[2];
long stride[2];
} dim;
__device__ unsigned i2m(unsigned i, dim d)
{
unsigned m = 0;
unsigned j = i;
for(int k = 0; k < 2; k++)
{
m += d.stride[k] * (j%d.n[k]);
j = j / d.n[k];
}
return m;
}
__global__ void axpbyz(float a, float *x, float b, float *y, float
*z, unsigned long long n, dim *__dim0, dim *__dim1, dim *__dim2)
{
unsigned tid = threadIdx.x;
unsigned total_threads = gridDim.x*blockDim.x;
unsigned cta_start = blockDim.x*blockIdx.x;
unsigned i;
;
for (i = cta_start + tid; i < n; i += total_threads)
{
z[i2m(i,*__dim2)] = a*x[i2m(i,*__dim0)] + b*y[i2m(i,*__dim1)];
}
;
}
I've also attached a patch file that should take you from latest git to the
version in my repo. All of the changes are in elementwise.py and
gpuarray.py.