[PyCUDA] Strange behaviour of simple cuda kernel

Manuele Sigona

2017-11-06 08:02:38 UTC

Hello everybody,

I'm quite new to cuda and pycuda.

I need a kernel that creates a matrix (of dimension nxd) out of an array (1xd), by simply "repeating" the same array n times:

for example, suppose we have n = 4 and d = 3, then if the array is [1 2 3]

the result of my kernel should be:

[1 2 3

1 2 3

1 2 3

1 2 3] (a matrix 4x3)

Basically, it's the same as doing numpy.tile(array, (n, 1))

I've written the code below:

kernel_code_template = """
__global__ void TileKernel(float *in, float *out)
{
// Each thread computes one element of out
int y = blockIdx.y * blockDim.y + threadIdx.y;
int x = blockIdx.x * blockDim.x + threadIdx.x;

if (y > %(n)s || x > %(d)s) return;

out[y * %(d)s + x] = in[x];
}
"""

d = 64
n = 512

blockSizex = 16
blockSizey = 16
gridSizex = (d + blockSizex - 1) / blockSizex
gridSizey = (n + blockSizey - 1) / blockSizey

# get the kernel code from the template
kernel_code = kernel_code_template % {
'd': d,
'n': n
}
mod = SourceModule(kernel_code)
TileKernel = mod.get_function("TileKernel")

vec_cpu = np.arange(d).astype(np.float32) # just as an example
vec_gpu = gpuarray.to_gpu(vec_cpu)
out_gpu = gpuarray.empty((n, d), np.float32)

TileKernel.prepare("PP")
TileKernel.prepared_call((gridSizex, gridSizey), (blockSizex, blockSizey, 1), vec_gpu.gpudata, out_gpu.gpudata)

out_cpu = out_gpu.get()

Now, if I run this code with d equals a power of 2 >= 16 I get the right result (just like numpy.tile(vec_cpu, (n, 1)) );
but if I set d equals to anything else (let's say for example 88) I get that every element of the output matrix has the
correct value, except the first column: some entries are right but others have another value (equals to d),
and the entries of the first column that have the wrong value are different every run.
I really can't figure out where's the problem, but maybe it's just something simple that I'm missing...

Any help will be appreciated, thanks in advance!

Best regards,
Manuele