Manuele Sigona

2017-11-06 08:02:38 UTC

Hello everybody,

I'm quite new to cuda and pycuda.

I need a kernel that creates a matrix (of dimension nxd) out of an array (1xd), by simply "repeating" the same array n times:

for example, suppose we have n = 4 and d = 3, then if the array is [1 2 3]

the result of my kernel should be:

[1 2 3

1 2 3

1 2 3

1 2 3] (a matrix 4x3)

Basically, it's the same as doing numpy.tile(array, (n, 1))

I've written the code below:

kernel_code_template = """

__global__ void TileKernel(float *in, float *out)

{

// Each thread computes one element of out

int y = blockIdx.y * blockDim.y + threadIdx.y;

int x = blockIdx.x * blockDim.x + threadIdx.x;

if (y > %(n)s || x > %(d)s) return;

out[y * %(d)s + x] = in[x];

}

"""

d = 64

n = 512

blockSizex = 16

blockSizey = 16

gridSizex = (d + blockSizex - 1) / blockSizex

gridSizey = (n + blockSizey - 1) / blockSizey

# get the kernel code from the template

kernel_code = kernel_code_template % {

'd': d,

'n': n

}

mod = SourceModule(kernel_code)

TileKernel = mod.get_function("TileKernel")

vec_cpu = np.arange(d).astype(np.float32) # just as an example

vec_gpu = gpuarray.to_gpu(vec_cpu)

out_gpu = gpuarray.empty((n, d), np.float32)

TileKernel.prepare("PP")

TileKernel.prepared_call((gridSizex, gridSizey), (blockSizex, blockSizey, 1), vec_gpu.gpudata, out_gpu.gpudata)

out_cpu = out_gpu.get()

Now, if I run this code with d equals a power of 2 >= 16 I get the right result (just like numpy.tile(vec_cpu, (n, 1)) );

but if I set d equals to anything else (let's say for example 88) I get that every element of the output matrix has the

correct value, except the first column: some entries are right but others have another value (equals to d),

and the entries of the first column that have the wrong value are different every run.

I really can't figure out where's the problem, but maybe it's just something simple that I'm missing...

Any help will be appreciated, thanks in advance!

Best regards,

Manuele

I'm quite new to cuda and pycuda.

I need a kernel that creates a matrix (of dimension nxd) out of an array (1xd), by simply "repeating" the same array n times:

for example, suppose we have n = 4 and d = 3, then if the array is [1 2 3]

the result of my kernel should be:

[1 2 3

1 2 3

1 2 3

1 2 3] (a matrix 4x3)

Basically, it's the same as doing numpy.tile(array, (n, 1))

I've written the code below:

kernel_code_template = """

__global__ void TileKernel(float *in, float *out)

{

// Each thread computes one element of out

int y = blockIdx.y * blockDim.y + threadIdx.y;

int x = blockIdx.x * blockDim.x + threadIdx.x;

if (y > %(n)s || x > %(d)s) return;

out[y * %(d)s + x] = in[x];

}

"""

d = 64

n = 512

blockSizex = 16

blockSizey = 16

gridSizex = (d + blockSizex - 1) / blockSizex

gridSizey = (n + blockSizey - 1) / blockSizey

# get the kernel code from the template

kernel_code = kernel_code_template % {

'd': d,

'n': n

}

mod = SourceModule(kernel_code)

TileKernel = mod.get_function("TileKernel")

vec_cpu = np.arange(d).astype(np.float32) # just as an example

vec_gpu = gpuarray.to_gpu(vec_cpu)

out_gpu = gpuarray.empty((n, d), np.float32)

TileKernel.prepare("PP")

TileKernel.prepared_call((gridSizex, gridSizey), (blockSizex, blockSizey, 1), vec_gpu.gpudata, out_gpu.gpudata)

out_cpu = out_gpu.get()

Now, if I run this code with d equals a power of 2 >= 16 I get the right result (just like numpy.tile(vec_cpu, (n, 1)) );

but if I set d equals to anything else (let's say for example 88) I get that every element of the output matrix has the

correct value, except the first column: some entries are right but others have another value (equals to d),

and the entries of the first column that have the wrong value are different every run.

I really can't figure out where's the problem, but maybe it's just something simple that I'm missing...

Any help will be appreciated, thanks in advance!

Best regards,

Manuele