[PyCUDA] cuModuleLoadDataEx failed: device kernel image is invalid

Discussion:

Zhangsheng Lai

2018-04-19 01:34:08 UTC

I'm encountering this error as I run my code on the same docker environment
but on different workstations.

```
Traceback (most recent call last):
File "simple_peer.py", line 76, in <module>
tslr_gpu, lr_gpu = mp.initialise()
File "/root/distributed-mpp/naive/mccullochpitts.py", line 102, in
initialise
""", arch='sm_60')
File "/root/anaconda3/lib/python3.6/site-packages/pycuda/compiler.py",
line 294, in __init__
self.module = module_from_buffer(cubin)
pycuda._driver.LogicError: cuModuleLoadDataEx failed: device kernel image
is invalid -

```
I did a quick search and only found this :
https://github.com/inducer/pycuda/issues/45 , but it doesn't seem to be
relevant to my problem as it runs on my initial workstation. Can anyone see
what is the issue?

Below is my code that I'm trying to run:
```
def initialise(self):
"""
Documentation here
"""

mod = SourceModule("""
#include <math.h>
__global__ void initial(float *tslr_out, float *lr_out, float
*W_gpu,\
float *b_gpu, int *x_gpu, int d, float temp)
{
int tx = threadIdx.x;

// Wx stores the W_ji x_i product value
float Wx = 0;

// Matrix multiplication of W and x
for (int k=0; k<d; ++k)
{
float W_element = W_gpu[tx * d + k];
float x_element = x_gpu[k];
Wx += W_element * x_element;
}

// Computing the linear response, signed linear response with
temp
lr_out[tx] = Wx + b_gpu[tx];
tslr_out[tx] = (0.5/temp) * (1 - 2*x_gpu[tx])* (Wx + b_gpu[tx]);
}
""", arch='sm_60')

func = mod.get_function("initial")

# format for prepare defined at
https://docs.python.org/2/library/struct.html
func.prepare("PPPPPif")

dsize_nparray = np.zeros((self.d,), dtype = np.float32)

lr_gpu = cuda.mem_alloc(dsize_nparray.nbytes)
slr_gpu = cuda.mem_alloc(dsize_nparray.nbytes)
tslr_gpu = cuda.mem_alloc(dsize_nparray.nbytes)

grid=(1,1)
block=(self.d,1,1)
# block=(self.d,self.d,1)

func.prepared_call(grid, block, tslr_gpu, lr_gpu, self.W_gpu, \
self.b_gpu, self.x_gpu, self.d, self.temp)

return tslr_gpu, lr_gpu
```

Andreas Kloeckner

2018-04-19 14:48:02 UTC

Permalink

You're prescribing the GPU architecture (arch='...'). If this doesn't
match your GPU, this could easily cause this issue. Just deleting that
kwarg should be fine.

Andreas

Post by Zhangsheng Lai
I'm encountering this error as I run my code on the same docker environment
but on different workstations.
```
File "simple_peer.py", line 76, in <module>
tslr_gpu, lr_gpu = mp.initialise()
File "/root/distributed-mpp/naive/mccullochpitts.py", line 102, in
initialise
""", arch='sm_60')
File "/root/anaconda3/lib/python3.6/site-packages/pycuda/compiler.py",
line 294, in __init__
self.module = module_from_buffer(cubin)
pycuda._driver.LogicError: cuModuleLoadDataEx failed: device kernel image
is invalid -
```
https://github.com/inducer/pycuda/issues/45 , but it doesn't seem to be
relevant to my problem as it runs on my initial workstation. Can anyone see
what is the issue?
```
"""
Documentation here
"""
mod = SourceModule("""
#include <math.h>
__global__ void initial(float *tslr_out, float *lr_out, float
*W_gpu,\
float *b_gpu, int *x_gpu, int d, float temp)
{
int tx = threadIdx.x;
// Wx stores the W_ji x_i product value
float Wx = 0;
// Matrix multiplication of W and x
for (int k=0; k<d; ++k)
{
float W_element = W_gpu[tx * d + k];
float x_element = x_gpu[k];
Wx += W_element * x_element;
}
// Computing the linear response, signed linear response with
temp
lr_out[tx] = Wx + b_gpu[tx];
tslr_out[tx] = (0.5/temp) * (1 - 2*x_gpu[tx])* (Wx + b_gpu[tx]);
}
""", arch='sm_60')
func = mod.get_function("initial")
# format for prepare defined at
https://docs.python.org/2/library/struct.html
func.prepare("PPPPPif")
dsize_nparray = np.zeros((self.d,), dtype = np.float32)
lr_gpu = cuda.mem_alloc(dsize_nparray.nbytes)
slr_gpu = cuda.mem_alloc(dsize_nparray.nbytes)
tslr_gpu = cuda.mem_alloc(dsize_nparray.nbytes)
grid=(1,1)
block=(self.d,1,1)
# block=(self.d,self.d,1)
func.prepared_call(grid, block, tslr_gpu, lr_gpu, self.W_gpu, \
self.b_gpu, self.x_gpu, self.d, self.temp)
return tslr_gpu, lr_gpu
```
_______________________________________________
PyCUDA mailing list
https://lists.tiker.net/listinfo/pycuda

Andreas Kloeckner

2018-04-21 01:28:50 UTC

Permalink

Hi Andreas,
Thanks! It worked! Can I ask if you think cuda.memcpy_peer can be
used threads for GPUs (
https://wiki.tiker.net/PyCuda/Examples/MultipleThreads)? I think this is
more of a threading question than a PyCUDA question but would like your
insights on this.

Please make sure to keep the list cc'd for archival.

As for your question, I don't see why not.

Andreas