Discussion:
[PyCUDA] Dynamic parallelism (sm_35) with PyCUDA
Ahmed Fasih
2013-02-12 04:34:05 UTC
Permalink
Hi folks, I write in the hope that someone has gotten a K20 Kepler 3.5
compute capability device and has gotten it to do dynamic parallelism,
wherein a kernel can kick off grids on its own without returning to
the CPU. A "hello world" example is given at [1], page 23. With the
suggested nvcc command, I was able to compile and run the hello world
perfectly, but when I tried to load the kernels into PyCUDA, I get
LogicErrors, asserting

"cuModuleLoadDataEx failed: invalid image -"

This is before and after I take out the runtime API calls like
cudaDeviceSynchronize() in the kernel code, and persist after I pass
the "-rdc=true" and "-lcudadevrt" options into SourceModule. Could
PyCUDA be hating the runtime API calls in a kernel? If anybody's
gotten this to work, please share!

On a related note, I've added support for the sm_35 (30?) feature of
choosing four versus eight byte banks for shared memory from PyCUDA, I
can send out a rough patch.

[1] http://docs.nvidia.com/cuda/pdf/CUDA_Dynamic_Parallelism_Programming_Guide.pdf

For completeness, here's the original CUDA code from there:

#include <stdio.h>
__global__ void childKernel()
{
printf("Hello ");
}
__global__ void parentKernel()
{
// launch child
childKernel<<<1,1>>>();
if (cudaSuccess != cudaGetLastError()) {
return;
}
// wait for child to complete
if (cudaSuccess != cudaDeviceSynchronize()) {
return;
}
printf("World!\n");
}
int main(int argc, char *argv[])
{
// launch parent
parentKernel<<<1,1>>>();
if (cudaSuccess != cudaGetLastError()) {
return 1;
}
// wait for parent to complete
if (cudaSuccess != cudaDeviceSynchronize()) {
return 2;
}
return 0;
}


And compiled via:
$ nvcc -arch=sm_35 -rdc=true hello_world.cu -o hello -lcudadevrt

(copied & pasted from NVIDIA doc [1], all rights reserved by NVIDIA etc.)
Andreas Kloeckner
2013-02-14 05:36:53 UTC
Permalink
Hi Ahmed,
Post by Ahmed Fasih
Hi folks, I write in the hope that someone has gotten a K20 Kepler 3.5
compute capability device and has gotten it to do dynamic parallelism,
wherein a kernel can kick off grids on its own without returning to
the CPU. A "hello world" example is given at [1], page 23. With the
suggested nvcc command, I was able to compile and run the hello world
perfectly, but when I tried to load the kernels into PyCUDA, I get
LogicErrors, asserting
"cuModuleLoadDataEx failed: invalid image -"
This is before and after I take out the runtime API calls like
cudaDeviceSynchronize() in the kernel code, and persist after I pass
the "-rdc=true" and "-lcudadevrt" options into SourceModule. Could
PyCUDA be hating the runtime API calls in a kernel? If anybody's
gotten this to work, please share!
On a related note, I've added support for the sm_35 (30?) feature of
choosing four versus eight byte banks for shared memory from PyCUDA, I
can send out a rough patch.
[1] http://docs.nvidia.com/cuda/pdf/CUDA_Dynamic_Parallelism_Programming_Guide.pdf
#include <stdio.h>
__global__ void childKernel()
{
printf("Hello ");
}
__global__ void parentKernel()
{
// launch child
childKernel<<<1,1>>>();
if (cudaSuccess != cudaGetLastError()) {
return;
}
// wait for child to complete
if (cudaSuccess != cudaDeviceSynchronize()) {
return;
}
printf("World!\n");
}
int main(int argc, char *argv[])
{
// launch parent
parentKernel<<<1,1>>>();
if (cudaSuccess != cudaGetLastError()) {
return 1;
}
// wait for parent to complete
if (cudaSuccess != cudaDeviceSynchronize()) {
return 2;
}
return 0;
}
$ nvcc -arch=sm_35 -rdc=true hello_world.cu -o hello -lcudadevrt
(copied & pasted from NVIDIA doc [1], all rights reserved by NVIDIA etc.)
Can you send a runnable example that exhibits that failure? (And yes,
please send that patch!) I've got access to a K20, but that currently
has the weird problem of working fine under CL and refusing to be
detected under CUDA...?

Andreas
Ahmed Fasih
2013-02-15 17:26:45 UTC
Permalink
Thanks Andreas, attached are a .cu file and a .py file that I've
adapted from the linked Nvidia document to try and run with PyCUDA. I
can compile and run the .cu file using the command included in its
header, but when I try "python hello.py", I get this error:

$ python hello.py
Traceback (most recent call last):
File "hello.py", line 11, in <module>
mod = SourceModule(custr, options=['-rdc=true', '-lcudadevrt'])
File
"/usr/local/lib/python2.7/dist-packages/pycuda-2012.1-py2.7-linux-x86_64.egg/pycuda/compiler.py",
line 285, in __init__
self.module = module_from_buffer(cubin)
pycuda._driver.LogicError: cuModuleLoadDataEx failed: invalid image -

The patch for 4- vs 8-byte shmem magic is forthcoming. Thanks!
Ahmed


On Thu, Feb 14, 2013 at 12:36 AM, Andreas Kloeckner
Post by Andreas Kloeckner
Hi Ahmed,
Post by Ahmed Fasih
Hi folks, I write in the hope that someone has gotten a K20 Kepler 3.5
compute capability device and has gotten it to do dynamic parallelism,
wherein a kernel can kick off grids on its own without returning to
the CPU. A "hello world" example is given at [1], page 23. With the
suggested nvcc command, I was able to compile and run the hello world
perfectly, but when I tried to load the kernels into PyCUDA, I get
LogicErrors, asserting
"cuModuleLoadDataEx failed: invalid image -"
This is before and after I take out the runtime API calls like
cudaDeviceSynchronize() in the kernel code, and persist after I pass
the "-rdc=true" and "-lcudadevrt" options into SourceModule. Could
PyCUDA be hating the runtime API calls in a kernel? If anybody's
gotten this to work, please share!
On a related note, I've added support for the sm_35 (30?) feature of
choosing four versus eight byte banks for shared memory from PyCUDA, I
can send out a rough patch.
[1] http://docs.nvidia.com/cuda/pdf/CUDA_Dynamic_Parallelism_Programming_Guide.pdf
#include <stdio.h>
__global__ void childKernel()
{
printf("Hello ");
}
__global__ void parentKernel()
{
// launch child
childKernel<<<1,1>>>();
if (cudaSuccess != cudaGetLastError()) {
return;
}
// wait for child to complete
if (cudaSuccess != cudaDeviceSynchronize()) {
return;
}
printf("World!\n");
}
int main(int argc, char *argv[])
{
// launch parent
parentKernel<<<1,1>>>();
if (cudaSuccess != cudaGetLastError()) {
return 1;
}
// wait for parent to complete
if (cudaSuccess != cudaDeviceSynchronize()) {
return 2;
}
return 0;
}
$ nvcc -arch=sm_35 -rdc=true hello_world.cu -o hello -lcudadevrt
(copied & pasted from NVIDIA doc [1], all rights reserved by NVIDIA etc.)
Can you send a runnable example that exhibits that failure? (And yes,
please send that patch!) I've got access to a K20, but that currently
has the weird problem of working fine under CL and refusing to be
detected under CUDA...?
Andreas
--
Ahmed Fasih
fasih.1 at osu.edu
wuzzyview at gmail.com
614 547 3323 (Google Voice)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: hello.py
Type: application/octet-stream
Size: 352 bytes
Desc: not available
URL: <http://lists.tiker.net/pipermail/pycuda/attachments/20130215/0231b29c/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: hello_world.cu
Type: application/octet-stream
Size: 1061 bytes
Desc: not available
URL: <http://lists.tiker.net/pipermail/pycuda/attachments/20130215/0231b29c/attachment-0001.obj>
Andreas Kloeckner
2014-07-14 20:20:39 UTC
Permalink
Hi all,
Post by Ahmed Fasih
Hi folks, I write in the hope that someone has gotten a K20 Kepler 3.5
compute capability device and has gotten it to do dynamic parallelism,
wherein a kernel can kick off grids on its own without returning to
the CPU. A "hello world" example is given at [1], page 23. With the
suggested nvcc command, I was able to compile and run the hello world
perfectly, but when I tried to load the kernels into PyCUDA, I get
LogicErrors, asserting
"cuModuleLoadDataEx failed: invalid image -"
This is before and after I take out the runtime API calls like
cudaDeviceSynchronize() in the kernel code, and persist after I pass
the "-rdc=true" and "-lcudadevrt" options into SourceModule. Could
PyCUDA be hating the runtime API calls in a kernel? If anybody's
gotten this to work, please share!
On a related note, I've added support for the sm_35 (30?) feature of
choosing four versus eight byte banks for shared memory from PyCUDA, I
can send out a rough patch.
[1] http://docs.nvidia.com/cuda/pdf/CUDA_Dynamic_Parallelism_Programming_Guide.pdf
#include <stdio.h>
__global__ void childKernel()
{
printf("Hello ");
}
__global__ void parentKernel()
{
// launch child
childKernel<<<1,1>>>();
if (cudaSuccess != cudaGetLastError()) {
return;
}
// wait for child to complete
if (cudaSuccess != cudaDeviceSynchronize()) {
return;
}
printf("World!\n");
}
int main(int argc, char *argv[])
{
// launch parent
parentKernel<<<1,1>>>();
if (cudaSuccess != cudaGetLastError()) {
return 1;
}
// wait for parent to complete
if (cudaSuccess != cudaDeviceSynchronize()) {
return 2;
}
return 0;
}
$ nvcc -arch=sm_35 -rdc=true hello_world.cu -o hello -lcudadevrt
(copied & pasted from NVIDIA doc [1], all rights reserved by NVIDIA etc.)
Sorry about the thread necromancy. I think I've got a lead on how to
make this happen, here:

https://github.com/inducer/pycuda/issues/45#issuecomment-48953922

Andreas

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 810 bytes
Desc: not available
URL: <http://lists.tiker.net/pipermail/pycuda/attachments/20140714/8f49957a/attachment.sig>
Loading...