Ahmed Fasih
2013-02-12 04:34:05 UTC
Hi folks, I write in the hope that someone has gotten a K20 Kepler 3.5
compute capability device and has gotten it to do dynamic parallelism,
wherein a kernel can kick off grids on its own without returning to
the CPU. A "hello world" example is given at [1], page 23. With the
suggested nvcc command, I was able to compile and run the hello world
perfectly, but when I tried to load the kernels into PyCUDA, I get
LogicErrors, asserting
"cuModuleLoadDataEx failed: invalid image -"
This is before and after I take out the runtime API calls like
cudaDeviceSynchronize() in the kernel code, and persist after I pass
the "-rdc=true" and "-lcudadevrt" options into SourceModule. Could
PyCUDA be hating the runtime API calls in a kernel? If anybody's
gotten this to work, please share!
On a related note, I've added support for the sm_35 (30?) feature of
choosing four versus eight byte banks for shared memory from PyCUDA, I
can send out a rough patch.
[1] http://docs.nvidia.com/cuda/pdf/CUDA_Dynamic_Parallelism_Programming_Guide.pdf
For completeness, here's the original CUDA code from there:
#include <stdio.h>
__global__ void childKernel()
{
printf("Hello ");
}
__global__ void parentKernel()
{
// launch child
childKernel<<<1,1>>>();
if (cudaSuccess != cudaGetLastError()) {
return;
}
// wait for child to complete
if (cudaSuccess != cudaDeviceSynchronize()) {
return;
}
printf("World!\n");
}
int main(int argc, char *argv[])
{
// launch parent
parentKernel<<<1,1>>>();
if (cudaSuccess != cudaGetLastError()) {
return 1;
}
// wait for parent to complete
if (cudaSuccess != cudaDeviceSynchronize()) {
return 2;
}
return 0;
}
And compiled via:
$ nvcc -arch=sm_35 -rdc=true hello_world.cu -o hello -lcudadevrt
(copied & pasted from NVIDIA doc [1], all rights reserved by NVIDIA etc.)
compute capability device and has gotten it to do dynamic parallelism,
wherein a kernel can kick off grids on its own without returning to
the CPU. A "hello world" example is given at [1], page 23. With the
suggested nvcc command, I was able to compile and run the hello world
perfectly, but when I tried to load the kernels into PyCUDA, I get
LogicErrors, asserting
"cuModuleLoadDataEx failed: invalid image -"
This is before and after I take out the runtime API calls like
cudaDeviceSynchronize() in the kernel code, and persist after I pass
the "-rdc=true" and "-lcudadevrt" options into SourceModule. Could
PyCUDA be hating the runtime API calls in a kernel? If anybody's
gotten this to work, please share!
On a related note, I've added support for the sm_35 (30?) feature of
choosing four versus eight byte banks for shared memory from PyCUDA, I
can send out a rough patch.
[1] http://docs.nvidia.com/cuda/pdf/CUDA_Dynamic_Parallelism_Programming_Guide.pdf
For completeness, here's the original CUDA code from there:
#include <stdio.h>
__global__ void childKernel()
{
printf("Hello ");
}
__global__ void parentKernel()
{
// launch child
childKernel<<<1,1>>>();
if (cudaSuccess != cudaGetLastError()) {
return;
}
// wait for child to complete
if (cudaSuccess != cudaDeviceSynchronize()) {
return;
}
printf("World!\n");
}
int main(int argc, char *argv[])
{
// launch parent
parentKernel<<<1,1>>>();
if (cudaSuccess != cudaGetLastError()) {
return 1;
}
// wait for parent to complete
if (cudaSuccess != cudaDeviceSynchronize()) {
return 2;
}
return 0;
}
And compiled via:
$ nvcc -arch=sm_35 -rdc=true hello_world.cu -o hello -lcudadevrt
(copied & pasted from NVIDIA doc [1], all rights reserved by NVIDIA etc.)