[PyCUDA] [Support Request] How to pass a list of lists to a Py CUDA Kernel?

Discussion:

Frank Ihle

2016-05-11 17:26:28 UTC

Hello CUDA,

I try to speed up my Python program with a not so trivial algorithm, so
I need to know:*
*

What is the correct way of transferring a list of lists of floats to the
(Py)CUDA Kernel?

*An example*

given as example the following list

|listToProc =[[-1,-2,-3,-4,-5],[1,2,3,4,5,6,7,8.1,9]]|

it shall be transfered to a PyCUDA kernel for further processing. I
would then proceed with common functions to transfer a list of values
(not a list of lists) like this

|listToProcAr =np.array(listToProc,dtype=np.object)listToProcAr_gpu
=cuda.mem_alloc(listToProcAr.nbytes)cuda.memcpy_htod(listToProcAr_gpu,listToProcAr)|

*However this results in two problems:*

1) |listToProcAr.nbytes = 2| - i.e. too less memory is reserved. I
believe this can be solved by

|listBytes =0forcurrentList inListToProc:listBytes
+=np.array(currentList,dtype=np.float32).nbytes|

and replace the variable here

|listToProcAr_gpu =cuda.mem_alloc(listBytes)|

2) and the *actual problem*

|cuda.memcpy_htod(listToProcAr_gpu, listToProcAr)| still seems to create
a wrong pointer in the Kernel. Because when trying to access the last
element of the second list (listToProc[1][8]) raises an

PyCUDA WARNING: a clean-up operation failed (dead context maybe?)

So I'm a little bit clueless at the moment

------------------------------------------------------------------------

*The PyCUDA code*

|__global__ void procTheListKernel(float
**listOfLists){listOfLists[0][0]=0;listOfLists[1][8]=0;__syncthreads();}
Can anyone help me out? Kind Regards Frank |

Frank Ihle

2016-05-11 17:49:35 UTC

Permalink

Hello CUDA,

I try to speed up my Python program with a not so trivial algorithm, so
I need to know. What is the correct way of transferring a list of lists
of floats to the (Py)CUDA Kernel?

given as example the following list

listToProc = [[-1,-2,-3,-4,-5],[1,2,3,4,5,6,7,8.1,9]]

it shall be transfered to a PyCUDA kernel for further processing. I
would then proceed with common functions to transfer a list of values
(not a list of lists) like this

listToProcAr = np.array(listToProc, dtype=np.object)
listToProcAr_gpu = cuda.mem_alloc(listToProcAr.nbytes)
cuda.memcpy_htod(listToProcAr_gpu, listToProcAr)

However this results in two problems:

1) listToProcAr.nbytes = 2 - i.e. too less memory is reserved. I believe
this can be solved by

listBytes = 0
for currentList in ListToProc:
listBytes += np.array(currentList,
dtype=np.float32).nbytes

and replace the variable here

listToProcAr_gpu = cuda.mem_alloc(listBytes)

2) and the actual problem

cuda.memcpy_htod(listToProcAr_gpu, listToProcAr) still seems to create a
wrong pointer in the Kernel. Because when trying to access the last
element of the second list (listToProc[1][8]) raises an

PyCUDA WARNING: a clean-up operation failed (dead context maybe?)

So I'm a little bit clueless at the moment

The PyCUDA code

__global__ void procTheListKernel(float ** listOfLists)
{
listOfLists[0][0] = 0;
listOfLists[1][8] = 0;
__syncthreads();
}

Can anyone help me out?

Kind Regards
Frank

||

Andreas Kloeckner

2016-05-12 04:31:54 UTC

Permalink

Hi Frank,

Post by Frank Ihle
I try to speed up my Python program with a not so trivial algorithm, so
I need to know. What is the correct way of transferring a list of lists
of floats to the (Py)CUDA Kernel?

Nested, variable-sized structures are generally tricky to map onto
array-shape hardware. You'll likely want to store your data in a
CSR-like data structure:

https://en.wikipedia.org/wiki/Sparse_matrix

Scans (such as the one in PyCUDA) can help significantly with the
resulting index computations.

Hope that helps,
Andreas

Frank Ihle

2016-05-12 22:02:43 UTC

Permalink

Hi Andreas,

and pardon my possible double-post for I was not sure whether HTML was
prohibited or not.

Thanks for the suggestion and your quick response, however I don't think
this will gonna solve my case. More precisely I try to process a stack
of Python/OpenCV images in CUDA. I know one sole image can easily be
transformed to a 1D-Pointer, but a list of images is what I'm thinking
about right know. One workaround would be appending all images , create
so a huge one, and keep the offsets. This way I know exactly at which
(x,y) a next image starts on the huge picture. It is working but not
very nice to maintain at least for my goals.

Another case would be a list of 1D-lists as suggested of random sizes (m
x 1) :

listToProc =
[[-1,-2,-3,-4,-5],[1,2,3,4,5,6,7,8.1,9],[2,3],[456,2,3,63],[1,2,3]]

Do you think one of the two cases can be transformed to a sparse matrix?
If so is there an example about these matrices for CUDA ?

Kind Regards
Frank

Post by Andreas Kloeckner
Hi Frank,

Post by Frank Ihle
I try to speed up my Python program with a not so trivial algorithm, so
I need to know. What is the correct way of transferring a list of lists
of floats to the (Py)CUDA Kernel?

Nested, variable-sized structures are generally tricky to map onto
array-shape hardware. You'll likely want to store your data in a
https://en.wikipedia.org/wiki/Sparse_matrix
Scans (such as the one in PyCUDA) can help significantly with the
resulting index computations.
Hope that helps,
Andreas