[PyCUDA] GPUArrays of GPUArrays

Discussion:

Christian Hacker

2015-04-16 16:30:11 UTC

Greetings. I am developing a supervised learning neural network simulator
with complex weights (MLMVN) and am attempting to parallelize the
underlying linear algebra with pycuda. Thus far I've managed to implement
all required functions using the GPUArray class, the pycuda.cumath module,
and the scikits.cuda library without writing a single custom kernel, but
there's a significant caveat. Because the topology of the network (# of
layers, # of neurons per layer) is generated dynamically, the simulator
must be able to routinely create a variable number of 2d arrays (with
varying shapes) that contain the weights for each layer. Consequentially,
what I need is an array of arrays, where each subarray has dimensions that
are specific to the layer it represents. If I were implementing this in
numpy, this would be trivial: create a 1d array with dtype=object and
shape=(# of layers, ), and then assign the 2d array of weights for each
layer to the corresponding element in the 1d array. This would be similar
to a Matlab cell object or a C# jagged array, as the internal dimensions of
the array are not all the same. Because the pycuda.gpuarray class doesn't
support element assignments, creating a device-side container analogous to
the described numpy array isn't possible. I've tried constructing a
"jagged" numpy array and simply using gpuarray.to_gpu(numpy_array), but
that does not work and seems to "confuse" my graphics card. The only
solution I've been able to find is to allocate a 1d numpy array of objects
as before, but then iteratively assign separate GPUArrays as each element
to represent the weight arrays for each layer. In other words, each element
of the 1d numpy array is a pointer to a GPUArray on the graphics card.
There is significant overhead (~ 1 order of magnitude) in accessing each
GPUArray compared to accessing a numpy array stored on the host machine,
but I assumed that this would be a non-issue since the host code wouldn't
be modifying those GPUArrays anyway, just referring them to the
pycuda.cumath functions and gpuarray operators. This assumption appears to
be incorrect - the GPU simulator runs extremely slowly, and its performance
only deteriorates with increasing sizes of learning sets. On the bright
side, it can (and has) converge(d). My conclusion is that the device and
host are constantly swapping data during the simulation, and I suspect my
method for storing the weights of each layer is to blame.

So my question is this: does referencing a GPUArray from within a numpy
array of objects entail some kind of ungodly overhead, and is there a
*good* way to store a "jagged" GPUArray? If anyone is willing to help me
through this issue, I will be grateful. Source code will be provided upon
request. Apologies for the length and, no-doubt, plethora of mistakes made
in this posting.

CH

Andreas Kloeckner

2015-04-18 01:45:25 UTC

Permalink

Hi Christian,

Post by Christian Hacker
So my question is this: does referencing a GPUArray from within a numpy
array of objects entail some kind of ungodly overhead, and is there a
*good* way to store a "jagged" GPUArray?

FWIW, I use object arrays with GPUArrays in them all the time, and they
work just fine. One thing to note is that a separate kernel will be
launched to perform arithmetic on each of the sub-arrays. As a result,
if the sub-array size is small enough that kernel launch overhead is
comparable to the cost of the operation on the array, then you will
start seeing a performance impact. I would say that as soon as the size
of your sub-arrays is around 10,000 or so, you should be OK.

If your sub-arrays are smaller and you care about every last bit of
performance, you will likely need to roll a custom solution that stores
segment boundaries along with the array.

Hope that helps,
Andreas

Andreas Kloeckner

2015-04-18 19:30:07 UTC

Permalink

Dear Christian,

First of all, please make sure to keep the list cc'd in your
replies. Without the benefits of searchable archival, justifying the
time to answer questions is much harder.

Thank you for your reply. If I'm understanding you correctly, it is
acceptable to have numpy arrays of objects allocated on the host, and then
assigning GPUArray instances as elements of those arrays. I didn't take
into account the overhead from launching the kernel - that may explain why
things work so slowly. I will attempt to test the simulator with larger
network topologies once I have pycuda set up on a machine with a
sufficiently powerful GPU.
If you will indulge my ignorance a little more, there is another problem I
would request advice for. I have run into a possible bottleneck in the
learning algorithm, specifically where the simulator must compare the
calculated error of the current learning cycle to a user-defined threshold
value to determine if further learning is required. Currently I am storing
this threshold value in a (1, 1) GPUArray and using the overloaded
comparison operators to check it against the calculated network error, also
stored on the GPU. The issue is that the code driving the simulator is all
host-side: a conditional statement checks the verity of the comparison and
decides whether to continue working. Because a comparison of values on two
GPUArrays will return a GPUArray with a binary integer value, whereas
Python conditionals require a binary integer value, I have no choice but to
transfer a single binary integer value from the device to the host - every
single learning cycle. Due to the variety of operations the simulator must
conduct each learning cycle, it would be unwieldy and, perhaps, impossible
to use an if_positive(...) function to sidestep this issue. So, following
Is it possible to write a custom kernel (or even a Python function) that
can return integer values to the Python interpreter after evaluating GPU
array data, without requiring the transfer of any of that data from the
device to the host?

Yes, but only in a limited way. With enough mapping/unmapping logic,
device kernels can indeed write to host memory. However I would
anticipate that the latency incurred in this process is similar (if
not worse) than the one involved in reading from the device.

Quite simply, if data resides on the device, the only way to get it off
of there is a read. Perhaps the only way (and quite an easy one if I
understand your situation right) would be to continue the computation
(overlapped with the transfer) and defer the convergence check until the
transfer finishes. Here's an example of code that does this:

https://github.com/inducer/pycuda/blob/master/pycuda/sparse/cg.py

Andreas