Christian Hacker
2015-04-16 16:30:11 UTC
Greetings. I am developing a supervised learning neural network simulator
with complex weights (MLMVN) and am attempting to parallelize the
underlying linear algebra with pycuda. Thus far I've managed to implement
all required functions using the GPUArray class, the pycuda.cumath module,
and the scikits.cuda library without writing a single custom kernel, but
there's a significant caveat. Because the topology of the network (# of
layers, # of neurons per layer) is generated dynamically, the simulator
must be able to routinely create a variable number of 2d arrays (with
varying shapes) that contain the weights for each layer. Consequentially,
what I need is an array of arrays, where each subarray has dimensions that
are specific to the layer it represents. If I were implementing this in
numpy, this would be trivial: create a 1d array with dtype=object and
shape=(# of layers, ), and then assign the 2d array of weights for each
layer to the corresponding element in the 1d array. This would be similar
to a Matlab cell object or a C# jagged array, as the internal dimensions of
the array are not all the same. Because the pycuda.gpuarray class doesn't
support element assignments, creating a device-side container analogous to
the described numpy array isn't possible. I've tried constructing a
"jagged" numpy array and simply using gpuarray.to_gpu(numpy_array), but
that does not work and seems to "confuse" my graphics card. The only
solution I've been able to find is to allocate a 1d numpy array of objects
as before, but then iteratively assign separate GPUArrays as each element
to represent the weight arrays for each layer. In other words, each element
of the 1d numpy array is a pointer to a GPUArray on the graphics card.
There is significant overhead (~ 1 order of magnitude) in accessing each
GPUArray compared to accessing a numpy array stored on the host machine,
but I assumed that this would be a non-issue since the host code wouldn't
be modifying those GPUArrays anyway, just referring them to the
pycuda.cumath functions and gpuarray operators. This assumption appears to
be incorrect - the GPU simulator runs extremely slowly, and its performance
only deteriorates with increasing sizes of learning sets. On the bright
side, it can (and has) converge(d). My conclusion is that the device and
host are constantly swapping data during the simulation, and I suspect my
method for storing the weights of each layer is to blame.
So my question is this: does referencing a GPUArray from within a numpy
array of objects entail some kind of ungodly overhead, and is there a
*good* way to store a "jagged" GPUArray? If anyone is willing to help me
through this issue, I will be grateful. Source code will be provided upon
request. Apologies for the length and, no-doubt, plethora of mistakes made
in this posting.
CH
with complex weights (MLMVN) and am attempting to parallelize the
underlying linear algebra with pycuda. Thus far I've managed to implement
all required functions using the GPUArray class, the pycuda.cumath module,
and the scikits.cuda library without writing a single custom kernel, but
there's a significant caveat. Because the topology of the network (# of
layers, # of neurons per layer) is generated dynamically, the simulator
must be able to routinely create a variable number of 2d arrays (with
varying shapes) that contain the weights for each layer. Consequentially,
what I need is an array of arrays, where each subarray has dimensions that
are specific to the layer it represents. If I were implementing this in
numpy, this would be trivial: create a 1d array with dtype=object and
shape=(# of layers, ), and then assign the 2d array of weights for each
layer to the corresponding element in the 1d array. This would be similar
to a Matlab cell object or a C# jagged array, as the internal dimensions of
the array are not all the same. Because the pycuda.gpuarray class doesn't
support element assignments, creating a device-side container analogous to
the described numpy array isn't possible. I've tried constructing a
"jagged" numpy array and simply using gpuarray.to_gpu(numpy_array), but
that does not work and seems to "confuse" my graphics card. The only
solution I've been able to find is to allocate a 1d numpy array of objects
as before, but then iteratively assign separate GPUArrays as each element
to represent the weight arrays for each layer. In other words, each element
of the 1d numpy array is a pointer to a GPUArray on the graphics card.
There is significant overhead (~ 1 order of magnitude) in accessing each
GPUArray compared to accessing a numpy array stored on the host machine,
but I assumed that this would be a non-issue since the host code wouldn't
be modifying those GPUArrays anyway, just referring them to the
pycuda.cumath functions and gpuarray operators. This assumption appears to
be incorrect - the GPU simulator runs extremely slowly, and its performance
only deteriorates with increasing sizes of learning sets. On the bright
side, it can (and has) converge(d). My conclusion is that the device and
host are constantly swapping data during the simulation, and I suspect my
method for storing the weights of each layer is to blame.
So my question is this: does referencing a GPUArray from within a numpy
array of objects entail some kind of ungodly overhead, and is there a
*good* way to store a "jagged" GPUArray? If anyone is willing to help me
through this issue, I will be grateful. Source code will be provided upon
request. Apologies for the length and, no-doubt, plethora of mistakes made
in this posting.
CH