[PyCUDA] out of memory issues

Discussion:

Keith Brown

2015-11-23 16:10:45 UTC

I have a 2 small matrix (160080,3) of type float32 and I am
calculating their dot product. While doing this, I keep getting
pycuda.__driver.MemoryError: cuMemAlloc failed out of memory.

I have 2 cards, each with 3GB of memory. Each matrix takes about 1875
kilobytes. I am not sure why this is occuring.

x=np.ones((160080,3L)).astype(np.float32)
a_gpu=gpuarray.to_gpu(x)
b_gpu=gpuarray.to_gpu(x)
c_gpu = linalg.dot(a_gpu,b_gpu,'N','T',handle=handle)

My handle is a cublasxt (not regular cublas since blasxt apprently
does better memory handling).

Any idea what is going on?

Andreas Kloeckner

2015-11-23 16:24:46 UTC

Permalink

Post by Keith Brown
I have a 2 small matrix (160080,3) of type float32 and I am
calculating their dot product. While doing this, I keep getting
pycuda.__driver.MemoryError: cuMemAlloc failed out of memory.
I have 2 cards, each with 3GB of memory. Each matrix takes about 1875
kilobytes. I am not sure why this is occuring.
x=np.ones((160080,3L)).astype(np.float32)
a_gpu=gpuarray.to_gpu(x)
b_gpu=gpuarray.to_gpu(x)
c_gpu = linalg.dot(a_gpu,b_gpu,'N','T',handle=handle)
My handle is a cublasxt (not regular cublas since blasxt apprently
does better memory handling).

Is "linalg" the regular numpy linalg module? If so, that's not going to
work, because that effectively accesses the GPU array element-by-element
across the PCIe bus. You probably need to call a cublas wrapper.

Andreas

Keith Brown

2015-11-23 16:28:00 UTC

Permalink

This is the linalg from cublas wrapper. Not from numpy.

On Mon, Nov 23, 2015 at 11:24 AM, Andreas Kloeckner

Post by Andreas Kloeckner

Lev Givon

2015-11-23 16:35:55 UTC

Permalink

--
Lev Givon
Bionet Group | Neurokernel Project
http://lebedov.github.io/
http://neurokernel.github.io/

_______________________________________________
PyCUDA mailing list

Keith Brown

2015-11-23 16:43:07 UTC

Permalink

I modified add_dot() to use cublas.xt.cublasXtSgemm. I don't think I
need to modify dot() because its calling add_dot at the end. Its not
calling cublasxt.cublasXtsgemm directly unless my matrix is 1d (which
it isn't) Correct?

BTW, smaller matrices work fine its just for larger matrices.

Post by Lev Givon

Did you also modify skcuda.linalg.dot() to explicitly call the cublasXt*gemm
functions rather than the stock cublas*gemm functions? The cublasXt*gemm
functions expect host memory pointers as their arguments, not GPU memory
pointers.
--
Lev Givon
Bionet Group | Neurokernel Project
http://lebedov.github.io/
http://neurokernel.github.io/

Yiyin Zhou

2015-11-23 16:45:12 UTC

Permalink

Post by Keith Brown

Post by Lev Givon

Post by Keith Brown
c_gpu = linalg.dot(a_gpu,b_gpu,'N','T',handle=handle)

Isn't your output matrix of size 160080x160080?
Yiyin

Post by Keith Brown
I modified add_dot() to use cublas.xt.cublasXtSgemm. I don't think I
need to modify dot() because its calling add_dot at the end. Its not
calling cublasxt.cublasXtsgemm directly unless my matrix is 1d (which
it isn't) Correct?
BTW, smaller matrices work fine its just for larger matrices.

Post by Lev Givon

Did you also modify skcuda.linalg.dot() to explicitly call the

cublasXt*gemm

Post by Lev Givon
functions rather than the stock cublas*gemm functions? The cublasXt*gemm
functions expect host memory pointers as their arguments, not GPU memory
pointers.
--
Lev Givon
Bionet Group | Neurokernel Project
http://lebedov.github.io/
http://neurokernel.github.io/

_______________________________________________
PyCUDA mailing list
http://lists.tiker.net/listinfo/pycuda

Thomas Unterthiner

2015-11-23 16:41:41 UTC

Permalink

You are computing the product of a [160080, 3] and a [3, 160080] matrix,
so the result is a [160080, 160080] matrix. To store a matrix of that
size (as float32) you would need 95GB of RAM. That's a though fit for a
3GB GPU ;-)

Jonas Bardino

2015-11-23 16:38:46 UTC

Permalink

Ehmm, I'm not sure I understand exactly what you do, but to me it sounds
like you try to calculate the dot product of a 160080 x 3 matrix and a
similar one transposed, i.e. a 3 x 160080 matrix. That would give you a
160080 x 160080 matrix result - which surely won't fit your 3GB of GPU
memory.

Cheers, Jonas

Keith Brown

2015-11-23 18:58:38 UTC

Permalink

Correct. My result matrix will be too large.

<sigh>

I would think cublasXT would take care of this for me. I though it
would do some sort of divide and conquer.

Is there a way to attack this sort of problem?

Post by Jonas Bardino
Ehmm, I'm not sure I understand exactly what you do, but to me it sounds
like you try to calculate the dot product of a 160080 x 3 matrix and a
similar one transposed, i.e. a 3 x 160080 matrix. That would give you a
160080 x 160080 matrix result - which surely won't fit your 3GB of GPU
memory.
Cheers, Jonas

_______________________________________________
PyCUDA mailing list
http://lists.tiker.net/listinfo/pycuda

Keith Brown

2015-11-23 20:14:40 UTC

Permalink

Thanks all for the replies.

My goal is simple. Atleast, I though it was simple :-)

I have function where I calculate the dot product

def F(a,b):
return np.dot(a.T,b)

I need to do this 8k times. The max size of 'a' and 'b' are (3 million, 1).

For smaller size of a and b. linalg.dot is working great. But I want a
more efficient way using GPU.

Perhaps, GPU isn't the way to go since the memory is too large?

(https://developer.nvidia.com/cublas)
"By using a streaming design, cuBLAS-XT efficiently manages transfers across the PCI-Express bus automatically, which allows input and output data to be stored on the host’s system memory. This provides out-of-core operation – the size of operand data is only limited by system memory size, not by GPU on-board memory size.”
So I don’t think cuBLAS-XT can help unless you have more than 95 GB of system RAM. If that is not the case, I think you have to step back and think about what you need to do with this array ultimately, and where you want to stage the data if you need to compute all 95 GB of it at once.

Post by Keith Brown
Correct. My result matrix will be too large.
<sigh>
I would think cublasXT would take care of this for me. I though it
would do some sort of divide and conquer.
Is there a way to attack this sort of problem?

_______________________________________________
PyCUDA mailing list
http://lists.tiker.net/listinfo/pycuda

Keith Brown

2015-11-24 04:03:33 UTC

Permalink

does anyone have any thoughts? is this feasible?

Post by Keith Brown
Thanks all for the replies.
My goal is simple. Atleast, I though it was simple :-)
I have function where I calculate the dot product
return np.dot(a.T,b)
I need to do this 8k times. The max size of 'a' and 'b' are (3 million, 1).
For smaller size of a and b. linalg.dot is working great. But I want a
more efficient way using GPU.
Perhaps, GPU isn't the way to go since the memory is too large?

_______________________________________________
PyCUDA mailing list
http://lists.tiker.net/listinfo/pycuda

Keith Brown

2015-11-24 21:08:40 UTC

Permalink

So, it turns out why its working on CPU because a.T is a view and
isn't occupying a lot of memory (if any). Now, for pycuda I need to do
a a.T.copy() to get it to work but this takes up more memory which is
leading to a memory allocation error.

Does anyone have an example of dot product with streams?

_______________________________________________
PyCUDA mailing list
http://lists.tiker.net/listinfo/pycuda