Discussion:
[PyCUDA] out of memory issues
Keith Brown
2015-11-23 16:10:45 UTC
Permalink
I have a 2 small matrix (160080,3) of type float32 and I am
calculating their dot product. While doing this, I keep getting
pycuda.__driver.MemoryError: cuMemAlloc failed out of memory.

I have 2 cards, each with 3GB of memory. Each matrix takes about 1875
kilobytes. I am not sure why this is occuring.

x=np.ones((160080,3L)).astype(np.float32)
a_gpu=gpuarray.to_gpu(x)
b_gpu=gpuarray.to_gpu(x)
c_gpu = linalg.dot(a_gpu,b_gpu,'N','T',handle=handle)

My handle is a cublasxt (not regular cublas since blasxt apprently
does better memory handling).

Any idea what is going on?
Andreas Kloeckner
2015-11-23 16:24:46 UTC
Permalink
Post by Keith Brown
I have a 2 small matrix (160080,3) of type float32 and I am
calculating their dot product. While doing this, I keep getting
pycuda.__driver.MemoryError: cuMemAlloc failed out of memory.
I have 2 cards, each with 3GB of memory. Each matrix takes about 1875
kilobytes. I am not sure why this is occuring.
x=np.ones((160080,3L)).astype(np.float32)
a_gpu=gpuarray.to_gpu(x)
b_gpu=gpuarray.to_gpu(x)
c_gpu = linalg.dot(a_gpu,b_gpu,'N','T',handle=handle)
My handle is a cublasxt (not regular cublas since blasxt apprently
does better memory handling).
Is "linalg" the regular numpy linalg module? If so, that's not going to
work, because that effectively accesses the GPU array element-by-element
across the PCIe bus. You probably need to call a cublas wrapper.

Andreas
Keith Brown
2015-11-23 16:28:00 UTC
Permalink
This is the linalg from cublas wrapper. Not from numpy.

On Mon, Nov 23, 2015 at 11:24 AM, Andreas Kloeckner
Post by Andreas Kloeckner
Post by Keith Brown
I have a 2 small matrix (160080,3) of type float32 and I am
calculating their dot product. While doing this, I keep getting
pycuda.__driver.MemoryError: cuMemAlloc failed out of memory.
I have 2 cards, each with 3GB of memory. Each matrix takes about 1875
kilobytes. I am not sure why this is occuring.
x=np.ones((160080,3L)).astype(np.float32)
a_gpu=gpuarray.to_gpu(x)
b_gpu=gpuarray.to_gpu(x)
c_gpu = linalg.dot(a_gpu,b_gpu,'N','T',handle=handle)
My handle is a cublasxt (not regular cublas since blasxt apprently
does better memory handling).
Is "linalg" the regular numpy linalg module? If so, that's not going to
work, because that effectively accesses the GPU array element-by-element
across the PCIe bus. You probably need to call a cublas wrapper.
Andreas
Lev Givon
2015-11-23 16:35:55 UTC
Permalink
Post by Keith Brown
I have a 2 small matrix (160080,3) of type float32 and I am
calculating their dot product. While doing this, I keep getting
pycuda.__driver.MemoryError: cuMemAlloc failed out of memory.
I have 2 cards, each with 3GB of memory. Each matrix takes about 1875
kilobytes. I am not sure why this is occuring.
x=np.ones((160080,3L)).astype(np.float32)
a_gpu=gpuarray.to_gpu(x)
b_gpu=gpuarray.to_gpu(x)
c_gpu = linalg.dot(a_gpu,b_gpu,'N','T',handle=handle)
My handle is a cublasxt (not regular cublas since blasxt apprently
does better memory handling).
Any idea what is going on?
Did you also modify skcuda.linalg.dot() to explicitly call the cublasXt*gemm
functions rather than the stock cublas*gemm functions? The cublasXt*gemm
functions expect host memory pointers as their arguments, not GPU memory
pointers.
--
Lev Givon
Bionet Group | Neurokernel Project
http://lebedov.github.io/
http://neurokernel.github.io/


_______________________________________________
PyCUDA mailing list
Keith Brown
2015-11-23 16:43:07 UTC
Permalink
I modified add_dot() to use cublas.xt.cublasXtSgemm. I don't think I
need to modify dot() because its calling add_dot at the end. Its not
calling cublasxt.cublasXtsgemm directly unless my matrix is 1d (which
it isn't) Correct?

BTW, smaller matrices work fine its just for larger matrices.
Post by Lev Givon
Post by Keith Brown
I have a 2 small matrix (160080,3) of type float32 and I am
calculating their dot product. While doing this, I keep getting
pycuda.__driver.MemoryError: cuMemAlloc failed out of memory.
I have 2 cards, each with 3GB of memory. Each matrix takes about 1875
kilobytes. I am not sure why this is occuring.
x=np.ones((160080,3L)).astype(np.float32)
a_gpu=gpuarray.to_gpu(x)
b_gpu=gpuarray.to_gpu(x)
c_gpu = linalg.dot(a_gpu,b_gpu,'N','T',handle=handle)
My handle is a cublasxt (not regular cublas since blasxt apprently
does better memory handling).
Any idea what is going on?
Did you also modify skcuda.linalg.dot() to explicitly call the cublasXt*gemm
functions rather than the stock cublas*gemm functions? The cublasXt*gemm
functions expect host memory pointers as their arguments, not GPU memory
pointers.
--
Lev Givon
Bionet Group | Neurokernel Project
http://lebedov.github.io/
http://neurokernel.github.io/
Yiyin Zhou
2015-11-23 16:45:12 UTC
Permalink
Post by Keith Brown
Post by Lev Givon
Post by Keith Brown
c_gpu = linalg.dot(a_gpu,b_gpu,'N','T',handle=handle)
Isn't your output matrix of size 160080x160080?
Yiyin
Post by Keith Brown
I modified add_dot() to use cublas.xt.cublasXtSgemm. I don't think I
need to modify dot() because its calling add_dot at the end. Its not
calling cublasxt.cublasXtsgemm directly unless my matrix is 1d (which
it isn't) Correct?
BTW, smaller matrices work fine its just for larger matrices.
Post by Lev Givon
Post by Keith Brown
I have a 2 small matrix (160080,3) of type float32 and I am
calculating their dot product. While doing this, I keep getting
pycuda.__driver.MemoryError: cuMemAlloc failed out of memory.
I have 2 cards, each with 3GB of memory. Each matrix takes about 1875
kilobytes. I am not sure why this is occuring.
x=np.ones((160080,3L)).astype(np.float32)
a_gpu=gpuarray.to_gpu(x)
b_gpu=gpuarray.to_gpu(x)
c_gpu = linalg.dot(a_gpu,b_gpu,'N','T',handle=handle)
My handle is a cublasxt (not regular cublas since blasxt apprently
does better memory handling).
Any idea what is going on?
Did you also modify skcuda.linalg.dot() to explicitly call the
cublasXt*gemm
Post by Lev Givon
functions rather than the stock cublas*gemm functions? The cublasXt*gemm
functions expect host memory pointers as their arguments, not GPU memory
pointers.
--
Lev Givon
Bionet Group | Neurokernel Project
http://lebedov.github.io/
http://neurokernel.github.io/
_______________________________________________
PyCUDA mailing list
http://lists.tiker.net/listinfo/pycuda
Thomas Unterthiner
2015-11-23 16:41:41 UTC
Permalink
You are computing the product of a [160080, 3] and a [3, 160080] matrix,
so the result is a [160080, 160080] matrix. To store a matrix of that
size (as float32) you would need 95GB of RAM. That's a though fit for a
3GB GPU ;-)
Post by Keith Brown
I have a 2 small matrix (160080,3) of type float32 and I am
calculating their dot product. While doing this, I keep getting
pycuda.__driver.MemoryError: cuMemAlloc failed out of memory.
I have 2 cards, each with 3GB of memory. Each matrix takes about 1875
kilobytes. I am not sure why this is occuring.
x=np.ones((160080,3L)).astype(np.float32)
a_gpu=gpuarray.to_gpu(x)
b_gpu=gpuarray.to_gpu(x)
c_gpu = linalg.dot(a_gpu,b_gpu,'N','T',handle=handle)
My handle is a cublasxt (not regular cublas since blasxt apprently
does better memory handling).
Any idea what is going on?
_______________________________________________
PyCUDA mailing list
http://lists.tiker.net/listinfo/pycuda
Jonas Bardino
2015-11-23 16:38:46 UTC
Permalink
Ehmm, I'm not sure I understand exactly what you do, but to me it sounds
like you try to calculate the dot product of a 160080 x 3 matrix and a
similar one transposed, i.e. a 3 x 160080 matrix. That would give you a
160080 x 160080 matrix result - which surely won't fit your 3GB of GPU
memory.

Cheers, Jonas
Post by Keith Brown
I have a 2 small matrix (160080,3) of type float32 and I am
calculating their dot product. While doing this, I keep getting
pycuda.__driver.MemoryError: cuMemAlloc failed out of memory.
I have 2 cards, each with 3GB of memory. Each matrix takes about 1875
kilobytes. I am not sure why this is occuring.
x=np.ones((160080,3L)).astype(np.float32)
a_gpu=gpuarray.to_gpu(x)
b_gpu=gpuarray.to_gpu(x)
c_gpu = linalg.dot(a_gpu,b_gpu,'N','T',handle=handle)
My handle is a cublasxt (not regular cublas since blasxt apprently
does better memory handling).
Any idea what is going on?
_______________________________________________
PyCUDA mailing list
http://lists.tiker.net/listinfo/pycuda
Keith Brown
2015-11-23 18:58:38 UTC
Permalink
Correct. My result matrix will be too large.

<sigh>

I would think cublasXT would take care of this for me. I though it
would do some sort of divide and conquer.

Is there a way to attack this sort of problem?
Post by Jonas Bardino
Ehmm, I'm not sure I understand exactly what you do, but to me it sounds
like you try to calculate the dot product of a 160080 x 3 matrix and a
similar one transposed, i.e. a 3 x 160080 matrix. That would give you a
160080 x 160080 matrix result - which surely won't fit your 3GB of GPU
memory.
Cheers, Jonas
Post by Keith Brown
I have a 2 small matrix (160080,3) of type float32 and I am
calculating their dot product. While doing this, I keep getting
pycuda.__driver.MemoryError: cuMemAlloc failed out of memory.
I have 2 cards, each with 3GB of memory. Each matrix takes about 1875
kilobytes. I am not sure why this is occuring.
x=np.ones((160080,3L)).astype(np.float32)
a_gpu=gpuarray.to_gpu(x)
b_gpu=gpuarray.to_gpu(x)
c_gpu = linalg.dot(a_gpu,b_gpu,'N','T',handle=handle)
My handle is a cublasxt (not regular cublas since blasxt apprently
does better memory handling).
Any idea what is going on?
_______________________________________________
PyCUDA mailing list
http://lists.tiker.net/listinfo/pycuda
_______________________________________________
PyCUDA mailing list
http://lists.tiker.net/listinfo/pycuda
Keith Brown
2015-11-23 20:14:40 UTC
Permalink
Thanks all for the replies.

My goal is simple. Atleast, I though it was simple :-)

I have function where I calculate the dot product

def F(a,b):
return np.dot(a.T,b)

I need to do this 8k times. The max size of 'a' and 'b' are (3 million, 1).

For smaller size of a and b. linalg.dot is working great. But I want a
more efficient way using GPU.

Perhaps, GPU isn't the way to go since the memory is too large?
(https://developer.nvidia.com/cublas)
"By using a streaming design, cuBLAS-XT efficiently manages transfers across the PCI-Express bus automatically, which allows input and output data to be stored on the host’s system memory. This provides out-of-core operation – the size of operand data is only limited by system memory size, not by GPU on-board memory size.”
So I don’t think cuBLAS-XT can help unless you have more than 95 GB of system RAM. If that is not the case, I think you have to step back and think about what you need to do with this array ultimately, and where you want to stage the data if you need to compute all 95 GB of it at once.
Post by Keith Brown
Correct. My result matrix will be too large.
<sigh>
I would think cublasXT would take care of this for me. I though it
would do some sort of divide and conquer.
Is there a way to attack this sort of problem?
Post by Jonas Bardino
Ehmm, I'm not sure I understand exactly what you do, but to me it sounds
like you try to calculate the dot product of a 160080 x 3 matrix and a
similar one transposed, i.e. a 3 x 160080 matrix. That would give you a
160080 x 160080 matrix result - which surely won't fit your 3GB of GPU
memory.
Cheers, Jonas
Post by Keith Brown
I have a 2 small matrix (160080,3) of type float32 and I am
calculating their dot product. While doing this, I keep getting
pycuda.__driver.MemoryError: cuMemAlloc failed out of memory.
I have 2 cards, each with 3GB of memory. Each matrix takes about 1875
kilobytes. I am not sure why this is occuring.
x=np.ones((160080,3L)).astype(np.float32)
a_gpu=gpuarray.to_gpu(x)
b_gpu=gpuarray.to_gpu(x)
c_gpu = linalg.dot(a_gpu,b_gpu,'N','T',handle=handle)
My handle is a cublasxt (not regular cublas since blasxt apprently
does better memory handling).
Any idea what is going on?
_______________________________________________
PyCUDA mailing list
http://lists.tiker.net/listinfo/pycuda
_______________________________________________
PyCUDA mailing list
http://lists.tiker.net/listinfo/pycuda
_______________________________________________
PyCUDA mailing list
http://lists.tiker.net/listinfo/pycuda
Keith Brown
2015-11-24 04:03:33 UTC
Permalink
does anyone have any thoughts? is this feasible?
Post by Keith Brown
Thanks all for the replies.
My goal is simple. Atleast, I though it was simple :-)
I have function where I calculate the dot product
return np.dot(a.T,b)
I need to do this 8k times. The max size of 'a' and 'b' are (3 million, 1).
For smaller size of a and b. linalg.dot is working great. But I want a
more efficient way using GPU.
Perhaps, GPU isn't the way to go since the memory is too large?
(https://developer.nvidia.com/cublas)
"By using a streaming design, cuBLAS-XT efficiently manages transfers across the PCI-Express bus automatically, which allows input and output data to be stored on the host’s system memory. This provides out-of-core operation – the size of operand data is only limited by system memory size, not by GPU on-board memory size.”
So I don’t think cuBLAS-XT can help unless you have more than 95 GB of system RAM. If that is not the case, I think you have to step back and think about what you need to do with this array ultimately, and where you want to stage the data if you need to compute all 95 GB of it at once.
Post by Keith Brown
Correct. My result matrix will be too large.
<sigh>
I would think cublasXT would take care of this for me. I though it
would do some sort of divide and conquer.
Is there a way to attack this sort of problem?
Post by Jonas Bardino
Ehmm, I'm not sure I understand exactly what you do, but to me it sounds
like you try to calculate the dot product of a 160080 x 3 matrix and a
similar one transposed, i.e. a 3 x 160080 matrix. That would give you a
160080 x 160080 matrix result - which surely won't fit your 3GB of GPU
memory.
Cheers, Jonas
Post by Keith Brown
I have a 2 small matrix (160080,3) of type float32 and I am
calculating their dot product. While doing this, I keep getting
pycuda.__driver.MemoryError: cuMemAlloc failed out of memory.
I have 2 cards, each with 3GB of memory. Each matrix takes about 1875
kilobytes. I am not sure why this is occuring.
x=np.ones((160080,3L)).astype(np.float32)
a_gpu=gpuarray.to_gpu(x)
b_gpu=gpuarray.to_gpu(x)
c_gpu = linalg.dot(a_gpu,b_gpu,'N','T',handle=handle)
My handle is a cublasxt (not regular cublas since blasxt apprently
does better memory handling).
Any idea what is going on?
_______________________________________________
PyCUDA mailing list
http://lists.tiker.net/listinfo/pycuda
_______________________________________________
PyCUDA mailing list
http://lists.tiker.net/listinfo/pycuda
_______________________________________________
PyCUDA mailing list
http://lists.tiker.net/listinfo/pycuda
Keith Brown
2015-11-24 21:08:40 UTC
Permalink
So, it turns out why its working on CPU because a.T is a view and
isn't occupying a lot of memory (if any). Now, for pycuda I need to do
a a.T.copy() to get it to work but this takes up more memory which is
leading to a memory allocation error.

Does anyone have an example of dot product with streams?
Post by Keith Brown
Thanks all for the replies.
My goal is simple. Atleast, I though it was simple :-)
I have function where I calculate the dot product
return np.dot(a.T,b)
I need to do this 8k times. The max size of 'a' and 'b' are (3 million, 1).
For smaller size of a and b. linalg.dot is working great. But I want a
more efficient way using GPU.
Perhaps, GPU isn't the way to go since the memory is too large?
(https://developer.nvidia.com/cublas)
"By using a streaming design, cuBLAS-XT efficiently manages transfers across the PCI-Express bus automatically, which allows input and output data to be stored on the host’s system memory. This provides out-of-core operation – the size of operand data is only limited by system memory size, not by GPU on-board memory size.”
So I don’t think cuBLAS-XT can help unless you have more than 95 GB of system RAM. If that is not the case, I think you have to step back and think about what you need to do with this array ultimately, and where you want to stage the data if you need to compute all 95 GB of it at once.
Post by Keith Brown
Correct. My result matrix will be too large.
<sigh>
I would think cublasXT would take care of this for me. I though it
would do some sort of divide and conquer.
Is there a way to attack this sort of problem?
Post by Jonas Bardino
Ehmm, I'm not sure I understand exactly what you do, but to me it sounds
like you try to calculate the dot product of a 160080 x 3 matrix and a
similar one transposed, i.e. a 3 x 160080 matrix. That would give you a
160080 x 160080 matrix result - which surely won't fit your 3GB of GPU
memory.
Cheers, Jonas
Post by Keith Brown
I have a 2 small matrix (160080,3) of type float32 and I am
calculating their dot product. While doing this, I keep getting
pycuda.__driver.MemoryError: cuMemAlloc failed out of memory.
I have 2 cards, each with 3GB of memory. Each matrix takes about 1875
kilobytes. I am not sure why this is occuring.
x=np.ones((160080,3L)).astype(np.float32)
a_gpu=gpuarray.to_gpu(x)
b_gpu=gpuarray.to_gpu(x)
c_gpu = linalg.dot(a_gpu,b_gpu,'N','T',handle=handle)
My handle is a cublasxt (not regular cublas since blasxt apprently
does better memory handling).
Any idea what is going on?
_______________________________________________
PyCUDA mailing list
http://lists.tiker.net/listinfo/pycuda
_______________________________________________
PyCUDA mailing list
http://lists.tiker.net/listinfo/pycuda
_______________________________________________
PyCUDA mailing list
http://lists.tiker.net/listinfo/pycuda
Loading...