[PyCUDA] Handling large matrix multiplication

Discussion:

Keith Brown

2015-11-09 03:23:18 UTC

I have several thousand matrices where I need to calculate their dot
product. So, it seems pyCuda should do the trick (i hope). I am
running into an issue with block sizes.

Here is my code

#!/usr/bin/env python
import sys,time
from string import Template
import numpy as np
from pycuda import driver, compiler, gpuarray, tools
from pycuda.compiler import SourceModule
import pycuda.autoinit

def main():
d={}
size=4
d['size']=size

src=Template("""
__global__ void MatrixMulKernel(float *a, float *b, float *c)
{
int tx = threadIdx.x;
int ty = threadIdx.y;
float Pvalue = 0;

for (int k = 0; k < $size; ++k) {
float Aelement = a[ty * $size + k];
float Belement = b[k * $size + tx];
Pvalue += Aelement * Belement;
}
c[ty * $size + tx] = Pvalue;

}
""")

#src.substitute(d)

a_cpu = np.random.randn(size,size).astype(np.float32)
b_cpu = np.random.randn(size,size).astype(np.float32)

a_gpu=gpuarray.to_gpu(a_cpu)
b_gpu=gpuarray.to_gpu(b_cpu)
c_gpu = gpuarray.empty((size,size), np.float32)

src.substitute(d)
mod = compiler.SourceModule(src.substitute(d))
mm=mod.get_function("MatrixMulKernel")
v=mm(a_gpu,b_gpu,c_gpu,
block=(16,16,1),
)
start=time.time()

gpu_ans=c_gpu.get()
stop=time.time()
print "Gpu",stop-start

start=time.time()
cpu_ans=np.dot(a_cpu,b_cpu)
stop=time.time()
print "Cpu",stop-start

#print gpu_ans
#print cpu_ans
print np.allclose(gpu_ans,cpu_ans,atol=1e-03)

Couple of issues:
When I increase size of matrix it seems it gets less accurate than CPU
dot product therefore I have to use np.allclose to compare.
Also, what is the optimal block size I should be using?

Lev Givon

2015-11-09 04:33:17 UTC

Permalink

Post by Keith Brown
I have several thousand matrices where I need to calculate their dot
product. So, it seems pyCuda should do the trick (i hope). I am
running into an issue with block sizes.
Here is my code
#!/usr/bin/env python
import sys,time
from string import Template
import numpy as np
from pycuda import driver, compiler, gpuarray, tools
from pycuda.compiler import SourceModule
import pycuda.autoinit
d={}
size=4
d['size']=size
src=Template("""
__global__ void MatrixMulKernel(float *a, float *b, float *c)
{
int tx = threadIdx.x;
int ty = threadIdx.y;
float Pvalue = 0;
for (int k = 0; k < $size; ++k) {
float Aelement = a[ty * $size + k];
float Belement = b[k * $size + tx];
Pvalue += Aelement * Belement;
}
c[ty * $size + tx] = Pvalue;
}
""")
#src.substitute(d)
a_cpu = np.random.randn(size,size).astype(np.float32)
b_cpu = np.random.randn(size,size).astype(np.float32)
a_gpu=gpuarray.to_gpu(a_cpu)
b_gpu=gpuarray.to_gpu(b_cpu)
c_gpu = gpuarray.empty((size,size), np.float32)
src.substitute(d)
mod = compiler.SourceModule(src.substitute(d))
mm=mod.get_function("MatrixMulKernel")
v=mm(a_gpu,b_gpu,c_gpu,
block=(16,16,1),
)
start=time.time()
gpu_ans=c_gpu.get()
stop=time.time()
print "Gpu",stop-start
start=time.time()
cpu_ans=np.dot(a_cpu,b_cpu)
stop=time.time()
print "Cpu",stop-start
#print gpu_ans
#print cpu_ans
print np.allclose(gpu_ans,cpu_ans,atol=1e-03)
When I increase size of matrix it seems it gets less accurate than CPU
dot product therefore I have to use np.allclose to compare.

It isn't necessary clear that the CPU answer is "more accurate"; since the
summations performed on the GPU may occur in a different order than those on the
CPU and since floating point addition is not associative, the difference between
the GPU and CPU results may become more pronounced for the larger summations
required when computing the dot product of large matrices.

Post by Keith Brown
Also, what is the optimal block size I should be using?

It depends on your matrix size; you generally want to set the block (and grid)
size to maximize the number of threads active at a specific time.

If your matrices are very small (4 x 4), it isn't clear that using the GPU will
save you much time compared to using numpy because of the cost of copying the
matrices to and from GPU memory.

Note that if you are dealing with large matrices, you may wish to check out the
CUBLAS functions for matrix multiplication; a dot() function that uses those
functions is available in scikit-cuda [1], although the Python code that makes
the function easy to use may impose some noticeable overhead if you plan to
invoke it several thousand times.

[1] http://scikit-cuda.rtfd.org

--
Lev Givon
Bionet Group | Neurokernel Project
http://lebedov.github.io/
http://neurokernel.github.io/

_______________________________________________
PyCUDA mailing list

Keith Brown

2015-11-09 04:46:47 UTC

Permalink

Thanks Lev.
My matrix size is going to be large, somewhere near n=100000.
So, how can I test between CPU and GPU matrix math? I though my
technique was good enough but apparently not.

Post by Lev Givon

Post by Keith Brown
Also, what is the optimal block size I should be using?

It depends on your matrix size; you generally want to set the block (and grid)
size to maximize the number of threads active at a specific time.
If your matrices are very small (4 x 4), it isn't clear that using the GPU will
save you much time compared to using numpy because of the cost of copying the
matrices to and from GPU memory.
Note that if you are dealing with large matrices, you may wish to check out the
CUBLAS functions for matrix multiplication; a dot() function that uses those
functions is available in scikit-cuda [1], although the Python code that makes
the function easy to use may impose some noticeable overhead if you plan to
invoke it several thousand times.
[1] http://scikit-cuda.rtfd.org
--
Lev Givon
Bionet Group | Neurokernel Project
http://lebedov.github.io/
http://neurokernel.github.io/

Lev Givon

2015-11-09 05:26:56 UTC

Permalink

Post by Keith Brown
Thanks Lev.
My matrix size is going to be large, somewhere near n=100000.

(I assume n = total number of elements in the matrix; a matrix of size 10**5 x
10**5 32-bit floating point values would require more memory than currently
available GPUs can provide.)

Post by Keith Brown
So, how can I test between CPU and GPU matrix math? I though my
technique was good enough but apparently not.

If you are trying to ensure that the CPU and GPU are doing as similar floating
point computations as possible, you may want to look into whether the intrinsic
single precision functions that CUDA provides to enable control of rounding
during addition and multiplication (e.g., __fadd_rd, __fad_rn, etc.) may be
useful, as well as compiler options that affect processing of denormals (e.g.,
--ftz). For the purposes of checking algorithmic correctness against an existing
(CPU-based) implementation, you may want to use double precision (even if you
plan to use single precision for your actual computations). In general, though,
it is prudent to test results (via allclose()) with some defined tolerance in
light of the effects of floating point operations.

--
Lev Givon
Bionet Group | Neurokernel Project
http://lebedov.github.io/
http://neurokernel.github.io/

Keith Brown

2015-11-10 02:30:02 UTC

Permalink

Hi Lev,
I started to use NVBLAS which is an implementation of BLAS for GPUs.
So far its OK (i think). I don't see all of my GPUs being utilized
when I do a numpy.dot(A,B). I will play more around with it to get a
better ideas. I want to avoid writing my own matrix multiplication
method.

Post by Lev Givon

Post by Keith Brown
Thanks Lev.
My matrix size is going to be large, somewhere near n=100000.

(I assume n = total number of elements in the matrix; a matrix of size 10**5 x
10**5 32-bit floating point values would require more memory than currently
available GPUs can provide.)

Post by Keith Brown
So, how can I test between CPU and GPU matrix math? I though my
technique was good enough but apparently not.