Discussion:
GPUs slower than CPU on dot product
archana sapkota
2017-04-25 19:49:38 UTC
Hello,
I just started working with PyCUDA. Basically whole CUDA is new to me. I
was trying to get to use the GPU to compute dot products of a large number
of vectors. Cause it was taking several days using multiple CPU cores.

But with my first try, I am sad that I did not see the boost in speed. Here
is a piece of code that I am currently running. This is just to see how
much speedup I will be getting. My vector of interest has a dimension of
around "3000". So eventually I will be computing dot product ( or L2 norm)
of those vectors.

I would highly appreciate if someone could suggest what I am missing and
how I could achieve my goal.

I also see some difference in results on numpy and on GPUs. Not as big a
concern right now but I am curious why.

Here is a sample code I m working with:

import pycuda.gpuarray as gpuarray
import pycuda.reduction as reduction
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy
import time

krnl = reduction.ReductionKernel(numpy.float32, neutral="0",
reduce_expr="a+b", map_expr="x[i]*y[i]",
arguments="float *x, float *y")
ssd = reduction.ReductionKernel(numpy.float32, neutral="0",
reduce_expr="a+b", map_expr="(x[i] - y[i])*(x[i] - y[i])",
arguments="float *x, float *y")

for i in range(10):
a = numpy.random.randn(3000)
b = numpy.random.randn(3000)

a_gpu = gpuarray.to_gpu(a.astype(numpy.float32))
b_gpu = gpuarray.to_gpu(b.astype(numpy.float32))

start = time.time()
numpy_dot = numpy.dot(a,b)
end = time.time()
dt = end - start

print ("CPU time", dt)
print ("numpy_dot", numpy_dot)
print ("numpy_euclid", numpy_ssd)

start = time.time()
my_dot_prod = krnl(a_gpu, b_gpu).get()
end = time.time()

dt = end - start
print ("GPU time", dt)
print ("my dot product", my_dot_prod)
print ("my euclid", my_euclid)
print ("\n")

Example timings are:
CPU time 5.9604644775390625e-06
numpy_dot -19.7736554062 <(773)%20655-4062>
numpy_ssd 5975.41368065
GPU time 0.0009388923645019531
my dot product -19.77365493774414
my ssd 5975.4140625

Thanks,
Arch
Lev E Givon
2017-04-25 20:11:42 UTC
On Tue, Apr 25, 2017 at 3:49 PM, archana sapkota
Post by archana sapkota
Hello,
I just started working with PyCUDA. Basically whole CUDA is new to me. I was
trying to get to use the GPU to compute dot products of a large number of
vectors. Cause it was taking several days using multiple CPU cores.
But with my first try, I am sad that I did not see the boost in speed. Here
is a piece of code that I am currently running. This is just to see how much
speedup I will be getting. My vector of interest has a dimension of around
"3000". So eventually I will be computing dot product ( or L2 norm) of those
vectors.
I would highly appreciate if someone could suggest what I am missing and how
I could achieve my goal.
I also see some difference in results on numpy and on GPUs. Not as big a
concern right now but I am curious why.
import pycuda.gpuarray as gpuarray
import pycuda.reduction as reduction
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy
import time
krnl = reduction.ReductionKernel(numpy.float32, neutral="0",
reduce_expr="a+b", map_expr="x[i]*y[i]",
arguments="float *x, float *y")
ssd = reduction.ReductionKernel(numpy.float32, neutral="0",
reduce_expr="a+b", map_expr="(x[i] - y[i])*(x[i] - y[i])",
arguments="float *x, float *y")
a = numpy.random.randn(3000)
b = numpy.random.randn(3000)
a_gpu = gpuarray.to_gpu(a.astype(numpy.float32))
b_gpu = gpuarray.to_gpu(b.astype(numpy.float32))
start = time.time()
numpy_dot = numpy.dot(a,b)
end = time.time()
dt = end - start
print ("CPU time", dt)
print ("numpy_dot", numpy_dot)
print ("numpy_euclid", numpy_ssd)
start = time.time()
my_dot_prod = krnl(a_gpu, b_gpu).get()
end = time.time()
dt = end - start
print ("GPU time", dt)
print ("my dot product", my_dot_prod)
print ("my euclid", my_euclid)
print ("\n")
CPU time 5.9604644775390625e-06
numpy_dot -19.7736554062
numpy_ssd 5975.41368065
GPU time 0.0009388923645019531
my dot product -19.77365493774414
my ssd 5975.4140625
Thanks,
Arch
Several points:

- The first time you invoke the kernel will be slower than subsequent
invocations because of the time taken to compile the kernel.
- Owing to the relatively low bandwidth of GPU to host memory
transfers, you will probably not see any overall speedup for
relatively short vectors such as those you are processing if you are
loading a new vector into GPU memory at every iteration. You probably
will see better performance processing your vectors in parallel on the
CPU using something like Python's multiprocessing module or dask
- Since you are using single precision floats, you will see
differences in the CUDA/numpy results because of internal
implementation differences.
--
Lev E. Givon, PhD
http://lebedov.github.io
Daniel Berjón Díez
2017-04-25 21:27:58 UTC
GPUs are good at crunching numbers, but only provided that you can give
them enough numbers to crunch to hide the latency of memory access.

GPUs are only efficient if your problem exhibits high arithmetic intensity,
which can be loosely defined as the number of arithmetic operations per

Sadly, dot product is not such a problem: just one product and an addition
for every two operands; the sum does not help either because it cannot be
that well parallelized.

Regards,
Daniel
Post by archana sapkota
Hello,
I just started working with PyCUDA. Basically whole CUDA is new to me. I
was trying to get to use the GPU to compute dot products of a large number
of vectors. Cause it was taking several days using multiple CPU cores.
But with my first try, I am sad that I did not see the boost in speed.
Here is a piece of code that I am currently running. This is just to see
how much speedup I will be getting. My vector of interest has a dimension
of around "3000". So eventually I will be computing dot product ( or L2
norm) of those vectors.
I would highly appreciate if someone could suggest what I am missing and
how I could achieve my goal.
I also see some difference in results on numpy and on GPUs. Not as big a
concern right now but I am curious why.
import pycuda.gpuarray as gpuarray
import pycuda.reduction as reduction
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy
import time
krnl = reduction.ReductionKernel(numpy.float32, neutral="0",
reduce_expr="a+b", map_expr="x[i]*y[i]",
arguments="float *x, float *y")
ssd = reduction.ReductionKernel(numpy.float32, neutral="0",
reduce_expr="a+b", map_expr="(x[i] - y[i])*(x[i] - y[i])",
arguments="float *x, float *y")
a = numpy.random.randn(3000)
b = numpy.random.randn(3000)
a_gpu = gpuarray.to_gpu(a.astype(numpy.float32))
b_gpu = gpuarray.to_gpu(b.astype(numpy.float32))
start = time.time()
numpy_dot = numpy.dot(a,b)
end = time.time()
dt = end - start
print ("CPU time", dt)
print ("numpy_dot", numpy_dot)
print ("numpy_euclid", numpy_ssd)
start = time.time()
my_dot_prod = krnl(a_gpu, b_gpu).get()
end = time.time()
dt = end - start
print ("GPU time", dt)
print ("my dot product", my_dot_prod)
print ("my euclid", my_euclid)
print ("\n")
CPU time 5.9604644775390625e-06
numpy_dot -19.7736554062 <(773)%20655-4062>
numpy_ssd 5975.41368065
GPU time 0.0009388923645019531
my dot product -19.77365493774414
my ssd 5975.4140625
Thanks,
Arch
_______________________________________________
PyCUDA mailing list
https://lists.tiker.net/listinfo/pycuda