[PyCUDA] Is this a candidate that could be ported to pycuda?

Discussion:

Bruce Labitt

2015-03-28 15:29:01 UTC

From reading the documentation, I am confused if paralleling of this kind
of function is worth doing in pycuda.

I'm trying to add the effect of phase noise in to a radar simulation. The
simulation is written in Scipy/numpy. Currently I am using joblib to run
multiple cores. It is too slow for the scenarios I wish to try. It does
work for a small number of targets and reduced phase noise array sizes.
The following is the current approach:

Function to parallelize

def MSIN( farray, Mf, tf, jj ):
"""
farray, Mf, tf, ii

farray array of frequencies (size = 10000)
Mf array of coefficients (size = 10000)
tf 2D array ~[2048 x 256] of time
jj list of indices (fraction of the problem to solve)

"""
Msin = 0.0
for ii in jj:
Msin = Msin + Mf[ii] * 2.0*cos( 2.0*pi*farray[ii]*tf )
return Msin

Current method to call function in parallel (multiprocessing)

"""
====================================================
Parallel computes the function MSIN with njobs cores
====================================================
"""
MMM = Parallel(n_jobs=njobs, max_nbytes=None)\
(delayed(MSIN)( f, aa, tf1, ii ) for ii in idx)
Msin = reduce(add, MMM) # add all the results of the cores together

Any suggestions to port this to pycuda? Reasonable candidate?

In essence, it is accumulating a scalar weighted cos function for many
elements of a 2D array. It 'feels' like it should be portable. Any road
blocks forseen? The 2D array of times is continuous in the sense of
stride. But there are discontinuous jumps in time values in the array,
which I do not think is a problem.

I have from DumpProperties.py
Device #0: GeForce GTX 680M
Compute Capability: 3.0
Total Memory: 4193984 KB
CAN_MAP_HOST_MEMORY: 1
CLOCK_RATE: 758000
MAX_BLOCK_DIM_X: 1024
MAX_BLOCK_DIM_Y: 1024
MAX_BLOCK_DIM_Z: 64
MAX_GRID_DIM_X: 2147483647
MAX_GRID_DIM_Y: 65535
MAX_GRID_DIM_Z: 65535

CUDA6.5

Thanks in advance for any insight, or suggestions on how to attack the
problem

-Bruce

Craig Stringham

2015-03-29 03:19:31 UTC

Permalink

Hi Bruce,
That's an excellent problem for a GPU. However, because each problem uses a
fair amount of memory being careful about how the memory is accessed will
dominate your performance gains (as is typical when using a GPU). For
example tf won't fit in the shared memory or cache of a multi-processor so
you'll also want to divide the problem again.
If you don't need to get this working for routine usage though, you might
just try using numba primitives to move it to a GPU. I haven't used them,
so I can't attest that it will give you a good answer. On the other hand,
this is the sort of problem that makes learning CUDA and PyCUDA easy, so
you might as well give it a shot.
Regards,
Craig

Post by Bruce Labitt
From reading the documentation, I am confused if paralleling of this kind
of function is worth doing in pycuda.
I'm trying to add the effect of phase noise in to a radar simulation. The
simulation is written in Scipy/numpy. Currently I am using joblib to run
multiple cores. It is too slow for the scenarios I wish to try. It does
work for a small number of targets and reduced phase noise array sizes.
Function to parallelize
"""
farray, Mf, tf, ii
farray array of frequencies (size = 10000)
Mf array of coefficients (size = 10000)
tf 2D array ~[2048 x 256] of time
jj list of indices (fraction of the problem to solve)
"""
Msin = 0.0
Msin = Msin + Mf[ii] * 2.0*cos( 2.0*pi*farray[ii]*tf )
return Msin
Current method to call function in parallel (multiprocessing)
"""
====================================================
Parallel computes the function MSIN with njobs cores
====================================================
"""
MMM = Parallel(n_jobs=njobs, max_nbytes=None)\
(delayed(MSIN)( f, aa, tf1, ii ) for ii in idx)
Msin = reduce(add, MMM) # add all the results of the cores together
Any suggestions to port this to pycuda? Reasonable candidate?
In essence, it is accumulating a scalar weighted cos function for many
elements of a 2D array. It 'feels' like it should be portable. Any road
blocks forseen? The 2D array of times is continuous in the sense of
stride. But there are discontinuous jumps in time values in the array,
which I do not think is a problem.
I have from DumpProperties.py
Device #0: GeForce GTX 680M
Compute Capability: 3.0
Total Memory: 4193984 KB
CAN_MAP_HOST_MEMORY: 1
CLOCK_RATE: 758000
MAX_BLOCK_DIM_X: 1024
MAX_BLOCK_DIM_Y: 1024
MAX_BLOCK_DIM_Z: 64
MAX_GRID_DIM_X: 2147483647
MAX_GRID_DIM_Y: 65535
MAX_GRID_DIM_Z: 65535
CUDA6.5
Thanks in advance for any insight, or suggestions on how to attack the
problem
-Bruce
_______________________________________________
PyCUDA mailing list
http://lists.tiker.net/listinfo/pycuda

Craig Stringham

2015-03-29 03:26:34 UTC

Permalink

One other tip on setting up your program is to remember to reduce memory
accesses as much as possible, so try to maximize the computations you
perform for every memory transfer. So you'll probably want to load a large
chunk of tf and compute on several indicies of Mf and farray.
Craig

Post by Craig Stringham
Hi Bruce,
That's an excellent problem for a GPU. However, because each problem uses
a fair amount of memory being careful about how the memory is accessed will
dominate your performance gains (as is typical when using a GPU). For
example tf won't fit in the shared memory or cache of a multi-processor so
you'll also want to divide the problem again.
If you don't need to get this working for routine usage though, you might
just try using numba primitives to move it to a GPU. I haven't used them,
so I can't attest that it will give you a good answer. On the other hand,
this is the sort of problem that makes learning CUDA and PyCUDA easy, so
you might as well give it a shot.
Regards,
Craig

Post by Bruce Labitt
From reading the documentation, I am confused if paralleling of this kind
of function is worth doing in pycuda.
I'm trying to add the effect of phase noise in to a radar simulation.
The simulation is written in Scipy/numpy. Currently I am using joblib to
run multiple cores. It is too slow for the scenarios I wish to try. It
does work for a small number of targets and reduced phase noise array
Function to parallelize
"""
farray, Mf, tf, ii
farray array of frequencies (size = 10000)
Mf array of coefficients (size = 10000)
tf 2D array ~[2048 x 256] of time
jj list of indices (fraction of the problem to solve)
"""
Msin = 0.0
Msin = Msin + Mf[ii] * 2.0*cos( 2.0*pi*farray[ii]*tf )
return Msin
Current method to call function in parallel (multiprocessing)
"""
====================================================
Parallel computes the function MSIN with njobs cores
====================================================
"""
MMM = Parallel(n_jobs=njobs, max_nbytes=None)\
(delayed(MSIN)( f, aa, tf1, ii ) for ii in idx)
Msin = reduce(add, MMM) # add all the results of the cores together
Any suggestions to port this to pycuda? Reasonable candidate?
In essence, it is accumulating a scalar weighted cos function for many
elements of a 2D array. It 'feels' like it should be portable. Any road
blocks forseen? The 2D array of times is continuous in the sense of
stride. But there are discontinuous jumps in time values in the array,
which I do not think is a problem.
I have from DumpProperties.py
Device #0: GeForce GTX 680M
Compute Capability: 3.0
Total Memory: 4193984 KB
CAN_MAP_HOST_MEMORY: 1
CLOCK_RATE: 758000
MAX_BLOCK_DIM_X: 1024
MAX_BLOCK_DIM_Y: 1024
MAX_BLOCK_DIM_Z: 64
MAX_GRID_DIM_X: 2147483647
MAX_GRID_DIM_Y: 65535
MAX_GRID_DIM_Z: 65535
CUDA6.5
Thanks in advance for any insight, or suggestions on how to attack the
problem
-Bruce
_______________________________________________
PyCUDA mailing list
http://lists.tiker.net/listinfo/pycuda