Discussion:
[PyCUDA] gpuarray functions in sourcemodules
Fil Peters
2015-02-17 16:28:31 UTC
Permalink
Hello,

I am just new to pycuda and started testing it. I was wondering if it is possible to use the gpuarray functions in a sourcemodule.

For example, I was trying to covert the following code into a pycuda sourcemodule:

numpy code:

fac1=np.float32(0.5)
fac2=np.float32(1.001)
for i in range(niter):
c = a
d = b
a = (c + d)*fac1
b = c*fac2
f += a*b + a*a
k = np.dot(a,b)


Until the line "f += a*b + a*a" it works well:

mod = SourceModule("""

__global__ void vecmul(float *dest, float *in1, float *in2, float *in3,
float *in4, int niter)
{

const int i = blockDim.x*blockIdx.x + threadIdx.x;

for(int n = 0; n < niter; n++) {
in3[i] = in1[i];
in4[i] = in2[i];
in1[i] = (in3[i]+in4[i])*0.5 ;
in2[i] = in3[i]*1.001;
dest[i] += in1[i] * in2[i] + in1[i]*in1[i];
}
}
""")

'of course I realize that the dot product is not very useful in this loop, but in my final program I will need to reuse this value in the loop).
So for this specific case the question is how to incorporate the function gpuarray.dot() in the code, or if that is not possible how to include a reduction kernel in the sourcemodule.

many thanks in advance,
Fil





_______________________________________________
PyCUDA mailing list
Andreas Kloeckner
2015-02-17 16:45:40 UTC
Permalink
Post by Fil Peters
Hello,
I am just new to pycuda and started testing it. I was wondering if it is possible to use the gpuarray functions in a sourcemodule.
fac1=np.float32(0.5)
fac2=np.float32(1.001)
c = a
d = b
a = (c + d)*fac1
b = c*fac2
f += a*b + a*a
k = np.dot(a,b)
mod = SourceModule("""
__global__ void vecmul(float *dest, float *in1, float *in2, float *in3,
float *in4, int niter)
{
const int i = blockDim.x*blockIdx.x + threadIdx.x;
for(int n = 0; n < niter; n++) {
in3[i] = in1[i];
in4[i] = in2[i];
in1[i] = (in3[i]+in4[i])*0.5 ;
in2[i] = in3[i]*1.001;
dest[i] += in1[i] * in2[i] + in1[i]*in1[i];
}
}
""")
'of course I realize that the dot product is not very useful in this loop, but in my final program I will need to reuse this value in the loop).
So for this specific case the question is how to incorporate the
function gpuarray.dot() in the code, or if that is not possible how to
include a reduction kernel in the sourcemodule.
No, you can't, sorry.

Andreas
Fil Peters
2015-02-19 09:03:46 UTC
Permalink
Thanks for the answer, it is a pity that you it is not possible to use this functions, especially since it also seems not possible to use the cublas functions in the source modules. In order to be able to use the gpu array functions in a large loop one has to prevent the copy to the main memory. To take from the simple speed test example:

##################
# GPUArray SECTION
# The result is copied back to main memory on each iteration, this is a bottleneck
....
....
for i in range(n_iter):
a_gpu = pycuda.cumath.sin(a_gpu)
end.record() # end timing
.....
....

would it be possible to not copy the result to the main memory (I also do not see why the result needs to be copied back to the main memory, it looks more logic to me only to copy when you ask for it).
kind regards
Fil
Post by Andreas Kloeckner
 Hello,
 I am just new to pycuda and started testing it. I was wondering if it is possible to use the gpuarray functions in a sourcemodule.
 fac1=np.float32(0.5)
 fac2=np.float32(1.001)
     c = a
     d = b
     a = (c + d)*fac1
     b = c*fac2
     f += a*b + a*a
     k = np.dot(a,b)
 mod = SourceModule("""
 __global__ void vecmul(float *dest, float *in1, float *in2, float *in3,
 float *in4, int niter)
 {
   const int i = blockDim.x*blockIdx.x + threadIdx.x;
   for(int n = 0; n < niter; n++) {
     in3[i] = in1[i];
     in4[i] = in2[i];
     in1[i] = (in3[i]+in4[i])*0.5 ;
     in2[i] = in3[i]*1.001;
     dest[i] += in1[i] * in2[i] + in1[i]*in1[i];
   }
 }
 """)
 'of course I realize that the dot product is not very useful in this loop, but in my final program I will need to reuse this value in the loop).
 So for this specific case the question is how to incorporate the
 function gpuarray.dot() in the code, or if that is not possible how to
 include a reduction kernel in the sourcemodule.
No, you can't, sorry.
Andreas
Andreas Kloeckner
2015-02-21 00:45:12 UTC
Permalink
Post by Fil Peters
##################
# GPUArray SECTION
# The result is copied back to main memory on each iteration, this is a bottleneck
....
....
a_gpu = pycuda.cumath.sin(a_gpu)
end.record() # end timing
.....
....
would it be possible to not copy the result to the main memory (I also do not see why the result needs to be copied back to the main memory, it looks more logic to me only to copy when you ask for it).
If you're talking about the memory bandwidth hit you're taking from the
global/store load, that's easy to solve: Investigate ElementwiseKernel
in PyCUDA. That lets you merge multiple operations so that only one
fetch/store cycle is needed.

HTH,
Andreas

Loading...