Daniel Gebreiter

2016-10-16 20:22:11 UTC

Hello all,

I get "pycuda._driver.LogicError: cuMemcpyDtoH failed: an illegal memory access was encountered" errors when I use pycuda with matrices over certain sizes. Only a restart of spyder remedies the issue. The matrix sizes are still well below what I believe my graphics card should be able to handle (a Geforce GTX 1060, 3GB). Is there a pycuda-driven limit?

I've created a fairly simple example which simply computes the cross products of two 3d-vectors.

The code works fine for up N approx. 35000 vectors. Above that, I get the following error:

Traceback (most recent call last): File "C:\owncloud\Python\float3_example.py", line 68, in <module> dest = c_gpu.get() File "C:\WinPython-64bit-3.5.2.2Qt5\python-3.5.2.amd64\lib\site-packages\pycuda-2016.1.2-py3.5-win-amd64.egg\pycuda\gpuarray.py", line 271, in get _memcpy_discontig(ary, self, async=async, stream=stream) File "C:\WinPython-64bit-3.5.2.2Qt5\python-3.5.2.amd64\lib\site-packages\pycuda-2016.1.2-py3.5-win-amd64.egg\pycuda\gpuarray.py", line 1190, in _memcpy_discontig drv.memcpy_dtoh(dst, src.gpudata)pycuda._driver.LogicError: cuMemcpyDtoH failed: an illegal memory access was encountered

Assuming the problem lies with my code rather than pyCuda - is there a problem with my usage of the float3 vector types inside but not outside the CUDA kernel? (The results are correct for small matrices.) I couldn't find a succint example of a best practice case of passing lists of 3d vectors (or float3s) to kernel using pyCuda. Or the way I have set up blocks and grids (I tried many)?

Many thanks!

Here's the very simple example:

from __future__ import print_functionfrom __future__ import absolute_importimport pycuda.autoinitimport numpyfrom pycuda.compiler import SourceModulefrom pycuda import gpuarray

mod = SourceModule("""__global__ void cross_products(float3* vCs, float3* vAs, float3* vBs, int w, int h){ const int c = blockIdx.x * blockDim.x + threadIdx.x; const int r = blockIdx.y * blockDim.y + threadIdx.y; int i = r * w + c; // 1D flat index // Check if within array bounds. if ((c >= w) || (r >= h)) { return; } float3 vA = vAs[i]; float3 vB = vBs[i]; float3 vC = make_float3(vA.y*vB.z - vA.z*vB.y, vA.z*vB.x - vA.x*vB.z, vA.x*vB.y - vA.y*vB.x); vCs[i] = vC; }""")

cross_products = mod.get_function("cross_products")N = 32000 #on my machine, this fails if N > 36000M = 3a = numpy.ndarray((N,M), dtype = numpy.float32)b = numpy.ndarray((N,M), dtype = numpy.float32)for i in range(0,N): a[i] = [1,0,0] b[i] = [0,1,0]

c = numpy.zeros((N,M), dtype = numpy.float32)

print("a x b")print(numpy.cross(a,b))

M_gpu = numpy.int32(M)N_gpu = numpy.int32(N)a_gpu = gpuarray.to_gpu(a) b_gpu = gpuarray.to_gpu(b)c_gpu = gpuarray.to_gpu(c)

bx = 32 #256by = 32 #1gdimX = (int)((M + bx-1) / bx);gdimY = (int)((N + by-1) / by); print("grid")print(gdimX)print(gdimY)cross_products(c_gpu, a_gpu, b_gpu, M_gpu, N_gpu, block=(bx,by,1), grid = (gdimX, gdimY))

dest = c_gpu.get()

print("dest")print(dest)print("diff")print(dest-numpy.cross(a,b))

I get "pycuda._driver.LogicError: cuMemcpyDtoH failed: an illegal memory access was encountered" errors when I use pycuda with matrices over certain sizes. Only a restart of spyder remedies the issue. The matrix sizes are still well below what I believe my graphics card should be able to handle (a Geforce GTX 1060, 3GB). Is there a pycuda-driven limit?

I've created a fairly simple example which simply computes the cross products of two 3d-vectors.

The code works fine for up N approx. 35000 vectors. Above that, I get the following error:

Traceback (most recent call last): File "C:\owncloud\Python\float3_example.py", line 68, in <module> dest = c_gpu.get() File "C:\WinPython-64bit-3.5.2.2Qt5\python-3.5.2.amd64\lib\site-packages\pycuda-2016.1.2-py3.5-win-amd64.egg\pycuda\gpuarray.py", line 271, in get _memcpy_discontig(ary, self, async=async, stream=stream) File "C:\WinPython-64bit-3.5.2.2Qt5\python-3.5.2.amd64\lib\site-packages\pycuda-2016.1.2-py3.5-win-amd64.egg\pycuda\gpuarray.py", line 1190, in _memcpy_discontig drv.memcpy_dtoh(dst, src.gpudata)pycuda._driver.LogicError: cuMemcpyDtoH failed: an illegal memory access was encountered

Assuming the problem lies with my code rather than pyCuda - is there a problem with my usage of the float3 vector types inside but not outside the CUDA kernel? (The results are correct for small matrices.) I couldn't find a succint example of a best practice case of passing lists of 3d vectors (or float3s) to kernel using pyCuda. Or the way I have set up blocks and grids (I tried many)?

Many thanks!

Here's the very simple example:

from __future__ import print_functionfrom __future__ import absolute_importimport pycuda.autoinitimport numpyfrom pycuda.compiler import SourceModulefrom pycuda import gpuarray

mod = SourceModule("""__global__ void cross_products(float3* vCs, float3* vAs, float3* vBs, int w, int h){ const int c = blockIdx.x * blockDim.x + threadIdx.x; const int r = blockIdx.y * blockDim.y + threadIdx.y; int i = r * w + c; // 1D flat index // Check if within array bounds. if ((c >= w) || (r >= h)) { return; } float3 vA = vAs[i]; float3 vB = vBs[i]; float3 vC = make_float3(vA.y*vB.z - vA.z*vB.y, vA.z*vB.x - vA.x*vB.z, vA.x*vB.y - vA.y*vB.x); vCs[i] = vC; }""")

cross_products = mod.get_function("cross_products")N = 32000 #on my machine, this fails if N > 36000M = 3a = numpy.ndarray((N,M), dtype = numpy.float32)b = numpy.ndarray((N,M), dtype = numpy.float32)for i in range(0,N): a[i] = [1,0,0] b[i] = [0,1,0]

c = numpy.zeros((N,M), dtype = numpy.float32)

print("a x b")print(numpy.cross(a,b))

M_gpu = numpy.int32(M)N_gpu = numpy.int32(N)a_gpu = gpuarray.to_gpu(a) b_gpu = gpuarray.to_gpu(b)c_gpu = gpuarray.to_gpu(c)

bx = 32 #256by = 32 #1gdimX = (int)((M + bx-1) / bx);gdimY = (int)((N + by-1) / by); print("grid")print(gdimX)print(gdimY)cross_products(c_gpu, a_gpu, b_gpu, M_gpu, N_gpu, block=(bx,by,1), grid = (gdimX, gdimY))

dest = c_gpu.get()

print("dest")print(dest)print("diff")print(dest-numpy.cross(a,b))