[PyCUDA] an illegal memory access was encountered

Daniel Gebreiter

2016-10-18 19:18:13 UTC

We've managed to reproduce the error on a second machine. On this machine with a slightly less powerful graphics card, the 'illegal memory access' error appeared when N was approx. 17000 (previously: 37000).Below that, no errors or unsual behaviour was recorded. The results were correct also. Combined array sizes were still only a fraction of the capacity of the card (2GB).
Perhaps someone has an explanation? Have I conceptually misundersttod something? Why does the error occur only when N is larger than a given number? And how could I determine from what N the code will fail?
Many thanks,
Daniel
Here's the sample code once more. It produces the pairwise cross-products of two lists of 3d vectors:
from __future__ import print_functionfrom __future__ import absolute_importimport pycuda.autoinitimport numpyfrom pycuda.compiler import SourceModulefrom pycuda import gpuarray
mod = SourceModule("""__global__ void cross_products(float3* vCs, float3* vAs, float3* vBs, int w, int h){ const int r = blockIdx.x * blockDim.x + threadIdx.x; const int c = blockIdx.y * blockDim.y + threadIdx.y; int i = r * w + c; if ((c >= w) || (r >= h) || i > w*h) { return; } float3 vA = vAs[i]; float3 vB = vBs[i]; float3 vC = make_float3(vA.y*vB.z - vA.z*vB.y, vA.z*vB.x - vA.x*vB.z, vA.x*vB.y - vA.y*vB.x); vCs[i] = vC; }""")
cross_products = mod.get_function("cross_products")N = 35000M = 3a = numpy.ndarray((N,M), dtype = numpy.float32)b = numpy.ndarray((N,M), dtype = numpy.float32)for i in range(0,N): a[i] = [1,0,0] b[i] = [0,1,0]
c = numpy.zeros((N,M), dtype = numpy.float32)
print("a x b")print(numpy.cross(a,b))print(numpy.cross(a,b).nbytes)M_gpu = numpy.int32(M)N_gpu = numpy.int32(N)a_gpu = gpuarray.to_gpu(a) b_gpu = gpuarray.to_gpu(b)c_gpu = gpuarray.to_gpu(c)

bx = 32 #256by = 3 #1gdimX = (int)((N + bx-1) / bx);gdimY = (int)((M + by-1) / by); print("grid")print(gdimX)print(gdimY)cross_products(c_gpu, a_gpu, b_gpu, M_gpu, N_gpu, block=(bx,by,1), grid = (gdimX, gdimY))
dest = c_gpu.get()print(a_gpu.mem_size)print("dest")print(dest)print("diff")print(numpy.sum(dest-numpy.cross(a,b)))

From: ***@hotmail.com
To: ***@tiker.net
Date: Sun, 16 Oct 2016 22:22:11 +0200
Subject: [PyCUDA] an illegal memory access was encountered

Hello all,
I get "pycuda._driver.LogicError: cuMemcpyDtoH failed: an illegal memory access was encountered" errors when I use pycuda with matrices over certain sizes. Only a restart of spyder remedies the issue. The matrix sizes are still well below what I believe my graphics card should be able to handle (a Geforce GTX 1060, 3GB). Is there a pycuda-driven limit?
I've created a fairly simple example which simply computes the cross products of two 3d-vectors.
The code works fine for up N approx. 35000 vectors. Above that, I get the following error:
Traceback (most recent call last): File "C:\owncloud\Python\float3_example.py", line 68, in <module> dest = c_gpu.get() File "C:\WinPython-64bit-3.5.2.2Qt5\python-3.5.2.amd64\lib\site-packages\pycuda-2016.1.2-py3.5-win-amd64.egg\pycuda\gpuarray.py", line 271, in get _memcpy_discontig(ary, self, async=async, stream=stream) File "C:\WinPython-64bit-3.5.2.2Qt5\python-3.5.2.amd64\lib\site-packages\pycuda-2016.1.2-py3.5-win-amd64.egg\pycuda\gpuarray.py", line 1190, in _memcpy_discontig drv.memcpy_dtoh(dst, src.gpudata)pycuda._driver.LogicError: cuMemcpyDtoH failed: an illegal memory access was encountered
Assuming the problem lies with my code rather than pyCuda - is there a problem with my usage of the float3 vector types inside but not outside the CUDA kernel? (The results are correct for small matrices.) I couldn't find a succint example of a best practice case of passing lists of 3d vectors (or float3s) to kernel using pyCuda. Or the way I have set up blocks and grids (I tried many)?
Many thanks!
Here's the very simple example:
from __future__ import print_functionfrom __future__ import absolute_importimport pycuda.autoinitimport numpyfrom pycuda.compiler import SourceModulefrom pycuda import gpuarray
mod = SourceModule("""__global__ void cross_products(float3* vCs, float3* vAs, float3* vBs, int w, int h){ const int c = blockIdx.x * blockDim.x + threadIdx.x; const int r = blockIdx.y * blockDim.y + threadIdx.y; int i = r * w + c; // 1D flat index // Check if within array bounds. if ((c >= w) || (r >= h)) { return; } float3 vA = vAs[i]; float3 vB = vBs[i]; float3 vC = make_float3(vA.y*vB.z - vA.z*vB.y, vA.z*vB.x - vA.x*vB.z, vA.x*vB.y - vA.y*vB.x); vCs[i] = vC; }""")
cross_products = mod.get_function("cross_products")N = 32000 #on my machine, this fails if N > 36000M = 3a = numpy.ndarray((N,M), dtype = numpy.float32)b = numpy.ndarray((N,M), dtype = numpy.float32)for i in range(0,N): a[i] = [1,0,0] b[i] = [0,1,0]
c = numpy.zeros((N,M), dtype = numpy.float32)
print("a x b")print(numpy.cross(a,b))
M_gpu = numpy.int32(M)N_gpu = numpy.int32(N)a_gpu = gpuarray.to_gpu(a) b_gpu = gpuarray.to_gpu(b)c_gpu = gpuarray.to_gpu(c)

bx = 32 #256by = 32 #1gdimX = (int)((M + bx-1) / bx);gdimY = (int)((N + by-1) / by); print("grid")print(gdimX)print(gdimY)cross_products(c_gpu, a_gpu, b_gpu, M_gpu, N_gpu, block=(bx,by,1), grid = (gdimX, gdimY))
dest = c_gpu.get()
print("dest")print(dest)print("diff")print(dest-numpy.cross(a,b))