Andreas Schäfer
2015-07-11 22:56:40 UTC
Hi pycuda community Iâm rather new in programming pycuda and currently try to implement a shallow water wave equation solver in pycuda, it worked pretty well so far, but the major struggle in terms of speed is the iteration process. Iâm already searching for several hours to find an appropriate solution and still end up having the loop in python calling 4 independent kernels for each iteration step. The version that worked so far had direct driver.In()- and driver.Out()-calls for each kernel, but thatâs pretty slow, keeping the stuff on the device is significantly faster!, but after the first iteration, all elements are zero when working with appropriate cuda memory allocation. Whatâs my error, Iâm wondering? I attached the code, but hereâs the current structure: #Allocating and Copying all arrays on the device:u_gpu=drv.mem_alloc(u.nbytes)v_gpu=drv.mem_alloc(v.nbytes)eta_gpu=drv.mem_alloc(eta.nbytes)âŠdrv.memcpy_htod(u_gpu,u) ⊠etc.âŠFor i in range(n_iterations):               u_old_gpu=u_gpu               Kernel1(u_gpu,u_old_gpu,v_gpu, ⊠grid, block)v_old_gpu=v_gpu               Kernel2(v_gpu,v_old_gpu,u_gpu, ⊠grid, block)                              Kernel3 -needs kernel2 and kernel1 to finish beforehand                Kernel4 -needs kernel3 to finish beforehand               etc. and then copying back all the stuff. If n_iterations 1, all arrays are filled with zeros, except if I copy the stuff back to the host and again back to the device?Same error shows up when I use gpuarrays, so I guess I have a logic error in my cuda application, maybe a device synchronization or so? Iâm working with a Geforce GTX 970, and an i7 3770k, Windows 10. Looking forward to your answer and thanks in advance! Cheers,Andreas    Â