[PyCUDA] Question How to Safely Use pycuda with mpi4py

Discussion:

kjs

2014-11-02 02:04:13 UTC

Hello,

I have written an MPI routine in Python that sends jobs to N worker
processes. The root process handles file IO and the workers do
computation. In the worker processes calls are made to the cuda enabled
GPU to do FFTs.

Is it safe to have N processes potentially making calls to the same GPU
at the same time? I have not made any amendments to the cuda code[0],
and have little knowledge of what could possibly go wrong.

Thanks much,
Kevin

[0] I am using python-mne with cuda enabled to call scikits.cuda.fft
https://github.com/mne-tools/mne-python/blob/master/mne/cuda.py

Andreas Kloeckner

2014-11-02 02:50:00 UTC

Permalink

Yes, the CUDA runtime manages this just fine, in the sense that it is
safe. N processes talking to the same GPU may not yield awesome
performance though.

Andreas

Eric Larson

2014-11-02 02:57:09 UTC

Permalink

Hey Kevin,

Not sure about the CUDA limitations, I'll let others speak to that...

But in developing the mne-python CUDA filtering code, IIRC the primary
limitation was (by far) transferring the data to and from the GPU. The FFT
computations themselves were a fraction of the total time. I suspect using
multiple jobs won't help CUDA filtering very much since the jobs would
presumably compete for the same memory bandwidth, but I would love to be
wrong about this. If it works better, it would be great to open an
mne-python issue for it, as we are always looking for speedups :)

Cheers,
Eric

Post by kjs
Hello,
I have written an MPI routine in Python that sends jobs to N worker
processes. The root process handles file IO and the workers do
computation. In the worker processes calls are made to the cuda enabled
GPU to do FFTs.
Is it safe to have N processes potentially making calls to the same GPU
at the same time? I have not made any amendments to the cuda code[0],
and have little knowledge of what could possibly go wrong.
Thanks much,
Kevin
[0] I am using python-mne with cuda enabled to call scikits.cuda.fft
https://github.com/mne-tools/mne-python/blob/master/mne/cuda.py
_______________________________________________
PyCUDA mailing list
http://lists.tiker.net/listinfo/pycuda

kjs

2014-11-02 03:55:47 UTC

Permalink

Post by Eric Larson
Hey Kevin,
Not sure about the CUDA limitations, I'll let others speak to that...
But in developing the mne-python CUDA filtering code, IIRC the primary
limitation was (by far) transferring the data to and from the GPU. The FFT
computations themselves were a fraction of the total time. I suspect using
multiple jobs won't help CUDA filtering very much since the jobs would
presumably compete for the same memory bandwidth, but I would love to be
wrong about this. If it works better, it would be great to open an
mne-python issue for it, as we are always looking for speedups :)
Cheers,
Eric

Thanks Andreas, this is good to know. I noticed that even though pycuda
is currently only using one of two GPUs, that GPU is only ever at ~35%
memory and ~22% processing utilization. This could be related to Eric's
observation that the PCI-e 16x bus bandwidth reaches capacity while the
GPU is pushing out fast FFT'ed arrays. Thus allowing for only one or two
arrays in the GPU at the same time.

From what I have seen, using cuda speeds up my FFTs ~2x. Though, the
workers do many other computations on the CPU. It's a worst case
scenario that all N workers are trying to send data to the GPU at the
same time.

-Kevin

kjs

2014-11-06 16:10:57 UTC

Permalink

In the routine I describe below, I am beginning to see the following
error. Please note, I was able to successfully run this routine all the
way though when PyCUDA was linked to system CUDA5. The errors started
popping up after I installed CUDA6 system wide and thus recompiled
PyCUDA. I am running Debian Testing.

Traceback (most recent call last):
File "feature_extractor.py", line 475, in <module>
main()
File "feature_extractor.py", line 467, in main
fe.set_features(fname[0])
File "feature_extractor.py", line 51, in set_features
self.apply_filters()
File "feature_extractor.py", line 99, in apply_filters
n_jobs='cuda', copy = False, verbose=False)
File "<string>", line 2, in band_stop_filter
File
"/home/kjs/py-virt-envs/dreateam/local/lib/python2.7/site-packages/mne-0.9.git-py2.7.egg/mne/utils.py",
line 509, in verbose
return function(*args, **kwargs)
File
"/home/kjs/py-virt-envs/dreateam/local/lib/python2.7/site-packages/mne-0.9.git-py2.7.egg/mne/filter.py",
line 742, in band_stop_filter
xf = _filter(x, Fs, freq, gain, filter_length, picks, n_jobs, copy)
File
"/home/kjs/py-virt-envs/dreateam/local/lib/python2.7/site-packages/mne-0.9.git-py2.7.egg/mne/filter.py",
line 345, in _filter
n_jobs=n_jobs)
File
"/home/kjs/py-virt-envs/dreateam/local/lib/python2.7/site-packages/mne-0.9.git-py2.7.egg/mne/filter.py",
line 141, in _overlap_add_filter
n_segments, n_seg, cuda_dict)
File
"/home/kjs/py-virt-envs/dreateam/local/lib/python2.7/site-packages/mne-0.9.git-py2.7.egg/mne/filter.py",
line 173, in _1d_overlap_filter
prod = fft_multiply_repeated(h_fft, seg, cuda_dict)
File
"/home/kjs/py-virt-envs/dreateam/local/lib/python2.7/site-packages/mne-0.9.git-py2.7.egg/mne/cuda.py",
line 196, in fft_multiply_repeated
x = np.array(cuda_dict['x'].get(), dtype=x.dtype, subok=True,
File
"/home/kjs/py-virt-envs/dreateam/local/lib/python2.7/site-packages/pycuda-2014.1-py2.7-linux-x86_64.egg/pycuda/gpuarray.py",
line 264, in get
drv.memcpy_dtoh(ary, self.gpudata)
pycuda._driver.LaunchError: cuMemcpyDtoH failed: launch timeout
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuMemFree failed: launch timeout
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuMemFree failed: launch timeout
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuMemFree failed: launch timeout
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuModuleUnload failed: launch timeout

Thanks,
Kevin

Post by kjs

Thanks Andreas, this is good to know. I noticed that even though pycuda
is currently only using one of two GPUs, that GPU is only ever at ~35%
memory and ~22% processing utilization. This could be related to Eric's
observation that the PCI-e 16x bus bandwidth reaches capacity while the
GPU is pushing out fast FFT'ed arrays. Thus allowing for only one or two
arrays in the GPU at the same time.
From what I have seen, using cuda speeds up my FFTs ~2x. Though, the
workers do many other computations on the CPU. It's a worst case
scenario that all N workers are trying to send data to the GPU at the
same time.
-Kevin
_______________________________________________
PyCUDA mailing list
http://lists.tiker.net/listinfo/pycuda

Eric Larson

2014-11-06 16:26:45 UTC

Permalink

If it's not too much hassle, you could try uninstalling all CUDA5-related
system packages to ensure that PyCUDA is linking to the appropriate CUDA6
library, headers, etc., but I doubt that's actually your problem.

Eric

Post by kjs
In the routine I describe below, I am beginning to see the following
error. Please note, I was able to successfully run this routine all the
way though when PyCUDA was linked to system CUDA5. The errors started
popping up after I installed CUDA6 system wide and thus recompiled
PyCUDA. I am running Debian Testing.
File "feature_extractor.py", line 475, in <module>
main()
File "feature_extractor.py", line 467, in main
fe.set_features(fname[0])
File "feature_extractor.py", line 51, in set_features
self.apply_filters()
File "feature_extractor.py", line 99, in apply_filters
n_jobs='cuda', copy = False, verbose=False)
File "<string>", line 2, in band_stop_filter
File
"/home/kjs/py-virt-envs/dreateam/local/lib/python2.7/site-packages/mne-0.9.git-py2.7.egg/mne/utils.py",
line 509, in verbose
return function(*args, **kwargs)
File
"/home/kjs/py-virt-envs/dreateam/local/lib/python2.7/site-packages/mne-0.9.git-py2.7.egg/mne/filter.py",
line 742, in band_stop_filter
xf = _filter(x, Fs, freq, gain, filter_length, picks, n_jobs, copy)
File
"/home/kjs/py-virt-envs/dreateam/local/lib/python2.7/site-packages/mne-0.9.git-py2.7.egg/mne/filter.py",
line 345, in _filter
n_jobs=n_jobs)
File
"/home/kjs/py-virt-envs/dreateam/local/lib/python2.7/site-packages/mne-0.9.git-py2.7.egg/mne/filter.py",
line 141, in _overlap_add_filter
n_segments, n_seg, cuda_dict)
File
"/home/kjs/py-virt-envs/dreateam/local/lib/python2.7/site-packages/mne-0.9.git-py2.7.egg/mne/filter.py",
line 173, in _1d_overlap_filter
prod = fft_multiply_repeated(h_fft, seg, cuda_dict)
File
"/home/kjs/py-virt-envs/dreateam/local/lib/python2.7/site-packages/mne-0.9.git-py2.7.egg/mne/cuda.py",
line 196, in fft_multiply_repeated
x = np.array(cuda_dict['x'].get(), dtype=x.dtype, subok=True,
File
"/home/kjs/py-virt-envs/dreateam/local/lib/python2.7/site-packages/pycuda-2014.1-py2.7-linux-x86_64.egg/pycuda/gpuarray.py",
line 264, in get
drv.memcpy_dtoh(ary, self.gpudata)
pycuda._driver.LaunchError: cuMemcpyDtoH failed: launch timeout
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuMemFree failed: launch timeout
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuMemFree failed: launch timeout
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuMemFree failed: launch timeout
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuModuleUnload failed: launch timeout
Thanks,
Kevin

Post by kjs

FFT

Post by kjs

Post by Eric Larson
computations themselves were a fraction of the total time. I suspect

using

Post by kjs

Post by Eric Larson
multiple jobs won't help CUDA filtering very much since the jobs would
presumably compete for the same memory bandwidth, but I would love to be
wrong about this. If it works better, it would be great to open an
mne-python issue for it, as we are always looking for speedups :)
Cheers,
Eric

Thanks Andreas, this is good to know. I noticed that even though pycuda
is currently only using one of two GPUs, that GPU is only ever at ~35%
memory and ~22% processing utilization. This could be related to Eric's
observation that the PCI-e 16x bus bandwidth reaches capacity while the
GPU is pushing out fast FFT'ed arrays. Thus allowing for only one or two
arrays in the GPU at the same time.
From what I have seen, using cuda speeds up my FFTs ~2x. Though, the
workers do many other computations on the CPU. It's a worst case
scenario that all N workers are trying to send data to the GPU at the
same time.
-Kevin
_______________________________________________
PyCUDA mailing list
http://lists.tiker.net/listinfo/pycuda

_______________________________________________
PyCUDA mailing list
http://lists.tiker.net/listinfo/pycuda