[PyCUDA] Possible to use pycuda.driver.memcpy_peer to communicate between processes?

Discussion:

Gavin Weiguang Ding

2014-10-06 20:30:46 UTC

Hi,

I'm trying to do p2p communication between 2 GPUs without going through the
CPU memory.
And I need to communicate between 2 processes (due to one process in Theano
can only use 1 GPU), is that possible with pycuda? or
specifically pycuda.driver.memcpy_peer?

Any comments/suggestions?

Thanks!
Gavin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.tiker.net/pipermail/pycuda/attachments/20141006/25a249ce/attachment.html>

Lev Givon

2014-10-06 20:54:33 UTC

Permalink

Post by Gavin Weiguang Ding
Hi,
I'm trying to do p2p communication between 2 GPUs without going through the
CPU memory.
And I need to communicate between 2 processes (due to one process in Theano
can only use 1 GPU), is that possible with pycuda? or
specifically pycuda.driver.memcpy_peer?

--
Lev Givon
Bionet Group | Neurokernel Project
http://www.columbia.edu/~lev/
http://lebedov.github.io/
http://neurokernel.github.io/

Gavin Weiguang Ding

2014-10-07 03:55:49 UTC

Permalink

Hi Lev,

Thanks for the reply!
My GPUs do support GPUDirect. And I've tested using the "simpleP2P" from
cuda samples.

I've been trying a little bit on that but without success. I'm new to
pycuda and multiprocessing, excuse me if I ask dumb questions.

If I understand it right, I need to do pycuda.driver.init() and
make_context() inside each process.

But to use pycuda.driver.memcpy_peer, I need to pass the context defined in
one process to another. But when I try to pass the context with Pipe or
Queue from multiprocessing, it returns pickling error.
Is this the right way of doing it? assuming pickling error can be solved.

I've attached my toy example code, if that could explain my problem better.
Thanks!

import atexit
import numpy as np
import pycuda.driver as drv
import pycuda.gpuarray as gpuarray
from multiprocessing import Process, Queue

def gen_gpuarray(obj_queue):
# generate a gpuarray to pass

drv.init()
dev0 = drv.Device(1)
ctx0 = dev0.make_context()
atexit.register(ctx0.pop)

ctx0.push()

obj_queue.put(ctx0)
ctx1 = obj_queue.get()

ctx0.enable_peer_access(ctx1)

x = np.random.rand(1000)
x_gpu = gpuarray.to_gpu(x)

obj_queue.put(x_gpu)

ctx0.pop()

def get_gpuarray(obj_queue):
# try to receive the gpu array

drv.init()
dev1 = drv.Device(2)
ctx1 = dev1.make_context()
atexit.register(ctx1.pop)

ctx1.push()
ctx0 = obj_queue.get()
obj_queue.put(ctx1)
ctx1.enable_peer_access(ctx0)

to_get_gpu = obj_queue.get()

y_gpu = gpuarray.zeros_like(to_get_gpu)

drv.memcpy_peer(y_gpu.ptr, to_get_gpu.ptr,
to_get_gpu.dtype.itemsize * to_get_gpu.size,
ctx1, ctx0)
ctx1.pop()

if __name__ == '__main__':
q = Queue()

process_gen = Process(target=gen_gpuarray, args=(q, ))
process_get = Process(target=get_gpuarray, args=(q, ))

process_gen.start()
process_get.start()

process_gen.join()
process_get.join()

Post by Gavin Weiguang Ding

Post by Gavin Weiguang Ding
Hi,
I'm trying to do p2p communication between 2 GPUs without going through

the

Post by Gavin Weiguang Ding
CPU memory.
And I need to communicate between 2 processes (due to one process in

Theano

Post by Gavin Weiguang Ding
can only use 1 GPU), is that possible with pycuda? or
specifically pycuda.driver.memcpy_peer?

Yes, but your GPU needs to support GPUDirect peer-to-peer communication.
Assuming that you have the appropriate hardware, you can also use mpi4py to
transfer data between GPUs if it has been compiled against an MPI implementation
that supports GPU-to-GPU communication (e.g., OpenMPI 1.8, MVAPICH2 2.0).
--
Lev Givon
Bionet Group | Neurokernel Project
http://www.columbia.edu/~lev/
http://lebedov.github.io/
http://neurokernel.github.io/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.tiker.net/pipermail/pycuda/attachments/20141006/6e9601ba/attachment.html>

Lev Givon

2014-10-07 04:21:26 UTC

Permalink

Post by Gavin Weiguang Ding
Hi Lev,
Thanks for the reply!
My GPUs do support GPUDirect. And I've tested using the "simpleP2P" from
cuda samples.
I've been trying a little bit on that but without success. I'm new to
pycuda and multiprocessing, excuse me if I ask dumb questions.
If I understand it right, I need to do pycuda.driver.init() and
make_context() inside each process.
But to use pycuda.driver.memcpy_peer, I need to pass the context defined in
one process to another. But when I try to pass the context with Pipe or
Queue from multiprocessing, it returns pickling error.
Is this the right way of doing it? assuming pickling error can be solved.

--
Lev Givon
Bionet Group | Neurokernel Project
http://www.columbia.edu/~lev/
http://lebedov.github.io/
http://neurokernel.github.io/

Gavin Weiguang Ding

2014-10-07 22:40:46 UTC

Permalink

Hi Lev,

Thanks a lot for the example! It really helps!

Just to make sure I understand the example in the right way:

The actual p2p data transfer happens when executing the following line?
x_gpu = gpuarray.GPUArray(shape, dtype, gpudata=drv.IPCMemoryHandle(h))

So, if I use different devices in proc1 and proc2, then
x_gpu = gpuarray.GPUArray(shape, dtype, gpudata=drv.IPCMemoryHandle(h))
would actually transfer the data from device 1 to device 2 through p2p?

My GPU server has 3 gpus, and only device 1 and 2 can do p2p.
What I observed is that
the example works when
1. proc1 and proc2 using the same device
2. proc1 and proc2 using device 1 and 2 respectively or vice versa

it doesn't work when
proc1 using device 0 and proc2 using 1 or 2 and vice versa
it also returns the error of
"LogicError: cuIpcOpenMemHandle failed: invalid/unknown error code"

So I guess that's the desired behaviour? and IPCMemoryHandle only works
when p2p (or within same GPU) is possible?

Thanks a lot!

Post by Lev Givon

Post by Gavin Weiguang Ding
one process to another. But when I try to pass the context with Pipe or
Queue from multiprocessing, it returns pickling error.
Is this the right way of doing it? assuming pickling error can be solved.

Since CUDA contexts are private, you can't use the context set up in one process
in another. In recent versions of CUDA, you can use its IPC API to transfer a
GPU memory address from one GPU to another. See
https://gist.github.com/lebedov/6408165 for an example of how to use the API
(requires pyzmq).
--
Lev Givon
Bionet Group | Neurokernel Project
http://www.columbia.edu/~lev/
http://lebedov.github.io/
http://neurokernel.github.io/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.tiker.net/pipermail/pycuda/attachments/20141007/0dc14f2a/attachment.html>

Lev Givon

2014-10-08 06:05:49 UTC

Permalink

Post by Gavin Weiguang Ding
Hi Lev,
Thanks a lot for the example! It really helps!
The actual p2p data transfer happens when executing the following line?
x_gpu = gpuarray.GPUArray(shape, dtype, gpudata=drv.IPCMemoryHandle(h))
So, if I use different devices in proc1 and proc2, then
x_gpu = gpuarray.GPUArray(shape, dtype, gpudata=drv.IPCMemoryHandle(h))
would actually transfer the data from device 1 to device 2 through p2p?
My GPU server has 3 gpus, and only device 1 and 2 can do p2p.
What I observed is that
the example works when
1. proc1 and proc2 using the same device
2. proc1 and proc2 using device 1 and 2 respectively or vice versa
it doesn't work when
proc1 using device 0 and proc2 using 1 or 2 and vice versa
it also returns the error of
"LogicError: cuIpcOpenMemHandle failed: invalid/unknown error code"
So I guess that's the desired behaviour? and IPCMemoryHandle only works
when p2p (or within same GPU) is possible?
Thanks a lot!

--
Lev Givon
Bionet Group | Neurokernel Project
http://www.columbia.edu/~lev/
http://lebedov.github.io/
http://neurokernel.github.io/

Gavin Weiguang Ding

2014-10-08 14:41:01 UTC

Permalink

I see.

Again to confirm, are the following lines (in proc2) the right way to
actually copy data?
x_gpu = gpuarray.GPUArray(shape, dtype, gpudata=drv.IPCMemoryHandle(h))
y_gpu = gpuarray.zeros_like(x_gpu)
drv.memcpy_peer(y_gpu.ptr, x_gpu.ptr, x_gpu.dtype.itemsize * x_gpu.size,
ctx, ctx)

It seems working for me, I'm kind of not sure about to set dest_ctx and
src_ctx to be the same.

Thanks a lot!!!

Post by Lev Givon

CUDA's IPC mechanism doesn't actually copy any data; it makes it possible
to share a device pointer created in one process with some other process. You
still need to transfer the data from one location to the other.
--
Lev Givon
Bionet Group | Neurokernel Project
http://www.columbia.edu/~lev/
http://lebedov.github.io/
http://neurokernel.github.io/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.tiker.net/pipermail/pycuda/attachments/20141008/f0877a27/attachment.html>

Lev Givon

2014-10-08 15:29:48 UTC

Permalink

Post by Gavin Weiguang Ding
I see.
Again to confirm, are the following lines (in proc2) the right way to
actually copy data?
x_gpu = gpuarray.GPUArray(shape, dtype, gpudata=drv.IPCMemoryHandle(h))
y_gpu = gpuarray.zeros_like(x_gpu)
drv.memcpy_peer(y_gpu.ptr, x_gpu.ptr, x_gpu.dtype.itemsize * x_gpu.size,
ctx, ctx)
It seems working for me, I'm kind of not sure about to set dest_ctx and
src_ctx to be the same.

Yes - that should do it.

Post by Gavin Weiguang Ding
Thanks a lot!!!

--
Lev Givon
Bionet Group | Neurokernel Project
http://www.columbia.edu/~lev/
http://lebedov.github.io/
http://neurokernel.github.io/