Discussion:
[PyCUDA] Question about managing contexts
Alex Park
2015-04-28 22:41:33 UTC
Permalink
Hi,

I'm trying to use multiple gpus with mpi and ipc handles instead of the
built-in mpi primitives to p2p communication.

I think I'm not quite understanding how contexts should be managed. For
example, I have two versions of a toy example to try out accessing data
between nodes via ipc handle. Both seem to work, in the sense that process
1 can 'see' the data from process 0, but the first version completes
without any error, while the second version generates the following error:

PyCUDA WARNING: a clean-up operation failed (dead context maybe?)


│···········
cuMemFree failed: invalid value

The two versions are attached below. Would appreciate any insight as to
what I'm doing wrong.

-Alex



Here are the two versions:

*VERSION 1*

from mpi4py import MPIimport numpy as npimport atexitimport
pycuda.driver as drvimport pycuda.gpuarray as gpuarrayclass
TestMGPU(object): def __init__(self): self.mpi_size =
MPI.COMM_WORLD.size self.mpi_rank = MPI.COMM_WORLD.rank def
proc(self): if self.mpi_rank == 0: ctx =
drv.Device(self.mpi_rank).make_context() self.x_gpu =
gpuarray.to_gpu(np.random.rand(8)) h =
drv.mem_get_ipc_handle(self.x_gpu.ptr)
MPI.COMM_WORLD.send((h, self.x_gpu.shape, self.x_gpu.dtype), dest=1)
print 'p1 self.x_gpu:', self.x_gpu ctx.detach()
else: ctx = drv.Device(self.mpi_rank).make_context()
h, s, d = MPI.COMM_WORLD.recv(source=0) ptr =
drv.IPCMemoryHandle(h) xt_gpu = gpuarray.GPUArray(s, d,
gpudata=ptr) print 'xt_gpu: ', xt_gpu
ctx.detach()if __name__ == '__main__': drv.init()
atexit.register(MPI.Finalize) a = TestMGPU() a.proc()




*VERSION 2 (Imports are the same)*

class TestMGPU(object): def __init__(self): self.mpi_size =
MPI.COMM_WORLD.size self.mpi_rank = MPI.COMM_WORLD.rank
self.x_gpu = gpuarray.to_gpu(np.random.rand(8)) def proc(self):
if self.mpi_rank == 0: h =
drv.mem_get_ipc_handle(self.x_gpu.ptr)
MPI.COMM_WORLD.send((h, self.x_gpu.shape, self.x_gpu.dtype), dest=1)
print 'p1 self.x_gpu:', self.x_gpu else: h,
s, d = MPI.COMM_WORLD.recv(source=0) ptr =
drv.IPCMemoryHandle(h) xt_gpu = gpuarray.GPUArray(s, d,
gpudata=ptr) print 'xt_gpu: ', xt_gpuif __name__ ==
'__main__': drv.init() ctx =
drv.Device(MPI.COMM_WORLD.rank).make_context()
atexit.register(ctx.pop) atexit.register(MPI.Finalize) a =
TestMGPU() a.proc()
Andreas Kloeckner
2015-05-01 16:34:38 UTC
Permalink
Post by Alex Park
Hi,
I'm trying to use multiple gpus with mpi and ipc handles instead of the
built-in mpi primitives to p2p communication.
I think I'm not quite understanding how contexts should be managed. For
example, I have two versions of a toy example to try out accessing data
between nodes via ipc handle. Both seem to work, in the sense that process
1 can 'see' the data from process 0, but the first version completes
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
│···········
cuMemFree failed: invalid value
The two versions are attached below. Would appreciate any insight as to
what I'm doing wrong.
Context management in CUDA is a bit of a mess. In particular, resources
in a given context cannot be freed if that context isn't (or can't be
made) the active context in the current thread. Your Context.pop atexit
makes sure that PyCUDA can select contexts when it tries to do clean-up,
which may (because of MPI) run in a different thread than the one that
does the bulk of the work.

That's my best guess as to what's going on. tl;dr: The Context.pop() is
the main difference.

Andreas
Alex Park
2015-05-11 19:44:16 UTC
Permalink
Hi,

Thank you for the response.

As a followup question, I was looking at the underlying code for
ipc_mem_handle, and it seems like when a handle is deleted, it tries to do
a mem_free on the underlying device pointer.

So could there not be a situation as follows:

1. Process A allocates gpuarray G and passes IPC handle H to Process B
2. Process B unpacks H, does something with it, then lets it go out of
scope, so H is closed and the underlying gpu mem is freed
3. Process A then tries to do something on G but then finds that its
memory has already been freed

-Alex
Post by Alex Park
Post by Alex Park
Hi,
I'm trying to use multiple gpus with mpi and ipc handles instead of the
built-in mpi primitives to p2p communication.
I think I'm not quite understanding how contexts should be managed. For
example, I have two versions of a toy example to try out accessing data
between nodes via ipc handle. Both seem to work, in the sense that
process
Post by Alex Park
1 can 'see' the data from process 0, but the first version completes
without any error, while the second version generates the following
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
│···········
cuMemFree failed: invalid value
The two versions are attached below. Would appreciate any insight as to
what I'm doing wrong.
Context management in CUDA is a bit of a mess. In particular, resources
in a given context cannot be freed if that context isn't (or can't be
made) the active context in the current thread. Your Context.pop atexit
makes sure that PyCUDA can select contexts when it tries to do clean-up,
which may (because of MPI) run in a different thread than the one that
does the bulk of the work.
That's my best guess as to what's going on. tl;dr: The Context.pop() is
the main difference.
Andreas
--
*Alex Park, PhD *
*Engineer*

*Making machines smarter.*
*Nervana Systems* | nervanasys.com | (617) 283-6951
6440 Lusk Blvd. #D211, San Diego, CA 92121
2483 Old Middlefield Way #203, Mountain View, CA 94043
Andreas Kloeckner
2015-05-11 19:57:35 UTC
Permalink
Post by Alex Park
Thank you for the response.
As a followup question, I was looking at the underlying code for
ipc_mem_handle, and it seems like when a handle is deleted, it tries to do
a mem_free on the underlying device pointer.
1. Process A allocates gpuarray G and passes IPC handle H to Process B
2. Process B unpacks H, does something with it, then lets it go out of
scope, so H is closed and the underlying gpu mem is freed
3. Process A then tries to do something on G but then finds that its
memory has already been freed
Hum, sounds like that needs to be fixed. I'd be happy to take a pull request.

Andreas
Alex Park
2015-05-11 20:18:12 UTC
Permalink
Not sure if its sufficiently tested for other peoples' usage, but deleting

/src/cpp/cuda.hpp: line 1624

seemed to solve my problems. Logic here being that the memory will be
freed inside the process that allocated the memory when the object from
which the handle was grabbed gets deleted.

-Alex
Post by Andreas Kloeckner
Post by Alex Park
Thank you for the response.
As a followup question, I was looking at the underlying code for
ipc_mem_handle, and it seems like when a handle is deleted, it tries to
do
Post by Alex Park
a mem_free on the underlying device pointer.
1. Process A allocates gpuarray G and passes IPC handle H to Process B
2. Process B unpacks H, does something with it, then lets it go out of
scope, so H is closed and the underlying gpu mem is freed
3. Process A then tries to do something on G but then finds that its
memory has already been freed
Hum, sounds like that needs to be fixed. I'd be happy to take a pull request.
Andreas
--
*Alex Park, PhD *
*Engineer*

*Making machines smarter.*
*Nervana Systems* | nervanasys.com | (617) 283-6951
6440 Lusk Blvd. #D211, San Diego, CA 92121
2483 Old Middlefield Way #203, Mountain View, CA 94043
Andreas Kloeckner
2015-05-11 20:56:06 UTC
Permalink
Post by Alex Park
Not sure if its sufficiently tested for other peoples' usage, but deleting
/src/cpp/cuda.hpp: line 1624
seemed to solve my problems. Logic here being that the memory will be
freed inside the process that allocated the memory when the object from
which the handle was grabbed gets deleted.
Done in git. It would be great if you could give this a whirl and report
back. At any rate, thanks for the suggestion!

Andreas

Loading...