[PyCUDA] Question about managing contexts

Discussion:

Alex Park

2015-04-28 22:41:33 UTC

Hi,

I'm trying to use multiple gpus with mpi and ipc handles instead of the
built-in mpi primitives to p2p communication.

I think I'm not quite understanding how contexts should be managed. For
example, I have two versions of a toy example to try out accessing data
between nodes via ipc handle. Both seem to work, in the sense that process
1 can 'see' the data from process 0, but the first version completes
without any error, while the second version generates the following error:

PyCUDA WARNING: a clean-up operation failed (dead context maybe?)

âÂ·Â·Â·Â·Â·Â·Â·Â·Â·Â·Â·
cuMemFree failed: invalid value

The two versions are attached below. Would appreciate any insight as to
what I'm doing wrong.

-Alex

Here are the two versions:

*VERSION 1*

from mpi4py import MPIimport numpy as npimport atexitimport
pycuda.driver as drvimport pycuda.gpuarray as gpuarrayclass
TestMGPU(object): def __init__(self): self.mpi_size =
MPI.COMM_WORLD.size self.mpi_rank = MPI.COMM_WORLD.rank def
proc(self): if self.mpi_rank == 0: ctx =
drv.Device(self.mpi_rank).make_context() self.x_gpu =
gpuarray.to_gpu(np.random.rand(8)) h =
drv.mem_get_ipc_handle(self.x_gpu.ptr)
MPI.COMM_WORLD.send((h, self.x_gpu.shape, self.x_gpu.dtype), dest=1)
print 'p1 self.x_gpu:', self.x_gpu ctx.detach()
else: ctx = drv.Device(self.mpi_rank).make_context()
h, s, d = MPI.COMM_WORLD.recv(source=0) ptr =
drv.IPCMemoryHandle(h) xt_gpu = gpuarray.GPUArray(s, d,
gpudata=ptr) print 'xt_gpu: ', xt_gpu
ctx.detach()if __name__ == '__main__': drv.init()
atexit.register(MPI.Finalize) a = TestMGPU() a.proc()

*VERSION 2 (Imports are the same)*

class TestMGPU(object): def __init__(self): self.mpi_size =
MPI.COMM_WORLD.size self.mpi_rank = MPI.COMM_WORLD.rank
self.x_gpu = gpuarray.to_gpu(np.random.rand(8)) def proc(self):
if self.mpi_rank == 0: h =
drv.mem_get_ipc_handle(self.x_gpu.ptr)
MPI.COMM_WORLD.send((h, self.x_gpu.shape, self.x_gpu.dtype), dest=1)
print 'p1 self.x_gpu:', self.x_gpu else: h,
s, d = MPI.COMM_WORLD.recv(source=0) ptr =
drv.IPCMemoryHandle(h) xt_gpu = gpuarray.GPUArray(s, d,
gpudata=ptr) print 'xt_gpu: ', xt_gpuif __name__ ==
'__main__': drv.init() ctx =
drv.Device(MPI.COMM_WORLD.rank).make_context()
atexit.register(ctx.pop) atexit.register(MPI.Finalize) a =
TestMGPU() a.proc()

Andreas Kloeckner

2015-05-01 16:34:38 UTC

Permalink

Post by Alex Park
Hi,
I'm trying to use multiple gpus with mpi and ipc handles instead of the
built-in mpi primitives to p2p communication.
I think I'm not quite understanding how contexts should be managed. For
example, I have two versions of a toy example to try out accessing data
between nodes via ipc handle. Both seem to work, in the sense that process
1 can 'see' the data from process 0, but the first version completes
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
âÂ·Â·Â·Â·Â·Â·Â·Â·Â·Â·Â·
cuMemFree failed: invalid value
The two versions are attached below. Would appreciate any insight as to
what I'm doing wrong.

Context management in CUDA is a bit of a mess. In particular, resources
in a given context cannot be freed if that context isn't (or can't be
made) the active context in the current thread. Your Context.pop atexit
makes sure that PyCUDA can select contexts when it tries to do clean-up,
which may (because of MPI) run in a different thread than the one that
does the bulk of the work.

That's my best guess as to what's going on. tl;dr: The Context.pop() is
the main difference.

Andreas

Alex Park

2015-05-11 19:44:16 UTC

Permalink

Hi,

Thank you for the response.

As a followup question, I was looking at the underlying code for
ipc_mem_handle, and it seems like when a handle is deleted, it tries to do
a mem_free on the underlying device pointer.

So could there not be a situation as follows:

1. Process A allocates gpuarray G and passes IPC handle H to Process B
2. Process B unpacks H, does something with it, then lets it go out of
scope, so H is closed and the underlying gpu mem is freed
3. Process A then tries to do something on G but then finds that its
memory has already been freed

-Alex

Post by Alex Park

process

Post by Alex Park
1 can 'see' the data from process 0, but the first version completes
without any error, while the second version generates the following
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
âÂ·Â·Â·Â·Â·Â·Â·Â·Â·Â·Â·
cuMemFree failed: invalid value
The two versions are attached below. Would appreciate any insight as to
what I'm doing wrong.

Context management in CUDA is a bit of a mess. In particular, resources
in a given context cannot be freed if that context isn't (or can't be
made) the active context in the current thread. Your Context.pop atexit
makes sure that PyCUDA can select contexts when it tries to do clean-up,
which may (because of MPI) run in a different thread than the one that
does the bulk of the work.
That's my best guess as to what's going on. tl;dr: The Context.pop() is
the main difference.
Andreas

--
*Alex Park, PhD *
*Engineer*

*Making machines smarter.*
*Nervana Systems* | nervanasys.com | (617) 283-6951
6440 Lusk Blvd. #D211, San Diego, CA 92121
2483 Old Middlefield Way #203, Mountain View, CA 94043

Andreas Kloeckner

2015-05-11 19:57:35 UTC

Permalink

Post by Alex Park
Thank you for the response.
As a followup question, I was looking at the underlying code for
ipc_mem_handle, and it seems like when a handle is deleted, it tries to do
a mem_free on the underlying device pointer.
1. Process A allocates gpuarray G and passes IPC handle H to Process B
2. Process B unpacks H, does something with it, then lets it go out of
scope, so H is closed and the underlying gpu mem is freed
3. Process A then tries to do something on G but then finds that its
memory has already been freed

Hum, sounds like that needs to be fixed. I'd be happy to take a pull request.

Andreas

Alex Park

2015-05-11 20:18:12 UTC

Permalink

Not sure if its sufficiently tested for other peoples' usage, but deleting

/src/cpp/cuda.hpp: line 1624

seemed to solve my problems. Logic here being that the memory will be
freed inside the process that allocated the memory when the object from
which the handle was grabbed gets deleted.

-Alex

Post by Andreas Kloeckner

Post by Alex Park
Thank you for the response.
As a followup question, I was looking at the underlying code for
ipc_mem_handle, and it seems like when a handle is deleted, it tries to

Post by Alex Park
a mem_free on the underlying device pointer.
1. Process A allocates gpuarray G and passes IPC handle H to Process B
2. Process B unpacks H, does something with it, then lets it go out of
scope, so H is closed and the underlying gpu mem is freed
3. Process A then tries to do something on G but then finds that its
memory has already been freed

Hum, sounds like that needs to be fixed. I'd be happy to take a pull request.
Andreas

Andreas Kloeckner

2015-05-11 20:56:06 UTC

Permalink

Post by Alex Park
Not sure if its sufficiently tested for other peoples' usage, but deleting
/src/cpp/cuda.hpp: line 1624
seemed to solve my problems. Logic here being that the memory will be
freed inside the process that allocated the memory when the object from
which the handle was grabbed gets deleted.

Done in git. It would be great if you could give this a whirl and report
back. At any rate, thanks for the suggestion!

Andreas