[PyCUDA] Two-step gpu to gpu transfer

Discussion:

Baskaran Sankaran

2015-11-11 06:46:38 UTC

Hi all,

I am looking for a solution for exchanging some tensors between two gpus,
that do not have P2P enabled. Assuming two GPUs on the same node, I guess I
have to do it in two steps; first copy to host memory from GPU (gpu-0) and
then copy from host memory to the other GPU (gpu-1). However it is not
exactly clear to me as to how I can go about this.

Any help is appreciated.

Thanks
- Baskaran

Andreas Kloeckner

2015-11-11 07:40:41 UTC

Permalink

Post by Baskaran Sankaran
Hi all,
I am looking for a solution for exchanging some tensors between two gpus,
that do not have P2P enabled. Assuming two GPUs on the same node, I guess I
have to do it in two steps; first copy to host memory from GPU (gpu-0) and
then copy from host memory to the other GPU (gpu-1). However it is not
exactly clear to me as to how I can go about this.

(1) Allocate memory on host
(2) memcpy(host mem, gpu0_mem)
(3) memcpy(gpu1_mem, host_mem)
(4) (Optionally) free host mem

Not sure what you're asking...

Andreas

Baskaran Sankaran

2015-11-17 20:08:10 UTC

Permalink

@Lev, thanks for the tip; I will look into it.

In the meanwhile, I am running into some speed issues. I notice that it
slows down progressively almost by a factor of 0.5, in just 7000 updates.
It starts with about 2.6 sec/ mini-batch (average speed), but after 7000
mini-batches, the time increases to 3.7 secs/ mini-batch.

I suspect that I may not be sending the host memory pointers but the actual
arrays, serialized by zmq's send_pyobj (see below in the code). Could
someone confirm whether I am doing it correctly? Should I just be sending/
receiving host memory pointers?

Also, is it correct that host memory pointers don't change throughout the
training; I do pagelocked_zeros_like first and then just copy to the same
memory through memcpy_dtoh_async. In this case, I thought that I don't have
to send/ receive the tr_params_host_list/ tr_params_host_other_list
everytime; however that didn't work.

Here is the relevant snippets from my code:

# We need to create gpuarrays to receive values from the other gpu,
done once before training
for tr_param in itemlist(tparams):
# create empty gpuarrays to receive from other gpu
tr_param_other = theano.shared(tr_param.get_value() * 0.)
tr_param_ga_other =
theano.misc.pycuda_utils.to_gpuarray(tr_param_other.container.value)
tr_params_other_list.append(tr_param_other)
tr_params_ga_other_list.append(tr_param_ga_other)

# gpuarrays for current params
tr_param_ga =
theano.misc.pycuda_utils.to_gpuarray(tr_param.container.value)
tr_param_host = drv.pagelocked_zeros_like(tr_param_ga)
tr_params_ga_list.append(tr_param_ga)
tr_params_host_list.append(tr_param_host)

# Now during training, we need to copy to host and then exchange the
params in host mem
for x, y in train:
mb_start = time.time()
...
f_cost = f_update(x, y)

if numpy.mod(uidx, syncFreq) == 0:
# copy from device to host memory and pass host params list
d2h_start = time.time()
for tr_param_host, tr_param_ga in zip(tr_params_host_list,
tr_params_ga_list):
drv.memcpy_dtoh_async(tr_param_host, tr_param_ga.ptr)

sock.send_pyobj(tr_params_host_list)
d2h = time.time() - d2h_start
d2h_tot += d2h

h2d_start = time.time()
tr_params_host_other_list = sock.recv_pyobj() # receive host
params list

for tr_param_ga_other, tr_param_host_other in
zip(tr_params_ga_other_list, tr_params_host_other_list):
drv.memcpy_htod_async(tr_param_ga_other.ptr,
tr_param_host_other)
h2d = time.time() - h2d_start
h2d_tot += h2d
f_avg_params(x, y) # average the params in two gpus

mb_tot += time.time() - mb_start

The other possibility is that send_pyobj() and recv_pyobj() are blocking
causing the slowdown as it waits. But, the d2h/ h2d times increases only
marginally, for example from 0.1 secs in the beginning to 0.24 secs/
minibatch after 7k minibatches. So clearly this doesn't explain more than 1
sec slowdown. In any case, I have now added the zmq.NOBLOCK to
send_pyobj(); will have to see if it helps.

Thanks a lot for any help on these.

Best
- Baskaran

Thanks Andreas for the hint. Actually, what I am trying is little bit
complex than that. I have two python processes running on two GPUs. In a
simpler setting, I have array x in gpu0's Python process to be

transferred

to gpu1's process and vice versa.
* alloc-host-memory
* memcpy from device to host (gpu0 to host; gpu1 to host)
* send/receive objects in host memory to the Python process in the other
gpu
* memcpy from host to device within respective gpu
The solution and output from a sample run follow. Now, I wonder if it is
possible to improve this further. One possibility is whether the device

host copy can be eliminated. Because, I need to transfer several theano
tensors between multiple (up to 4) gpus and I need to do this quite
frequently (say every nth mini batch) during training.
Note: Not all gpus are P2P capable and so memcpy_peer wouldn't work.

If you have access to a recent release of OpenMPI or MVAPICH2 built with
CUDA
support, you may wish to try using mpi4py for transferring data between
GPUArrays in different processes; you can pass the MPI wrapper functions
the
GPUArray pointers and let the underlying MPI implementation determine when
to
take advantage of P2P.
--
Lev Givon
Bionet Group | Neurokernel Project
http://lebedov.github.io/
http://neurokernel.github.io/

Lev Givon

2015-11-17 22:21:01 UTC

Permalink

Post by Baskaran Sankaran
@Lev, thanks for the tip; I will look into it.
In the meanwhile, I am running into some speed issues. I notice that it
slows down progressively almost by a factor of 0.5, in just 7000 updates.
It starts with about 2.6 sec/ mini-batch (average speed), but after 7000
mini-batches, the time increases to 3.7 secs/ mini-batch.
I suspect that I may not be sending the host memory pointers but the actual
arrays, serialized by zmq's send_pyobj (see below in the code). Could
someone confirm whether I am doing it correctly? Should I just be sending/
receiving host memory pointers?

You are transmitting the array contents. If you use IPC to send the GPU array
pointers to both processes [1], you should be able to perform a device-to-device
copy between the two memory locations even if you can't use P2P [2] (assuming
that UVA is supported on both devices).

[1] https://gist.github.com/e554b3985e196b07f93b
[2] https://gist.github.com/3078644

--
Lev Givon
Bionet Group | Neurokernel Project
http://lebedov.github.io/
http://neurokernel.github.io/

Baskaran Sankaran

2015-11-17 22:56:56 UTC

Permalink

No UVA is not enabled on them; I already tried memcpy_dtod.

In send_pyobj the array would just be passed as reference though right?

Thanks

Post by Baskaran Sankaran

actual

Post by Baskaran Sankaran
arrays, serialized by zmq's send_pyobj (see below in the code). Could
someone confirm whether I am doing it correctly? Should I just be

sending/

Post by Baskaran Sankaran
receiving host memory pointers?

You are transmitting the array contents. If you use IPC to send the GPU array
pointers to both processes [1], you should be able to perform a device-to-device
copy between the two memory locations even if you can't use P2P [2] (assuming
that UVA is supported on both devices).
[1] https://gist.github.com/e554b3985e196b07f93b
[2] https://gist.github.com/3078644
--
Lev Givon
Bionet Group | Neurokernel Project
http://lebedov.github.io/
http://neurokernel.github.io/

Lev Givon

2015-11-17 23:12:03 UTC

Permalink

Post by Baskaran Sankaran
No UVA is not enabled on them; I already tried memcpy_dtod.
In send_pyobj the array would just be passed as reference though right?

Since send_pyobj() passes its argument directly to pickle.dumps(), it
will serialize the array contents just as it would with a normally allocated numpy array.

--
Lev Givon
Bionet Group | Neurokernel Project
http://lebedov.github.io/
http://neurokernel.github.io/

Baskaran Sankaran

2015-11-17 23:15:57 UTC

Permalink

Oh right, that's a good catch. If I can get the host pointer, then I can
just use send. Is it possible to access host mem pointer in pycuda then?

Post by Lev Givon

Post by Baskaran Sankaran
No UVA is not enabled on them; I already tried memcpy_dtod.
In send_pyobj the array would just be passed as reference though right?

Since send_pyobj() passes its argument directly to pickle.dumps(), it
will serialize the array contents just as it would with a normally allocated numpy array.
--
Lev Givon
Bionet Group | Neurokernel Project
http://lebedov.github.io/
http://neurokernel.github.io/

Lev Givon

2015-11-17 23:43:03 UTC

Permalink

Post by Baskaran Sankaran

Post by Lev Givon

Post by Baskaran Sankaran
No UVA is not enabled on them; I already tried memcpy_dtod.
In send_pyobj the array would just be passed as reference though right?

Since send_pyobj() passes its argument directly to pickle.dumps(), it
will serialize the array contents just as it would with a normally
allocated numpy array.

Oh right, that's a good catch. If I can get the host pointer, then I can
just use send. Is it possible to access host mem pointer in pycuda then?

Well, you can get a pointer to the array's host memory via
array_name.ctypes.data, but you can't use that pointer by default in a different
Python process because the memory isn't shared. Not sure that PyCUDA currently
provides anything to facilitate sharing of host memory between processes (the
IPC mechanism I mentioned only applies to GPU pointers).

--
Lev Givon
Bionet Group | Neurokernel Project
http://lebedov.github.io/
http://neurokernel.github.io/