@Lev, thanks for the tip; I will look into it.
In the meanwhile, I am running into some speed issues. I notice that it
slows down progressively almost by a factor of 0.5, in just 7000 updates.
It starts with about 2.6 sec/ mini-batch (average speed), but after 7000
mini-batches, the time increases to 3.7 secs/ mini-batch.
I suspect that I may not be sending the host memory pointers but the actual
arrays, serialized by zmq's send_pyobj (see below in the code). Could
someone confirm whether I am doing it correctly? Should I just be sending/
receiving host memory pointers?
Also, is it correct that host memory pointers don't change throughout the
training; I do pagelocked_zeros_like first and then just copy to the same
memory through memcpy_dtoh_async. In this case, I thought that I don't have
to send/ receive the tr_params_host_list/ tr_params_host_other_list
everytime; however that didn't work.
Here is the relevant snippets from my code:
# We need to create gpuarrays to receive values from the other gpu,
done once before training
for tr_param in itemlist(tparams):
# create empty gpuarrays to receive from other gpu
tr_param_other = theano.shared(tr_param.get_value() * 0.)
tr_param_ga_other =
theano.misc.pycuda_utils.to_gpuarray(tr_param_other.container.value)
tr_params_other_list.append(tr_param_other)
tr_params_ga_other_list.append(tr_param_ga_other)
# gpuarrays for current params
tr_param_ga =
theano.misc.pycuda_utils.to_gpuarray(tr_param.container.value)
tr_param_host = drv.pagelocked_zeros_like(tr_param_ga)
tr_params_ga_list.append(tr_param_ga)
tr_params_host_list.append(tr_param_host)
# Now during training, we need to copy to host and then exchange the
params in host mem
for x, y in train:
mb_start = time.time()
...
f_cost = f_update(x, y)
if numpy.mod(uidx, syncFreq) == 0:
# copy from device to host memory and pass host params list
d2h_start = time.time()
for tr_param_host, tr_param_ga in zip(tr_params_host_list,
tr_params_ga_list):
drv.memcpy_dtoh_async(tr_param_host, tr_param_ga.ptr)
sock.send_pyobj(tr_params_host_list)
d2h = time.time() - d2h_start
d2h_tot += d2h
h2d_start = time.time()
tr_params_host_other_list = sock.recv_pyobj() # receive host
params list
for tr_param_ga_other, tr_param_host_other in
zip(tr_params_ga_other_list, tr_params_host_other_list):
drv.memcpy_htod_async(tr_param_ga_other.ptr,
tr_param_host_other)
h2d = time.time() - h2d_start
h2d_tot += h2d
f_avg_params(x, y) # average the params in two gpus
mb_tot += time.time() - mb_start
The other possibility is that send_pyobj() and recv_pyobj() are blocking
causing the slowdown as it waits. But, the d2h/ h2d times increases only
marginally, for example from 0.1 secs in the beginning to 0.24 secs/
minibatch after 7k minibatches. So clearly this doesn't explain more than 1
sec slowdown. In any case, I have now added the zmq.NOBLOCK to
send_pyobj(); will have to see if it helps.
Thanks a lot for any help on these.
Best
- Baskaran
Thanks Andreas for the hint. Actually, what I am trying is little bit
complex than that. I have two python processes running on two GPUs. In a
simpler setting, I have array x in gpu0's Python process to be
transferred
to gpu1's process and vice versa.
* alloc-host-memory
* memcpy from device to host (gpu0 to host; gpu1 to host)
* send/receive objects in host memory to the Python process in the other
gpu
* memcpy from host to device within respective gpu
The solution and output from a sample run follow. Now, I wonder if it is
possible to improve this further. One possibility is whether the device
to
host copy can be eliminated. Because, I need to transfer several theano
tensors between multiple (up to 4) gpus and I need to do this quite
frequently (say every nth mini batch) during training.
Note: Not all gpus are P2P capable and so memcpy_peer wouldn't work.
If you have access to a recent release of OpenMPI or MVAPICH2 built with
CUDA
support, you may wish to try using mpi4py for transferring data between
GPUArrays in different processes; you can pass the MPI wrapper functions
the
GPUArray pointers and let the underlying MPI implementation determine when
to
take advantage of P2P.
--
Lev Givon
Bionet Group | Neurokernel Project
http://lebedov.github.io/
http://neurokernel.github.io/