[PyCUDA] Pycuda and RDMA transfer

Andreas Kloeckner

2015-10-31 03:16:23 UTC

Apologies for emailing you directly. I did subscribe to the PyCuda mailing
list, but my request is not approved yet.

There is no approvals process. It's likely that the subscription request
went to your spam folder. I've CC'd the list. Maybe someone on there
knows.

I have been using PyCuda in recent times to parallelize Theano across two
GPUs and I should say that it has been really useful. For example, I was
able to achieve 1.85x speedup of our Neural MT with Pycuda over the single
GPU version.

I'm happy to hear you're finding the software useful.

I am now trying to see if I could parallelize it across more gpus. However,
the gpus in this case are connected through socket-level links and not
through PCI-e switches. Here is the topology of a typical node in the gpu
cluster.
[xc181] ~ $ nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 mlx4_0 CPU Affinity
GPU0 X PIX SOC SOC PHB 0-7
GPU1 PIX X SOC SOC PHB 0-7
GPU2 SOC SOC X PIX SOC 8-15
GPU3 SOC SOC PIX X SOC 8-15
mlx4_0 PHB PHB SOC SOC X
X = Self
SOC = Path traverses a socket-level link (e.g. QPI)
PHB = Path traverses a PCIe host bridge
PXB = Path traverses multiple PCIe internal switches
PIX = Path traverses a PCIe internal switch
So, I wonder whether the PyCuda peer-to-peer copy (memcpy_peer) will work
for these socket-level links. I am unable to test this in the cluster here,
because the GPUDirect is enabled only between pairs of gpus (0-1 and 2-3).
However, from the nvidia website, it seems the GPUDirect v3 supports RDMA
that allows these kinds of transfers (across two nodes or across
socket-linked nodes).
https://developer.nvidia.com/gpudirect
http://devblogs.nvidia.com/parallelforall/benchmarking-gpudirect-rdma-on-modern-server-platforms/
I must admit that I am not very familiar with the differences in the
technologies and so my understanding could be incorrect.
So, my question here is whether PyCuda memcpy_peer will support the RDMA
style GPUDirect transfers? Any info will be greatly appreciated.

Sorry, I haven't used this technology myself, so I simply don't
know. What I can say is that if any amount of control over this is
available through the CUDA API, that same level of control should also
be achievable through PyCUDA.

Maybe someone on the list has an idea.

Hope that helps at least a bit,
Andreas