[PyCUDA] Frequent crashes after using pycuda

Discussion:

Craig Stringham

2014-12-10 20:03:33 UTC

Hi,
I keep crashing a server (kernel panic) when using pycuda within ipython.
It doesn't seem to matter what kernel I run and it only crashes several
minutes after I have run a kernel but have kept the ipython shell open. I
am using the latest git version of pycuda with the latest release of CUDA
(6.5) on a K40m. Below is the last portion of the crash dmesg.
Have any of you had a similar issue? I will try putting the GPU in
persistent mode since it appears it is falling off the bus... maybe that
will fix it.

...
<7>nvidia 0000:09:00.0: irq 131 for MSI/MSI-X
<6>IPT_DBLK IN=eth1 OUT= MAC=00:1e:64:98:6c:53:00:1b:8f:d9:dc:04:08:00
SRC=198.20.69.74 DST=137.79.160.131 LEN=40 TOS=0x00 PREC=0x00 TTL=116
ID=60947 PROTO=TCP SPT=20849 DPT=80 WINDOW=9963 RES=0x00 SYN URGP=0
<7>nvidia 0000:09:00.0: irq 131 for MSI/MSI-X
<7>nvidia 0000:09:00.0: irq 131 for MSI/MSI-X
<0>{1}[Hardware Error]: Hardware error from APEI Generic Hardware Error
Source: 1
<0>{1}[Hardware Error]: event severity: fatal
<0>{1}[Hardware Error]: Error 0, type: fatal
<0>{1}[Hardware Error]: section_type: PCIe error
<0>{1}[Hardware Error]: port_type: 4, root port
<0>{1}[Hardware Error]: version: 1.16
<0>{1}[Hardware Error]: command: 0x4010, status: 0x0547
<0>{1}[Hardware Error]: device_id: 0000:00:03.2
<0>{1}[Hardware Error]: slot: 0
<0>{1}[Hardware Error]: secondary_bus: 0x09
<0>{1}[Hardware Error]: vendor_id: 0x8086, device_id: 0x0e0a
<0>{1}[Hardware Error]: class_code: 000406
<0>{1}[Hardware Error]: bridge: secondary_status: 0x0000, control: 0x0003
<0>Kernel panic - not syncing: Fatal hardware error!
<4>Pid: 0, comm: swapper Tainted: P ---------------
2.6.32-504.1.3.el6.x86_64 #1
<4>Call Trace:
<4> <NMI> [<ffffffff815292dc>] ? panic+0xa7/0x16f
<4> [<ffffffff8131c191>] ? ghes_notify_nmi+0x241/0x250
<4> [<ffffffff81530095>] ? notifier_call_chain+0x55/0x80
<4> [<ffffffff815300fa>] ? atomic_notifier_call_chain+0x1a/0x20
<4> [<ffffffff810a4eae>] ? notify_die+0x2e/0x30
<4> [<ffffffff8152dd89>] ? do_nmi+0x1e9/0x340
<4> [<ffffffff8152d620>] ? nmi+0x20/0x30
<4> [<ffffffff812ea5e1>] ? intel_idle+0xb1/0x170
<4> <<EOE>> [<ffffffff810a3d88>] ? hrtimer_start+0x18/0x20
<4> [<ffffffff81426cd8>] ? menu_select+0x178/0x390
<4> [<ffffffff81425bb7>] ? cpuidle_idle_call+0xa7/0x140
<4> [<ffffffff81009fc6>] ? cpu_idle+0xb6/0x110
<4> [<ffffffff8151063a>] ? rest_init+0x7a/0x80
<4> [<ffffffff81c2af8f>] ? start_kernel+0x424/0x430
<4> [<ffffffff81c2a33a>] ? x86_64_start_reservations+0x125/0x129
<4> [<ffffffff81c2a453>] ? x86_64_start_kernel+0x115/0x124
<4>NVRM: GPU at 0000:09:00.0 has fallen off the bus.
<4>NVRM: GPU is on Board 0320414000580.
<4>NVRM: GPU at 0000:09:00.0 has fallen off the bus.
<4>NVRM: GPU is on Board 0320414000580.

Thanks,
Craig

Andreas Kloeckner

2014-12-10 23:04:53 UTC

Permalink

Hi Craig,

Post by Craig Stringham
I keep crashing a server (kernel panic) when using pycuda within ipython.
It doesn't seem to matter what kernel I run and it only crashes several
minutes after I have run a kernel but have kept the ipython shell open. I
am using the latest git version of pycuda with the latest release of CUDA
(6.5) on a K40m. Below is the last portion of the crash dmesg.
Have any of you had a similar issue? I will try putting the GPU in
persistent mode since it appears it is falling off the bus... maybe that
will fix it.

I've never seen such a thing, and I'm not even sure what PyCUDA could do
to cause a hard crash a machine like that. It sounds like either a
driver or hardware bug to me. If there's something that PyCUDA can
do to avoid the hard crash, I'd be more than happy to do that--but I'm
not even sure what that would be...

One possible issue is threads. PyCUDA (CUDA generally, really) isn't
very good around them--but that should only be an issue if multiple
threads touch the GPU.

As a simple test, can you may write a script that does some PyCUDA
computations and then sits in a raw_input()? That's somehow the closest
analog to IPython I can think of.

Anyhow, these are just some suggestions. Hope some of this is helpful.

Andreas

Craig Stringham

2014-12-11 00:25:35 UTC

Permalink

Thank you Andreas. I tried python with just an input at the end and it did
still just crash, so it is not an ipython specific thing. Also putting the
GPU in persistent mode didn't stop it either (it did take longer to crash
though). Maybe I'll right a CUDA only program with an infinite loop at the
end and see if that will crash it as well. As you said, since it causes a
hard crash it is most likely a driver or hardware issue.
Thanks!
Craig

Post by Andreas Kloeckner
Hi Craig,

I've never seen such a thing, and I'm not even sure what PyCUDA could do
to cause a hard crash a machine like that. It sounds like either a
driver or hardware bug to me. If there's something that PyCUDA can
do to avoid the hard crash, I'd be more than happy to do that--but I'm
not even sure what that would be...
One possible issue is threads. PyCUDA (CUDA generally, really) isn't
very good around them--but that should only be an issue if multiple
threads touch the GPU.
As a simple test, can you may write a script that does some PyCUDA
computations and then sits in a raw_input()? That's somehow the closest
analog to IPython I can think of.
Anyhow, these are just some suggestions. Hope some of this is helpful.
Andreas

Craig Stringham

2014-12-11 23:04:25 UTC

Permalink

So it's definitely not a PyCUDA issue or even python related. It
consistently crashes on test six of the gpumemtest... and now it kernel
panics on every reboot... I think we have a hardware failure.
Thank you for your help,
Craig

Post by Craig Stringham
Thank you Andreas. I tried python with just an input at the end and it did
still just crash, so it is not an ipython specific thing. Also putting the
GPU in persistent mode didn't stop it either (it did take longer to crash
though). Maybe I'll right a CUDA only program with an infinite loop at the
end and see if that will crash it as well. As you said, since it causes a
hard crash it is most likely a driver or hardware issue.
Thanks!
Craig
On Wed Dec 10 2014 at 3:05:03 PM Andreas Kloeckner <

Post by Andreas Kloeckner
Hi Craig,

Post by Craig Stringham
I keep crashing a server (kernel panic) when using pycuda within

ipython.

Post by Craig Stringham
It doesn't seem to matter what kernel I run and it only crashes several
minutes after I have run a kernel but have kept the ipython shell open.

Post by Craig Stringham
am using the latest git version of pycuda with the latest release of

CUDA

Post by Craig Stringham
(6.5) on a K40m. Below is the last portion of the crash dmesg.
Have any of you had a similar issue? I will try putting the GPU in
persistent mode since it appears it is falling off the bus... maybe that
will fix it.

I've never seen such a thing, and I'm not even sure what PyCUDA could do
to cause a hard crash a machine like that. It sounds like either a
driver or hardware bug to me. If there's something that PyCUDA can
do to avoid the hard crash, I'd be more than happy to do that--but I'm
not even sure what that would be...
One possible issue is threads. PyCUDA (CUDA generally, really) isn't
very good around them--but that should only be an issue if multiple
threads touch the GPU.
As a simple test, can you may write a script that does some PyCUDA
computations and then sits in a raw_input()? That's somehow the closest
analog to IPython I can think of.
Anyhow, these are just some suggestions. Hope some of this is helpful.
Andreas