Craig Stringham
2014-12-10 20:03:33 UTC
Hi,
I keep crashing a server (kernel panic) when using pycuda within ipython.
It doesn't seem to matter what kernel I run and it only crashes several
minutes after I have run a kernel but have kept the ipython shell open. I
am using the latest git version of pycuda with the latest release of CUDA
(6.5) on a K40m. Below is the last portion of the crash dmesg.
Have any of you had a similar issue? I will try putting the GPU in
persistent mode since it appears it is falling off the bus... maybe that
will fix it.
...
<7>nvidia 0000:09:00.0: irq 131 for MSI/MSI-X
<6>IPT_DBLK IN=eth1 OUT= MAC=00:1e:64:98:6c:53:00:1b:8f:d9:dc:04:08:00
SRC=198.20.69.74 DST=137.79.160.131 LEN=40 TOS=0x00 PREC=0x00 TTL=116
ID=60947 PROTO=TCP SPT=20849 DPT=80 WINDOW=9963 RES=0x00 SYN URGP=0
<7>nvidia 0000:09:00.0: irq 131 for MSI/MSI-X
<7>nvidia 0000:09:00.0: irq 131 for MSI/MSI-X
<0>{1}[Hardware Error]: Hardware error from APEI Generic Hardware Error
Source: 1
<0>{1}[Hardware Error]: event severity: fatal
<0>{1}[Hardware Error]: Error 0, type: fatal
<0>{1}[Hardware Error]: section_type: PCIe error
<0>{1}[Hardware Error]: port_type: 4, root port
<0>{1}[Hardware Error]: version: 1.16
<0>{1}[Hardware Error]: command: 0x4010, status: 0x0547
<0>{1}[Hardware Error]: device_id: 0000:00:03.2
<0>{1}[Hardware Error]: slot: 0
<0>{1}[Hardware Error]: secondary_bus: 0x09
<0>{1}[Hardware Error]: vendor_id: 0x8086, device_id: 0x0e0a
<0>{1}[Hardware Error]: class_code: 000406
<0>{1}[Hardware Error]: bridge: secondary_status: 0x0000, control: 0x0003
<0>Kernel panic - not syncing: Fatal hardware error!
<4>Pid: 0, comm: swapper Tainted: P ---------------
2.6.32-504.1.3.el6.x86_64 #1
<4>Call Trace:
<4> <NMI> [<ffffffff815292dc>] ? panic+0xa7/0x16f
<4> [<ffffffff8131c191>] ? ghes_notify_nmi+0x241/0x250
<4> [<ffffffff81530095>] ? notifier_call_chain+0x55/0x80
<4> [<ffffffff815300fa>] ? atomic_notifier_call_chain+0x1a/0x20
<4> [<ffffffff810a4eae>] ? notify_die+0x2e/0x30
<4> [<ffffffff8152dd89>] ? do_nmi+0x1e9/0x340
<4> [<ffffffff8152d620>] ? nmi+0x20/0x30
<4> [<ffffffff812ea5e1>] ? intel_idle+0xb1/0x170
<4> <<EOE>> [<ffffffff810a3d88>] ? hrtimer_start+0x18/0x20
<4> [<ffffffff81426cd8>] ? menu_select+0x178/0x390
<4> [<ffffffff81425bb7>] ? cpuidle_idle_call+0xa7/0x140
<4> [<ffffffff81009fc6>] ? cpu_idle+0xb6/0x110
<4> [<ffffffff8151063a>] ? rest_init+0x7a/0x80
<4> [<ffffffff81c2af8f>] ? start_kernel+0x424/0x430
<4> [<ffffffff81c2a33a>] ? x86_64_start_reservations+0x125/0x129
<4> [<ffffffff81c2a453>] ? x86_64_start_kernel+0x115/0x124
<4>NVRM: GPU at 0000:09:00.0 has fallen off the bus.
<4>NVRM: GPU is on Board 0320414000580.
<4>NVRM: GPU at 0000:09:00.0 has fallen off the bus.
<4>NVRM: GPU is on Board 0320414000580.
Thanks,
Craig
I keep crashing a server (kernel panic) when using pycuda within ipython.
It doesn't seem to matter what kernel I run and it only crashes several
minutes after I have run a kernel but have kept the ipython shell open. I
am using the latest git version of pycuda with the latest release of CUDA
(6.5) on a K40m. Below is the last portion of the crash dmesg.
Have any of you had a similar issue? I will try putting the GPU in
persistent mode since it appears it is falling off the bus... maybe that
will fix it.
...
<7>nvidia 0000:09:00.0: irq 131 for MSI/MSI-X
<6>IPT_DBLK IN=eth1 OUT= MAC=00:1e:64:98:6c:53:00:1b:8f:d9:dc:04:08:00
SRC=198.20.69.74 DST=137.79.160.131 LEN=40 TOS=0x00 PREC=0x00 TTL=116
ID=60947 PROTO=TCP SPT=20849 DPT=80 WINDOW=9963 RES=0x00 SYN URGP=0
<7>nvidia 0000:09:00.0: irq 131 for MSI/MSI-X
<7>nvidia 0000:09:00.0: irq 131 for MSI/MSI-X
<0>{1}[Hardware Error]: Hardware error from APEI Generic Hardware Error
Source: 1
<0>{1}[Hardware Error]: event severity: fatal
<0>{1}[Hardware Error]: Error 0, type: fatal
<0>{1}[Hardware Error]: section_type: PCIe error
<0>{1}[Hardware Error]: port_type: 4, root port
<0>{1}[Hardware Error]: version: 1.16
<0>{1}[Hardware Error]: command: 0x4010, status: 0x0547
<0>{1}[Hardware Error]: device_id: 0000:00:03.2
<0>{1}[Hardware Error]: slot: 0
<0>{1}[Hardware Error]: secondary_bus: 0x09
<0>{1}[Hardware Error]: vendor_id: 0x8086, device_id: 0x0e0a
<0>{1}[Hardware Error]: class_code: 000406
<0>{1}[Hardware Error]: bridge: secondary_status: 0x0000, control: 0x0003
<0>Kernel panic - not syncing: Fatal hardware error!
<4>Pid: 0, comm: swapper Tainted: P ---------------
2.6.32-504.1.3.el6.x86_64 #1
<4>Call Trace:
<4> <NMI> [<ffffffff815292dc>] ? panic+0xa7/0x16f
<4> [<ffffffff8131c191>] ? ghes_notify_nmi+0x241/0x250
<4> [<ffffffff81530095>] ? notifier_call_chain+0x55/0x80
<4> [<ffffffff815300fa>] ? atomic_notifier_call_chain+0x1a/0x20
<4> [<ffffffff810a4eae>] ? notify_die+0x2e/0x30
<4> [<ffffffff8152dd89>] ? do_nmi+0x1e9/0x340
<4> [<ffffffff8152d620>] ? nmi+0x20/0x30
<4> [<ffffffff812ea5e1>] ? intel_idle+0xb1/0x170
<4> <<EOE>> [<ffffffff810a3d88>] ? hrtimer_start+0x18/0x20
<4> [<ffffffff81426cd8>] ? menu_select+0x178/0x390
<4> [<ffffffff81425bb7>] ? cpuidle_idle_call+0xa7/0x140
<4> [<ffffffff81009fc6>] ? cpu_idle+0xb6/0x110
<4> [<ffffffff8151063a>] ? rest_init+0x7a/0x80
<4> [<ffffffff81c2af8f>] ? start_kernel+0x424/0x430
<4> [<ffffffff81c2a33a>] ? x86_64_start_reservations+0x125/0x129
<4> [<ffffffff81c2a453>] ? x86_64_start_kernel+0x115/0x124
<4>NVRM: GPU at 0000:09:00.0 has fallen off the bus.
<4>NVRM: GPU is on Board 0320414000580.
<4>NVRM: GPU at 0000:09:00.0 has fallen off the bus.
<4>NVRM: GPU is on Board 0320414000580.
Thanks,
Craig