[Ipmitool-devel] Is the BMC robust to recover from system hangs? impitool unresponsive

Rahul Nabar rpnabar at gmail.com
Mon Aug 30 10:06:55 CDT 2010


On Mon, Aug 30, 2010 at 7:54 AM, Andy Cress <andy.cress at us.kontron.com> wrote:

Thanks very much Andy for taking the time for such a detailed
response! That sure helps!
>
> Yes, that is a key function that all IPMI BMCs are supposed to provide.
> The BMC is generally not affected by what the OS does, unless there are
> IPMI-aware applications running in the OS, specifically talking to the
> BMC.

Nothing that I am aware of other than ipmitool. None of the
vendor-specific GUIs etc.

> 1) IPMI LAN configuration.  Make sure that the IPMI LAN was properly
> configured.  It sounds like you may have tested this beforehand.  Even
> something like the ARP configuration could cause the port to no longer
> be visible to the router.

Yes, I had tested extensively prior to failure. This is a HPC cluster
with about ~300 identical servers and other servers in the group are
still responding perfectly. All are on their own dedicated IP subnet
although the IMPI physical network is the same as the normal 1GiGE eth
network. i.e. IPMI traffic is piggybacking on the same eth adapter
port.

>
> 3) Some OS-resident (custom?) IPMI-aware application that may be causing
> trouble/stress/configuration problems with the BMC.

Nothing that I can imagine. I'm using CentOS and fairly standard Linux tools.

>n a healthy
> system, the 'ps -ef' output on the target should show any ipmi-related
> processes that are running.

I don't see any suspicious  processes (on a sister node that hasn't
crashed). But here's a ps -ef if anything out of place is evident to
you.

[root at eu001 ~]# ps -ef
UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0 Aug29 ?        00:00:01 init [3]
root         2     1  0 Aug29 ?        00:00:00 [migration/0]
root         3     1  0 Aug29 ?        00:00:00 [ksoftirqd/0]
root         4     1  0 Aug29 ?        00:00:00 [watchdog/0]
root         5     1  0 Aug29 ?        00:00:00 [migration/1]
root         6     1  0 Aug29 ?        00:00:00 [ksoftirqd/1]
root         7     1  0 Aug29 ?        00:00:00 [watchdog/1]
root         8     1  0 Aug29 ?        00:00:00 [migration/2]
root         9     1  0 Aug29 ?        00:00:00 [ksoftirqd/2]
root        10     1  0 Aug29 ?        00:00:00 [watchdog/2]
root        11     1  0 Aug29 ?        00:00:00 [migration/3]
root        12     1  0 Aug29 ?        00:00:00 [ksoftirqd/3]
root        13     1  0 Aug29 ?        00:00:00 [watchdog/3]
root        14     1  0 Aug29 ?        00:00:00 [migration/4]
root        15     1  0 Aug29 ?        00:00:00 [ksoftirqd/4]
root        16     1  0 Aug29 ?        00:00:00 [watchdog/4]
root        17     1  0 Aug29 ?        00:00:00 [migration/5]
root        18     1  0 Aug29 ?        00:00:00 [ksoftirqd/5]
root        19     1  0 Aug29 ?        00:00:00 [watchdog/5]
root        20     1  0 Aug29 ?        00:00:00 [migration/6]
root        21     1  0 Aug29 ?        00:00:00 [ksoftirqd/6]
root        22     1  0 Aug29 ?        00:00:00 [watchdog/6]
root        23     1  0 Aug29 ?        00:00:00 [migration/7]
root        24     1  0 Aug29 ?        00:00:00 [ksoftirqd/7]
root        25     1  0 Aug29 ?        00:00:00 [watchdog/7]
root        26     1  0 Aug29 ?        00:00:00 [events/0]
root        27     1  0 Aug29 ?        00:00:00 [events/1]
root        28     1  0 Aug29 ?        00:00:00 [events/2]
root        29     1  0 Aug29 ?        00:00:00 [events/3]
root        30     1  0 Aug29 ?        00:00:00 [events/4]
root        31     1  0 Aug29 ?        00:00:00 [events/5]
root        32     1  0 Aug29 ?        00:00:00 [events/6]
root        33     1  0 Aug29 ?        00:00:00 [events/7]
root        34     1  0 Aug29 ?        00:00:00 [khelper]
root       169     1  0 Aug29 ?        00:00:00 [kthread]
root       181   169  0 Aug29 ?        00:00:00 [kblockd/0]
root       182   169  0 Aug29 ?        00:00:00 [kblockd/1]
root       183   169  0 Aug29 ?        00:00:00 [kblockd/2]
root       184   169  0 Aug29 ?        00:00:00 [kblockd/3]
root       185   169  0 Aug29 ?        00:00:00 [kblockd/4]
root       186   169  0 Aug29 ?        00:00:00 [kblockd/5]
root       187   169  0 Aug29 ?        00:00:00 [kblockd/6]
root       188   169  0 Aug29 ?        00:00:00 [kblockd/7]
root       189   169  0 Aug29 ?        00:00:00 [kacpid]
root       302   169  0 Aug29 ?        00:00:00 [cqueue/0]
root       303   169  0 Aug29 ?        00:00:00 [cqueue/1]
root       304   169  0 Aug29 ?        00:00:00 [cqueue/2]
root       305   169  0 Aug29 ?        00:00:00 [cqueue/3]
root       306   169  0 Aug29 ?        00:00:00 [cqueue/4]
root       307   169  0 Aug29 ?        00:00:00 [cqueue/5]
root       308   169  0 Aug29 ?        00:00:00 [cqueue/6]
root       309   169  0 Aug29 ?        00:00:00 [cqueue/7]
root       312   169  0 Aug29 ?        00:00:00 [khubd]
root       314   169  0 Aug29 ?        00:00:00 [kseriod]
root       437   169  0 Aug29 ?        00:00:00 [pdflush]
root       438   169  0 Aug29 ?        00:00:30 [pdflush]
root       439   169  0 Aug29 ?        00:00:00 [kswapd0]
root       440   169  0 Aug29 ?        00:00:00 [kswapd1]
root       441   169  0 Aug29 ?        00:00:00 [aio/0]
root       442   169  0 Aug29 ?        00:00:00 [aio/1]
root       443   169  0 Aug29 ?        00:00:00 [aio/2]
root       444   169  0 Aug29 ?        00:00:00 [aio/3]
root       445   169  0 Aug29 ?        00:00:00 [aio/4]
root       446   169  0 Aug29 ?        00:00:00 [aio/5]
root       447   169  0 Aug29 ?        00:00:00 [aio/6]
root       448   169  0 Aug29 ?        00:00:00 [aio/7]
root       598   169  0 Aug29 ?        00:00:00 [kpsmoused]
root       703   169  0 Aug29 ?        00:00:00 [mpt_poll_0]
root       704   169  0 Aug29 ?        00:00:00 [scsi_eh_0]
root       732   169  0 Aug29 ?        00:00:00 [kstriped]
root       769   169  0 Aug29 ?        00:00:01 [kjournald]
root       794   169  0 Aug29 ?        00:00:00 [kauditd]
root       827     1  0 Aug29 ?        00:00:00 /sbin/udevd -d
root      1416   169  0 Aug29 ?        00:00:00 [cxgb3]
root      2046   169  0 Aug29 ?        00:00:00 [kmpathd/0]
root      2047   169  0 Aug29 ?        00:00:00 [kmpathd/1]
root      2048   169  0 Aug29 ?        00:00:00 [kmpathd/2]
root      2049   169  0 Aug29 ?        00:00:00 [kmpathd/3]
root      2050   169  0 Aug29 ?        00:00:00 [kmpathd/4]
root      2051   169  0 Aug29 ?        00:00:00 [kmpathd/5]
root      2052   169  0 Aug29 ?        00:00:00 [kmpathd/6]
root      2053   169  0 Aug29 ?        00:00:00 [kmpathd/7]
root      2054   169  0 Aug29 ?        00:00:00 [kmpath_handlerd]
root      2089   169  0 Aug29 ?        00:03:03 [kjournald]
root      2091   169  0 Aug29 ?        00:00:00 [kjournald]
root      2093   169  0 Aug29 ?        00:00:00 [kjournald]
root      2320   169  0 Aug29 ?        00:00:00 [iw_cxgb3]
root      2394   169  0 Aug29 ?        00:00:00 [ib_mcast]
root      2395   169  0 Aug29 ?        00:00:00 [ib_inform]
root      2396   169  0 Aug29 ?        00:00:00 [local_sa]
root      2406   169  0 Aug29 ?        00:00:00 [ib_cm/0]
root      2407   169  0 Aug29 ?        00:00:00 [ib_cm/1]
root      2408   169  0 Aug29 ?        00:00:00 [ib_cm/2]
root      2409   169  0 Aug29 ?        00:00:00 [ib_cm/3]
root      2410   169  0 Aug29 ?        00:00:00 [ib_cm/4]
root      2411   169  0 Aug29 ?        00:00:00 [ib_cm/5]
root      2412   169  0 Aug29 ?        00:00:00 [ib_cm/6]
root      2413   169  0 Aug29 ?        00:00:00 [ib_cm/7]
root      2433   169  0 Aug29 ?        00:00:00 [ipoib]
root      2476   169  0 Aug29 ?        00:00:00 [ib_addr]
root      2486   169  0 Aug29 ?        00:00:00 [iw_cm_wq]
root      2496   169  0 Aug29 ?        00:00:00 [rdma_cm]
root      3111     1  0 Aug29 ?        00:00:00 /sbin/dhclient -1 -q
-lf /var/lib/dhclient/dhclient-eth1.leases -pf /var/run/dhclien
root      3383     1  0 Aug29 ?        00:00:00 auditd
root      3385  3383  0 Aug29 ?        00:00:00 /sbin/audispd
root      3415     1  0 Aug29 ?        00:00:00 syslogd -m 0
root      3418     1  0 Aug29 ?        00:00:00 klogd -x
root      3432     1  0 Aug29 ?        00:00:00 irqbalance
rpc       3452     1  0 Aug29 ?        00:00:00 portmap
root      3489   169  0 Aug29 ?        00:00:00 [rpciod/0]
root      3490   169  0 Aug29 ?        00:00:00 [rpciod/1]
root      3491   169  0 Aug29 ?        00:00:00 [rpciod/2]
root      3492   169  0 Aug29 ?        00:00:00 [rpciod/3]
root      3493   169  0 Aug29 ?        00:00:00 [rpciod/4]
root      3494   169  0 Aug29 ?        00:00:00 [rpciod/5]
root      3495   169  0 Aug29 ?        00:00:00 [rpciod/6]
root      3496   169  0 Aug29 ?        00:00:00 [rpciod/7]
root      3509     1  0 Aug29 ?        00:00:00 rpc.statd
root      3541     1  0 Aug29 ?        00:00:00 rpc.idmapd
dbus      3564     1  0 Aug29 ?        00:00:00 dbus-daemon --system
root      3617     1  0 Aug29 ?        00:00:00 [lockd]
root      3642     1  0 Aug29 ?        00:00:00 pcscd
root      3656     1  0 Aug29 ?        00:00:00 /usr/sbin/acpid
68        3669     1  0 Aug29 ?        00:00:00 hald
root      3670  3669  0 Aug29 ?        00:00:00 hald-runner
68        3678  3670  0 Aug29 ?        00:00:00 hald-addon-acpi:
listening on acpid socket /var/run/acpid.socket
root      3740     1  0 Aug29 ?        00:00:00 /usr/bin/hidd --server
root      3775     1  0 Aug29 ?        00:00:00 automount
root      3799     1  0 Aug29 ?        00:00:00 /usr/sbin/sshd
ntp       3818     1  0 Aug29 ?        00:00:00 ntpd -u ntp:ntp -p
/var/run/ntpd.pid -g
root      3831     1  0 Aug29 ?        00:00:00 crond
root      3872     1  0 Aug29 ?        00:00:06 /opt/torque/sbin/pbs_mom
root      3896     1  0 Aug29 ?        00:00:00 /usr/sbin/atd
condor    3911     1  0 Aug29 ?        00:00:07
/usr/sbin/condor_master -pidfile /condor/var/run/condor/master.pid
condor    3921  3911  0 Aug29 ?        00:00:25 condor_startd -f
root      3960     1  0 Aug29 ?        00:00:00 /usr/sbin/smartd -q never
root      3963     1  0 Aug29 tty1     00:00:00 /sbin/mingetty tty1
root      3965     1  0 Aug29 tty2     00:00:00 /sbin/mingetty tty2
root      3969     1  0 Aug29 tty3     00:00:00 /sbin/mingetty tty3
root      3970     1  0 Aug29 tty4     00:00:00 /sbin/mingetty tty4
root      3971     1  0 Aug29 tty5     00:00:00 /sbin/mingetty tty5
root      3973     1  0 Aug29 tty6     00:00:00 /sbin/mingetty tty6
root      4322  3799  0 Aug29 ?        00:00:00 sshd: cfarberow at pts/0
512       4323  4322  0 Aug29 pts/0    00:00:00 -bash
512      17952  3872  0 10:00 ?        00:00:00 orted -mca ess env
-mca orte_ess_jobid 2802450432 -mca orte_ess_vpid 2 -mca orte_ess
512      17953 17952  0 10:00 ?        00:00:00 /bin/sh
/opt/bin/dacapo_nexus_2.7.8.exec
/work/cfarberow/ORR/Ir3re/test/3.5/tmpi0uSv
512      17954 17952  0 10:00 ?        00:00:00 /bin/sh
/opt/bin/dacapo_nexus_2.7.8.exec
/work/cfarberow/ORR/Ir3re/test/3.5/tmpi0uSv
512      17955 17952  0 10:00 ?        00:00:00 /bin/sh
/opt/bin/dacapo_nexus_2.7.8.exec
/work/cfarberow/ORR/Ir3re/test/3.5/tmpi0uSv
512      17956 17952  0 10:00 ?        00:00:00 /bin/sh
/opt/bin/dacapo_nexus_2.7.8.exec
/work/cfarberow/ORR/Ir3re/test/3.5/tmpi0uSv
512      17957 17952  0 10:00 ?        00:00:00 /bin/sh
/opt/bin/dacapo_nexus_2.7.8.exec
/work/cfarberow/ORR/Ir3re/test/3.5/tmpi0uSv
512      17958 17952  0 10:00 ?        00:00:00 /bin/sh
/opt/bin/dacapo_nexus_2.7.8.exec
/work/cfarberow/ORR/Ir3re/test/3.5/tmpi0uSv
512      17959 17953 99 10:00 ?        00:03:37
/opt/bin/dacapo_2.7.8_nexus.run
/work/cfarberow/ORR/Ir3re/test/3.5/tmpi0uSvB Out-Ir3
512      17960 17954 99 10:00 ?        00:03:37
/opt/bin/dacapo_2.7.8_nexus.run
/work/cfarberow/ORR/Ir3re/test/3.5/tmpi0uSvB Out-Ir3
512      17961 17952  0 10:00 ?        00:00:00 /bin/sh
/opt/bin/dacapo_nexus_2.7.8.exec
/work/cfarberow/ORR/Ir3re/test/3.5/tmpi0uSv
512      17962 17952  0 10:00 ?        00:00:00 /bin/sh
/opt/bin/dacapo_nexus_2.7.8.exec
/work/cfarberow/ORR/Ir3re/test/3.5/tmpi0uSv
512      17963 17955 99 10:00 ?        00:03:37
/opt/bin/dacapo_2.7.8_nexus.run
/work/cfarberow/ORR/Ir3re/test/3.5/tmpi0uSvB Out-Ir3
512      17964 17956 99 10:00 ?        00:03:37
/opt/bin/dacapo_2.7.8_nexus.run
/work/cfarberow/ORR/Ir3re/test/3.5/tmpi0uSvB Out-Ir3
512      17965 17957 99 10:00 ?        00:03:37
/opt/bin/dacapo_2.7.8_nexus.run
/work/cfarberow/ORR/Ir3re/test/3.5/tmpi0uSvB Out-Ir3
512      17966 17961 99 10:00 ?        00:03:37
/opt/bin/dacapo_2.7.8_nexus.run
/work/cfarberow/ORR/Ir3re/test/3.5/tmpi0uSvB Out-Ir3
512      17967 17958 99 10:00 ?        00:03:36
/opt/bin/dacapo_2.7.8_nexus.run
/work/cfarberow/ORR/Ir3re/test/3.5/tmpi0uSvB Out-Ir3
512      17968 17962 99 10:00 ?        00:03:37
/opt/bin/dacapo_2.7.8_nexus.run
/work/cfarberow/ORR/Ir3re/test/3.5/tmpi0uSvB Out-Ir3
root     18046  3831  0 10:03 ?        00:00:00 crond
root     18047 18046  0 10:03 ?        00:00:00 [bash] <defunct>
root     18068 18046  0 10:03 ?        00:00:00 /usr/sbin/sendmail
-FCronDaemon -i -odi -oem -oi -t
root     18074  3799  0 10:03 ?        00:00:00 sshd: root at pts/9
root     18075 18074  0 10:03 pts/9    00:00:00 -bash
root     18105 18075  0 10:03 pts/9    00:00:00 ps -ef
[root at eu001 ~]# ps -ef | grep ipmi
root     18107 18075  0 10:04 pts/9    00:00:00 grep ipmi


>
> 4) A bug in the BMC.  You didn't mention which vendor's IPMI BMC is
> being used, but from the To list, it might be Dell (?).  Get the BMC
> version number and find out if there is an upgrade from the vendor.
> That is more important than what ipmitool does.  If the BMC is in a bad
> state, the history from the IPMI SEL may be helpful to the vendor.  If
> it is reproducible after an upgrade, the vendor should be able to fix
> it.


Yup! It is indeed Dell. A R410-server. I've posted on the Del list
too. I'll wait to see if I get any ideas there.

Thanks again!

-- 
Rahul



More information about the Linux-PowerEdge mailing list