Using the ILOM for troubleshooting on ODA
I worked on root cause analysis for a strange node reboot on client's Oracle Database Appliance yesterday. The case was quite interesting from the perspective that none of the logs contained any information related to the cause of the reboot. I could only see the log entries for normal activities and then - BOOM! - the start-up sequence! It looked like someone just power cycled the node. I also observed the heartbeat timeouts followed by the node eviction on the remaining node. There was still one place I hadn't checked and it revealed the cause of the issue. One of the cool things about ODA is it's service processor (SP) called Integrated Lights Out Manager (ILOM), which allows you to do many things that you'd normally do being physically located in the data center - power cycle the node, change the BIOS settings, choose boot devices, and ... (the drum-roll) ... see the console outputs from the server node! And it doesn't only show the current console output but it keeps logging it too. Each ODA server has its own ILOM, so I found out the IP address for the ILOM of the node which failed and connected to it using SSH. [code light="true" highlight="1,11"] $ ssh pythian@oda01a-mgmt Password: Oracle(R) Integrated Lights Out Manager Version 184.108.40.206.a r70764 Copyright (c) 2011, Oracle and/or its affiliates. All rights reserved. -> -> ls / Targets: HOST STORAGE SYS SP Properties: Commands: cd show [/code] ILOM can be browsed as it would be a directory structure. Here the "Targets" are different components of the system. When you "cd" into a target you see sub-components and so on. Each target can have properties, they are displayed as variable=value pairs under "Properties" section. And there are also list of "Commands" that you can execute for the current target. the "ls" command shows the sub-targets, the properties and the commands for the current target. Here's how I found the console outputs from the failed node: [code light="true" highlight="1,4,20,23,40,43"] -> cd HOST /HOST -> ls /HOST Targets: console diag Properties: boot_device = default generate_host_nmi = (Cannot show property) Commands: cd set show -> cd console /HOST/console -> ls /HOST/console Targets: history Properties: line_count = 0 pause_count = 0 start_from = end Commands: cd show start stop -> cd history /HOST/console/history -> ls [/code] The last "ls" command started printing all the history of console outputs on my screen and look what I found just before the startup sequence (I removed some lines to make this shorter and I also highlighted the most interesting lines): [code light="true" highlight="8,10"] divide error: 0000 [#1] SMP last sysfs file: /sys/devices/pci0000:00/0000:00:09.0/0000:1f:00.0/host7/port-7:1/expander-7:1/port-7:1:2/end_device-7:1:2/target7:0:15/7:0:15:0/timeout CPU 3 Modules linked in: iptable_filter(U) ip_tables(U) x_tables(U) oracleacfs(P)(U) oracleadvm(P)(U) oracleoks(P)(U) mptctl(U) mptbase(U) autofs4(U) hidp(U) l2cap(U) bluetooth(U) rfkill(U) nfs(U) fscache(U) nfs_acl(U) auth_rpcgss(U) lockd(U) sunrpc(U) bonding(U) be2iscsi(U) ib_iser(U) rdma_cm(U) ib_cm(U) iw_cm(U) ib_sa(U) ib_mad(U) ib_core(U) ib_addr(U) iscsi_tcp(U) bnx2i(U) cnic(U) uio(U) dm_round_robin(U) ipv6(U) cxgb3i(U) libcxgbi(U) cxgb3(U) mdio(U) libiscsi_tcp(U) libiscsi(U) scsi_transport_iscsi(U) video(U ) output(U) sbs(U) sbshc(U) parport_pc(U) lp(U) parport(U) ipmi_si(U) ipmi_devintf(U) ipmi_msghandler(U) igb(U) ixgbe(U) joydev(U) ses(U) enclosure(U) e1000e(U) snd_seq_dummy(U) snd_seq_oss(U) snd_seq_midi_event(U) snd_seq(U) snd_seq_device(U) snd_pcm_oss(U) snd_mixer_oss(U) snd_pcm(U) snd_timer(U) snd(U) soundcore(U) snd_page_alloc(U) iTCO_wdt(U) iTCO_vendor_support(U) shpchp(U) i2c_i801(U) i2c_core(U) ioatdma(U) dca(U) pcspkr(U) dm_multipath(U) usb_storage(U) mpt2sas(U) scsi_transport_sas(U) raid_class(U) ahci(U) raid1(U) [last unloaded: microcode] Pid: 29478, comm: top Tainted: P W 2.6.32-300.11.1.el5uek #1 SUN FIRE X4370 M2 SERVER RIP: 0010:[<ffffffff8104b3e8>] [<ffffffff8104b3e8>] thread_group_times+0x5b/0xab ... Kernel panic - not syncing: Fatal exception Pid: 29478, comm: top Tainted: P D W 2.6.32-300.11.1.el5uek #1 Call Trace: [<ffffffff8105797e>] panic+0xa5/0x162 [<ffffffff8107ae09>] ? up+0x39/0x3e [<ffffffff810580d1>] ? release_console_sem+0x194/0x19d [<ffffffff8105839a>] ? console_unblank+0x6a/0x6f [<ffffffff8105764b>] ? print_oops_end_marker+0x23/0x25 [<ffffffff81456ea6>] oops_end+0xb7/0xc7 [<ffffffff8101565d>] die+0x5a/0x63 [<ffffffff8145677c>] do_trap+0x115/0x124 [<ffffffff81013674>] do_divide_error+0x96/0x9f [<ffffffff8104b3e8>] ? thread_group_times+0x5b/0xab [<ffffffff810dd2f8>] ? get_page_from_freelist+0x4be/0x65e [<ffffffff81012b1b>] divide_error+0x1b/0x20 [<ffffffff8104b3e8>] ? thread_group_times+0x5b/0xab [<ffffffff8104b3d4>] ? thread_group_times+0x47/0xab [<ffffffff8116ee13>] ? collect_sigign_sigcatch+0x46/0x5e [<ffffffff8116f366>] do_task_stat+0x354/0x8c3 [<ffffffff81238267>] ? put_dec+0xcf/0xd2 [<ffffffff81238396>] ? number+0x12c/0x244 [<ffffffff8107419b>] ? get_pid_task+0xe/0x19 [<ffffffff811eac34>] ? security_task_to_inode+0x16/0x18 [<ffffffff8116a77b>] ? task_lock+0x15/0x17 [<ffffffff8116add1>] ? task_dumpable+0x29/0x3c [<ffffffff8116c1c6>] ? pid_revalidate+0x80/0x99 [<ffffffff81135992>] ? seq_open+0x25/0xba [<ffffffff81135a08>] ? seq_open+0x9b/0xba [<ffffffff8116d147>] ? proc_single_show+0x0/0x7a [<ffffffff81135b2e>] ? single_open+0x8f/0xb8 [<ffffffff8116aa0e>] ? proc_single_open+0x23/0x3b [<ffffffff81127cc1>] ? do_filp_open+0x4f8/0x92d [<ffffffff8116f8e9>] proc_tgid_stat+0x14/0x16 [<ffffffff8116d1a6>] proc_single_show+0x5f/0x7a [<ffffffff81135e73>] seq_read+0x193/0x350 [<ffffffff811ea88c>] ? security_file_permission+0x16/0x18 [<ffffffff8111a797>] vfs_read+0xad/0x107 [<ffffffff8111b24b>] sys_read+0x4c/0x70 [<ffffffff81011db2>] system_call_fastpath+0x16/0x1b Rebooting in 60 seconds..??? [/code] A quick search on My Oracle Support quickly found a match: Kernel Panic at "thread_group_times+0x5b/0xab" (Doc ID 1620097.1)". The call stack and the massages are a 100% match and the root cause is a kernel bug that's fixed in more recent versions. I'm not sure how I would have gotten to the root cause if this system was not an ODA and the server would have just bounced without logging the Kernel Panic in any of the logs. ODA's ILOM definitely made the troubleshooting effort less painful and probably saved us from couple more incidents caused by this bug in the future as we'd been able to troubleshoot it quicklyand we'll be able to implement the fix sooner.