Oracle Database

Using the ILOM for troubleshooting on ODA

3 min read

Sep 17, 2014 12:00:00 AM

I worked on root cause analysis for a strange node reboot on client's Oracle Database Appliance yesterday.

The case was quite interesting from the perspective that none of the logs contained any information related to the cause of the reboot. I could only see the log entries for normal activities and then - BOOM! - the start-up sequence! It looked like someone just power cycled the node. I also observed the heartbeat timeouts followed by the node eviction on the remaining node. There was still one place I hadn't checked, and it revealed the cause of the issue.

Leveraging ODA Architecture for Troubleshooting

The Role of Integrated Lights Out Manager (ILOM)

One of the cool things about ODA is its service processor (SP) called Integrated Lights Out Manager (ILOM), which allows you to do many things that you'd normally do being physically located in the data center - power cycle the node, change the BIOS settings, choose boot devices, and ... (the drum-roll) ... see the console outputs from the server node! And it doesn't only show the current console output, but it keeps logging it too. Each ODA server has its own ILOM, so I found out the IP address for the ILOM of the node which failed and connected to it using SSH.

$ ssh pythian@oda01a-mgmt  Password:   Oracle(R) Integrated Lights Out Manager Version 3.0.14.13.a r70764 Copyright (c) 2011, Oracle and/or its affiliates. All rights reserved.  ->  -> ls /  Targets:   HOST   STORAGE   SYS   SP  Properties: Commands:   cd   show

Navigating the ILOM Directory Structure

Accessing the Host Console History

ILOM can be browsed as it would be a directory structure. Here the "Targets" are different components of the system. When you "cd" into a target you see sub-components and so on. Each target can have properties, they are displayed as variable=value pairs under "Properties" section. And there are also list of "Commands" that you can execute for the current target. the "ls" command shows the sub-targets, the properties and the commands for the current target. Here's how I found the console outputs from the failed node:

-> cd HOST /HOST -> ls /HOST  Targets:   console   diag  Properties:   boot_device = default   generate_host_nmi = (Cannot show property)  Commands:   cd   set   show  -> cd console /HOST/console -> ls /HOST/console  Targets:   history  Properties:   line_count =0   pause_count =0   start_from = end  Commands:   cd   show   start   stop  -> cd history /HOST/console/history -> ls

Analysis of the Console Output

Identifying the Kernel Panic

The last "ls" command started printing all the history of console outputs on my screen and look what I found just before the startup sequence (I removed some lines to make this shorter):

divide error: 0000 [#1] SMP last sysfs file: /sys/devices/pci0000:00/0000:00:09.0/0000:1f:00.0/host7/port-7:1/expander-7:1/port-7:1:2/end_device-7:1:2/target7:0:15/7:0:15:0/timeout CPU: 3  Modules linked in:   iptable_filter(U) ip_tables(U) x_tables(U) oracleacfs(P)(U) oracleadvm(P)(U)   oracleoks(P)(U) mptctl(U) mptbase(U) autofs4(U) hidp(U) l2cap(U) bluetooth(U)   ...   usb_storage(U) mpt2sas(U) scsi_transport_sas(U) raid_class(U) ahci(U) raid1(U)   [last unloaded: microcode]  Pid: 29478, comm: top Tainted: P W 2.6.32-300.11.1.el5uek #1 SUN FIRE X4370 M2 SERVER  RIP: 0010:[<ffffffff8104b3e8>] [<ffffffff8104b3e8>] thread_group_times+0x5b/0xab ... Kernel panic - not syncing: Fatal exception Pid: 29478, comm: top Tainted: P D W 2.6.32-300.11.1.el5uek #1  Call Trace:   [<ffffffff8105797e>] panic+0xa5/0x162   ...   [<ffffffff81013674>] do_divide_error+0x96/0x9f   [<ffffffff8104b3e8>] ? thread_group_times+0x5b/0xab   ...   [<ffffffff81011db2>] system_call_fastpath+0x16/0x1b  Rebooting in 60 seconds..???

Root Cause Discovery

Matching the Call Stack with Known Issues

A quick search on My Oracle Support quickly found a match: Kernel Panic at "thread_group_times+0x5b/0xab" (Doc ID 1620097.1)".

The call stack and the messages are a 100% match and the root cause is a kernel bug that's fixed in more recent versions. I'm not sure how I would have gotten to the root cause if this system was not an ODA and the server would have just bounced without logging the Kernel Panic in any of the logs.

ODA's ILOM definitely made the troubleshooting effort less painful and probably saved us from couple more incidents caused by this bug in the future as we'd been able to troubleshoot it quickly and we'll be able to implement the fix sooner.