Oracle 10.2.0.3 CRS – Missed Heart Beats Format in ocssd.log

Posted in: Technical Track

Oracle CRS 10.2.0.3 patchset changed the logging of missed heartbeats by CSS.

Here is example how heartbeats misses are logged in ocssd.log in 10.2.0.3:

[    CSSD]2007-02-02 14:41:06.867 [1199618400] >WARNING: clssnmPollingThread: node node1 (1) at 50% heartbeat fatal, eviction in 29.440 seconds
[    CSSD]2007-02-02 14:41:21.865 [1199618400] >WARNING: clssnmPollingThread: node node1 (1) at 75% heartbeat fatal, eviction in 14.440 seconds
[    CSSD]2007-02-02 14:41:30.864 [1199618400] >WARNING: clssnmPollingThread: node node1 (1) at 90% heartbeat fatal, eviction in 5.440 seconds
[    CSSD]2007-02-02 14:41:31.866 [1199618400] >WARNING: clssnmPollingThread: node node1 (1) at 90% heartbeat fatal, eviction in 4.440 seconds
[    CSSD]2007-02-02 14:41:32.868 [1199618400] >TRACE:   clssnmPollingThread: node node1 (1) is impending reconfig
[    CSSD]2007-02-02 14:41:32.868 [1199618400] >WARNING: clssnmPollingThread: node node1 (1) at 90% heartbeat fatal, eviction in 3.440 seconds
[    CSSD]2007-02-02 14:41:32.868 [1199618400] >TRACE:   clssnmPollingThread: diskTimeout set to (57000)ms impending reconfig status(1)
[    CSSD]2007-02-02 14:41:33.870 [1199618400] >TRACE:   clssnmPollingThread: node node1 (1) is impending reconfig
[    CSSD]2007-02-02 14:41:33.870 [1199618400] >WARNING: clssnmPollingThread: node node1 (1) at 90% heartbeat fatal, eviction in 2.430 seconds
[    CSSD]2007-02-02 14:41:34.862 [1199618400] >TRACE:   clssnmPollingThread: node node1 (1) is impending reconfig
[    CSSD]2007-02-02 14:41:34.862 [1199618400] >WARNING: clssnmPollingThread: node node1 (1) at 90% heartbeat fatal, eviction in 1.440 seconds
[    CSSD]2007-02-02 14:41:35.864 [1199618400] >TRACE:   clssnmPollingThread: node node1 (1) is impending reconfig
[    CSSD]2007-02-02 14:41:35.864 [1199618400] >WARNING: clssnmPollingThread: node node1 (1) at 90% heartbeat fatal, eviction in 0.440 seconds
[    CSSD]2007-02-02 14:41:36.306 [1199618400] >TRACE:   clssnmPollingThread: node node1 (1) is impending reconfig
[    CSSD]2007-02-02 14:41:36.306 [1199618400] >TRACE:   clssnmPollingThread: Eviction started for node node1 (1), flags 0x000f, state 3, wt4c 0

Note that 10.2.0.2 would start logging each heartbeat miss from the second miss:

[ CSSD]2006-11-07 22:15:59.420 [1107360096] >TRACE: clssnmPollingThread: node node1 (1) missed(2) checkin(s)
[ CSSD]2006-11-07 22:16:00.422 [1107360096] >TRACE: clssnmPollingThread: node node1 (1) missed(3) checkin(s)
[ CSSD]2006-11-07 22:16:01.424 [1107360096] >TRACE: clssnmPollingThread: node node1 (1) missed(4) checkin(s)
[ CSSD]2006-11-07 22:16:02.426 [1107360096] >TRACE: clssnmPollingThread: node node1 (1) missed(5) checkin(s)
[ CSSD]2006-11-07 22:16:03.428 [1107360096] >TRACE: clssnmPollingThread: node node1 (1) missed(6) checkin(s)
[ CSSD]2006-11-07 22:16:04.430 [1107360096] >TRACE: clssnmPollingThread: node node1 (1) missed(7) checkin(s)

10.2.0.3 has somewhat more user-friendly format and tells you when potential eviction would occur but starts logging only after 50% of heartbeats are missed. This means that you won’t be aware of “short” interconnect instability if there are any. I would prefer something like "missed i checking out of n".

The output above is from Linux platform. It might be different on other operating systems. For Linux even with 10g you should have installed hangcheck-timer but this is another topic and I might blog about it soon.

Interested in working with Alex? Schedule a tech call.

About the Author

What does it take to be chief technology officer at a company of technology experts? Experience. Imagination. Passion. Alex Gorbachev has all three. He’s played a key role in taking the company global, having set up Pythian’s Asia Pacific operations. Today, the CTO office is an incubator of new services and technologies – a mini-startup inside Pythian. Most recently, Alex built a Big Data Engineering services team and established a Data Science practice. Highly sought after for his deep expertise and interest in emerging trends, Alex routinely speaks at industry events as a member of the OakTable.

5 Comments. Leave new

Totally agree with the “missed i checking out of n”…. Gr8 post b.t.w

Reply

Fairlie,
Thanks for reading. Glad to see you around.

Reply

We are getting the same error. We were testing the interconnect and pulled a cable. I was expecting no change as we had another interconnect in place but the Clusterware rebooted one of the nodes and continually rebooted until we plugged the cable back in. Any ideas why this should happen?

Reply
Alex Gorbachev
June 19, 2007 10:06 am

Redundant interconnect is a must for reliable RAC cluster. However, it needs to work BEFORE you put Oracle Clusterware on top of it.

Oracle Clusterware doesn’t have an ability to support redundant interconnect natively so the way to configure redundant interconnect is to use a third party product. There are quite a few terms for it like bonding, trunking and so on.
In a nutshell, all those products join more than one network cards and represent it as a single virtual network card so that if one of real network cards/cables are not operating, it’s transparent for virtual network card (except lower bandwidth). This way Oracle Clusterware is just using one network which is redundant behind the scenes.

What platform you are on? What technology you are using?

To test if it works I would start with setting a simple ping over interconnect from every host. Before doing so — stop clusterware on every node using “crsctl stop crs”. Then while pings are running — pull out a cable and see what happens with ping responses. Wait several minutes to see if it recovers – perhaps, the technology you are using takes long to reconfigure.

Reply

Hi,
This is good infor. WOuld it be safe to assme that above error “clssnmPollingThread” is from the network heartbeat. Cause there is one more heartbeat which css performs and tha tis vote disk i/o. For which there is disktimeout.

b)In case of “clssnmPollingThread” after the node is evicted does it reboot?

c) assuming above error is for network heartbeat what doe the error look like when it is vote disk io timeout.

Thanks

Reply

Leave a Reply

Your email address will not be published. Required fields are marked *