Oracle 10.2.0.3 CRS – Missed Heart Beats Format in ocssd.log

Feb 4, 2007 / By Alex Gorbachev

Tags:

Oracle CRS 10.2.0.3 patchset changed the logging of missed heartbeats by CSS.

Here is example how heartbeats misses are logged in ocssd.log in 10.2.0.3:

[    CSSD]2007-02-02 14:41:06.867 [1199618400] >WARNING: clssnmPollingThread: node node1 (1) at 50% heartbeat fatal, eviction in 29.440 seconds
[    CSSD]2007-02-02 14:41:21.865 [1199618400] >WARNING: clssnmPollingThread: node node1 (1) at 75% heartbeat fatal, eviction in 14.440 seconds
[    CSSD]2007-02-02 14:41:30.864 [1199618400] >WARNING: clssnmPollingThread: node node1 (1) at 90% heartbeat fatal, eviction in 5.440 seconds
[    CSSD]2007-02-02 14:41:31.866 [1199618400] >WARNING: clssnmPollingThread: node node1 (1) at 90% heartbeat fatal, eviction in 4.440 seconds
[    CSSD]2007-02-02 14:41:32.868 [1199618400] >TRACE:   clssnmPollingThread: node node1 (1) is impending reconfig
[    CSSD]2007-02-02 14:41:32.868 [1199618400] >WARNING: clssnmPollingThread: node node1 (1) at 90% heartbeat fatal, eviction in 3.440 seconds
[    CSSD]2007-02-02 14:41:32.868 [1199618400] >TRACE:   clssnmPollingThread: diskTimeout set to (57000)ms impending reconfig status(1)
[    CSSD]2007-02-02 14:41:33.870 [1199618400] >TRACE:   clssnmPollingThread: node node1 (1) is impending reconfig
[    CSSD]2007-02-02 14:41:33.870 [1199618400] >WARNING: clssnmPollingThread: node node1 (1) at 90% heartbeat fatal, eviction in 2.430 seconds
[    CSSD]2007-02-02 14:41:34.862 [1199618400] >TRACE:   clssnmPollingThread: node node1 (1) is impending reconfig
[    CSSD]2007-02-02 14:41:34.862 [1199618400] >WARNING: clssnmPollingThread: node node1 (1) at 90% heartbeat fatal, eviction in 1.440 seconds
[    CSSD]2007-02-02 14:41:35.864 [1199618400] >TRACE:   clssnmPollingThread: node node1 (1) is impending reconfig
[    CSSD]2007-02-02 14:41:35.864 [1199618400] >WARNING: clssnmPollingThread: node node1 (1) at 90% heartbeat fatal, eviction in 0.440 seconds
[    CSSD]2007-02-02 14:41:36.306 [1199618400] >TRACE:   clssnmPollingThread: node node1 (1) is impending reconfig
[    CSSD]2007-02-02 14:41:36.306 [1199618400] >TRACE:   clssnmPollingThread: Eviction started for node node1 (1), flags 0x000f, state 3, wt4c 0

Note that 10.2.0.2 would start logging each heartbeat miss from the second miss:

[ CSSD]2006-11-07 22:15:59.420 [1107360096] >TRACE: clssnmPollingThread: node node1 (1) missed(2) checkin(s)
[ CSSD]2006-11-07 22:16:00.422 [1107360096] >TRACE: clssnmPollingThread: node node1 (1) missed(3) checkin(s)
[ CSSD]2006-11-07 22:16:01.424 [1107360096] >TRACE: clssnmPollingThread: node node1 (1) missed(4) checkin(s)
[ CSSD]2006-11-07 22:16:02.426 [1107360096] >TRACE: clssnmPollingThread: node node1 (1) missed(5) checkin(s)
[ CSSD]2006-11-07 22:16:03.428 [1107360096] >TRACE: clssnmPollingThread: node node1 (1) missed(6) checkin(s)
[ CSSD]2006-11-07 22:16:04.430 [1107360096] >TRACE: clssnmPollingThread: node node1 (1) missed(7) checkin(s)

10.2.0.3 has somewhat more user-friendly format and tells you when potential eviction would occur but starts logging only after 50% of heartbeats are missed. This means that you won’t be aware of “short” interconnect instability if there are any. I would prefer something like "missed i checking out of n".

The output above is from Linux platform. It might be different on other operating systems. For Linux even with 10g you should have installed hangcheck-timer but this is another topic and I might blog about it soon.

5 Responses to “Oracle 10.2.0.3 CRS – Missed Heart Beats Format in ocssd.log”

  • Fairlie says:

    Totally agree with the “missed i checking out of n”…. Gr8 post b.t.w

  • Fairlie,
    Thanks for reading. Glad to see you around.

  • Ray says:

    We are getting the same error. We were testing the interconnect and pulled a cable. I was expecting no change as we had another interconnect in place but the Clusterware rebooted one of the nodes and continually rebooted until we plugged the cable back in. Any ideas why this should happen?

  • Redundant interconnect is a must for reliable RAC cluster. However, it needs to work BEFORE you put Oracle Clusterware on top of it.

    Oracle Clusterware doesn’t have an ability to support redundant interconnect natively so the way to configure redundant interconnect is to use a third party product. There are quite a few terms for it like bonding, trunking and so on.
    In a nutshell, all those products join more than one network cards and represent it as a single virtual network card so that if one of real network cards/cables are not operating, it’s transparent for virtual network card (except lower bandwidth). This way Oracle Clusterware is just using one network which is redundant behind the scenes.

    What platform you are on? What technology you are using?

    To test if it works I would start with setting a simple ping over interconnect from every host. Before doing so — stop clusterware on every node using “crsctl stop crs”. Then while pings are running — pull out a cable and see what happens with ping responses. Wait several minutes to see if it recovers – perhaps, the technology you are using takes long to reconfigure.

  • Sunny says:

    Hi,
    This is good infor. WOuld it be safe to assme that above error “clssnmPollingThread” is from the network heartbeat. Cause there is one more heartbeat which css performs and tha tis vote disk i/o. For which there is disktimeout.

    b)In case of “clssnmPollingThread” after the node is evicted does it reboot?

    c) assuming above error is for network heartbeat what doe the error look like when it is vote disk io timeout.

    Thanks

Leave a Reply

  • (will not be published)

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>