RAC – Clusterware Intermittently Fails to Start

While working on a new Grid Infrastructure installation on Red Hat V8.4, we had some issues testing a new cluster: The clusterware failed to come back up after a node reboot… but sometimes it worked.
We had the latest 19c software patched to 19.12 and were using ASMLib to configure the ASM disks.
When failing, the messages in the CRS alert.log stated that the clusterware was failing to find any voting disks—see the last line of the output:
2021-09-23 08:12:21.292 [OSYSMOND(28810)]CRS-8500: Oracle Clusterware OSYSMOND process is starting with operating system process ID 28810 2021-09-23 08:12:21.288 [CSSDMONITOR(28808)]CRS-8500: Oracle Clusterware CSSDMONITOR process is starting with operating system process ID 28808 2021-09-23 08:12:21.651 [CSSDAGENT(28847)]CRS-8500: Oracle Clusterware CSSDAGENT process is starting with operating system process ID 28847 2021-09-23 08:12:27.533 [OCSSD(28867)]CRS-8500: Oracle Clusterware OCSSD process is starting with operating system process ID 28867 2021-09-23 08:12:28.592 [OCSSD(28867)]CRS-1713: CSSD daemon is started in hub mode 2021-09-23 08:12:28.703 [OCSSD(28867)]CRS-1714: Unable to discover any voting files, retrying discovery in 15 seconds; Details at (:CSSNM00070:) in /racd01/app/oracle/diag/crs/l-04296-d5c/crs/trace/ocssd.trc
And in the ocssd.trc file:
2021-09-23 08:12:28.703 : SKGFD:3462854400: ERROR: -14(asmlib /opt/oracle/extapi/64/asm/orcl/1/libasm.so version failed with 2 ) 2021-09-23 08:12:28.703 : CLSF:3462854400: Allocated CLSF context 2021-09-23 08:12:28.703 : CSSD:3462854400: [ INFO] clssnmvDDiscThread: using discovery string ORCL:* for initial discovery 2021-09-23 08:12:28.703 : SKGFD:3462854400: Discovery with str:ORCL:*: 2021-09-23 08:12:28.703 : SKGFD:3462854400: UFS discovery with :ORCL:*: 2021-09-23 08:12:28.703 : SKGFD:3462854400: Execute glob on the string /racd01/app/19.3.0.0.0/grid/dbs/ORCL:* 2021-09-23 08:12:28.703 : SKGFD:3462854400: OSS discovery with :ORCL:*: 2021-09-23 08:12:28.703 : SKGFD:3462854400: Discovery skipping bad asmlib :ASM:: 2021-09-23 08:12:28.703 : CSSD:3462854400: [ INFO] clssnmvDiskVerify: Successful discovery of 0 disks 2021-09-23 08:12:28.703 : CSSD:3462854400: [ INFO] clssnmCompleteInitVFDiscovery: Completing initial voting file discovery 2021-09-23 08:12:28.703 : CSSD:3462854400: [ INFO] clssnmvFindInitialConfigs: No voting files found 2021-09-23 08:12:28.703 : CSSD:3462854400: [ INFO] (:CSSNM00070:)clssnmCompleteInitVFDiscovery: Voting file not found. Retrying discovery in 15 seconds
Note at line 1, the ‘libasm.so version failed with 2’.
We checked the disks using the discovery tool:
$ /usr/sbin/oracleasm-discover Using ASMLib from /opt/oracle/extapi/64/asm/orcl/1/libasm.so asm_version() failed with code 2
It gave the same error! We traced the discovery—strace /usr/sbin/oracleasm-discover—which shows clearly that we have no permissions:
write(1, "Using ASMLib from /opt/oracle/ex"..., 61) = 61 openat(AT_FDCWD, "/dev/oracleasm/.query_version", O_RDWR) = -1 EACCES (Permission denied) openat(AT_FDCWD, "/dev/oracleasm/iid", O_RDONLY) = -1 EACCES (Permission denied) write(2, "asm_version() failed with code 2"..., 33) = 33 exit_group(2) = ? +++ exited with 2 +++
We checked the disks and indeed found that they have the wrong userid and group:
$ ls -lasi total 0 18446638755664080616 0 drwxr-xr-x. 4 root root 0 Sep 23 06:50 . 3 0 drwxr-xr-x. 23 root root 5260 Sep 23 06:51 .. 18446638755664098920 0 -rwxrwx---. 1 root root 0 Sep 23 06:50 .check_iid 18446638755664086952 0 drwxr-xr-x. 1 root root 0 Sep 23 06:50 disks 18446638755664083432 0 -rwxrwx---. 1 root root 0 Sep 23 06:50 .get_iid 18446638755664094696 0 drwxrwx---. 1 root root 0 Sep 23 06:50 iid 18446638755664099624 0 -rwxrwx---. 1 root root 0 Sep 23 06:50 .query_disk 18446638755664071464 0 -rwxrwx---. 1 root root 0 Sep 23 06:50 .query_handle 18446638755664090472 0 -rwxrwx---. 1 root root 0 Sep 23 06:50 .query_version
They should be owned by the Oracle software owner and the DBA group. After checking the oracleasm log /var/log/oracleasm, we found the following error:
Thu Sep 23 08:12:28 Creating /dev/oracleasm mount point: /dev/oracleasm Thu Sep 23 08:12:28 Creating /dev/oracleasm mount point: /dev/oracleasm Thu Sep 23 08:12:28 Loading module "oracleasm": oracleasm Thu Sep 23 08:12:28 Loading module "oracleasm": oracleasm Thu Sep 23 08:12:28 Configuring "oracleasm" to use device physical block size Thu Sep 23 08:12:28 Configuring "oracleasm" to use device physical block size Thu Sep 23 08:12:28 chown: invalid user: 'oracle:dba' Thu Sep 23 08:12:28 Mounting ASMlib driver filesystem: failed Thu Sep 23 08:12:28 Mounting ASMlib driver filesystem: failed Thu Sep 23 08:12:28 Unable to fix ASM driver permissions Thu Sep 23 08:12:28 Unable to fix ASM driver permissions
Note line 7, so it seems like the username is not valid! But I am logged in as Oracle?
Checking the users, we see that oracle is not in the /etc/passwd file, so we must be getting it from somewhere else.
So we checked /etc/nsswitch.conf and found that we also get user information from System Security Services:
passwd: files sss
Checking the services, we see that they started at around the same time:
# systemctl status sssd - sssd.service - System Security Services Daemon Loaded: loaded (/usr/lib/systemd/system/sssd.service; enabled; vendor preset: enabled) Active: active (running) since Thu 2021-09-23 08:12:28 EDT; 46min ago ... # systemctl status oracleasm - oracleasm.service - Load oracleasm Modules Loaded: loaded (/usr/lib/systemd/system/oracleasm.service; enabled; vendor preset: disabled) Active: active (exited) since Thu 2021-09-23 08:12:28 EDT; 46min ago ...
This could be our cause and why the problem is intermittent. If oracleasm starts before the sssd service, then it will not be able to find any details for the oracle user when it tries to change the disk ownership, but if the oracleasm service is slower and starts after the sssd service, the user details will be available and the clusterware will start as normal.
So we added an After clause (line 3) to the /usr/lib/systemd/system/oracleasm.service file:
[Unit] Description=Load oracleasm Modules After=sssd.service [Service] Type=oneshot RemainAfterExit=yes ExecStart=/usr/sbin/oracleasm.init start_sysctl ExecStop=/usr/sbin/oracleasm.init stop_sysctl ExecReload=/usr/sbin/oracleasm.init restart_sysctl [Install] WantedBy=multi-user.target
This makes sure that the sssd service starts before oracleasm.
This resolved the issue and the clusterware comes up normally every time now.
The disks look like this:
$ ls -lasi total 0 18446638755664080616 0 drwxr-xr-x. 4 root root 0 Sep 23 09:50 . 3 0 drwxr-xr-x. 23 root root 5260 Sep 23 09:51 .. 18446638755664098920 0 -rw-rw----. 1 oracle dba 0 Sep 23 09:50 .check_iid 18446638755664086952 0 drwxr-xr-x. 1 root root 0 Sep 23 09:50 disks 18446638755664083432 0 -rw-rw----. 1 oracle dba 0 Sep 23 09:50 .get_iid 18446638755664094696 0 drwxrwx---. 1 oracle dba 0 Sep 23 09:50 iid 18446638755664099624 0 -rw-rw----. 1 oracle dba 0 Sep 23 09:50 .query_disk 18446638755664071464 0 -rw-rw----. 1 oracle dba 0 Sep 23 09:50 .query_handle 18446638755664090472 0 -rw-rw----. 1 oracle dba 0 Sep 23 09:50 .query_version
Thank you for reading my post! For more like this, please subscribe at the top of the page or here.