Availability Group - Delay with Listener Connectivity After Failover
Tags: MySQL, Google Cloud Platform, Technical Track, High Availability, Cloud, Open Source, Google Cloud Platform (Gcp), Proxysql
The beauty of working for multiple clients from different industries, is that you get exposed to a myriad of environment setup and configuration. Every company has its own standards for Network and Server configuration, as well different hardware vendors. This introduces their own kinks and excitement to your everyday work - half of which you'll likely not encounter if you are working in-house and using the same hardware. The past week we encountered a rare and interesting issue with High Availability Group. The issue was two-fold, first it was initially not failing over automatically one one node though that was not as exciting as the second part - when it was able to failover correctly, our client was experiencing delays with the availability of the Listener Name outside its own sub-net after failover, automatic or otherwise. It is reachable within its own subnet but takes more than thirty minutes to be reachable outside of it even though the Failover happened smoothly and without error. The first part was fairly straightforward. Checking on the cluster logs and event logs the automatic failover was throwing the error below when trying to failover on one of the nodes.
The error is as it says, the Computer object does not have the appropriate permissions on the Domain to register the DNS Name Resource for the Listener. For the cluster to perform this operation smoothly "Authenticated Users" should have read/write all permissions on the Computer Object for the cluster, its nodes and the Listener DNS Name. To do this, Log in to the Active Directory ServerCluster network name resource 'Listener_DNS_NAME' failed registration of one or more associated DNS name(s) for the following reason: DNS operation refused. . Ensure that the network adapters associated with dependent IP address resources are configured with at least one accessible DNS server.
- Open Active Directory Users and Computers.
- On the View menu, select Advanced Features.
- Right-click the object and then click Properties.
- On the Security tab, click Advanced to view all of the permission entries that exist for the object.
- Verify that the Authenticated Users is in the list and has the permission to Read and Write All. Add the required permissions then Save the changes.
Ah, the often over-looked SPN. This should be part of your installation process - setting the SPN. To keep the story short and so we can get to the exciting part, you can refer here for the detailed instructions on how to configure the SPN for SQL Server. Aside from registering the SPN for each of the Nodes as specified on the MSDN Link, You'll also need to register the SPN for the Listener, as always 1433 is the port being used by your SQL Server:The Kerberos client received a KRB_AP_ERR_MODIFIED error from the server ComputerName$. The target name used was HTTP/ComputerName.Domain.com. This indicates that the target server failed to decrypt the ticket provided by the client. This can occur when the target server principal name (SPN) is registered on an account other than the account the target service is using. Ensure that the target SPN is only registered on the account used by the server. This error can also happen if the target service account password is different than what is configured on the Kerberos Key Distribution Center for that target service. Ensure that the service on the server and the KDC are both configured to use the same password. If the server name is not fully qualified, and the target domain (Domain.COM) is different from the client domain (Doamin.COM), check if there are identically named server accounts in these two domains, or use the fully-qualified name to identify the server.
setspn -A MSSQLSvc/Listener_DNS_NAME.Domain.com:1433 DOMAIN/SQLServiceAccountThis will enable Kerberos for the client connection to the Availability Group Listener and address the errors we received above. After configuring the SPN for the servers Automatic Fail over is now running smoothly, or so we thought. The client came back to us that it was taking some time for the application to connect to the Listener Name. Checking on the cluster logs and SQL Server's end, everything appears to be in order. No errors being thrown and Server is accessible. Now it get's interesting. Ping test within the Database sub net is successful but ping test outside of it was timing out. It was able to connect though after a substantial amount of time. After a few more test and checking the behavior is the same on both nodes. It takes more than thirty minutes for the Name to be reachable outside of the database sub net. After involving the Network Admin we found out that a MAC Address conflict is happening. That's our "Aha!" moment. Windows 2003 servers and later issues a Gratuitous ARP (GARP) requests during failover. There are some switches/devices that does not forward Gratuitous ARP by default. This causes the devices on the other end of the switch to not have the correct MAC address associated to the Name. This causes the failure. It often corrects itself when the router detects the failures and do a broadcast and gets the correct value. That's why it becomes accessible after some time. This KB details the issue. To address this, changes must be done on the configuration of the switches, you'll need to check with your hard ware vendor for this. Now, that could be the end of our problems, but after enabling the switch to forward GARP, we found out that the server itself is not sending a GARP request. This is a server configuration issue and requires some Registry changes. Open the Registry for the server and locate the key below:
From there check if there is a key for
ArpRetryCount, if there is make sure that the Value is not set to 0. The value could be between 0-3. After changing this and restarting the servers Everything works perfectly. Last two issues are a bit rare, and something I wouldn't have experienced if the client wasn't using that particular hard ware and that particular standard configuration.