troubleshooting

Keeping Cassandra Alive

Troubleshooting Cassandra under pressure This is the second blog post in the series. This is a bit more technical than the first one. I will explain some things that can be made to keep a cluster/server running when you are having problems in that cluster. There were a lot of changes in Cassandra over the…

Using the ILOM for Troubleshooting on ODA

I worked on root cause analysis for a strange node reboot on client’s Oracle Database Appliance yesterday. The case was quite interesting from the perspective that none of the logs contained any information related to the cause of the reboot. I could only see the log entries for normal activities and then – BOOM! –…

Unexpected Shutdown Caused by ASR

In past few days I had two incidents and an outage, for just a few minutes. However, outage in a production environment is related to cost relatively and strictly. The server that had outage was because of failing over and then failing back about 4 to 5 times in 15 minutes. I was holding pager,…

Corollary to Tilton’s law: There only is the first problem.

Kenny Tilton posted about database troubleshooting, and he anecdotally illustrates and elaborates on a law of troubleshooting that I strongly agree with: Always solve the first problem. The corollary to his law is that “there only is the first problem.” I’m not sure I entirely agree with that one, but I will admit that that corollary is true at least 90% of the time, which is often enough to make it an incredibly useful insight.