Sorry for the extended outage. I had a self-inflicted problem that took most of the day to resolve. Hopefully things will be stable now. We are running on all new hardware and upgraded software. Let's hope it is worth the trouble.
|
Thanks for the message Jeff. It is hard when we dcum junkies can't get our fix. I barely knew what to do with myself. (only partially kidding) |
I hope the main forum comes back soon. ![]() |
Sorry about the unexpected outage this morning. If you are interested in the gory details, they are on the home page.
|
I read the home page. Did network traffic from the DB cluster knock over your firewall? Sometimes a route will send everything out and back. |
This is a possibility, but not the most likely. The firewall was reporting scans from the Internet. Of course there is nothing new with that, but there could have been something special about that traffic that choked the firewall. My best guess is that there is no relationship between the firewall issues and the cluster issues. But, the coincidence of timing is hard to ignore. BTW, I am going to solve the cluster traffic problem by running a direct cable between the data nodes. I added the cable today and the nodes can communicate over it. But, at the moment, the cluster is not using that route. I'm still trying to fix that. |
Yup, a crossover cable can work and it's clean if you have the NICs to spare. That should eliminate the effects of load, routing, and latency in one shot. I'm wondering if the scans are really user connections timing out. During the outage, I wasn't going straight to the outage page. A lot of pages were hanging for quite a while. Maybe the app servers aren't playing nice with the cluster DB connections and that bogged evertying down, leaving lots of connection requests hanging at the firewall. |
This is a real possibility. The servers were full of connections in time-wait state. |