Today I may have inadvertently discovered the root cause of the major problems we had back in November that caused DCUM to be down for considerable periods of time. Unfortunately, it took some outages today in order to discover the problem.
The short answer is that it appears that the primary switch we have -- a Cisco SG300 -- got into what I might charitably call a less than optimal state. The longer explanation is that this we've used this switch for years without a problem. When we had the issues in November, I rebooted it as some point just to see if that made any difference, but as I recall it didn't. The switch has not been rebooted since then. All this time, it appeared to work and, in fact, did work to a certain point. Today I was trying to transfer a very large file from one server to another. I noticed that the transfer was very slow -- toping out at about 500 kbps on a gig network. I started doing various testing and found out that while several servers were pushing 20 mbps or so, no transfers would get anywhere near that speed. Even stranger, I found that one server's traffic was being broadcast to every other device on the switch. There are a couple of known causes for this sort of thing, but taking steps to address those issues didn't change anything. While I watched the network, things got worse and worse and then, suddenly, some ports simply stopped passing traffic. Unfortunately, one port was for a database cluster detained and when it stopped communicating, it crashed. I then rebooted the switch which, of course, caused the entire cluster to go down. When the switch came back up, several ports simply wouldn't turn on. There was a fairly recent firmware update, so I installed that. We also moved some cables around on the switch to find ports that worked. Finally, after several reboots, all ports started working. Then, I had to get the db cluster working again. While that took time, there were no problems. When I finished, I found that the slow db queries that has been the problem back in November were now fast. So, I can probably go back to a more conventional configuration of db components. But, I'll wait a while for that. I assume that there were bugs in the previous version of the switch firmware and I now don't know how much to trust that switch. In the network stack of firewall, switch, and servers, it is the least expensive device. So, moving to something beefier might be a good idea. |
Thank you for all your hard work. |
+1. I don't understand much of what you described but appreciate all you do for us. I don't know if you were a reader of The Dish, Jeff, but now that Andrew Sullivan has stopped blogging, I appreciate how much of a community he created. You do as well, and I'm very grateful. |
Thanks for all you do. |