jsteele
Post 01/04/2015 10:39     Subject: Re:What was the outcome on the website performance issue

I still haven't found the root cause. Here is what I do know:

For roughly two years, they system worked relatively smoothly. In late October or early November -- I don't remember of the top of my head -- while moving a thread or deleting messages -- again, I don't remember exactly -- I received an error message about the database cluster going away. The cluster than started having very brief outages which I could not determine a cause of. Because I was going to have to restart the cluster in an attempt to clear things up, I decided to upgrade at that time. In retrospect, this was probably a bad idea because it introduced another variable and made trouble-shooting more difficult.

Periodic slowness during times of high activity continued. The most obvious symptom during such times was high resource usage. While troubleshooting what might be causing the load, I discovered all kinds of unusual things occurring -- particularly extremely heavy access by a bunch of different indexing spiders. There was also a lot of scanning the site by "script kiddie" stuff that wasn't new -- this has always happened. But, blocking such activity never showed a positive impact so I now think all of this was a red herring.

I then decided to deploy two additional servers and move components to those. During the two years that things had worked, I wasn't running in the recommended deployment configuration. When the new servers arrived, I set things up as recommended, but the slowness got even worse causing an outage.

Eventually, I changed the deployment configuration to one that is not recommended, but actually works. Basically, we have the web tier consisting of Apache and Tomcat, the MySQL tier consisting of MySQL, and the data tier consisting of a MySQL NDB cluster. It is recommended to have all three tiers on separate hardware or at least have the MySQL and NDB tiers on separate hardware. I originally had all three tiers on the same servers (but two servers load balancing). When the new servers came, I moved the NDB cluster to them. However, queries from the MySQL servers to the cluster were extremely slow. The same query from a MySQL server located on a cluster machine were very fast. So, I configured Tomcat to connect to the MySQL servers on the cluster machines and that's how things are running now.

The open question is why the MySQL servers on the web tier machines are so slow. I have eliminated all the obvious answers and am simply baffled at this point. I don't think it is an issue of network communication between those machines -- they are using 1 gig copper connected to the same switch and there is no sign of any network problem. I think it is something in the servers themselves, but probably not something included with the MySQL distribution. Rather some software that is part of the Linux distribution I use (Centos) that was upgraded and caused the initial problems. I now think that the original slowness and high resource issue was a reflection of whatever component is at the root of the trouble. That is probably something that was updated during routine patching at some point and I have no idea what it might be. But, I think that whatever that component was, it caused the MySQL servers to start working slowly, causing the initial problem and then caused the MySQL servers to continue working slowly even when they were connecting to other machines.

I am planning to configure another server that is currently not being used as a virtual machine and create a bunch of virtual servers on it and test a number of different configurations to see if I can prove or disprove my theory. But, I haven't made much progress on this so far.
Anonymous
Post 01/03/2015 23:28     Subject: What was the outcome on the website performance issue

I know you were working your butt off to restore a DB and install new hardware.

Did you find the root cause? In the end, do you think it was capacity or some db corruption?