I've noticed recently that the reply pages are timing out. It's not my internet connection since its happened many times in two locales / two computers for me. Perhaps a bug or high-traffic issues on the forum? |
Yes, I am aware of the problem and desperately trying to find a solution. It looks like the problem is that the database is overwhelmed. I've been trying various solutions with little success.
|
Not OP, but this has been a major issue for the past couple of days. I'd say fully 3/4 - maybe more like 9 out of 10 of my posts are timing out and I'm getting error messages. Do you have an update on a solution? Don't make me find something more useful to do with my downtime!! ![]() |
I am doing my best to resolve this issue. Believe me, it is affecting me as well. I can't even post half the time. I really have no idea what the root cause might be. It seems to be related to the database. Right now, I am basically treating symptoms in the hope that I can stumble across the root cause and correct it. But, the problem only shows up during periods of high traffic. Unfortunately, during periods of high traffic, the amount of debugging information is overwhelming. So, it's a bit of a Catch-22. I made a couple of changes yesterday that I hope will help and will make a couple more tonight. But, this is really driving me crazy.
|
It's my impression that the problem is not load related, based on the time of day and some checks of the online users. I see long pauses at times I would not expect a lot of traffic. So even if the transactions complete at off-peak times, the underlying symptom appears under low load.
The immediate things that come to mind are the total size of the database, an app holding open idle db connections, or whatever you changed a few months ago because I would put the problem's start sometime in November or December. It doesn't feel like a query problem or a problem throwing up the results, although occasionally we see posts that, when displayed, show raw HTML which is odd. I know that's not much help because those are pretty obvious, but it's really all I can observe from here on the other side of the looking glass. |
Thanks. I'm basically collecting debugging information around the clock. I have been seeing the following issues:
1) errors generated when a nonexistent page is requested. Because of the way the forums are written, this doesn't simply generate a "page not found" error like it normally would. Instead, it generates a Java exception. I had been ignoring these, but because they were filling the log files I started to pay attention. I found that the vast majority either come from spambots or Google's spider. I have no idea why the GoogleBot is trying to index nonexistent pages -- and these aren't pages that once existed but were deleted. They are pages that never existed. Spambots tend to make requests for various php pages that are unlikely to exist on a Java application, to say the least. I think the shear traffic generated by these sources added to the problem and I've taken steps to stop it. While I was typing this message I saw another of these errors, but it was caused by the BingBot. It tried to retrieve page 28 of a 27 page discussion. The last post in that discussion was in May 2010 and it has never had a page 28. So, why would the BingBot be trying to retrieve it? It makes no sense. 2) search related errors. This is a weird one because it appears that the search system simply crashes very frequently. Yet, I've never had one single complaint about the search crashing -- just a great many complaints that the search system sucks. I can't reproduce the issue myself and I've never seen it with my own eyes. I just see the errors in the logs. I downgraded the search engine this morning in order to see if the errors go away. So far, I haven't seen any of those errors. 3) db connection pool timeouts. This is what I believe is the most important. However, it's also the one that only shows up under heavy load. I have been tweaking pool and db settings for days trying to find a combination that works. So far, no luck. 4) Search Indexing. I think this accounts for the freezes during non-peak hours. When the search indexer starts running, system load goes up. Then, I see a lot of dropped connections as users react to frozen pages and hit "reload". Then, when the indexer finishes, all sorts of connections come flying in at once which leads to some of the db issues described above. I have also made a couple of changes to the indexer to see if that fixes things. I unexpectedly stumbled across a change that I had made in 2010 that may have contributed to this issue. I reversed that change, so hopefully this will be less of an issue. I still think there is something else that I am not seeing that is the root cause. If so, I will find it but it might take some time. |
At some point in the recent past, I noticed that you started (?) preventing people from doing multiple posts in too short a time. I see them when I post, realize that autocorrect has turned me into a total idiot, and try to post a correction.
Is it possible that this change introduced a bug (like not letting go of a connection), or is it something you introduced in order to fix the existing issue? |
there are a lot of db lock errors. it looks like the db needs to be upgraded |
is it possible that you're being hacked? Maybe some hacker group wants to expose all of our personal secrets matched with our IP addresses? Ha ha. That's funny, right? Funny because it can't happen....right? |
Actually, this is a good point. The forum software has a built-in mechanism to combat spammers. I had forgotten about that, but some time ago (within the past couple of months) I did adjust it. Let me go take a look at that. |
The db lock errors are the symptom of other problems. You may be correct that the database needs to be upgraded, but this problem showed up all of a sudden. If it were just a problem of load growing beyond capacity, I would have expected it to be more gradual. I am focusing on the database -- for obvious reasons -- as the culprit here. |
Being hacked has been one of the possibilities that I have considered. I am a pretty good system administrator so it is unusual that I cannot find the problem even after all this time. At this point, I'd almost be happy to discover a hacker. I doubt they are after any personal information and would be pretty difficult, if not impossible, for them to get it. But, hackers do a lot of things just for fun. However, I'll stress, that having considered the possibility of a hack being the cause of this, I took steps to catch him if he is there. So far, there is no evidence of a hacker. That's either good if it means there is no hacker, or bad if it means he is a hell of a good hacker. |
Do you have historical statistics on post completion time (like if you do monitoring with automated transactions?) It may be that transaction times were creeping up and you just started hitting the point where timeouts started happening. |
The reason I brought it up is that the site has gotten a ton of publicity lately, with the city paper, etc. Possible someone wants to be clever and show us vicious bitches up for caring about strollers or whatever. Keep laying the traps, Jeff! |
It looks like you changed something today. It's snappier. What was it? Or am I just getting lucky? |