Today's Problems

jsteele
Site Admin Online
One thing that I've found from running this site is a lack of understanding of what is involved in keeping things going and a divergence of opinion about how it should be done. Having just had an unplanned "really f'd up day", I'm inclined to wax poetic on the topic.

Because Sundays are normally fairly slow, I decided to do some upgrades on the servers. We have a cluster of two web servers and another cluster of two database servers. There are also a few random servers that do this and that, but they aren't important -- regardless of their opinions on the matter -- for this discussion. I try to have every service duplicated so that we can always fail over to a backup. But, in reality, that is very difficult. Today, after some firmware upgrades to the primary web server, the server's network interface cards (ethernet connections) went out. That server could not connect to the network and for all intents and purposes became a very large and not particularly efficient paper weight. That's when my blood pressure nearly blew my eyeballs out of their sockets.

I had already failed over everything (so I thought) to the second member of the web cluster. However, the authentication server had not been clustered and was unavailable. As a result nobody could login. Not even me. Then, because whatever higher beings determine our fate decided that I needed a good kick in the guts, the application server on the front end died. That meant that we had no home page. Look at the bright side: no need to worry about the inability to login when you can 't get to the home page which is the only page that allows you to login.

But, I'm a hardcore computer geek who has been through a bunch of tough times. I have backups. Do what you want to me. I'll recover from a backup. I guess I didn't mention that while I was installing those firmware upgrades that caused my primary server to henceforth identify as a paperweight, I decided to run an OS upgrade on the server that does my daily backups. Yes, there was that. Little did I know, that upgrade changed the database that was being used. So, I have a backup program that without fail conducts backups. That program is more faithful than my dog (and I don't even have to feed it). But, it relies on the database that was just replaced with an incompatible one. So, I have no ability at this time to recover the backups. And, did I say that I can't even login to my own website?

This is the point where I really feel like I can share some advice. Despite my best efforts, everything has gone wrong. How do you confront this situation? I focused on what was important. For me, that was being able to manage the website. I needed to be able to login. Therefore, I created an entirely new version of the forums that used a different authentication facility. This version was only available to me. That allowed me to login and manage things as necessary. Then, I focused on the backup server. I had to play a little fast and a little loose, but as long as this is just between friends, I managed to trick the backup program into running. This sort of thing doesn't show up on certification exams, which is probably why I'm not certified for anything (at least that I'll admit). Once I could recover my backups, I was home free.

The lesson is: focus on what is important and have good backups.

I recovered all that I needed from my backups and got things running in respectable, if not perfect, form. The other problems will be solved later. But, if anyone wants a giant paperweight, I have one available at an attractive price.

Anonymous
Well done for wrestling the beast back into shape.

I was worried you'd been severely hacked. Glad that's not the case.
Anonymous
This may call for a festivus form.
Anonymous
Anonymous wrote:This may call for a festivus form.

Wonderful airing of grievances, Jeff, and now on to the feats of strength!
Anonymous
I know there is a lot going on, and thanks for all you do, but are you aware that the report button is an issue? At least for me? Iphone 5S on latest version of iOS. Submitting the report gets a server error.
Anonymous
Anonymous wrote: I know there is a lot going on, and thanks for all you do, but are you aware that the report button is an issue? At least for me? Iphone 5S on latest version of iOS. Submitting the report gets a server error.


Me too. On a laptop, using safari.
Anonymous
I love how Jeff posted a long explanation of what sounds like a very difficult day, and in the responses, there are complaints about things not working. This is amusing to me. Say "sorry you had a rough day, thanks for all you do!" here, and then start a new "just so you know I have this issue" thread.
Anonymous
I have next to no idea what you just said but I'll go with the lesson that can be applied universally:


Focus on what's important and have good backups.
Anonymous
Meant to say: Thanks! and

Focus on what's important and have good backups.
Anonymous
Anonymous wrote:
Anonymous wrote: I know there is a lot going on, and thanks for all you do, but are you aware that the report button is an issue? At least for me? Iphone 5S on latest version of iOS. Submitting the report gets a server error.


Me too. On a laptop, using safari.


FYI, I've been describing this same type of problem on the Home Page Down thread. I'm glad it's not just me.

Anonymous
Why are you still running on your own hardware vice going the hosted route?
jsteele
Site Admin Online
Anonymous wrote:Why are you still running on your own hardware vice going the hosted route?


I haven't found a hosted service that meets my requirements and I like the control of owning my own hardware. Really the only downside to doing it this way is that I'm the only one that can fix things. If I had a back-up person or two, there would really be no justification for going to a hosted service. I'll probably need to add a couple more servers to get complete redundancy, but hardware is pretty cheap these days. The server that went down ended up having its motherboard replaced. If the server that does backups had not gotten hosed on the same day, I doubt anyone would have noticed the outage. Once I was able to get access to my backups, things were running again quickly. All in all it was not that bad considering I lost the motherboard on the primary server. Also, hat's off to Dell who replaced the motherboard within 24 hours of me requesting it.

Anonymous
Anonymous wrote:Meant to say: Thanks! and

Focus on what's important and have good backups.



(S)He who laughs last had a backup.
Anonymous
Jeff, I am NOT a tech person and didn't understand 99% of what you wrote but glad everything worked out.

When my computer crashes, I just go to my other one and then to the kids who roll their eyes at me, fix it, and then ask me to please join them in the 21st century.

Smart asses but love 'em to death (and not just because they're computer literate).
Anonymous
I didn't understand the original post but wanted to thank you all the same.
post reply Forum Index » Website Feedback
Message Quick Reply
Go to: