|
|||||||
|
Nightmares
And the winner of our contest is......
In spring of 2002 I was called in by a Company to help them with their websites (I wasn't an employee at that time, I had a small shop of about 10 guys doing web development/business systems).
The issue? Their Host went bankrupt and they had 4 days to move 100+ sites with 50+ pages each off their servers before the plug was pulled.
If this didn't go right, 80,000 plus doctors around the country would be very unhappy - they use the sites to look for Continuing Medical Education seminars, register, make payments, keep track of their certificates, etc. the Company would lose a LOT of money.
So - no big deal, right? Just access the servers and pull the code off and put it on our servers.
SORRY - the Hosting Vendor of the Company actually outsourced the hosting to another vendor whom they hadn't paid and they refused to give us access at all! With the short amount of time that we had, I pulled my entire team together to go to every page, save as html, fix the image src tags and everything else and rebuild the sites by hand as static HTML (no database driven content). We got their sites back up in 1/2 day before the plug got pulled, and then a week or two later launched a Content Management system for them to be able to keep the content up to date.
This took a team of 6 72+ hours of work. Non-stop. 3 all nighters in a row with interspersing of Colin Powell 20 Minute Power Naps.
Needless to say, the Company was very happy we could rescue them, and now that I've moved on from owning my own company they have very happily employed me.
-A horrifying tale of nightmarish servers and HTML brought to you by a Mr. Anderson
And the runner's up are....
Several years ago at a former employer, I was lucky enough to be onsite working, when what initially appeared to be a sprinkler system malfunction began to drench the data center. We rushed inside with umbrellas and plastic sheeting to cover the servers - and more importantly the UPS battery units - to give us enough time to shut down and power everything off. In the process I became soaked, and risked electric shock, even in the raised floor environment, from the rapidly ponding water. Thankfully, nobody received any shocks, and only a few machines were eventually lost, due to our quick action to power them down as gracefully as possible.
Then, to my dismay, we discovered the water was not coming out of the relatively clean fire suppression system, but rather from a large pipe two floors above, which had been skewered by a runaway forklift. The really scary part about this, was that the building was a 50+ year old semiconductor fab, with unknown and possibly carcinogenic chemical spills built up over the years, in the crawl spaces and crevices through which the leak cascaded. So, I got to take a "Silkwood" style emergency shower immediately afterward and hope for the best. I'm still waiting and hoping "it's not a tumor" as the years go by...
-Our tale of watery demise and resurrection is brought to you by the mysterious Gene Kelly
I think that the worst nightmare is the one that you know will come, and it will repeat all over again.
I work at a large bank. As a "modern" bank, it lets the customers to access their accounts over the Internet. This is hosted in an application with a balanced architecture with about 25 servers (the frontend). Those servers access a mainframe that runs the accounting info (the backend). The problem (or nightmare) arises on the last days of the month, when customers want to check if they received their salary. Every month there are performance issues in the mainframe, so the frontend retains connections.
Imagine the staff at the "rush hour" continuously shutting down and restarting the application and web servers running in the frontend, trying to keep the connection count not so high. To make things worse, that makes the managers think that the "fault" is in the frontend!
But when the rush hour ends, the nightmare continues. Every frontend machine has it own independent application server instance. No clustering and no cross domain. Thus, the staff has to manually connect to each web and application server of each frontend machine to check that everything is normal again... until the next day or the next month.....
After one of these nightmares, the boss ordered us to open a case with the application server vendor. He told us that he needs a support engineer in person ASAP. Then begun the nightmare for this engineer. He could not attach his notebook to the corporate network because of "security reasons", so he had to use a staff workstation. But he wasn't allowed to connect to the Internet from the corporate network. Thus, he had to manually "cut and paste" from the workstation to his notebook that was connected to the Internet through a mobile phone. Otherwise, he had to diagnose the problem without a connection to the servers or without connection to the vendor labs. As you may imagine, the problem isn't really with the application servers, so the engineer could only recommend some tuning... And the nightmare came back at the next end of month.
-A tale of shocking frontend madness from Green Teddy
One of the first real experiences I had with an IT nightmare was when I worked for an educational website company that hosted over 30 different websites. My job was scary enough: I was a Jr. Sys admin working in a Network Operations center that was responsible for monitoring 452 machines in a data center, 24x7. Eek.
The company was in the process of moving to a new location in about a month, so to prepare for this, all of us were put up in other buildings around the complex as the main building we used to work out of was being renovated. The main data center was a short walk away so if there was a problem that required hands on tweaking, we had to run over there to fix it.
One night while I was working a 12 hour shift from 6pm to 6am, something went wrong. I started to notice some of our websites weren't allowing people to login. Some Senior sys admins discovered one of the two machines that runs the web portal for a certain site was down. So 3 of us ran to the data center to see what was up.
The two machines in question were named Smokey and Bandit. Well, Smokey lived up to his name. When we found Smokey, he was smoking, literally. Obviously something got fried. So we unplugged him, loaded him on a cart and prepared to take him back to our building. As we opened the door, we realized the powers that be just wouldn't make this any easier for us. It's now pouring rain with thunder and lightning outside. Great. So we cover up Smokey with plastic bags and attempt to try this again. All 3 of us are getting soaked as we are trying to steer the small cart in the pouring rain down a steep hill.
We eventually get Smokey back to the building. We are all freezing now since we are all wet and the air conditioning is on. Smokey is eventually packed up in a box and sent back the manufacturer for repair. For the next 2 days, certain key websites had problems with logins, hanging pages and timeouts. Customers were complaining that they couldn't get onto the site and everyones nerves were shot. The constant beeping of the monitoring software was making us all delusional.
So the day arrives where we get Smokey back. As I take off the cover to the machine and a Senior Sys Admin looks inside the box to check out the work that was done, his screams echo down the hallway as the real horror story began: They replaced the WRONG parts in the box! So we have to ship him out again!
In the meantime, the smart computer geeks at this company found a temporary solution to the problem. But I'll never forget such craziness as finding smoking machines, getting soaked in the rain, and dealing with the chaos that soon followed in the next few days.
-A macabre tale of the elements from our friend Beans
That day was difficult to forget. Winter, early January, and a new challenge faced me. My boss told me to join the on-site people we had for a very important customer. My mission was to provide help and knowledge to improve the customer's IT department. As an experienced IT Consultant, it didn't seem too complicated...
After a few days I had the necessary feeling about the infrastructure and all the things to improve. Also, I had the feeling about the customer, who was a very strange person to say the least. I will call him Jekyll or Hyde.
One of the first things that impressed me was that the customer had hundreds of machines but lacked a proper monitoring system. Jekyll relied on a very old system whose natural state was down. With this scenario, it was easy to suggest an improvement plan to address this situation. I surfed the web, asked friends, and we discovered a small but brave company called Histeric [Editor's note...I believe they may mean Hyperic here, or maybe he's just being very witty] and a very interesting product. And it was cheap!
It was like heaven. We had thousands of metrics in the systems that provided the most critical services. At that time we had tools to solve the incidents, arguments to reject those incidents that were assigned to us incorrectly, historical data, automatic inventory, etc. That is, a totally new world.
Jekyll was pleased with that new software and all the information it provided for free. One morning I met an old mate. I knew him in a project several years ago, so I went to greet him. Suddenly, in my back, I felt like a laser beam aiming directly to my head. I turned around and there it was Hyde staring at me with a look in his face that would have scared even the Sasquatch itself... Ten minutes later, I was told to go to Hyde's cubicle and the first phrase was: "Would you like coffee and cookies?". "What?" I said... "Yes, for you and your friend. This is not a meeting place." This is very weird, I thought. Not so.
I was delighted with this new software and the development team behind it, especially a very supportive guy called Mr. Single. Hyde perceived my joy as some hidden personal interest in the tool, and the software that one day pleased Jekyll was forbidden by Hyde. Totally forbidden, every trace of the product had to be erased.
Well, that's life. Several months later we had a very important problem to diagnose, but theoretically we only had our museum software. Jekyll told us to get metrics, to use that tool from the Histeric guys. "Err... you told us to uninstall it." "What? I provide you with monitoring tools and you do not use them?" he said. My understanding of the situation was below zero. Hopefully, I left Histeric running in a clandestine way on the troublesome machines and I opened the Histeric Dashboard. "Very good, you have to use my tools, that's why I have bought them", Jekyll said.
Hyde does not like the tool at all, but Jekyll loves it. So do I, the problem is to recognize them.
-A Frightening yarn about a real life Jekyll and Hyde brought to you by a very clever Mr. Brokenturnip
About 10 years ago, I was working on a new customer CRM portal using commercial software as a base for a company who supplies electric wire. The company had a number of plants around the country, several of which had labor strikes going on. Part of the CRM portal included developing an Availability To Promise functionality, that looked up inventory in the various plants. During development and pilot of this, we used terminal services to remotely access various servers in the plants and headquarters. Inconveniently, during a maintenance window where we were upgrading the commercial software, one of the disgruntled employees at one of the plants, discovered that the port was open to terminal services and tunneled into our main servers. This employee deleted dlls, config files, and renamed key executables to create general chaos in the test environment and the production pilot. Since the upgrade was literally mid-flight, this became an even worse game of restoring the system. Our backup set us back two days with the work that was already being done in process to prepare and perform the upgrade. Naturally there was a review the next week, so we were racing for deadlines. In addition to messing with our CRM system, this person also managed to find the directory where key graphics for the corporate website was located, and replaced the corporate boring gifs with porn. Needless to say, this was the single worst 72-hour day I ever had to recover the installation and return operations - sans terminal services access - to normal. To add insult to injury, by turning off terminal services I needed to work directly in the server room to finish the upgrade - which meant 3 days in a freezing, noisy server room - with my server being RIGHT next to the air conditioning vent.
-A chilling tale brought to you by Elizabeth