Earlier this year, I wrote an article entitled The IBM Power Trap that described how a series of decisions related to how you manage your applications and IBM Power can leave you stuck. A few people reached out to me with questions so I thought it might be appropriate to share a real-life story about one of our customers, we will call them Glass Corp International (GCI), that got caught in this trap and had to be rescued.
From Boom Time to Bust
With their busiest weekend of the year just days away, GCI, an Ohio based company, had all their production lines running to ensure they would have plenty of available stock. Incoming orders passed through their ERP (Enterprise Resource Platform) system, hosted on their IBM Power 5 server running V5R3, where they were broken down into raw materials requirements and work orders for each line. Everything was running smoothly, until it was not. When GCI’s Power server failed, the IT team sprang into action. They had prepared for this case with an extended support agreement for their aging hardware, so a quick call to their hardware support vendor was expected to bring fast resolution to their problems. But, after multiple unsuccessful attempts at fixing the server, concerns began to mount as to whether it could be repaired in a timely manner. As days passed, GCI realized they had “all their eggs in the V5R3 repair basket” and that they needed to investigate other solutions. On Friday, GCI reached out to Racksquared to inquire about further assistance.
After some brief conversations and a short drive, GCI associates arrived at Racksquared with their Power server in hand, hoping someone might be able to correctly diagnose the problem and implement a fix. But once again, with the age of the hardware and the nature of the failure, a repair was not in order. At this point, the decision was made to rebuild GCI’s LPAR in the Racksquared IBM Power in the Cloud. With the age of the OS (V5R3) and the state of the hardware, there wasn’t a smooth, graceful migration path so the team made the decision to go straight to V7R4 and hope for the best. A few hours later, with the LPAR configured, it was time to recover GCI’s data from their cloud backup service provider. This process was not fully automated and had not been tested, leading to new problems. When the call went out to the service provider for assistance, there was no answer. Turns out the company had decided to start the holiday weekend early and the technician, with the required expertise, was on the golf course enjoying the start of his long weekend. Hours later, contact was made with the technician and the data was retrieved and loaded on to the V7R4 server. With fingers crossed the system was started and to everyone’s relief, it worked! In just over 36 hours from initial contact with Racksquared, GCI had been recovered from a stand-alone Power 5 server running V5R3 to a cloud hosted Power 9 running V7R4 system and their business was up and running. What a relief for GCI!
Things to be Learned from GCI
The GCI disaster epitomizes the meaning behind the saying “the best laid plans of mice and men” in that no matter how good the plan, it does not necessarily ensure success. In this disaster, GCI had taken steps to mitigate their risk, but their disaster recovery plan had not been fully tested. With an older hardware platform, they invested in extended hardware support, and they were faithfully performing offsite backups of their data. What they hadn’t done was clearly defined their Recovery Time Objective (RTO -how quickly do they need to be back up and running), or their Recovery Point Objective (RPO – how much data can they afford to lose) and then test their disaster recovery plan against those objectives. Let’s take a look at other things to be learned from GCI.
Install OS and Application Upgrades
Saving time and money by not installing OS or application upgrades is a formula for an eventual disaster, as GCI found out. They were extremely fortunate that their custom application “just worked” when it was installed on a V7R4 system. Had they been using a third-party application; they may not have been so fortunate and what if their third-party application provider was no longer in business to provide them with a license key? This disaster would not have ended so well for them.
Extended Support may not Save the Day
Extended support is often the quick fix, solution that companies think of when hardware or software goes out of support. Honestly, it’s not a bad short-term option as you plan your upgrade or migrate off a platform but as a long-term solution, don’t bet the business on it. Most extended support contracts contain caveats that limit the vendors responsibility for providing a fix or doing so in a timely manner. The older your hardware is, the harder it will be for your service provider to find replacement parts. GCI learned this when replacement parts never materialized from their service provider.
Sometimes you get Exactly what you Pay for
Budgets are tight and 24x7x365 service agreements can be expensive leading companies to purchase a less expensive business hours only agreement. This might seem like a good money saving strategy but when the perfect storm hits, it can leave you without support for a long time. Take the scenario with GCI where disaster struck at the start of a holiday weekend. With a standard business hour support model, they had the potential of being down for almost four days before the vendor would even look at their problem. If you can’t afford an extended outage, the cost savings doesn’t matter. The compounding issue that GCI faced was that their vendor had one person on staff able to assist them and he was golfing. Make sure you understand exactly what you are buying!
Create and Test your Disaster Recovery Plan
GCI knew that their IBM Power server was a critical part of their business and as a result they developed a DR plan but where they fell short was with testing that plan. All the components for a quick recovery were in place and had everything worked as planned, I wouldn’t be telling this story. But the story continues. GCI had never fully tested their plan so when it came time to execute it the “gaps” became apparent creating delays that added days to their recovery. Fortunately, they had some manual work arounds, and the business was not completely down but the manual processes had an adverse impact on production and delivery of product. Failure to test is not unique to GCI, we frequently implement customers at Racksquared that opt out of testing their DR plan. These tests are built into their price, they are paying for these tests, and still fail to take advantage of them. I highly recommend an annual DR test exercise to make sure you work out the “kinks” before you need to the solution.
How Racksquared can Help you
Racksquared Data Centers is a full-service data center offering everything from colocation to fully managed IT solutions. We specialize in providing IBM Power solutions including colocation, cloud backup, disaster recovery, IBM Power in the cloud and iSeries administration services. Our customers gain access to IBM technical expertise, the latest IBM technologies (servers and SANs) and a secure reliable resilient environment for their IT systems. We become an extension of your team, providing flexible designs and solutions while managing, monitoring, and supporting your infrastructure, so that your business is always up and running.