Tolerant Application Hosting

IT Guy: “We had a CPU failure in one of our servers so I moved your site to new hardware.”

Client: “Thanks! When did this happen?”

IT Guy: “A few minutes ago.”

Client: “But I’ve been on the web site for the last few minutes and I never saw it go down.”

IT Guy: “It didn’t go down.  I moved it live, before the hardware crashed.”

Client: “Cool!  How’d you do that?”

One of the biggest pains in IT is a hardware failure.  Hours to weeks of availability and critical data can be lost.  And if the failure impacts a client’s system, it becomes an issue of SLA and can strain the relationship.  But what if a severe hardware failure only affected a running system for a minute or two?  Or how about being able to move a running system to healthier or more powerful hardware without any disruption to service?

I’ve spent the last 18 months slowly building an environment that is quite tolerant to hardware failures and maintenance requirements.  The enabling technology is called vMotion and it is available in a product called VMware vSphere 4.  The idea is that a group of very powerful servers are connected and configured into a cluster.  Onto that, we layer vSphere, then the operating system of choice.  And finally the application is installed in its typical fashion.  The operating system is unaware of the fact that it is not a real server.  It is a virtual server or “virtual machine” running on a vMotion cluster.

The benefits of running virtual machines (VMs) in a vMotion cluster are numerous.  Here are my top three favorites.

1. Hardware failures become a non-critical issue.  If a hardware error does occur, all the VMs on that server can be evacuated to another healthy server within the vMotion cluster without disruption to service.  If the server flat out crashes, then the VMs are immediately rebooted on another server in the vMotion cluster. The outage lasts only a minute or two while the VMs boot up in their new home.  This allows me to stay in bed at 3AM and watch recovery alerts trickle in rather than having to get up and drive to the data center to revive a dead server.  And if it weren’t for the report we’d send you the next morning, you wouldn’t even know it happened.

2. The underlying hardware can be maintained or upgraded without disrupting the VMs. Since VMs can be relocated on-the-fly, we simply migrate the running VMs to another server in the vMotion cluster and perform maintenance as necessary.  When the maintenance is finished, the VMs automatically migrate back to the server in order to re-balance the workload in the vMotion cluster.  This significantly reduces the number of times you have to explain to your VPs that the site will be down for maintenance and helps keep me on regular daytime hours.

3. Hardware can be seamlessly added to the vMotion cluster to increase its pool of processing power and memory. This allows scaling of applications with little or no disruption.  As soon as a new server is added to the vMotion cluster, some of the running VMs are migrated to the new server in order to re-balance the workload across the cluster.  To ice this cake, we can set up standby hardware that automatically comes on line when the workload requires it.  So when your marketing campaign goes viral, I’ll be sitting there with you enjoying a drink at the celebration instead of monitoring the servers and making sure they don’t overload and crash.

It’s a win-win strategy.  I get to play with all this cool technology that tolerates the normal course of hardware issues and you can run your business-critical application without worrying about its health and availability.

Posted by Mark Chester on 12/21/09 11:18 AM

VN:F [1.6.4_902]

rating average:

(3 votes)

Post a Comment