Vinay Pai: What Should You Do if a Server Fails? Absolutely Nothing.

Mar 12, 2007

What Should You Do if a Server Fails? Absolutely Nothing.

A server failure can be disruptive to your business and your personal life. Servers tend to fail at the worst times— late at night, over the weekend, or in the middle of an important demo. Disk drives fail, motherboards burn out and software crashes; these things happen. To fix the problem, someone could reboot a server or reinstall software on a new server. What if your data center could automatically recover by performing these actions for you?

Cassatt Collage goes beyond provisioning. Collage constantly monitors the applications and servers in your data center. Collage polls for a heartbeat from each server—using standard OS-level and application-level monitors available in Linux, Windows and Solaris. You can also introduce your own custom monitors, such as a customized agent or a script running inside a database, and have Collage monitor that as well. When a server has failed, Collage will replace the server with a new one from the free pool and boot the same application/service on the new server—all of this within minutes and without losing any data.

For each application, you specify the monitoring parameters. You can define which monitors to use, the polling interval, and how many retries you should allow. In a smaller configuration with less than 100 servers, I like to monitor SNMP at 30-second intervals with 3 retries. For larger configurations with ~400 servers, I would increase the monitoring interval to 60 seconds. For Apache servers, I add an HTTP monitor on a system URL embedded in my web app so that I can ensure that the Apache service is running.

As part of our standard customer demo, I like to pull a blade server that’s running an application and watch Collage respond to the failure. I’ll let the customer pull the blade out of the chassis in the lab; blinking lights, fan noise and 10 racks of servers always adds a little extra to the demo. By the time we return to the conference room (with the failed blade in hand), Collage has quarantined the failed blade in a maintenance pool, allocated a new server from the free pool and booted this server to the same application. All of this takes only 3 minutes from bare metal to a running application on a new server. This demo is always very memorable and illustrates High Availability in a very simple manner. (Seeing is believing.) I had given this demo to some visiting executives from TCS. A year later, I had met them again in San Jose, and they still remembered the demo.

2 comments:

Anonymous said...: This is a very useful post and tells me a lot about Cassatt's capabilities. It would be valuable for me to hear more about your work with blade servers. I've talked to many IT pros who bought blades prior to 2007 and had a lot of problems with failures, as well as heat and configuration woes. One IT pro told me that he only uses half as many blades in a chassis as he could in order to avoid overheating. What's your experience been?; April 19, 2007 at 2:02 PM
Tio Prósperito said...: In Cassatt, we've got about 500 servers we use to test and qualify Collage on a variety of platforms that span multiple generations of systems from Dell, HP, Sun, IBM and generic x86 systems.

We have five IBM blade centers deployed within Cassatt, and these were purchased in 2004 & 2005. We've never seen any heat-related issues or failures with our blade centers. Three are running at full capacity (14 HS-20 blades), one is running with 10 blades and one with 12 blades. Most of the blades have local disk, and all the BC's have dual switch module and dual power supplies. Collage only powers on the systems that are in use. As a result, we're not always running each blade in the chassis. Also, Collage can be configured to respond to the internal temperature of a Blade Center. In data environments where air circulation is poor, Collage could be configured to take corrective actions based on dangerous temperature levels.; April 26, 2007 at 11:21 AM