Aug 28, 2007

Don't Worry, It's Safe to Power off that Server and Power It on Again

There's lot of superstition out there regarding data center best practices, and there is some amount of voodoo when it comes to powering down servers. Will the server come up when you power it on? Will the power supply fail?

This morning, I spoke with Mukesh Khattar who has studied the failure rates for the 30,000 servers in his data center. During the past three years, only 1 power supply failed. That's a failure rate of 0.001% / year. (Yeah, that's five-nines, baby!)

Mukesh is also looking at reliability rates for servers with dual power supplies and single power supplies. A power supply is designed to run at 80%-90% load, which is the case in a server with a single power supply. When you have dual power supplies in a server, each supply only runs at 40%-50% load. Since these power supplies are running below their optimum load level, they consequently generate more heat.

In our conversation, I started thinking about what you're spending for that additional reliability. With a Dell 2950, that second power supply costs $299. For 1000 servers, you've just spent $299k for those second power supplies. I'm not even counting the additional operating expense for the eletricity and cooling costs. During the 3-year depreciation for those servers, only 0.06 power supplies will fail. That's right, not even 1 sever out of that 1000 is expected to fail in 3 years due to a power-supply failure.

So, a few take-aways:
  1. Don't over-provision your server hardware for failures that are extremely unlikely. You're going to drive up your capital expenses and your operating expense for a failure that's unlikely to ever occur.
  2. Power off those unused servers. When you power them back on, they will come back on. Throw away your garlic cloves and salt shakers. It will be okay. Look at the data, it will set you free.
  3. Take a different approach to high availability. Instead of trying to bullet-proof your hardware to prevent a failure, think about a graceful way to recover from a hardware failure.

1 comment:

James Urquhart said...

From Service Level Automation in the Data Center:

"Remember this point for some of my future posts--this myth is busted, and knowing this opens you to some very quick and simple power efficiency practices."