This morning, I spoke with Mukesh Khattar who has studied the failure rates for the 30,000 servers in his data center. During the past three years, only 1 power supply failed. That's a failure rate of 0.001% / year. (Yeah, that's five-nines, baby!)
Mukesh is also looking at reliability rates for servers with dual power supplies and single power supplies. A power supply is designed to run at 80%-90% load, which is the case in a server with a single power supply. When you have dual power supplies in a server, each supply only runs at 40%-50% load. Since these power supplies are running below their optimum load level, they consequently generate more heat.
In our conversation, I started thinking about what you're spending for that additional reliability. With a Dell 2950, that second power supply costs $299. For 1000 servers, you've just spent $299k for those second power supplies. I'm not even counting the additional operating expense for the eletricity and cooling costs. During the 3-year depreciation for those servers, only 0.06 power supplies will fail. That's right, not even 1 sever out of that 1000 is expected to fail in 3 years due to a power-supply failure.
So, a few take-aways:
- Don't over-provision your server hardware for failures that are extremely unlikely. You're going to drive up your capital expenses and your operating expense for a failure that's unlikely to ever occur.
- Power off those unused servers. When you power them back on, they will come back on. Throw away your garlic cloves and salt shakers. It will be okay. Look at the data, it will set you free.
- Take a different approach to high availability. Instead of trying to bullet-proof your hardware to prevent a failure, think about a graceful way to recover from a hardware failure.