Sep 3, 2007

Yes, It's Still Safe to Power Off and Power On That Server

After my previous post on the reliability of power supplies, I decided to see what our Cassatt experiences can tell us about server reliability. Within my department, I have engineering labs located in three locations-- Colorado Springs, Minneapolis and San Jose-- and about 500 servers in total.

Mukund and I decided to look at the data from 123 servers located in San Jose. These servers are used by Mukund's team for System Test activities. His team has developed over 700 automated tests that are used to qualify our Cassatt product suite. As part of the test run, servers are routinely power-cycled. We physically pull power from the servers at the start of each test run. All the nodes are on managed Power Distribution Units (APC's and Baytechs), and the automated tests power down the outlets from the PDU before running the tests. This has been in place since 2004.

For the 123 servers that were analyzed, not a single power-supply or disk drive failed during the past two years.

Here are the server counts in the study:
  • 26 IBM HS-20 blades
  • 8 HP DL380 G4
  • 45 HP DL360 G4
  • 8 HP DL360 G3
  • 6 HP DL140
  • 3 HP DL385
  • 5 Sun SPARC
  • 1 IBM x345
  • 15 Dell 1850
  • 6 Dell 2650
During the past 5 months, the power supplies on these 23 servers were power-cycled 18,826 times. That's an average of once per day per server. As part of the system testing, these servers were power-cycled repeatedly by using their power controller. The power operations from the power controller generate stress on the server's internal comments, such as the motherboard and disk drives, but the power supply remains connected to A/C power. These power operations from the power controller are not counted in the 18,826 figure cited earlier.

In a future posting, Mukund and I will provide more details on these additional power operations. We will also provide data from the servers in our other engineering labs.

So if you're still afraid to power down that server, don't worry! Power supplies and hard drives are very reliable these days. From several different studies, we've seen that power supplies hold up quite well from (and are even designed for) power cycling.

1 comment:

Scott said...

We’re getting to a point where network automation isn’t an option any network administrator can go without. Many businesses struggle to deliver solid and reliable IT service to their clients simply because the lack of automation forces the people taking care of the network to work out all the bugs and patches rather than working on improving existing applications.

Luckily the department I work in has a great ITSM set up and we are able to process workflow much more efficiently than other companies do mainly because of our excellent database administration.