Most company’s have specific requirements for reliability and availability of server systems, especially systems hosting crucial company operations such as email and web commerce sites. A systems administrator is usually required to provide regular reporting on system uptime (and downtime). Although this may seem like a rather straight forward and quantifiable proposition, it can actually be a rather tricky metric to report due to the subjective nature of what most IT folks consider availability or uptime. For example, availability does not necessarily equate to uptime and vice versa as you can have a system that is up and running but not available. In addition, how do you report on a system that is up and available but in a degraded mode? So, before you begin gathering uptime metrics, you had better come to an agreement on what qualifies as up and available vs down and unavailable as well as various states in between.
Reliability — the probability that the product (system) will perform its intended function for a specified time period when operating under normal (or stated) environmental conditions
Availability — the probability that a server will be perform its intended function under normal operating conditions when needed, usually expressed as a percentage.
Quantitative Calculations of Availability
Ignoring the possibility of degraded states for now, let’s look at availability as if it equates equally to system uptime. This is usually expressed as the number of nines in the percentage of uptime calculation. For example, five nines equates to 99.999% uptime which is approximately 5 min and 15 sec of downtime per year!
There are a number of simple ways to calculate system uptime such as Microsoft’s uptime.exe tool, running systeminfo.exe at a command prompt or just looking at Task manager on Vista or 2008 server. However, what you really need to do to track uptime percentages is to track the amount of downtime within a specific period.
The easiest way to calculate your uptime or “Nines” rating is as follows:
Availability = (Uptime / (Uptime +Downtime)) x 100
OR
Availability = (Elapsed Time – Downtime) / Elapsed Time) x 100
For example, in the last 30 days (43, 200 minutes) I have a system that has been online and available for 29 days 23 hours and 34 minutes (43,174 minutes). In other words it has been offline for 26 minutes within in a 30 day (43,200 min) period due to an unscheduled outage:
Elapsed Time (Minutes in Month) =43200
Downtime Minutes = 26
Availability = (43,174/(43174 + 26)) x 100 = 99.9%
or
Availability = ((43,200 – 26)/43,200) x 100 = 99.9%
Alternatively, you can just plug your numbers in to this handy uptime calculator.
Qualitative Calculations of Availability
Quantifying uptime, although still a chore, is fine when viewing uptime as a binary state in which you have only 2 possible scenarios but how do you determine the availability of a sluggish email server or even a SQL server cluster that keeps failing back and forth between nodes? This is where it gets subjective and since every system is different, I don’t think there are any set rules or standards. The best course of action is probably to obtain service baselines to determine a server’s quality of service so that a degraded service can be measured against that baseline. For example, say we have a system that normally handles a 100 transactions per second but in the last 30 days, there was a 24 hour period where it was only handling 25 transactions per second. However, a reboot (which completed in 5 minutes) resolved the problem. In this scenario, I think it is fair to assume it was 75% degraded or 25% available until the reboot. The uptime calculation for the month may be as follows:
Availability =
((Elapsed Time – (Degraded time x Degraded %) – Downtime) / Elapsed time) x 100
Availability = ((43,200 – (1440 x .75) – 5)/43200) x 100 =97.5%
In the end, the real difficulty is not the calculation but quantifying the degraded state.
Cost vs. Availability
What continuously amazes me however, are the woefully underfunded IT departments with specific, yet unobtainable uptime metrics. The unfortunate Systems Administrator’s in these shops are doomed to failure or at least to many sleepless nights. I’m not saying that 3, 4 or even 5 nines is impossible, it just requires an incredibly robust infrastructure and can be heavily influenced by “how” uptime is calculated. In the end, you must weigh the cost of downtime vs the cost of building out the network and server infrastructure for high availability. In other words, as expected downtime decreases, costs increase dramatically if not exponentially. For a fantastic article describing the difficulties of achieving high nines, you may want to read the article (or point your CTO to) “Five Nines: Chasing the Dream?” in order to better ground them in uptime reality. In the end, you have to come to an agreement with IT management on what constitutes “Downtime” so that you can properly calculate your uptime.
For some pointers on increasing availability, see my related post Increasing Server Uptime
[...] Beckford, “Calculating Server Uptime”, February 13, [...]