- Inform customers
- Check procedures and systems
- (Re-)consider redundancy and failover
- Review your SLAs and contracts
2. Check your own procedures and systems. Could you have brought up your systems more quickly or more effectively after the data center itself failed? Did your processes contribute to the problem in some way, or make recovery more difficult? If you haven’t had a failure yet, this is harder, but try to replicate failure (on backup systems, perhaps) if possible.
3. (Re-)consider redundancy and failover. Even the best, and most expensive, data center in the world can have failures. Take a careful look at the cost/benefit of hosting your services at a redundant data center as well. Cost savings could be found, for example, by making your redundant systems using older hardware or less-powerful, hardware, smaller pipes, or less sophisticated centers. It may be better to serve 50% of your normal traffic successfully from your backup systemsâ€”or it may be better to simply apologize to customers and wait for your main system to come up. (At the very least you’d better have a plan for having a temporary “site is down message” available if something does go wrong.) This is a very fact-specific inquiry. The point is to do the inquiry. Don’t just rest with the status quo.
4. Look over your Service Level Agreements (SLAs). An SLA is a contract between the data center provider and you, the customer. As such, it ought to be treated to the respect due any contract you enter into that may determine your future existence as a viable business.
What does your data center promise? Make sure you have detailed and specific requirements, not vague statements like “We’re the best in the industry and promise 99.99999% uptime.” This future promise is unlikely to be enforceable under contract law, and is essentially marketing “puffery.” While nice for the marketing people, it carries no real legal weight.
Instead, work with your technical people (and your lawyers) to draw up specific requirements, like: “power will be maintained to our servers at all times, meaning no more than 30 seconds of power-loss over any 1 hour period.” Other metrics might include network uptime, average time for the data center helpdesk to answer, and so on. If you have expectations, put them in writing in the SLA! Any promises the marketing people make you are pretty out once you sign the contract that is the SLA (see the parol evidence rule).
Each requirement should be accompanied by a result to the data center if it is not met (sometimes called “liquidated damages,” although the result need not be monetary). Thus, for example, adding that the data center owes $1,000 for every 1 minute of insufficient electrical power to servers, with a maximum of $50,000 per day. Remember: this is incredibly situation specific: both your lawyers and your technical folks will need to be involved to get the specificity that will protect you in a courtroom. These “damages” also need to be reasonable from both perspectives, and cannot be punitive. Finally, you should always require the an action plan from the data center provider on how to avoid the problem in the future.
You may decide that your SLAs are really for managers and marketers, and that you’d like to put the technical specifications into a “Service Level Specification” (SLS) document, and perhaps the legal specifics into a “Service Contract” or similar. Whatever you call them, just make sure your lawyer tells you that you have a proper contract in the correct jurisdiction, and that it will actually make what’s important to you enforceable. That’s the leverage you’ll need if things go ever go really wrong, and (paradoxically, some may feel) will keep you out of court: the more specific your contract is, the less likely anyone will want to litigate it.
Some resources for preparing SLAs:
I highly recommend you have your lawyer review anything you come up with. It’s far, far cheaper to pay for legal assistance now that either to end up out of business or suing your data center in court.
If your SLAs are not up to the level I’ve described, take the time to update them now.
5. Finally: learn from the experience. Review what happened. Discuss. Create a test environment. Fail your services. Review. Discuss. Wash, rinse, repeat.
Have other sites or resources to contribute? Put them in the comments, or send them to me directly.