What to plan for in terms of physical disasters that affect your technology...
1) Physical Disaster: a natural disaster (the obvious examples are flood, tornado, earthquake) or a man-made disaster. Examples of man-made disasters are:
a. Fire is the big one. Even an underground computer room with halogen fire suppression can still be involved in a larger fire that started in the building itself, that would overwhelm the computer room fire suppression system, or sprinkler fire suppression from the upper office floors would drip water down into the computer room and wreck the computer room. Or smoke damage. Or firefighters arrive and use foam or water that further damages the infrastructure. Believe it or not, there are plenty of computer rooms with water fire suppression, that if triggered, would ruin all the equipment in the computer rooms! A fire could even happen down the street in an underground pipe that might wipe out all the local network lines, so network connectivity is lost.
b. A distraught employee shoots up the place, injuring or killing key employees (any number examples of this), and then sets a fire. Unfortunately, if the data recovery procedures were not documented well, i.e., were only in the head of the persons who were gravely injured or killed, how would you recover the data? Poor documentation of recovery procedures is quite common.
c. Someone sets off a bomb (this happened at eBay Network Operations Center in San Jose when some colleagues were about a mile away last week. They did not even know until they saw it on the news, the bomb was too small to do any real damage that would interrupt eBay operations)
d. A disgruntled system admin is fired, and the next day all disk drives are formatted clean with swear words as labels (this actually happened to my computer center once).
e. Power is lost in a large geographic region for a couple of days (i.e., longer than on-site batteries and/or generators can deal with) due to national electrical distribution grid system overload (this happened a couple of years ago in the Mid-West and Northeast)
2) Worms: Let's say someone brings in a USB thumb drive infected with a worm to share some files with a friend, or to install a personal productivity app, (or installs something on their laptop at home and then brings in the laptop), and nothing happens for a couple of months. But then, on Halloween, worms start propagating across the internal network with such frequency that it causes all systems to slow to a crawl. In other words, the worm silently infected all internal computers for months, and then on a trigger day, they try to break out of the internal network through the firewall into the wider Internet. You shut down all the systems and run anti-virus cleaners, and bring them up one-by-one, but you missed a couple of systems or someone hooks up their infected laptop or maybe the anti-virus cleaner is not effective against this particular worm, and the worm re-infects all systems in minutes. So, you take all systems off network, and use backups from last week, but you find that all your backups are infected also. So, now you have to go back to earlier backups, to last month, and you find that the worm is still in there. Finally, you are forced back months in time until you find a clean backup. Maybe you decide at this point to work with the anti-virus vendor to create an effective worm cleaner, since it is not practical to restore backups to all your internal systems. The recent SQL Slammer worm or the Melissa virus are examples.
3) DDoS = Distributed Denial of Service attack: Many personal computers across the Internet are infected with a virus that will take direction from a central system. This central system is increasingly owned by organized crime that sells a service to spammers to send out zillions of emails for low cost from thousands of PC's simultaneously. Sometimes, this "zombie army" is used in a DDoS, where they ping a server, say Amazon.com, in an attempt to bog down the server and then extort money to stop the attack.
The correct planning items should be:
1) The indispensable, but still rather simple and cheap thing to do: Exhaustively document backup and recovery procedures. Even more important, recovery drills should be held at least once a year, with an occasional surprise exercise! TEST, TEST, TEST! You would be amazed how often a disk failure occurs and IT trys to restore the disk, and finds out that the backup tapes have not been properly made all along, so the data is lost forever, or the recovery procedure fails. There is no more important disaster preparation than a total system rebuild occasionally.
2) The expensive things to do: A hot, geographically distributed system, like a cluster that is half in San Francisco and half in Oakland, or better yet, another state entirely. Another strategy is to have a cold standby facility, like at Sungaard, that in a few days could be up and running and accommodate your staff and run all your operations.