Thanks to Dana Epp for pointing out this interview with Dr. Bruce Lindsay. It is an interesting conversation about architecting systems in a way to deal with “unexpected” failures.
Some interesting quotes from the interview:
Let’s consider an example of system design for graceful degradation. In the Sabre system for airline reservations, they’re willing to have something like one in 1,000 duplicate reservations, because it turns out it doesn’t matter if somebody has two reservations in the system. Eventually, somebody will figure it out and do something about it. That’s not a black-and-white system. It’s one that has a fuzzy edge to it and is willing to tolerate some inconsistencies because they have other ways of recovering from the inconsistency.
and
Laziness in programming leads to lack of specificity in error messages. You might say, “You know, I have a choice between an unparameterized message that sort of says what’s happening, and a much more complicated parameterized message for which I’d have to write 15 lines of code.” Guess what I’m going to choose?
I think that, in general, software developers—and hardware people aren’t a heckuvalot better—are not investing enough money in correctly stating the errors and being more precise in describing the error.
Plus, a name which describes some of the toughest bugs there are to deal with, Heisenbugs:
So the real definition of a Heisenbug is when you look, it goes away—in deference to Dr. [Werner] Heisenberg, who said, “The more closely you look at one thing, the less closely can you see something else.”
Very interesting read. Thanks again for the link Dana.