ACM's interview of Bruce Lindsay by Steve Bourne is a classic. Designing for failure is something a lot of people talk about, but not with much specificity. Not so with this interview:
SB Are you really thinking of system failures as opposed to user errors?
BL I don’t think of user errors, such as improper input kinds of things, as “failures.” Those are normal occurrences. I don’t think of a compiler saying you misspelled goto as really being an error. That’s expected.
If you look at the OWASP Top Ten for example, or other lists like SANS, so many of the so-called security vulnerabilities go back input validation issues. But they are normal and even input that is not intended to malicious, for example due to poor training, causes faults that can compromise the system.
SB One thing we could explore here is what techniques we have in our toolkit for error detection.
BL Fault handling always begins with the detection of the fault—most often by use of some kind of redundancy in the system, whether it be parity, sanity checks in the code where we may spend 10 percent of the lines of code checking the consistency of our state to see if we should go into error handling. Timeout is not exactly a redundancy but that’s another way of detecting errors.
The key point here is that if you go to sea with only one clock, you can’t tell whether it’s telling you the right time. You need to have some way to check. For example, if you read a message from a network, you might want to check the header to see if it is really a message that you were expecting—that is, look for some bits and some position that says, “Aha, this seems like I should go further.”
Or if you read from the disk, you might want to check a label on the disk block to see if it was the block you thought you were asking for. It’s always some kind of redundancy in the state that allows you to detect the occurrence of an error. If you hadn’t thought about failures, why would you put the address of a disk block into the disk block?
SB So, really what you’re trying to do is establish confidence in your belief about what’s going on in the system?
BL In a large sense, that’s right. And to validate, as you go, that the data or state upon which you’re going to operate is self-consistent.
Self-consistency echoes an assurance goal also advocated for by Brian Snow where he asks for how can we safely use security gear that we cannot trust?
SB Once you’ve detected the error, now what? You can report it, but the question is who do you report it back to and what do you report back?
BL There are two classes of detection. One is that I looked at my own guts and they didn’t look right, and so I say this is an error situation. The other is I called some other component that failed to perform as requested. In either case, I’m faced with a detected error. The first thing to do is fold your tent—that is, put the state back so that the state that you manage is coherent. Then you report to the guy who called you, possibly making some dumps along the way, or you can attempt alternate logic to circumvent the exception.
In our database projects, what typically happens is it gets reported up, up, up the chain until you get to some very high level that then says, “Oh, I see this as one of those really bad ones. I’m going to initiate the massive dumping now.” When you report an error, you should classify it. You should give it a name. If you’re a component that reports errors, there should be an exhaustive list of the errors that you would report.
That’s one of the real problems in today’s programming language architecture for exception handling. Each component should list the exceptions that were raised: typically if I call you and you say that you can raise A, B, and C, but you can call Joe who can raise D, E, and F, and you ignore D, E, and F, then I’m suddenly faced with D, E, and F at my level and there’s nothing in your interface that said D, E, and F errors were things you caused. That seems to be ubiquitous in the programming and the language facilities. You are never required to say these are all the errors that might escape from a call to me. And that’s because you’re allowed to ignore errors. I’ve sometimes advocated that, no, you’re not allowed to ignore any error. You can reclassify an error and report it back up, but you’ve got to get it in the loop.
Security exceptions in many cases require special handling, so that they may be softened before returning any data to a client or log file. The programming langauge support for error detection, softening, and reporting is pathetic, programmers must literally grow their own wheat to make bread in this case.
SB One of the interesting aspects of this is the trade-off between how long it takes to detect something and how much time you really have to recover in the system.
BL And what it means to remove the failed component, because there is a split brain problem that I think you’re out and you think I’m out. Who’s in charge?
SB Right, and while they’re arguing about it, nothing is happening.
BL That’s also possible, although some of these systems can continue service. It’s rare that a distributed system needs the participation of all active members to perform a single action.
There is also the issue of dealing out the failed members. If there are five of us and four of us think that you’re dead, the next thing to do is make sure you’re dead by putting three more bullets in you.
We see that particularly in the area of shared management of disks, where you have two processors, two systems, connected to the same set of storage. The problem is that if one system is going to take over the storage for the other system, then the first system better not be using the storage anymore. We actually find architectural facilities in the storage subsystems for freezing out participants—so-called fencing facilities.
So if I think you’re dead and I want to take over your use and responsibility for the storage, the first thing I want to do is tell the storage, “Pay no attention to him anymore. I’m in charge now.” If you were to continue to use the storage while I blithely go forward and think I’m the one who’s in charge of it, terrible things can happen.
There are two aspects of collaboration in distributed systems. One is figure out who’s playing, and the second one is, if someone now is considered not playing, make damn sure somehow that they’re not playing.
This lesson applies to a lot of security in distirbuted system, including authentication systems, SE(I)Ms, XML gateways, and a lot more. Part of the solution is to attempt to use tools, for example a STS, enabling "security" to not function as some dualistic boolean, but rather a composeable domain, with its own rules, behavior, and logic to adapt to these situations, since the languages and servers are not able without a tremndous amount of fu. Remember that Survivability has three R's: Resistance, Recognition, and Recovery, the Anasazi certainly did.