ACM's interview of Bruce Lindsay by Steve Bourne is a classic. Designing for failure is something a lot of people talk about, but not with much specificity. Not so with this interview:
SB Are you really thinking of system failures as opposed
to user errors?
BL I don’t think of user errors, such as improper input
kinds of things, as “failures.” Those are normal occurrences. I
don’t think of a compiler saying you misspelled goto as really being an
error. That’s expected.
If you look at the OWASP Top Ten for example, or other lists like SANS, so many of the so-called security vulnerabilities go back input validation issues. But they are normal and even input that is not intended to malicious, for example due to poor training, causes faults that can compromise the system.
SB One
thing we could explore here is what techniques we have in our toolkit for error
detection.
BL Fault handling always begins with the detection of the
fault—most often by use of some kind of redundancy in the system, whether
it be parity, sanity checks in the code where we may spend 10 percent of the
lines of code checking the consistency of our state to see if we should go into
error handling. Timeout is not exactly a redundancy but that’s another
way of detecting errors.
The key point here is that if you go to sea with only one clock, you can’t
tell whether it’s telling you the right time. You need to have some way
to check. For example, if you read a message from a network, you might want
to check the header to see if it is really a message that you were expecting—that
is, look for some bits and some position that says, “Aha, this seems like
I should go further.”
Or if you read from the disk, you might want to check a label on the disk block
to see if it was the block you thought you were asking for. It’s always
some kind of redundancy in the state that allows you to detect the occurrence
of an error. If you hadn’t thought about failures, why would you put the
address of a disk block into the disk block?
SB So, really what you’re trying to do is establish
confidence in your belief about what’s going on in the system?
BL In a large sense, that’s right. And to validate,
as you go, that the data or state upon which you’re going to operate is
self-consistent.
Self-consistency echoes an assurance goal also advocated for by Brian Snow where he asks for how can we safely use security gear that we cannot trust?
SB Once you’ve detected the error,
now what? You can report it, but the question is who do you report it back to
and what do you report back?
BL There are two classes of detection. One is that I looked
at my own guts and they didn’t look right, and so I say this is an error
situation. The other is I called some other component that failed to perform
as requested. In either case, I’m faced with a detected error. The first
thing to do is fold your tent—that is, put the state back so that the
state that you manage is coherent. Then you report to the guy who called you,
possibly making some dumps along the way, or you can attempt alternate logic
to circumvent the exception.
In our database projects, what typically happens is it gets reported up, up,
up the chain until you get to some very high level that then says, “Oh,
I see this as one of those really bad ones. I’m going to initiate the
massive dumping now.” When you report an error, you should classify it.
You should give it a name. If you’re a component that reports errors,
there should be an exhaustive list of the errors that you would report.
That’s one of the real problems in today’s programming language
architecture for exception handling. Each component should list the exceptions
that were raised: typically if I call you and you say that you can raise A,
B, and C, but you can call Joe who can raise D, E, and F, and you ignore D,
E, and F, then I’m suddenly faced with D, E, and F at my level and there’s
nothing in your interface that said D, E, and F errors were things you caused.
That seems to be ubiquitous in the programming and the language facilities.
You are never required to say these are all the errors that might escape from
a call to me. And that’s because you’re allowed to ignore errors.
I’ve sometimes advocated that, no, you’re not allowed to ignore
any error. You can reclassify an error and report it back up, but you’ve
got to get it in the loop.
Security exceptions in many cases require special handling, so that they may be softened before returning any data to a client or log file. The programming langauge support for error detection, softening, and reporting is pathetic, programmers must literally grow their own wheat to make bread in this case.
SB One of the interesting aspects of this is the
trade-off between how long it takes to detect something and how much
time you really have to recover in the system.
BL And what it means to remove the failed
component, because there is a split brain problem that I think you’re
out and you think I’m out. Who’s in charge?
SB Right, and while they’re arguing about it, nothing is happening.
BL That’s also possible, although some of these
systems can continue service. It’s rare that a distributed system needs
the participation of all active members to perform a single action.
There is also the issue of dealing out the failed members. If there are
five of us and four of us think that you’re dead, the next thing to do
is make sure you’re dead by putting three more bullets in you.
We see that particularly in the area of shared management of disks,
where you have two processors, two systems, connected to the same set
of storage. The problem is that if one system is going to take over the
storage for the other system, then the first system better not be using
the storage anymore. We actually find architectural facilities in the
storage subsystems for freezing out participants—so-called fencing
facilities.
So if I think you’re dead and I want to take over your use and
responsibility for the storage, the first thing I want to do is tell
the storage, “Pay no attention to him anymore. I’m in charge now.” If
you were to continue to use the storage while I blithely go forward and
think I’m the one who’s in charge of it, terrible things can happen.
There are two aspects of collaboration in distributed systems. One is
figure out who’s playing, and the second one is, if someone now is
considered not playing, make damn sure somehow that they’re not
playing.
This lesson applies to a lot of security in distirbuted system, including authentication systems, SE(I)Ms, XML gateways, and a lot more. Part of the solution is to attempt to use tools, for example a STS, enabling "security" to not function as some dualistic boolean, but rather a composeable domain, with its own rules, behavior, and logic to adapt to these situations, since the languages and servers are not able without a tremndous amount of fu. Remember that Survivability has three R's: Resistance, Recognition, and Recovery, the Anasazi certainly did.