I recently got to work on a project with Howard Lipson. One interesting concept that he has been working for awhile is survivability in software systems. Survivability encompasses a number of domains I have seen in deployment and operations, but it brings them together into a unifying structure. Howard's slides from a recent presentation "Cyber Security and Control System Survivability" are online.
Traditional computer security is not adequate to keep highly distributed systems running in the face of cyber attacks. Survivability is an emerging discipline - a risk-management-based security paradigm.
Survivability is defined as:
the ability of a system to fulfill its mission, in a timely manner, in the presence of attacks, failures, or accidents.
The 3 R's of Survivability
Resistance - ability of a system to repel attacks
Recognition - ability to recognize attacks and the extent of the damage
Recovery - ability to restore essential services during attack, and recover full services after attack
And the fundamental goal of survivability:
The mission must survive
Not any individual component
Not even the system itself
This concept reminds me of something I heard Guy Kawasaki say about building software start ups. Everyone, he said, focuses on what it takes to get the plane in the air, but not many people focus on what it takes to keep the plane in the air. In reality, the plane should spend much more time in the air than taking off.
I agree with Howard that this is an emerging discipline. There are numerous issues to consider. I blogged about an interview with Bruce Lindsay that discusses some of them. Not the least of which is the total lack of programming language support for error detection - Bruce Lindsay:
In fact, there is zero language support for detection. What we see in the languages are facilities for dealing with the error once it has been discovered. Throwing an exception in the language is something the logic of the program does.
Most of the scripting languages, for example, have very little support at the language and semantic level for dealing with exceptions. And at the end of the day, most of what’s in the languages is stuff that you could have coded yourself.
There are some fairly dangerous features in languages—in particular, the raise error or throw exception and the handlers. How does that relate to the stack of procedure calls? What we see in some early approaches to language-supported error handling is that the stack is peeled back without doing anything until you find some level in the stack that has declared that it’s interested in handling the particular exception that’s in the error at the moment.
In general, folding a procedure or subroutine activation—method activation—without cleaning up the mess that may have already been made, the partially completed state transformation of that function, is very dangerous. If there have been memory allocations, for example, and you just peel back the stack entry, those memory allocations are likely not to be undone.
So it’s very important that at every level of the procedure activation, those procedures be given a chance to fold up their tent neatly, even if they can’t deal with the exception.
Programming languages and frameworks need to be built to support the non-functional qualities of a system not just the functional ones.