ACM has a great article: Bruce Lindsay interviewed by Steve Bourne (of Bourne shell fame) on the topic of designing for failure. This interview contains a metric ton of in the trecnhes wisdom of what can go wrong and what your program can do about it. Topics include: error detection, reporting, recovery, and heisenbugs. While much of the discussion is database-centric, the principles are relevant to any application.
Comments