Our countdown, brought to you by SearchStorage.com high availability expert Evan Marcus, includes some common sense tips for the everyday storage admin to follow.
#9: Invest in failure isolation
Apps should check for all error conditions
- Act on them when you find them
- Requires developer training
Failure in one component shouldn't propagate
- Network failures not seen by router or network management layer
- Disk failure, not seen by application after write error
Catching errors late probably means data corruption
- Error has propagated through system
- May leave other unknown side-effects
Looking for more great Evan Marcus information?
Check out the Evan Marcus availability tips section of SearchStorage.com.
Also, visit our bookstore for Evan's book: Blueprints for high availability: Designing resilient distributed systems.
Have your own tips for the everyday admin? Submit them here.
This material is copyright 1997-2002 by Evan Marcus and Hal L. Stern. It may not be used in whole or part for commercial purposes without the express permission of both authors.
This was first published in December 2002