FM Past Program: Fault-Tolerant Systems

This page is not maintained. It is probably out of date.

The goal of this focus area was to create a formalized theory of fault-tolerance including redundancy management, clock synchronization, Byzantine agreement, voting, etc. Much of the theory developed here is applicable to future fault-tolerant systems designs A detailed design of a fault-tolerant reliable computing base, the Reliable Computing Platform (RCP), has been developed and proven correct. It is hoped that the RCP will serve as a demonstration of the formal methods process and provide a foundation that can be expanded and used for future aerospace applications. It is one of the largest formal verifications ever performed.

The RCP architecture was designed in accordance with a system-design philosphy called ``Design For Validation''. The basic tenets of this design philosophy can be summarized in the following four statements:

  1. A system is designed in such a manner that complete and accurate models can be constructed to estimate critical properties such as reliability and performance. All parameters of the model that cannot be deduced from the logical design must be measured. All such parameters must be measurable within a feasible amount of time.

  2. The design process makes tradeoffs in favor of designs that minimize the number of parameters that must be measured in order to reduce the validation cost. A design that has exceptional performance properties yet requires the measurement of hundreds of parameters (for example, by time-consuming fault-injection experiments) would be rejected over a less capable system that requires minimal expermimentation.

  3. The system is designed and verified using rigourous mathematical techniques, usually refered to as a formal verification. It is assumed that the formal verification makes system failure due to design faults negligible so the reliability model does not include transitions representing design errors.

  4. The reliability (or performance) model is shown to be accurate with respect to the system implementation. This is accomplished analytically not experimentally.

Thus, a major objective of this approach is to minimize the amount of experimental testing required and maximize the ability to reason mathematically about correctness of the design. Although testing cannot be eliminated from the design/validation process, the primary basis of belief in the dependability of the system must come from analysis rather that from testing.

Curator and Responsible NASA Official: C. Michael Holloway
last modified: 2 August 2000 (10:13:08)