FM Past Program: Fault-Tolerant Systems
This page is not maintained.
It is probably out of date.
The goal of this focus area was to create a formalized theory of
fault-tolerance including redundancy management,
clock synchronization,
Byzantine agreement,
voting, etc.
Much of the theory
developed here is applicable to future fault-tolerant systems designs
A detailed design of a fault-tolerant reliable computing base, the
Reliable Computing Platform (RCP),
has been developed and proven
correct. It is hoped that the RCP will serve as a demonstration of
the formal methods process and provide a foundation that can be
expanded and used for future aerospace applications. It is one of the
largest formal verifications ever performed.
The RCP architecture was designed in accordance with a system-design philosphy
called ``Design For
Validation''.
The basic tenets of this design philosophy can be summarized in the
following four statements:
- A system is designed in such a manner that complete and accurate
models can be constructed to estimate critical properties such as
reliability and performance. All parameters of the model
that cannot be deduced from the logical design must be measured. All
such parameters must be measurable within a feasible amount of time.
- The design process makes tradeoffs in favor of designs that
minimize the number of parameters that must be measured in order to reduce the
validation cost. A design that has exceptional performance
properties yet requires the measurement of hundreds of parameters (for
example, by time-consuming fault-injection experiments) would be rejected
over a less capable system that requires minimal expermimentation.
- The system is designed and verified using rigourous
mathematical techniques, usually refered to as a formal
verification. It is assumed that the formal verification makes
system failure due to design faults negligible so the reliability model
does not include transitions representing design errors.
- The reliability (or performance) model is shown to be accurate
with respect to the system implementation. This is accomplished
analytically not experimentally.
Thus, a major objective of this approach is to minimize the amount of
experimental testing required and maximize the ability to reason
mathematically about correctness of the design. Although testing
cannot be eliminated from the design/validation process, the
primary basis of belief in the dependability of the system must come
from analysis rather that from testing.
Curator and Responsible NASA Official: C. Michael Holloway
last modified: 2 August 2000 (10:13:08)