The redundancy management strategies of virtually all fault-tolerant systems depend on some form of voting which in turn depends upon synchronization. Although in many systems the clock synchronization function has not been decoupled from the applications (e.g. the redundant versions of the applications synchronize by messages), research and experience have led us to believe that solving the synchronization problem independently from the applications design can provide significant simplification of the system The operating system is built on top of this clock-synchronization foundation. Of course, the correctness of this foundation is essential. Thus, the clock synchronization algorithm and its implementation are prime candidates for formal methods. The verification strategy shown below is being explored.
Maximum Clock Skew Property ^ | | proof | Synchronization Algorithm ^ | | proof | Digital Circuit Implementation
The top-level in the hierarchy is an abstract property of the form:
where \delta is the maximum clock skew guaranteed by the algorithm as long as a sufficient number of clocks (and the processors they are attached to) are working. The function C_p(t) gives the value of clock p at real time t.
The middle level in the hierarchy is a mathematical definition of the synchronization algorithm. The bottom level is a detailed digital design of a circuit that implements the algorithm. The bottom level is sufficiently detailed to make translation into silicon straight forward.
The verification process involves two important steps:
The first step has already been completed by SRI International. The first such proof was accomplished during the design and verification of SIFT. The proof was done by hand in the style of most journal proofs. More recently this proof step has been mechanically verified using the Ehdm theorem prover. In addition, SRI has mechanically verified Schneider's clock synchronization paradigm using Ehdm. A further generalization was found at NASA Langley; in particular, the bounded delay assumption was shown to follow from the other assumptions of the theory. The design of a digital circuit to distribute clock values in support of fault-tolerant synchronization has been completed by SRI International and is currently being formally verified
Work was done at NASA Langley to design and implement a fault-tolerant clock synchronization circuit capable of recovery from transient faults. The circuit was implemented using programmable logic devices (PLDs) and FOXI fiber optic communications chips. Unlike the SRI circuit, the convergence algorithm was implemented in hardware.