NCSC-TG-022

 

Library No. 5-236,061

 

Version 1

 

FOREWORD

 

A Guide to Understanding Trusted Recovery in Trusted Systems provides a set of good

practices related to trusted recovery. We have written this guideline to help the vendor and

evaluator community understand the requirements for trusted recovery, as well as the level of detail

required for trusted recovery at all applicable classes, as described in the Department of Defense

Trusted Computer Systems Evaluation Criteria. In an effort to provide guidance, we make

recommendations in this technical guideline that are not requirements in the Criteria.

 

The Trusted Recovery Guide is the latest in a series of technical guidelines published by the

National Computer Security Center. These publications provide insight to the Trusted Computer

Systems Evaluation Criteria requirements for the computer security vendor and technical

evaluator. The goal of the Technical Guideline Program is to discuss each feature of the Criteria

in detail and to provide the proper interpretations with specific guidance.

 

The National Computer Security Center has established an aggressive program to study and

implement computer security technology. Our goal is to encourage the widespread availability of

trusted computer products for use by any organization desiring better protection of its important

data. One way we do this is by the Trusted Product Evaluation Program. This program focuses on

the security features of commercially produced and supported computer systems. We evaluate the

protection capabilities against the established criteria presented in the Trusted Computer System

Evaluation Criteria. This program, and an open and cooperative business relationship with the

computer and telecommunications industries, will result in the fulfillment of our country's

information systems security requirements. We resolve to meet the challenge of identifying trusted

computer products suitable for use in processing information that needs protection.

 

invite your suggestions for revising this technical guideline. We will review this document as the

need arises.

 

30 December 1991

 

 

 

Patrick R. Gallagher, Jr.

 

Director

 

National Computer Security Center

 

ACKNOWLEDGMENTS

 

The National Computer Security Center extends special recognition and acknowledgment to

Dr. Virgil D. Gligor as the primary author of this document. James N. Menendez and Capt. James

A. Muysenberg (USAF) are recognized for the development of this guideline, and Capt.

Muysenberg is recognized for its editing and publication.

 

We wish to thank the many members of the computer security community who

enthusiastically gave their time and technical expertise in reviewing this guideline and providing

valuable comments and suggestions.

 

TABLE OF CONTENTS

 

FOREWORD   

 

ACKNOWLEDGMENTS  

 

1.0   INTRODUCTION     

 

1.1   Background 

 

1.2   Purpose    

 

1.3   Scope

 

1.4   Control Objective

 

1.5   Document Overview

 

2.0   FAILURES, DISCONTINUITIES, AND RECOVERY 

 

2.1   State-Transition (Action) Failures

 

2.2   TCB Failures     

 

2.3   Media Failures   

 

2.4   Discontinuity of Operation  

 

3.0   PROPERTIES OF TRUSTED RECOVERY    

 

3.1   Secure States    

 

3.2   Secure State Transitions    

 

4.0   DESIGN APPROACHES FOR TRUSTED RECOVERY  

 

4.1   Responsibility for Trusted Recovery

 

4.2   Some Practical Difficulties with Current Formalisms 

 

4.3   Summary of Current Approaches to Recovery     

 

4.3.1 Types of System Recovery    

 

4.3.2 Current Approaches     

 

4.3.3 Implementation of Atomic State Transitions    

 

4.3.3.1     Shadowing  

 

4.3.3.2     Logging    

 

4.3.3.3     Logging and Shadowing  

 

4.3.4 Recovery with Non-Atomic State Transitions    

 

4.3.4.1     Sources of Inconsistency--A Generic Example   

 

4.3.4.2     Non-Atomic TCB Primitives   

 

4.3.4.3     ldempotency of Recovery Procedures

 

4.3.4.4     Recovery With Non-Atomic System Primitives    

 

 4.4  Design Options for Trusted Recovery

 

5.0   IMPACT OF OTHER TCSEC REQUIREMENTS ON TRUSTED RECOVERY    

 

5.1   Operational Assurance  

 

5.2   Life-Cycle Assurance   

 

5.2.1 Security Testing 

 

5.2.2 Design Specification and Verification   

 

5.2.3 Configuration Management    

 

5.2.4 Trusted Distribution   

 

5.3   Documentation    

 

5.3.1 Trusted Facility Manual

 

5.3.2 Test Documentation     

 

5.3.3 Design Documentation   

 

6.0   SATISFYING THE TCSEC REQUIREMENTS 

 

6.1   Requirements for Security Class B3

 

6.1.1 Operational Assurance  

 

6.1.1.1     System Architecture    

 

6.1.1.2     Trusted Facility Management 

 

6.1.2 Life-Cycle Assurance   

 

6.1.2.1     Security Testing 

 

6.1.2.2     Design Specification and Verification   

 

6.1.2.3     Configuration Management    

 

6.1.3 Documentation    

 

6.1.3.1     Trusted Facility Manual

 

6.1.3.2     Test Documentation     

 

6.1.3.3     Design Documentation   

 

6.2   Additional Requirements of Security Class A1        

 

6.2.1 Additional Life-Cycle Assurance Requirements  

 

6.2.1.1     Configuration Management    

 

6.2.1.2     Trusted Distribution   

 

GLOSSARY   

 

BIBLIOGRAPHY     

 

1.0   INTRODUCTION

 

1.1   BACKGROUND

 

The principal goal of the National Computer Security Center (NCSC) is to encourage the

widespread availability of trusted computer systems. In support of this goal the NCSC created a

metric, the DoD Trusted Computer System Evaluation Criteria (TCSEC) [17], against which

computer systems could be evaluated.

 

The TCSEC was originally published on 15 August 1983 as CSC-STD-001-83. In December

1985 the Department of Defense adopted it, with a few changes, as a Department of Defense

Standard, DoD 5200.28-STD. DoD Directive 5200.28, Security Requirements for Automatic

Information Systems (AISs) [10], requires the Department of Defense to use the TCSEC. The

TCSEC is the standard used for evaluating the effectiveness of security controls built into DoD

AISs.

 

The TCSEC is divided into four divisions: D, C, B, and A. These divisions are ordered in a

hierarchical manner. The TCSEC reserves the highest division (A) for systems providing the best

available level of assurance. Within divisions C and B are subdivisions known as classes, which

also are ordered in a hierarchical manner to represent different levels of security in these divisions.

 

1.2   PURPOSE

 

An important assurance requirement of the TCSEC, which appears in classes B3 to A1, is

trusted recovery. The objective of trusted recovery is to ensure the maintenance of the security and

accountability properties of a system in the face of failures and discontinuities of operation. To

accomplish this, a system should incorporate a set of mechanisms enabling it to remain in a secure

state whenever a well-defined set of anticipated failures or discontinuities occur. It also should

include a set of procedures enabling the administrators to bring the system to a secure state

whenever unanticipated failures or discontinuities occur. (Chapter 6 explains the distinction

between anticipated and unanticipated failures.)

 

Besides these mechanisms, the TCSEC's B3-A1 classes require the implementor to follow

specific design principles and practices, collectively called assurance measures. The TCSEC

further requires the developer to provide specific documentation evidence sufficient for an

evaluator or accreditor to verify that the mechanisms and assurances are sufficient to meet

specified requirements.

 

This guide presents the issues involved in the design of trusted recovery. It provides guidance

to manufacturers on what functions of trusted recovery to incorporate into their systems. It also

provides guidance to system evaluators and accreditors on how to evaluate the design and

implementation of trusted recovery functions. This document contains suggestions and

recommendations derived from TCSEC objectives but which the TCSEC does not require.

Examples in this document are not the only way of accomplishing trusted recovery. Nor are the

recommendations supplementary requirements to the TCSEC. The only measure of TCSEC

compliance is the TCSEC itself.

 

This guideline isn't a tutorial introduction to the topic of recovery. Instead, it's a summary of

trusted recovery issues that should be addressed by operating systems designed to satisfy the

requirements of the B3 and A1 classes. We assume the reader of this document is an operating

system designer or evaluator who is already familiar with the notion of recovery in operating

systems. The guide explains the security properties of system recovery (and the notion of trusted

recovery). It also defines a set of baseline requirements and recommendations for the design and

evaluation of trusted recovery mechanisms and assurance. The reader who is unfamiliar with the

notion of system recovery and security modeling required of B3 and Al systems may find it useful

to refer both to the recovery literature (such as [1, 5, 14-16, 20-23, 25, 27]) and the security

literature (such as [3,11, 26, 29]) cited in this guide.

 

1.3   SCOPE

 

Trusted recovery refers to mechanisms and procedures necessary to ensure that failures and

discontinuities of operation don't compromise a system's secure operation. The guidelines for

trusted recovery presented refer to the design of these mechanisms and procedures required for the

classes B3 and A1 of the TCSEC. These guidelines apply to computer systems and products built

or modified with the intention of satisfying TCSEC requirements. We make additional

recommendations derived from the stated objectives of the TCSEC.

 

Not addressed are recovery measures designed to tolerate failures caused by physical attacks

on ADP equipment, natural disasters, water or fire damage, nor administrative measures that deal

with such events. The evaluation of these measures is beyond the scope of the TCSEC [17, p. 89].

 

1.4   CONTROL OBJECTIVE

 

Trusted recovery is one of the areas of operational assurance. The assurance control objective

states:

 

"Systems that are used to process or handle classified or other sensitive information must be

designed to guarantee correct and accurate interpretation of the security policy and must not

distort the intent of that policy. Assurance must be provided that correct implementation and

operation of the policy exists throughout the system's life-cycle." [17, p. 63]

 

This objective affects trusted recovery in two important ways. First, the design and

implementation of the recovery mechanisms and procedures must satisfy the life-cycle assurance

requirements of correct implementation and operation. Second, both a system's administrative

procedures and recovery mechanisms should ensure correct enforcement of the system security

policy in the face of system failures and discontinuities of operation. The notions of failure and

discontinuity of operation are defined in Chapter 2.

 

1.5   DOCUMENT OVERVIEW

 

This guide contains five chapters besides this introductory chapter. Chapter 2 reviews the key

notions of failure, discontinuity of operation, and recovery. Chapter 3 discusses the properties of

trusted recovery. Chapter 4 presents recovery design approaches and options that can be used for

trusted recovery. Chapter 5 discusses the impact of the other TCSEC requirements on trusted

recovery. Chapter 6 presents TCSEC requirements that affect the design and implementation of

trusted recovery functions, and includes additional recommendations corresponding to B3-A1

evaluation classes. The glossary contains the definitions of the significant terms used. Following

this is a list of the references cited in the text.

 

2.0   FAILURES, DISCONTINUITIES, AND RECOVERY

 

The TCSEC requires for security classes B3 and A1 that:

 

"Procedures and/or mechanisms shall be provided to assure that, alter an ADP system failure

or other discontinuity, recovery without a protection compromise is obtained." [17, p. 39]

 

In this chapter we discuss the notions of failure and discontinuity of Trusted Computing Base

(TCB) operations, and present an informal qualitative description of their effects on system states.

We also briefly present general recovery approaches used in practice. Throughout this chapter and

document we use the term "failure" for an event causing a system function to behave inconsistently

with its informal specification. We reserve the term "discontinuity" of operation for failures caused

by user, administrator, or operator action.

 

Recovery mechanisms of computer systems are designed to respond to anticipated failures or

discontinuities of operation. These mechanisms do not handle "unanticipated" failures nor

"unanticipated" discontinuities of operation; therefore, computer-system documentation should

include descriptions of administrative procedures to handle such events. In a well-designed system,

unanticipated failures and discontinuities of operation are events expected to occur with very low

frequency, i.e., once or twice per year. For this reason, administrative procedures, as opposed to

automated mechanisms in the system, represent an adequate response to unanticipated failures and

discontinuities of operation, even when these procedures are complex and extensive.

 

One can't establish formal models of failure and discontinuity of operation in which proofs

demonstrate the model's internal consistency. Neither physical systems, such as devices,

processing units, and storage, nor behaviors of users, administrators, and operators, have formal

properties [21]. Therefore, formal modeling and specification of expected failures and

discontinuities of operation can't be required. Only informal assumptions derived from operational

experience can be made about expected failures, discontinuities, their effects, and their

frequencies. References (14, 15, 21] present examples of such assumptions. These informal

assumptions, which should be stated explicitly in system documentation, form the basis for the

design of the recovery mechanisms and the definition of the administrative recovery procedures.

 

However, recovery mechanisms and administrative procedures must reconstruct consistent

system states, or prevent state transitions to inconsistent states, as a direct response to occurrences

of expected failures or discontinuities of operation [8, 9]. A system state is "consistent" if the

variables defining it satisfy given predicates expressing formally or informally invariant properties

of the system, discussed in Section 3.1. A "state transition" is a function which changes the

variables of a system state in a specified way, i.e., specified as constraints on the system's rules of

operation-discussed in Section 3.2. Therefore, the design of recovery mechanisms and

administrative procedures should use invariant properties and state-transition constraints of the

security model defined for the system, viz., discussion in Chapter 3.

 

The role of recovery mechanisms and of trusted recovery can be best understood by

illustrating the effect of failures and discontinuities of operation on typical systems. Informal and

qualitative assumptions of failures derived from operational experience with various systems have

been presented in the literature [14,15, 21]. Using these informal assumptions we can define

general classes of failures that affect the operation of a TCB.

 

One class of failures is identical to the class of errors caused when users pass wrong

parameters to TCB primitives, or invoke the wrong TCB primitives, and when system resources

are exhausted or found in an inconsistent state because of user actions. These are called state-

transition failures or action failures. We cover this type of user-induced failure, which falls more

naturally in the area of exception processing, for two reasons: (1) the failures of this class are,

nevertheless, TCB domain failures regardless of their cause; and (2) the processing of these

failures-not just their specification and documentation-is relevant to system security.

 

For example, incorrect error processing can bring the system into a state where a user cannot

communicate with the TCB, or can contribute to the mishandling of covert channels. However, we

place the major emphasis in this guideline on the more traditional notions of failure, namely TCB

failures, media failures, and administrator-induced discontinuity of operation.

 

2.1   STATE-TRANSITION (ACTION) FAILURES

 

State-transition failures, also called action failures, occur whenever a TCB primitive, which

causes a state transition, cannot complete its function because it detects exceptional conditions

during its execution. State-transition failures can be caused by bad parameters passed to TCB

primitives, by exhaustion of resource limits, by missing objects needed during TCB primitive

execution, and so on.

 

The effects of state-transition failures on TCB states are not as far-reaching as those of other

failures. Because these failures occur often, the code of TCB primitives usually includes recovery

mechanisms that undo the temporary modifications of system states before the primitive's return,

thus returning the system to a consistent state. If the recovery mechanisms of TCB primitives fail

to undo temporary modifications of system states, the system may remain in an inconsistent state

and eventually crash. A crash is a failure that causes the processors' registers to be reset to some

standard values [21]. Because consistent system states cannot be recovered from processor and

primary memory registers after a crash, these registers are referred to as "volatile" storage. In

contrast, consistent system states can usually be recovered from magnetic media such as disks and

tapes; these media are called "nonvolatile" storage.

 

Examples of recovery mechanisms included in TCB primitives to undo temporary state

modifications after state-transition failures are found in most contemporary operating systems. For

instance, consider the "creat" primitive of a hypothetical UNIX(R) system which allocates i-node

table entries before allocating file table entries [1]. If the file table entry is full at the time "creat"

call is made, a state-transition failure would occur. Before returning to the caller, the recovery code

of "creat" deallocates the i-node table entry allocated for the file that couldn't be created. Failure

to deallocate such entries would cause the i-node table to fill up and remain full, causing a system

crash.

 

(R) UNIX is a registered trademark of UNIX System Laboratories, Inc

 

2.2   TCB FAILURES

 

TCB failures occur whenever the TCB code detects an error below the TCB primitives'

interface which can't be fixed; i.e., the error cannot be masked. TCB failures are caused by

persistent inconsistencies in critical system tables, by wild branches of the TCB code (possibly

caused by transient hardware failures), by power failures, by processor failures, and so on. TCB

failures always cause a system crash.

 

In systems providing a high degree of hardware fault tolerance, system crashes still occur

because of software errors. Since crashes cause volatile storage to be lost, and since nonvolatile

media usually survive crashes, recovery mechanisms can reconstruct consistent states in a

maintenance mode of operation. After reconstructing a consistent state, the recovery mechanisms

restart the system with no process execution in progress, e.g., processes that were active, blocked,

or swapped out before the crash are aborted. New processes, which run the code of aborted

processes executing at the time of the crash, can be started by users after the consistent state is

reconstructed. Recovery mechanisms can reconstruct consistent states by either removing or

completing incomplete updates of various objects represented on nonvolatile media. Properties of

and design approaches for recovery mechanisms able to reconstruct consistent states from

nonvolatile storage after TCB failures are discussed in Section 3.2 and Chapter 4.

 

Some TCB failures allow a system to shut down in an orderly manner. These failures may be

caused by process swap-space exhaustion, timer-interrupt table exhaustion, and, in general, by

conditions that can't be handled by TCB primitives themselves in normal modes of operation.

Traps originated by persistent hardware failures, such as memory and bus parity errors, also may

cause failures.

 

2.3   MEDIA FAILURES

 

Media failures occur whenever errors are detected on some nonvolatile storage device that the

TCB cannot fix (i.e., the errors can't be masked). Media failures are caused by hardware failures

such as disk head crashes, persistent read/write failures due to misaligned heads, worn-out

magnetic coating, dust on the disk surface, and so on. They also are caused by software failures

such as TCB failures which make media unreadable.

 

The effect of media failures is that part, or all, of the media representing TCB objects become

inaccessible and corrupt. Data structures relevant to system security also may be corrupted by

media failures, e.g., object security labels. The system usually crashes unless the lost data can be

retrieved from archival storage and rebuilt on a redundant storage device. Of course, media failures

that don't affect TCB objects may not cause system crashes. If redundant media aren't available,

or if users and administrators don't keep archival data up-to-date, media failures may become

unrecoverable failures. Administrative recovery procedures may have to be used to bring the

system to a consistent state. As discussed in Chapters 5 and 6, all these procedures should be

explained in the system's Trusted Facility Manual.

 

2.4   DISCONTlNUITY OF OPERATION

 

Failures induced by users, administrators, and operators cause discontinuities of operation.

Inside an operating system, discontinuities of operation manifest themselves most often as state-

transition failures, TCB failures, and, less often, as media failures. They are caused by erroneous

actions, such as unexpected system shutdowns, e.g., by turning off the power. Also, they can be

caused by lack of action, such as ignoring the exhaustion of critical system resources under

administrative control despite documented or on-line warnings, e.g., audit trail is 95% full,

insufficient swap space left, inadequate configuration installed, etc.

 

The effects of discontinuities of operation are the same as those of the state-transition and TCB

failures mentioned above. Recovery mechanisms or administrative procedures necessary for the

reconstruction of a consistent state also are correspondingly similar to those used for failures. For

example, cancellation of a TCB primitive call by depressing the "break" key during the call's

execution might have the same effect as a state-transition failure detected by the TCB primitive.

Each TCB primitive and state transition would have to be designed either to ignore user

cancellation signals during execution of critical code sections or to clean up internal data structures

during the processing of such signals.

 

Actions such as system shutdowns by power-off action during execution of TCB code may

cause TCB failures. Recovery mechanisms for TCB failures caused by power failures also may be

able to handle unexpected system shutdowns. In either case, during subsequent power-on

procedures, the TCB not only detects that TCB failures left the system in an inconsistent state, but

also initiates recovery of a consistent state before the system enters the normal mode of operation.

 

Somewhat less often, administrator or operator actions cause media failures. For example,

initiation of on-line diagnostic tests of a media controller during normal mode of system operation,

instead of the maintenance mode, would most likely cause media failures. Similarly, initiation of

TCB maintenance actions such as disk reformatting in the normal mode of operation would

certainly cause subsequent media failures. Discontinuity of operation caused by administrator- or

operator-induced failures may require use of administrative recovery procedures.

 

3.0   PROPERTIES OF TRUSTED RECOVERY

 

The properties of trusted recovery are defined in terms of two notions: secure states and secure

state transitions. A system state is secure whenever consistency invariants derived from valid

interpretations of security and accountability models are satisfied. A state transition is secure if

both its input state and its output state are secure, and it satisfies the constraints placed on it by valid

interpretations of security policy and accountability policy models.

 

Accountability models include models of user authentication, trusted path, and audit. The

notions of invariants for secure states and constraints for specific state transitions are briefly

illustrated in this chapter and discussed in detail in reference [11]. Reference [29] discusses the

notion of a valid interpretation of a security model in detail and reference [3] illustrates it. For the

sake of brevity, interpretations of security models aren't illustrated in this guideline.

 

3.1   SECURE STATES

 

State-machine (or "state-transition") models of security, such as the Bell-La Padula