NCSC-TG-022
Library No. 5-236,061
Version 1
FOREWORD
A Guide to Understanding Trusted Recovery in Trusted
Systems provides a set of good
practices related to trusted recovery. We have written
this guideline to help the vendor and
evaluator community understand the requirements for
trusted recovery, as well as the level of detail
required for trusted recovery at all applicable classes,
as described in the Department of Defense
Trusted Computer Systems Evaluation Criteria. In an
effort to provide guidance, we make
recommendations in this technical guideline that are not
requirements in the Criteria.
The Trusted Recovery Guide is the latest in a series of
technical guidelines published by the
National Computer Security Center. These publications
provide insight to the Trusted Computer
Systems Evaluation Criteria requirements for the computer
security vendor and technical
evaluator. The goal of the Technical Guideline Program is
to discuss each feature of the Criteria
in detail and to provide the proper interpretations with
specific guidance.
The National Computer Security Center has established an
aggressive program to study and
implement computer security technology. Our goal is to
encourage the widespread availability of
trusted computer products for use by any organization
desiring better protection of its important
data. One way we do this is by the Trusted Product
Evaluation Program. This program focuses on
the security features of commercially produced and
supported computer systems. We evaluate the
protection capabilities against the established criteria
presented in the Trusted Computer System
Evaluation Criteria. This program, and an open and
cooperative business relationship with the
computer and telecommunications industries, will result
in the fulfillment of our country's
information systems security requirements. We resolve to
meet the challenge of identifying trusted
computer products suitable for use in processing
information that needs protection.
invite your suggestions for revising this technical
guideline. We will review this document as the
need arises.
30 December 1991
Patrick R. Gallagher, Jr.
Director
National Computer Security Center
ACKNOWLEDGMENTS
The National Computer Security Center extends special
recognition and acknowledgment to
Dr. Virgil D. Gligor as the primary author of this
document. James N. Menendez and Capt. James
A. Muysenberg (USAF) are recognized for the development
of this guideline, and Capt.
Muysenberg is recognized for its editing and publication.
We wish to thank the many members of the computer
security community who
enthusiastically gave their time and technical expertise
in reviewing this guideline and providing
valuable comments and suggestions.
TABLE OF CONTENTS
FOREWORD
ACKNOWLEDGMENTS
1.0 INTRODUCTION
1.1 Background
1.2 Purpose
1.3 Scope
1.4 Control
Objective
1.5 Document
Overview
2.0 FAILURES,
DISCONTINUITIES, AND RECOVERY
2.1 State-Transition
(Action) Failures
2.2 TCB Failures
2.3 Media Failures
2.4 Discontinuity
of Operation
3.0 PROPERTIES OF
TRUSTED RECOVERY
3.1 Secure States
3.2 Secure State
Transitions
4.0 DESIGN
APPROACHES FOR TRUSTED RECOVERY
4.1 Responsibility
for Trusted Recovery
4.2 Some Practical
Difficulties with Current Formalisms
4.3 Summary of
Current Approaches to Recovery
4.3.1 Types of
System Recovery
4.3.2 Current
Approaches
4.3.3 Implementation
of Atomic State Transitions
4.3.3.1 Shadowing
4.3.3.2 Logging
4.3.3.3 Logging
and Shadowing
4.3.4 Recovery with
Non-Atomic State Transitions
4.3.4.1 Sources
of Inconsistency--A Generic Example
4.3.4.2 Non-Atomic
TCB Primitives
4.3.4.3 ldempotency
of Recovery Procedures
4.3.4.4 Recovery
With Non-Atomic System Primitives
4.4 Design Options for Trusted Recovery
5.0 IMPACT OF
OTHER TCSEC REQUIREMENTS ON TRUSTED RECOVERY
5.1 Operational
Assurance
5.2 Life-Cycle
Assurance
5.2.1 Security
Testing
5.2.2 Design
Specification and Verification
5.2.3 Configuration
Management
5.2.4 Trusted
Distribution
5.3 Documentation
5.3.1 Trusted
Facility Manual
5.3.2 Test
Documentation
5.3.3 Design
Documentation
6.0 SATISFYING THE
TCSEC REQUIREMENTS
6.1 Requirements
for Security Class B3
6.1.1 Operational
Assurance
6.1.1.1 System
Architecture
6.1.1.2 Trusted
Facility Management
6.1.2 Life-Cycle
Assurance
6.1.2.1 Security
Testing
6.1.2.2 Design
Specification and Verification
6.1.2.3 Configuration
Management
6.1.3 Documentation
6.1.3.1 Trusted
Facility Manual
6.1.3.2 Test
Documentation
6.1.3.3 Design
Documentation
6.2 Additional
Requirements of Security Class A1
6.2.1 Additional
Life-Cycle Assurance Requirements
6.2.1.1 Configuration
Management
6.2.1.2 Trusted
Distribution
GLOSSARY
BIBLIOGRAPHY
1.0 INTRODUCTION
1.1 BACKGROUND
The principal goal of the National Computer Security
Center (NCSC) is to encourage the
widespread availability of trusted computer systems. In
support of this goal the NCSC created a
metric, the DoD Trusted Computer System Evaluation
Criteria (TCSEC) [17], against which
computer systems could be evaluated.
The TCSEC was originally published on 15 August 1983 as
CSC-STD-001-83. In December
1985 the Department of Defense adopted it, with a few
changes, as a Department of Defense
Standard, DoD 5200.28-STD. DoD Directive 5200.28,
Security Requirements for Automatic
Information Systems (AISs) [10], requires the Department
of Defense to use the TCSEC. The
TCSEC is the standard used for evaluating the
effectiveness of security controls built into DoD
AISs.
The TCSEC is divided into four divisions: D, C, B, and A.
These divisions are ordered in a
hierarchical manner. The TCSEC reserves the highest
division (A) for systems providing the best
available level of assurance. Within divisions C and B
are subdivisions known as classes, which
also are ordered in a hierarchical manner to represent
different levels of security in these divisions.
1.2 PURPOSE
An important assurance requirement of the TCSEC, which
appears in classes B3 to A1, is
trusted recovery. The objective of trusted recovery is to
ensure the maintenance of the security and
accountability properties of a system in the face of
failures and discontinuities of operation. To
accomplish this, a system should incorporate a set of
mechanisms enabling it to remain in a secure
state whenever a well-defined set of anticipated failures
or discontinuities occur. It also should
include a set of procedures enabling the administrators
to bring the system to a secure state
whenever unanticipated failures or discontinuities occur.
(Chapter 6 explains the distinction
between anticipated and unanticipated failures.)
Besides these mechanisms, the TCSEC's B3-A1 classes
require the implementor to follow
specific design principles and practices, collectively
called assurance measures. The TCSEC
further requires the developer to provide specific
documentation evidence sufficient for an
evaluator or accreditor to verify that the mechanisms and
assurances are sufficient to meet
specified requirements.
This guide presents the issues involved in the design of
trusted recovery. It provides guidance
to manufacturers on what functions of trusted recovery to
incorporate into their systems. It also
provides guidance to system evaluators and accreditors on
how to evaluate the design and
implementation of trusted recovery functions. This
document contains suggestions and
recommendations derived from TCSEC objectives but which
the TCSEC does not require.
Examples in this document are not the only way of
accomplishing trusted recovery. Nor are the
recommendations supplementary requirements to the TCSEC.
The only measure of TCSEC
compliance is the TCSEC itself.
This guideline isn't a tutorial introduction to the topic
of recovery. Instead, it's a summary of
trusted recovery issues that should be addressed by
operating systems designed to satisfy the
requirements of the B3 and A1 classes. We assume the
reader of this document is an operating
system designer or evaluator who is already familiar with
the notion of recovery in operating
systems. The guide explains the security properties of
system recovery (and the notion of trusted
recovery). It also defines a set of baseline requirements
and recommendations for the design and
evaluation of trusted recovery mechanisms and assurance. The
reader who is unfamiliar with the
notion of system recovery and security modeling required
of B3 and Al systems may find it useful
to refer both to the recovery literature (such as [1, 5,
14-16, 20-23, 25, 27]) and the security
literature (such as [3,11, 26, 29]) cited in this guide.
1.3 SCOPE
Trusted recovery refers to mechanisms and procedures
necessary to ensure that failures and
discontinuities of operation don't compromise a system's
secure operation. The guidelines for
trusted recovery presented refer to the design of these
mechanisms and procedures required for the
classes B3 and A1 of the TCSEC. These guidelines apply to
computer systems and products built
or modified with the intention of satisfying TCSEC
requirements. We make additional
recommendations derived from the stated objectives of the
TCSEC.
Not addressed are recovery measures designed to tolerate
failures caused by physical attacks
on ADP equipment, natural disasters, water or fire
damage, nor administrative measures that deal
with such events. The evaluation of these measures is
beyond the scope of the TCSEC [17, p. 89].
1.4 CONTROL
OBJECTIVE
Trusted recovery is one of the areas of operational
assurance. The assurance control objective
states:
"Systems that are used to process or handle
classified or other sensitive information must be
designed to guarantee correct and accurate interpretation
of the security policy and must not
distort the intent of that policy. Assurance must be
provided that correct implementation and
operation of the policy exists throughout the system's
life-cycle." [17, p. 63]
This objective affects trusted recovery in two important
ways. First, the design and
implementation of the recovery mechanisms and procedures
must satisfy the life-cycle assurance
requirements of correct implementation and operation.
Second, both a system's administrative
procedures and recovery mechanisms should ensure correct
enforcement of the system security
policy in the face of system failures and discontinuities
of operation. The notions of failure and
discontinuity of operation are defined in Chapter 2.
1.5 DOCUMENT
OVERVIEW
This guide contains five chapters besides this
introductory chapter. Chapter 2 reviews the key
notions of failure, discontinuity of operation, and
recovery. Chapter 3 discusses the properties of
trusted recovery. Chapter 4 presents recovery design
approaches and options that can be used for
trusted recovery. Chapter 5 discusses the impact of the
other TCSEC requirements on trusted
recovery. Chapter 6 presents TCSEC requirements that
affect the design and implementation of
trusted recovery functions, and includes additional
recommendations corresponding to B3-A1
evaluation classes. The glossary contains the definitions
of the significant terms used. Following
this is a list of the references cited in the text.
2.0 FAILURES,
DISCONTINUITIES, AND RECOVERY
The TCSEC requires for security classes B3 and A1 that:
"Procedures and/or mechanisms shall be provided to
assure that, alter an ADP system failure
or other discontinuity, recovery without a protection
compromise is obtained." [17, p. 39]
In this chapter we discuss the notions of failure and
discontinuity of Trusted Computing Base
(TCB) operations, and present an informal qualitative
description of their effects on system states.
We also briefly present general recovery approaches used
in practice. Throughout this chapter and
document we use the term "failure" for an event
causing a system function to behave inconsistently
with its informal specification. We reserve the term
"discontinuity" of operation for failures caused
by user, administrator, or operator action.
Recovery mechanisms of computer systems are designed to
respond to anticipated failures or
discontinuities of operation. These mechanisms do not
handle "unanticipated" failures nor
"unanticipated" discontinuities of operation;
therefore, computer-system documentation should
include descriptions of administrative procedures to
handle such events. In a well-designed system,
unanticipated failures and discontinuities of operation
are events expected to occur with very low
frequency, i.e., once or twice per year. For this reason,
administrative procedures, as opposed to
automated mechanisms in the system, represent an adequate
response to unanticipated failures and
discontinuities of operation, even when these procedures
are complex and extensive.
One can't establish formal models of failure and discontinuity
of operation in which proofs
demonstrate the model's internal consistency. Neither
physical systems, such as devices,
processing units, and storage, nor behaviors of users,
administrators, and operators, have formal
properties [21]. Therefore, formal modeling and
specification of expected failures and
discontinuities of operation can't be required. Only
informal assumptions derived from operational
experience can be made about expected failures,
discontinuities, their effects, and their
frequencies. References (14, 15, 21] present examples of
such assumptions. These informal
assumptions, which should be stated explicitly in system
documentation, form the basis for the
design of the recovery mechanisms and the definition of
the administrative recovery procedures.
However, recovery mechanisms and administrative
procedures must reconstruct consistent
system states, or prevent state transitions to
inconsistent states, as a direct response to occurrences
of expected failures or discontinuities of operation [8,
9]. A system state is "consistent" if the
variables defining it satisfy given predicates expressing
formally or informally invariant properties
of the system, discussed in Section 3.1. A "state
transition" is a function which changes the
variables of a system state in a specified way, i.e.,
specified as constraints on the system's rules of
operation-discussed in Section 3.2. Therefore, the design
of recovery mechanisms and
administrative procedures should use invariant properties
and state-transition constraints of the
security model defined for the system, viz., discussion
in Chapter 3.
The role of recovery mechanisms and of trusted recovery
can be best understood by
illustrating the effect of failures and discontinuities
of operation on typical systems. Informal and
qualitative assumptions of failures derived from
operational experience with various systems have
been presented in the literature [14,15, 21]. Using these
informal assumptions we can define
general classes of failures that affect the operation of
a TCB.
One class of failures is identical to the class of errors
caused when users pass wrong
parameters to TCB primitives, or invoke the wrong TCB
primitives, and when system resources
are exhausted or found in an inconsistent state because
of user actions. These are called state-
transition failures or action failures. We cover this
type of user-induced failure, which falls more
naturally in the area of exception processing, for two
reasons: (1) the failures of this class are,
nevertheless, TCB domain failures regardless of their
cause; and (2) the processing of these
failures-not just their specification and
documentation-is relevant to system security.
For example, incorrect error processing can bring the
system into a state where a user cannot
communicate with the TCB, or can contribute to the
mishandling of covert channels. However, we
place the major emphasis in this guideline on the more
traditional notions of failure, namely TCB
failures, media failures, and administrator-induced
discontinuity of operation.
2.1 STATE-TRANSITION
(ACTION) FAILURES
State-transition failures, also called action failures,
occur whenever a TCB primitive, which
causes a state transition, cannot complete its function
because it detects exceptional conditions
during its execution. State-transition failures can be
caused by bad parameters passed to TCB
primitives, by exhaustion of resource limits, by missing
objects needed during TCB primitive
execution, and so on.
The effects of state-transition failures on TCB states
are not as far-reaching as those of other
failures. Because these failures occur often, the code of
TCB primitives usually includes recovery
mechanisms that undo the temporary modifications of
system states before the primitive's return,
thus returning the system to a consistent state. If the
recovery mechanisms of TCB primitives fail
to undo temporary modifications of system states, the
system may remain in an inconsistent state
and eventually crash. A crash is a failure that causes
the processors' registers to be reset to some
standard values [21]. Because consistent system states
cannot be recovered from processor and
primary memory registers after a crash, these registers
are referred to as "volatile" storage. In
contrast, consistent system states can usually be
recovered from magnetic media such as disks and
tapes; these media are called "nonvolatile"
storage.
Examples of recovery mechanisms included in TCB
primitives to undo temporary state
modifications after state-transition failures are found
in most contemporary operating systems. For
instance, consider the "creat" primitive of a
hypothetical UNIX(R) system which allocates i-node
table entries before allocating file table entries [1].
If the file table entry is full at the time "creat"
call is made, a state-transition failure would occur.
Before returning to the caller, the recovery code
of "creat" deallocates the i-node table entry allocated
for the file that couldn't be created. Failure
to deallocate such entries would cause the i-node table
to fill up and remain full, causing a system
crash.
(R) UNIX is a registered trademark of UNIX System
Laboratories, Inc
2.2 TCB FAILURES
TCB failures occur whenever the TCB code detects an error
below the TCB primitives'
interface which can't be fixed; i.e., the error cannot be
masked. TCB failures are caused by
persistent inconsistencies in critical system tables, by
wild branches of the TCB code (possibly
caused by transient hardware failures), by power
failures, by processor failures, and so on. TCB
failures always cause a system crash.
In systems providing a high degree of hardware fault
tolerance, system crashes still occur
because of software errors. Since crashes cause volatile
storage to be lost, and since nonvolatile
media usually survive crashes, recovery mechanisms can
reconstruct consistent states in a
maintenance mode of operation. After reconstructing a consistent
state, the recovery mechanisms
restart the system with no process execution in progress,
e.g., processes that were active, blocked,
or swapped out before the crash are aborted. New
processes, which run the code of aborted
processes executing at the time of the crash, can be
started by users after the consistent state is
reconstructed. Recovery mechanisms can reconstruct
consistent states by either removing or
completing incomplete updates of various objects
represented on nonvolatile media. Properties of
and design approaches for recovery mechanisms able to
reconstruct consistent states from
nonvolatile storage after TCB failures are discussed in
Section 3.2 and Chapter 4.
Some TCB failures allow a system to shut down in an
orderly manner. These failures may be
caused by process swap-space exhaustion, timer-interrupt
table exhaustion, and, in general, by
conditions that can't be handled by TCB primitives
themselves in normal modes of operation.
Traps originated by persistent hardware failures, such as
memory and bus parity errors, also may
cause failures.
2.3 MEDIA FAILURES
Media failures occur whenever errors are detected on some
nonvolatile storage device that the
TCB cannot fix (i.e., the errors can't be masked). Media
failures are caused by hardware failures
such as disk head crashes, persistent read/write failures
due to misaligned heads, worn-out
magnetic coating, dust on the disk surface, and so on.
They also are caused by software failures
such as TCB failures which make media unreadable.
The effect of media failures is that part, or all, of the
media representing TCB objects become
inaccessible and corrupt. Data structures relevant to
system security also may be corrupted by
media failures, e.g., object security labels. The system
usually crashes unless the lost data can be
retrieved from archival storage and rebuilt on a
redundant storage device. Of course, media failures
that don't affect TCB objects may not cause system
crashes. If redundant media aren't available,
or if users and administrators don't keep archival data
up-to-date, media failures may become
unrecoverable failures. Administrative recovery
procedures may have to be used to bring the
system to a consistent state. As discussed in Chapters 5
and 6, all these procedures should be
explained in the system's Trusted Facility Manual.
2.4 DISCONTlNUITY
OF OPERATION
Failures induced by users, administrators, and operators
cause discontinuities of operation.
Inside an operating system, discontinuities of operation
manifest themselves most often as state-
transition failures, TCB failures, and, less often, as
media failures. They are caused by erroneous
actions, such as unexpected system shutdowns, e.g., by
turning off the power. Also, they can be
caused by lack of action, such as ignoring the exhaustion
of critical system resources under
administrative control despite documented or on-line
warnings, e.g., audit trail is 95% full,
insufficient swap space left, inadequate configuration
installed, etc.
The effects of discontinuities of operation are the same
as those of the state-transition and TCB
failures mentioned above. Recovery mechanisms or
administrative procedures necessary for the
reconstruction of a consistent state also are
correspondingly similar to those used for failures. For
example, cancellation of a TCB primitive call by
depressing the "break" key during the call's
execution might have the same effect as a
state-transition failure detected by the TCB primitive.
Each TCB primitive and state transition would have to be
designed either to ignore user
cancellation signals during execution of critical code
sections or to clean up internal data structures
during the processing of such signals.
Actions such as system shutdowns by power-off action
during execution of TCB code may
cause TCB failures. Recovery mechanisms for TCB failures
caused by power failures also may be
able to handle unexpected system shutdowns. In either
case, during subsequent power-on
procedures, the TCB not only detects that TCB failures
left the system in an inconsistent state, but
also initiates recovery of a consistent state before the
system enters the normal mode of operation.
Somewhat less often, administrator or operator actions
cause media failures. For example,
initiation of on-line diagnostic tests of a media
controller during normal mode of system operation,
instead of the maintenance mode, would most likely cause
media failures. Similarly, initiation of
TCB maintenance actions such as disk reformatting in the
normal mode of operation would
certainly cause subsequent media failures. Discontinuity
of operation caused by administrator- or
operator-induced failures may require use of
administrative recovery procedures.
3.0 PROPERTIES OF
TRUSTED RECOVERY
The properties of trusted recovery are defined in terms
of two notions: secure states and secure
state transitions. A system state is secure whenever
consistency invariants derived from valid
interpretations of security and accountability models are
satisfied. A state transition is secure if
both its input state and its output state are secure, and
it satisfies the constraints placed on it by valid
interpretations of security policy and accountability
policy models.
Accountability models include models of user
authentication, trusted path, and audit. The
notions of invariants for secure states and constraints
for specific state transitions are briefly
illustrated in this chapter and discussed in detail in
reference [11]. Reference [29] discusses the
notion of a valid interpretation of a security model in
detail and reference [3] illustrates it. For the
sake of brevity, interpretations of security models
aren't illustrated in this guideline.
3.1 SECURE STATES
State-machine (or "state-transition") models of
security, such as the Bell-La Padula