This article discusses methods of restoring secure systems, such as payment card processing, to normal cryptographic operation after an attack or adverse event is detected and recovering access to critical information.
Types of failure modes in a cryptographic system
Access to critical information can easily be lost if some error or compromise is detected in the keying process or any one of other numerous failures occur, including lost key cards or tokens, forgotten passwords, hardware failure, power loss, memory corruption, electromagnetic interference, etc. This is especially critical in the payment card industry where cardholder data can be a prime target for possible attack.
Using backups for failures due to the keying process
If the failure is related to inaccessibility of a key (or keys) due to corruption or other factors, access to information can usually be restored by using a backup copy of a key and related keying material. Instead of using backups, it may be more convenient for an organization to create new keying material, if such a method allows the recovery of the original protected information. This can prevent the possibility of further compromise, but in most cases, without an advanced and automated key management system, the generation and distribution of new keying material is an extensive process, and it is much easier to use backups.
Complications in recovery due to multiple key users
Whether or not backups are used or new keying material is generated, if a key or set of keys are shared between a single pair of users, the recovery process in the event of compromise is relatively simple. Damage assessment and protection measures against future compromise are also easily taken care of. However, if a large number of users depend on the same key or group of keys for accessing information, the recovery process and damage assessment can be extremely complex and expensive. In cases like these, it is preferable to have some form of alternate key available at the various locations, or use well-defined and secure paths for replacing compromised keys automatically. This would avoid the need for manual distribution of keys. Whenever a wide-spread key compromise occurs among many users, a means should be implemented to notify each one about the compromise and key replacement. It should also be emphasized that if any keying material has been subjected to unauthorized use, all information and/or processes protected by that keying material are immediately affected.
Establishing system redundancy due to non-key-related failures
If a failure is due to a non-key-related event, a contingency plan should be set up to recover any form of lost data or communication capability, and also to restore accessing mechanisms and critical processes while maintaining integrity protection and confidentiality for authorization and authentication. A backup system or method should be in place to bypass or replace the failed system or component, and recovery procedures should be written, along with delegated responsibilities for personnel. Any spare components that might be needed should be on hand for emergencies, including files, manuals, and critical software routines.
The sequence of events for recovery and restoring system operation
The survivability of a system depends on its ability to withstand adverse conditions and/or attacks without damage, and return the system to normal operation (in terms of security, reliability, performance, data availability, and correct functionality) after such an event is detected. This can involve several processes executed sequentially. The first step after detecting an attack or adverse event is to shut down all components affected by it, including keying material, data access, control systems, processes, etc. The next step is the critical one: to bring the system back into an operational state again with minimal loss of data and functionality. Then a damage assessment should be conducted as soon as possible along with the implementation of methods to protect against future compromise, and to reduce the consequences of the unauthorized usage of protected data or processes. These methods can vary to a large extent depending on the nature of the failure. One should also realize that certain failures or misuses cannot be prevented, so a higher tolerance level for these may need to be included.
Survivability comes in many forms
The concept of system survivability can be broken down to the following five types:
- Information survivability,
- Computer-system survivability,
- Network survivability,
- Application service survivability, and
- Enterprise survivability.
Survivability as it exists today
There is still much development work needed for current systems to achieve an acceptable level of survivability. Many systems, subsystems, and networks in operation today fall far short of what is really needed in terms of survivability. Many deficiencies can be found that can seriously hinder the attainment of survivability. Therefore, it is important that organizations endeavor to employ adequate key and/or data recovery mechanisms to ensure information availability whenever needed and to comply with assurance requirements.
Considering the survivability level in future systems
In order to avoid problems with failures and misuses on a system, the issue of survivability should be thoroughly considered in the initial design of a system. There should be detailed requirements for survivability to start with, so that stronger architectures, components, protocols, and cryptographic infrastructures can be implemented.
References and further reading
- Recommendation for Key Management – Part 1: General (2007) Elaine Barker, William Barker, William Burr, William Polk, and Miles Smid
- Practical Architectures for Survivable Systems and Networks (2000) By Peter G. Neumann
Computer Safety, Reliability and Security (2002) Stuart Anderson, Massimo Felici, Sandro Bologna (editors)
- Insecure Context Switching: Inoculating regular expressions for Survivability (2009) by Will Drewry and Tavis Ormandy