ITSCM recovery options
An organization’s ITSCM strategy is a balance between the cost of risk reduction measures and recovery options to support the recovery of critical business processes within agreed timescales. The following is a list of the potential IT recovery options that need to be considered when developing the strategy.
For certain types of services, manual work-arounds can be an effective interim measure for a limited timeframe until the IT service is resumed. For instance, a Service Desk call-logging service could survive for a limited time using paper forms linked to a laptop computer with a spreadsheet.
In the past, reciprocal arrangements were typical contingency measures where agreements were put in place with another organization using similar technology. This is no longer effective or possible for most types of IT systems, but can still be used in specific cases – for example, setting up an agreement to share high-speed printing facilities. Reciprocal arrangements can also be used for the off-site storage of backups and other critical information.
This option (sometimes referred to as ‘cold standby’) includes the provision of empty accommodation, fully equipped with power, environmental controls and local network cabling infrastructure, telecommunications connections, and available in a disaster situation for an organization to install its own computer equipment. It does not include the actual computing equipment, so is not applicable for services requiring speedy recovery, as set-up time is required before recovery of services can begin. This recovery option is only recommended for services that can bear a delay of recovery time in days or weeks, not hours. Any non-critical service that can bear this type of delay should take into account the cost of this option versus the benefit to the business before determining if a gradual recovery option should be included in the ITSCM options for the organization.
The accommodation may be provided commercially by a third party, for a fee, or may be private, (established by the organization itself) and provided as either a fixed or portable service.
A portable facility is typically a prefabricated building provided by a third party and located, when needed, at a predetermined site agreed with the organization. This may be in another location some distance from the home site, perhaps another owned building. The replacement computer equipment will need to be planned, but suppliers of computing equipment do not always guarantee replacement equipment within a fixed deadline, though they would normally do so under their best efforts.
This option (sometimes referred to as ‘warm standby’) is selected by organizations that need to recover IT facilities within a predetermined time to prevent impacts to the business process. The predetermined time will have been agreed with the business during the BIA.
Most common is the use of commercial facilities, which are offered by third-party recovery organizations to a number of subscribers, spreading the cost across those subscribers. Commercial facilities often include operation, system management and technical support. The cost varies depending on the facilities requested, such as processors, peripherals, communications, and how quickly the services must be restored.
The advantage of this service is that the customer can have virtually instantaneous access to a site, housed in a secure building, in the event of a disaster. It must be understood, however, that the restoration of services at the site may take some time, as delays may be encountered while the site is re-configured for the organization that invokes the service, and the organization’s applications and data will need to be restored from backups.
One potentially major disadvantage is the security implications of running IT services at a third party’s data centre. This must be taken into account when planning to use this type of facility. For some organizations, the external intermediate recovery option may not be appropriate for this reason.
If the site is invoked, there is often a daily fee for use of the service in an emergency, although this may be offset against additional cost of working insurance.
Commercial recovery services can be provided in self-contained, portable or mobile form where an agreed system is delivered to a customer’s site, within an agreed time.
This option (sometimes referred to as ‘hot standby’) provides for fast recovery and restoration of services and is sometimes provided as an extension to the intermediate recovery provided by a third-party recovery provider. Some organizations will provide their own facilities within the organization, but not on an alternative site to the one used for the normal operations. Others implement their own internal second locations on an alternative site to provide more resilient recovery.
Where there is a need for a fast restoration of a service, it is possible to ‘rent’ floor space at the recovery site and install servers or systems with application systems and communications already available, and data mirrored from the operational servers. In the event of a system failure, the customers can then recover and switch over to the backup facility with little loss of service. This typically involves the re-establishment of the critical systems and services within a 24-hour period.
This option (also often referred to as ‘hot standby’, ‘mirroring’, ‘load balancing’ or ‘split site’) provides for immediate restoration of services, with no loss of service. For business critical services, organizations requiring continuous operation will provide their own facilities within the organization, but not on the same site as the normal operations. Sufficient IT equipment will be ‘dual located’ in either an owned or hosted location to run the compete service from either location in the event of loss of one facility, with no loss of service to the customer. The second site can then be recovered whilst the service is provided from the single operable location. This is an expensive option, but may be justified for critical business processes or VBFs where non-availability for a short period could result in a significant impact, or where it would not be appropriate to be running IT services on a third party’s premises for security or other reasons. The facility needs to be located separately and far enough away from the home site that it will not be affected by a disaster affecting that location. However, these mirrored servers and sites options should be implemented in close liaison with Availability Management as they support services with high levels of availability.
The strategy is likely to include a combination of risk response measures and a combination of the above recovery options, as illustrated in Figure 4.25.
Figure 4.25 Example set of recovery options
Figure 4.25 shows that a number of options may be used to provide continuity of service. An example from Figure 4.25 shows that, initially, continuity of the Service Desk is provided using manual processes such as a set of forms, and maybe a spreadsheet operating from a laptop computer, whilst recovery plans for the service are completed on an alternative ‘fast recovery’ site. Once the alternative site has become operational, the Service Desk can switch back to using the IT service. However, use of the external ‘fast recovery’ alternative site is probably limited in duration, so while running temporarily from this site, the ‘intermediate site’ can be made operational and long-term operations can be transferred there.
Different services within an organization require different in-built resilience and different recovery options. Whatever option is chosen, the solution will need to be cost-justified. As a general rule, the longer the business can survive without a service, the cheaper the solution will be. For example, a critical healthcare system that requires continuous operation will be very costly, as potential loss of service will need to be eliminated by the use of immediate recovery, whereas a service the absence of which does not severely affect the business for a week or so could be supported by a much cheaper solution, such as intermediate recovery.
As well as the recovery of the computing equipment, planning needs to include the recovery of accommodation and infrastructure for both IT and user staff. Other areas to be taken into account include critical services such as power, telecommunications, water, couriers, post, paper records and reference material.
It is important to remember that the recovery is based around a series of stand-by arrangements including accommodation, procedures and people, as well as systems and telecommunications. Certain actions are necessary to implement the stand-by arrangements. For example:
18.104.22.168 Stage 3 – Implementation
Once the strategy has been approved, the IT Service Continuity Plans need to be produced in line with the Business Continuity Plans.
ITSCM plans need to be developed to enable the necessary information for critical systems, services and facilities to either continue to be provided or to be reinstated within an acceptable period to the business. An example ITSCM recovery plan is contained in Appendix K. Generally the Business Continuity Plans rely on the availability of IT services, facilities and resources. As a consequence of this, ITSCM plans need to address all activities to ensure that the required services, facilities and resources are delivered in an acceptable operational state and are ‘fit for purpose’ when accepted by the business. This entails not only the restoration of services and facilities, but also the understanding of dependencies between them, the testing required prior to delivery (performance, functional, operational and acceptance testing) and the validation of data integrity and consistency.
It should be noted that the continuity plans are more than just recovery plans, and should include documentation of the resilience measures and the measures that have been put into place to enable recovery, together with explanations of why a particular approach has been taken (this facilitates decisions should invocation determine that the particular situation requires a modification to the plan). However, the format of the plan should enable rapid access to the recovery information itself, perhaps as an appendix that can be accessed directly. All key staff should have access to copies of all the necessary recovery documentation.
Management of the distribution of the plans is important to ensure that copies are available to key staff at all times. The plans should be controlled documents (with formalized documents maintained under Change Management and Configuration Management control) to ensure that only the latest versions are in circulation and each recipient should ensure that a personal copy is maintained off-site.
The plan should ensure that all details regarding recovery of the IT services following a disaster are fully documented. It should have sufficient details to enable a technical person unfamiliar with the systems to be able to follow the procedures. The recovery plans include key details such as the data recovery point, a list of dependent systems, the nature of the dependency and their data recovery points, system hardware and software requirements, configuration details and references to other relevant or essential information about the service and systems.
It is a good idea to include a checklist that covers specific actions required during all stages of recovery for the service and system. For example, after the system has been restored to an operational state, connectivity checks, functionality checks or data consistency and integrity checks should be carried out prior to handing the service over to the business.
There are a number of technical plans that may already exist within an organization, documenting recovery procedures from a normal operational failure. The development and maintenance of these plans will be the responsibility of the specialist teams, but will be coordinated by the Business Continuity Management team. These will be useful additions or appendices to the main plan. Additionally, plans that will need to be integrated with the main BCP are:
Finally, each critical business area is responsible for the development of a plan detailing the individuals who will be in the recovery teams and the tasks to be undertaken on invocation of recovery arrangements.
The ITSCM Plan must contain all the information needed to recover the IT systems, networks and telecommunications in a disaster situation once a decision to invoke has been made, and then to manage the business return to normal operation once the service disruption has been resolved. One of the most important inputs into the plan development is the results of the Business Impact Analysis. Additionally other areas will need to be analysed, such as Service Level Agreements (SLA), security requirements, operating instructions and procedures and external contracts. It is likely that a separate SLA with alternative targets will have been agreed if running at a recovery site following a disaster.
Other areas that will need to be implemented following the approval of the strategy are: