During the disaster recovery process, the organizational structure will inevitably be different from normal operation and is based around:
- Executive – including senior management/executive board, with overall authority and control within the organization and responsible for crisis management and liaison with other departments, divisions, organizations, the media, regulators, emergency services etc.
- Coordination – typically one level below the executive group and responsible for coordinating the overall recovery effort within the organization
- Recovery – a series of business and service recovery teams, representing the critical business functions and the services that need to be established to support these functions. Each team is responsible for executing the plans within their own areas and for liaison with staff, customers and third parties. Within IT the recovery teams should be grouped by IT service and application. For example, the infrastructure team may have one or more people responsible for recovering external connections, voice services, local area networks, etc. and the support teams may be split by platform, operating system or application. In addition, the recovery priorities for the service, application or its components identified during the Business Impact Analysis should be documented within the recovery plans and applied during their execution.
Experience has shown that recovery plans that have not been fully tested do not work as intended, if at all. Testing is therefore a critical part of the overall ITSCM process and the only way of ensuring that the selected strategy, standby arrangements, logistics, business recovery plans and procedures will actually work in practice.
The IT service provider is responsible for ensuring that the IT services can be recovered in the required timescales with the required functionality and the required performance following a disaster.
There are four basic types of tests that can be undertaken:
- Walk-through tests can be conducted when the plan has been produced simply by getting the relevant people together to see if the plan(s) at least work in a simulated way.
- Full tests should be conducted as soon as possible after the plan production and at regular intervals of at least annually thereafter. They should involve the business units to assist in proving the capability to recover the services appropriately. They should, as far as possible, replicate an actual invocation of all stand-by arrangements and should involve external parties if they are planned to be involved in an actual invocation. The tests must not only prove recovery of the IT services but also the recovery of the business processes. It is recommended that an independent observer records all the activities of the tests and the timings of the service recovery. The observer’s documentation of the tests will be vital input into the subsequent post mortem review. The full tests may be announced or unannounced. The first test of the plan is likely to be announced and carefully planned, but subsequent tests may be ‘sprung’ on key players without warning. It is also essential that many different people get involved, including those not very familiar with the IT service and systems, as the people with the most knowledge may not be available when a disaster actually occurs.
- Partial tests can also be undertaken where recovery of certain elements of the overall plan is tested, such as single services or servers. These types of tests should be in addition to the full test not instead of the full test. The full test is the best way of testing that all services can be recovered in required timescales and can run together on the recovery systems.
- Scenario tests can be used to test reactions and plans to specific conditions, events and scenarios. They can include testing that BCPs and IT Service Continuity Plans interface with each other, as well as interfacing with all other plans involved in the handling and management of a major incident.
All tests need to be undertaken against defined test scenarios, which are described as realistically as possible. It should be noted, however, that even the most comprehensive test does not cover everything. For example, in a service disruption where there has been injury or even death to colleagues, the reaction of staff to a crisis cannot be tested and the plans need to make allowance for this. In addition, tests must have clearly defined objectives and Critical Success Factors, which will be used to determine the success or otherwise of the exercise.
188.8.131.52 Stage 4 – Ongoing operation
This stage consists of the following:
- Education, awareness and training – this should cover the organization and, in particular, the IT organization, for service continuity-specific items. This ensures that all staff are aware of the implications of business continuity and of service continuity and consider these as part of their normal working, and that everyone involved in the plan has been trained in how to implement their actions.
- Review – regular review of all of the deliverables from the ITSCM process needs to be undertaken to ensure that they remain current.
- Testing – following the initial testing, it is necessary to establish a programme of regular testing to ensure that the critical components of the strategy are tested, preferably at least annually, although testing of IT Service Continuity Plans should be arranged in line with business needs and the needs of the BCPs. All plans should also be tested after every major business change. It is important that any changes to the IT technology are also included in the strategy, implemented in an appropriate fashion and tested to ensure that they function correctly within the overall provision of IT following a disaster. The backup and recovery of IT service should also be monitored and tested to ensure that when they are needed during a major incident, they will operate as needed. This aspect is covered more fully in the Service Operation publication
- Change Management – the Change Management process should ensure that all changes are assessed for their potential impact on the ITSCM plans. If the planned change will invalidate the plans, then the plan must be updated before the change is implemented, and it should be tested as part of the change testing. The plans themselves must be under very strict Change Management and Configuration Management control. Inaccurate plans and inadequate recovery capabilities may result in the failure of BCPs. Also, on an ongoing basis, whenever there are new services or where services have major changes, it is essential that a BIA and risk assessment is conducted on the new or changed service and the strategy and plans updated accordingly.
Invocation is the ultimate test of the Business Continuity and ITSCM Plans. If all the preparatory work has been successfully completed, and plans developed and tested, then an invocation of the Business Continuity Plans should be a straightforward process, but if the plans have not been tested, failures can be expected. It is important that due consideration is given to the design of all invocation processes, to ensure that they are fit for purpose and interface to all other relevant invocation processes.
Invocation is a key component of the plans, which must include the invocation process and guidance. It should be remembered that the decision to invoke, especially if a third-party recovery facility is to be used, should not be taken lightly. Costs will be involved and the process will involve disruption to the business. This decision is typically made by a ‘crisis management’ team, comprising senior managers from the business and support departments (including IT), using information gathered through damage assessment and other sources.
A disruption could occur at any time of the day or night, so it is essential that guidance on the invocation process is readily available. Plans must be available to key staff in the office and away from the office.
The decision to invoke must be made quickly, as there may be a lead-time involved in establishing facilities at a recovery site. In the case of a serious building fire, the decision may be fairly easy to make. However, in the case of power failure or hardware fault, where a resolution is expected within a short period, a deadline should be set by which time if the incident has not been resolved, invocation will take place. If using external services providers, they should be warned immediately if there is a chance that invocation might take place.
The decision to invoke needs to take into account the:
- Extent of the damage and scope of the potential invocation
- Likely length of the disruption and unavailability of premises and/or services
- Time of day/month/year and the potential business impact. At year-end, the need to invoke may be more pressing, to ensure that year-end processing is completed on time.
Therefore the design of the invocation process must provide guidance on how all of these areas and circumstances should be assessed to assist the person invoking the continuity plan.
The ITSCM Plan should include details of activities that need to be undertaken, including:
- Retrieval of backup tapes or use of data vaulting to retrieve data
- Retrieval of essential documentation, procedures, workstation images, etc. stored off-site
- Mobilization of the appropriate technical personnel to go to the recovery site to commence the recovery of required systems and services
- Contacting and putting on alert telecommunications suppliers, support services, application vendors, etc. who may be required to undertake actions or provide assistance in the recovery process.
The invocation and initial recovery is likely to be a time of high activity, involving long hours for many individuals. This must be recognized and managed by the recovery team leaders to ensure that breaks are provided and prevent ‘burn-out’. Planning for shifts and handovers must be undertaken to ensure that the best use is made of the facilities available. It is also vitally important to ensure that the usual business and technology controls remain in place during invocation, recovery and return to normal to ensure that information security is maintained at the correct level and that data protection is preserved.
Once the recovery has been completed, the business should be able to operate from the recovery site at the level determined and agreed in the strategy and relevant SLA. The objective, however, will be to build up the business to normal levels, maintain operation from the recovery site in the short term and vacate the recovery site in the shortest possible time. Details of all these activities need to be contained within the plans. If using external services, there will be a finite contractual period for using the facility. Whatever the period, a return to normal must be carefully planned and undertaken in a controlled fashion. Typically this will be over a weekend and may include some necessary downtime in business hours. It is important that this is managed well and that all personnel involved are aware of their responsibilities to ensure a smooth transition.