The expanded incident lifecycle
A guiding principle of Availability Management is to recognize that it is still possible to gain customer satisfaction even when things go wrong. One approach to help achieve this requires Availability Management to ensure that the duration of any incident is minimized to enable normal business operations to resume as quickly as possible. An aim of Availability Management is to ensure the duration and impact from incidents impacting IT services are minimized, to enable business operations to resume as quickly as is possible. The analysis of the ‘expanded incident lifecycle’ enables the total IT service downtime for any given incident to be broken down and mapped against the major stages through which all incidents progress (the lifecycle). Availability Management should work closely with Incident Management and Problem Management in the analysis of all incidents causing unavailability.
A good technique to help with the technical analysis of incidents affecting the availability of components and IT services is to take an incident ‘lifecycle’ view. Every incident passes through several major stages. The time elapsed in these stages may vary considerably. For Availability Management purposes, the standard incident ‘lifecycle’, as described within Incident Management, has been expanded to provide additional help and guidance particularly in the area of ‘designing for recovery’. Figure 4.15 illustrates the expanded incident lifecycle.
Figure 4.15 The expanded incident lifecycle
From the above it can be seen that an incident can be broken down into individual stages within a lifecycle that can be timed and measured. This lifecycle view provides an important framework in determining, amongst others, systems management requirements for event and incident detection, diagnostic data capture requirements and tools for diagnosis, recovery plans to aid speedy recovery and how to verify that IT service has been restored. The individual stages of the lifecycle are considered in more detail as follows.
- Incident detection: the time at which the IT service provider organization is made aware of an incident. Systems management tools positively influence the ability to detect events and incidents and therefore to improve levels of availability that can be delivered. Implementation and exploitation should have a strong focus on achieving high availability and enhanced recovery objectives. In the context of recovery, such tools should be exploited to provide automated failure detection, assist failure diagnosis and support automated error recovery, with scripted responses. Tools are very important in reducing all stages of the incident lifecycle, but principally the detection of events and incidents. Ideally the event is automatically detected and resolved, before the users have noticed it or have been impacted in any way.
- Incident diagnosis: the time at which diagnosis to determine the underlying cause has been completed. When IT components fail, it is important that the required level of diagnostics is captured, to enable problem determination to identify the root cause and resolve the issue. The use and capability of diagnostic tools and skills is critical to the speedy resolution of service issues. For certain failures, the capture of diagnostics may extend service downtime. However, the non-capture of the appropriate diagnostics creates and exposes the service to repeat failures. Where the time required to take diagnostics is considered excessive, or varies from the target, a review should be instigated to identify if techniques and/or procedures can be streamlined to reduce the time required. Equally the scope of the diagnostic data available for capture can be assessed to ensure only the diagnostic data considered essential is taken. The additional downtime required to capture diagnostics should be included in the recovery metrics documented for each IT component.
- Incidentrepair: the time at which the failure has been repaired/fixed. Repair times for incidents should be continuously monitored and compared against the targets agreed within OLAs, underpinning contracts and other agreements. This is particularly important with respect to externally provided services and supplier performance. Wherever breaches are observed, techniques should be used to reduce or remove the breaches from similar incidents in the future.
- Incident recovery: the time at which component recovery has been completed. The backup and recovery requirements for the components underpinning a new IT service should be identified as early as possible within the design cycle. These requirements should cover hardware, software and data and recovery targets. The outcome from this activity should be a documented set of recovery requirements that enables the development of appropriate recovery plans. To anticipate and prepare for performing recovery such that reinstatement of service is effective and efficient requires the development and testing of appropriate recovery plans based on the documented recovery requirements. Wherever possible, the operational activities within the recovery plan should be automated. The testing of the recovery plans also delivers approximate timings for recovery. These recovery metrics can be used to support the communication of estimated recovery of service and validate or enhance the Component Failure Impact Analysis documentation. Availability Management must continuously seek and promote faster methods of recovery for all potential Incidents. This can be achieved via a range of methods, including automated failure detection, automated recovery, more stringent escalation procedures, exploitation of new and faster recovery tools and techniques. Availability requirements should also contribute to determining what spare parts are kept within the Definitive Spares to facilitate quick and effective repairs, as described within the Service Transition publication.
- Incident restoration: the time at which normal business service is resumed. An incident can only be considered ‘closed’ once service has been restored and normal business operation has resumed. It is important that the restored IT service is verified as working correctly as soon as service restoration is completed and before any technical staff involved in the incident are stood down. In the majority of cases, this is simply a case of getting confirmation from the affected users. However, the users for some services may be customers of the business, i.e. ATM services, internet-based services. For these types of services, it is recommended that IT service verification procedures are developed to enable the IT service provider organization to verify that a restored IT service is now working as expected. These could simply be visual checks of transaction throughput or user simulation scripts that validate the end-to-end service.
Each stage, and the associated time taken, influences the total downtime perceived by the user. By taking this approach it is possible to see where time is being ‘lost’ for the duration of an incident. For example, the service was unavailable to the business for 60 minutes, yet it only took five minutes to apply a fix – where did the other 55 minutes go?
Using this approach identifies possible areas of inefficiency that combine to make the loss of service experienced by the business greater than it need be. These could cover areas such as poor automation (alerts, automated recovery etc.), poor diagnostic tools and scripts, unclear escalation procedures (which delay the escalation to the appropriate technical support group or supplier), or lack of comprehensive operational documentation. Availability Management needs to work in close association with Incident and Problem Management to ensure repeat occurrences are eliminated. It is recommended that these measures are established and captured for all availability incidents. This provides Availability Management with metrics for both specific incidents and trending information. This information can be used as input to SFA assignments, SIP activities and regular Availability Management reporting and to provide an impetus for continual improvement activity to pursue cost-effective improvements. It can also enable targets to be set for specific stages of the incident lifecycle. While accepting that each incident may have a wide range of technical complexity, the targets can be used to reflect the consistency of how the IT service provider organization responds to incidents.
An output from the Availability Management process is the real-time monitoring requirements for IT services and components. To achieve the levels of availability required and/or ensure the rapid restoration of service following an IT failure requires investment and exploitation of a systems management toolset. Systems management tools are an essential building block for IT services that require a high level of availability and can provide an invaluable role in reducing the amount of downtime incurred. Availability Management requirements cover the detection and alerting of IT service and component exceptions, automated escalation and notification of IT failures and the automated recovery and restoration of components from known IT failure situations. This makes it possible to identify where ‘time is being lost’ and provides the basis for the identification of factors that can improve recovery and restoration times. These activities are performed on a regular basis within Service Operation.