Главная Обратная связь


Designing resilience

Capacity Management assists with the identification and improvement of the resilience within the IT infrastructure or any subset of it, wherever it is cost-justified. In conjunction with Availability Management, Capacity Management should use techniques such as Component Failure Impact Analysis (CFIA, as described in section 4.4 on Availability Management) to identify how susceptible the current configuration is to the failure or overload of individual components and make recommendations on any cost-effective solutions.

Capacity Management should be able to identify the impact on the available resources of particular failures, and the potential for running the most important services on the remaining resources. So the provision of spare capacity can act as resilience or fail-over in failure situations.

The requirements for resilience in the IT infrastructure should always be considered at the time of the service or system design. However, for many services, the resilience of the service is only considered after it is in live operational use. Incorporating resilience into Service Design is much more effective and efficient than trying to add it at a later date, once a service has become operational. Threshold management and control

The technical limits and constraints on the individual services and components can be used by the monitoring activities to set the thresholds at which warnings and alarms are raised and exception reports are produced. However, care must be exercised when setting thresholds, because many thresholds are dependent on the work being run on the particular component.

The management and control of service and component thresholds is fundamental to the effective delivery of services to meet their agreed service levels. It ensures that all service and component thresholds are maintained at the appropriate levels and are continuously, automatically monitored, and alerts and warnings generated when breaches occur. Whenever monitored thresholds are breached or threatened, then alarms are raised and breaches, warnings and exception reports are produced. Analysis of the situation should then be completed and remedial action taken whenever justified, ensuring that the situation does not recur. The same data items can be used to identify when SLAs are breached or likely to be breached or when component performance degrades or is likely to be degraded. By setting thresholds below or above the actual targets, action can be taken and a breach of the SLA targets avoided. Threshold monitoring should not only alarm on exceeding a threshold, but should also monitor the rate of change and predict when the threshold will be reached. For example, a disk-space monitor should monitor the rate of growth and raise an alarm when the current rate will cause the disk to be full within the next N days. If a 1GB disk has reached 90% capacity, and is growing at 100KB per day, it will be 1,000 days before it is full. If it is growing at 10MB per day, it will only be 10 days before it is full. The monitoring and management of these events and alarms is covered in detail in the Service Operations publication.

There may be occasions when optimization of infrastructure components and resources is needed to maintain or improve performance or throughput. This can often be done through Workload Management, which is a generic term to cover such actions as:

  • Rescheduling a particular service or workload to run at a different time of day or day of the week, etc. (usually away from peak times to off-peak windows) – which will often mean having to make adjustments to job-scheduling software
  • Moving a service or workload from one location or set of CIs to another – often to balance utilization or traffic
  • Technical ‘virtualization’: setting up and using virtualization techniques and systems to allow the movement of processing around the infrastructure to give better performance/resilience in a dynamic fashion
  • Limiting or moving demand for components or resources through Demand Management techniques, in conjunction with Financial Management (see section

It will only be possible to manage workloads effectively if a good understanding exists of which workloads will run at what time and how much resource utilization each workload places on the IT infrastructure. Diligent monitoring and analysis of workloads, together with a comprehensive CMIS, are therefore needed on an ongoing operational basis. Demand Management

The prime objective of Demand Management is to influence user and customer demand for IT services and manage the impact on IT resources.

This activity can be carried out as a short-term requirement because there is insufficient current capacity to support the work being run, or, as a deliberate policy of IT management, to limit the required capacity in the long term.

Short-term Demand Management may occur when there has been a partial failure of a critical resource in the IT infrastructure. For example, if there has been a failure of a processor within a multi-processor server, it may not be possible to run the full range of services. However, a limited subset of the services could be run. Capacity Management should be aware of the business priority of each of the services, know the resource requirements of each service (in this case, the amount of processor power required to run the service) and then be able to identify which services can be run while there is a limited amount of processor power available.

Long-term Demand Management may be required when it is difficult to cost-justify an expensive upgrade. For example, many processors are heavily utilized for only a few hours each day, typically 10.00–12.00 and 14.00–16.00. Within these periods, the processor may be overloaded for only one or two hours. For the hours between 18.00–08.00, these processors are only very lightly loaded and the components are under-utilized. Is it possible to justify the cost of an upgrade to provide additional capacity for only a few hours in 24 hours? Or is it possible to influence the demand and spread the requirement for resource across 24 hours, thereby delaying or avoiding altogether the need for a costly upgrade?

Demand Management needs to understand which services are utilizing the resource and to what level, and the schedule of when they must be run. Then a decision can be made on whether it will be possible to influence the use of resource and, if so, which option is appropriate.

The influence on the services that are running could be exercised by:

  • Physical constraints: for example, it may be possible to stop some services from being available at certain times, or to limit the number of customers who can use a particular service – for example, by limiting the number of concurrent users; the constraint could be implemented on a specific resource or component – for example, by limiting the number of physical connections to a network router or switch
  • Financial constraints: if charging for IT services is in place, reduced rates could be offered for running work at times of the day when there is currently less demand for the resource. This is known as differential charging. Modelling and trending

A prime objective of Capacity Management is to predict the behaviour of IT services under a given volume and variety of work. Modelling is an activity that can be used to beneficial effect in any of the sub-processes of Capacity Management.

The different types of modelling range from making estimates based on experience and current resource utilization information, to pilot studies, prototypes and full-scale benchmarks. The former is a cheap and reasonable approach for day-to-day small decisions, while the latter is expensive, but may be advisable when implementing a large new project or service. With all types of modelling, similar levels of accuracy can be obtained, but all are totally dependent on the skill of the person constructing the model and the information used to create it.


The first stage in modelling is to create a baseline model that reflects accurately the performance that is being achieved. When this baseline model has been created, predictive modelling can be done, i.e. ask the ‘What if?’ questions that reflect failures, planned changes to the hardware and/or the volume/variety of workloads. If the baseline model is accurate, then the accuracy of the result of the potential failures and changes can be trusted.

Effective Capacity Management, together with modelling techniques, enables Capacity Management to answer the ‘What if?’ questions. What if the throughput of Service A doubles? What if Service B is moved from the current server onto a new server – what will be the effect on the response times of the two services?

Trend analysis

Trend analysis can be done on the resource utilization and service performance information that has been collected by the Capacity Management process. The data can be analysed in a spreadsheet, and the graphical and trending and forecasting facilities used to show the utilization of a particular resource over a previous period of time, and how it can be expected to change in the future.

Typically, trend analysis only provides estimates of future resource utilization information. Trend analysis is less effective in producing an accurate estimate of response times, in which case either analytical or simulation modelling should be used. Trend analysis is most effective when there is a linear relationship between a small number of variables, and less effective when there are non-linear relationships between variables or when there are many variables.

sdamzavas.net - 2019 год. Все права принадлежат их авторам! В случае нарушение авторского права, обращайтесь по форме обратной связи...