Definitions & Scope
Incident Management is responsible for handling all Incidents. This includes all kinds of malfunctions, errors and bugs which are either reported by users or technical staff or detected and reported automatically by monitoring tools and system.
The primary goals of Incident Management are:
-
to restore normal service operation as quickly as possible after Incidents,
-
to minimize adverse impacts on business operations,
-
to maintain the best possible levels of service quality and availability.
An Incident is defined as:
-
an unplanned interruption to a service (e.g. unavailability of the mail system) OR
-
a reduction in the quality of a service (e.g. reduced network bandwidth or increased response times) OR
-
a failure of a service component which is necessary for service provision, even if the service has not been impacted yet (e.g. failure of one part of a hardware cluster)
All occurrences of such Incidents with an actual or potential negative impact on service quality are handled by Incident Management. Input into the process could come from end-users, GS & IT staff or other processes such as Event Management.
![Incident management process overview](../sites/test-static-04.web.cern.ch/files/incidentmanagement.jpg)
The Incident Management process design faces the following challenges:
-
standardized collection and documentation of information
-
correct and consistent classification and dispatching of Incidents
-
correct prioritization of Incidents and implementation of deadlines
-
further issues (e.g. escalation, automation, involvement of the right and exclusion of the wrong people)
Classification
Incident Urgency classes:
-
High: The damage caused by the Incident increases rapidly.
-
Medium: The damage caused by the Incident increases considerably over time
-
Low: The damage caused by the Incident only marginally increases over time
Incident Impact classes:
-
Down critical adverse impact on the service
-
Degraded major adverse impact on the service
-
Affected minor adverse impact on the service
-
Disrupted small number of the population affected
The Priority [P] is obtained from the combination of Impact [I] and Urgency [U]. [P] = [U] + [I] - 1
VIP treatment is implemented by raising the Priority with one step (e.g. a priority 2 becomes a priorty 1; a 6 becomes a 5 etc..).
The Classification Matrix
![Classification](../sites/test-static-04.web.cern.ch/files/tn_Priority.JPG)
Sample Restoration Deadline Matrix
![Restoration](../sites/test-static-04.web.cern.ch/files/tab2b.png)