Specifically, there are three key pieces of industry guidance that go some way to assisting the understanding of resilience: Cobit 5, ITIL v3 and the US FFIEC IT Examination Handbook.
Cobit 5, as part of managing critical IT assets (Cobit 5 - BAI09.02) and maintaining a continuity strategy (Cobit 5 - DSS04.02), statesi:
- Maintain the resilience of critical assets by applying regular preventive maintenance, monitoring performance, and, if required, providing alternative and/or additional assets to minimise the likelihood of failure; and
- Assess the likelihood of threats that could cause loss of business continuity and identify measures that will reduce the likelihood and impact through improved prevention and increased resilience.
The IT Infrastructure Library v3 (ITIL v3) defines resilience as “the ability of a Configuration Item or IT Service to resist Failure or to Recover quickly following a Failure. For example an armoured cable will resist failure when put under stress.”ii
ITIL provides further guidance in Services Operations highlighting that “resilience is designed and built into the system, for example multiple redundant disks or multiple processors. This protects the system against hardware failure since it is able to continue operating using the duplicated hardware component.”iii
ITIL v3 also provides guidance with respect to software resilience recommending “software, data and operating system resilience is also designed into the system, for example mirrored databases (where a database is duplicated on a backup device) and disk-striping technology (where individual bits of data are distributed across a disk array – so that a disk failure results in the loss of only a part of data, which can be easily recovered using algorithms)… setting up and using virtualization systems to allow movement of processing around the infrastructure to give better performance/resilience in a dynamic fashion.”iv
ITIL v3 defines fault tolerance as “the ability of an IT service or other configuration item to continue to operate correctly after failure of a component part.”v
ITIL v3 defines a countermeasure as referring to “any type of control. The term is most often used when referring to measures that increase resilience, fault tolerance or reliability of an IT service.”vi
ITIL v3 defines redundancy as “the use of one or more additional configuration items to provide fault tolerance. The term also has a generic meaning of obsolescence, or no longer needed.”vii
ITIL v3 defines high availability as “an approach or design that minimizes or hides the effects of configuration item failure from the users of an IT service. High availability solutions are designed to achieve an agreed level of availability and make use of techniques such as fault tolerance, resilience and fast recovery to reduce the number and impact of incidents.”viii
The FFIEC IT Examination handbook defines resiliency as “the ability of an organization to recover from a significant disruption and resume critical operations” and resiliency testing as “testing of an institution’s business continuity and disaster recovery resumption plans.”ix
So what is IT Resilience?
From the preceding literature review of industry guidance, resilience comprises the following:
- Failure risk assessment and preventative countermeasures
- Rapid incident detection and response
- Recovery and countermeasure improvement
What this practically would look like would be that IT failure risk assessments would be performed at an end-to-end service application and infrastructure level (i.e. a business service is delivered through applications hosted on infrastructure). These risk assessments would then be used to design and implement preventative countermeasures.
Countermeasures you’d expect to see would be redundancy, clustering, load balancing, fault tolerance or automatic failover switching features in the architecture with no single points of failure.
When an incident occurs that impacts either the assessed risks or the actual resilience features in the architecture, you’d expect this to be detected early and to see a well rehearsed, tested and informed incident management process respond to the incident to ensure recovery of resilience features.
Finally, you’d expect to see appropriate recovery options available to be able to support rapid recovery such as up to date backups, fully tested disaster recovery sites and associated IT business continuity plans that have been well tested.
i ISACA, Cobit 5 - Enabling Processes, United States, 2012. Available at: http://www.isaca.org/COBIT/Pages/COBIT-5-Enabling-Processes-product-page.aspx (Accessed 6 March 2014).
ii AXELOS Limited, ITIL glossary and abbreviations, United Kingdom, 2011. Available at: http://www.itil-officialsite.com/InternationalActivities/ITILGlossaries_2.aspx (Accessed 6 March 2014).