Image Alt

The Promise of Self-Healing IT Systems

The Promise of Self-Healing IT Systems

The human body has a fascinating capacity to heal itself. During our entire lifetime, the cells in our body are working continuously to bring us back to a state of equilibrium. This makes the cell, a dynamic living unit which has the capability to monitor and restore its processes based on the DNA code it was created with. Whenever the cells are attacked and destroyed by viruses, new cells come to the rescue and replace them quickly.

Taking this analogy, we can think of a computer system as a human body which is made up of various types of cells. They can be either hardware or software. When they are in the form of software units, the smaller they are, the easier it is for them to self-heal, recuperate from failures, multiply, or even get destroyed when that is needed. We can call these small units as microservices, and they can, indeed, have behaviours similar to those observed in a human body. This however does not mean that self-healing is only applicable to micro services. Like most other techniques, self-healing can be applied to almost any type of IT System. Just like life is all about collective thought and progress, each computer system is part of something bigger. It has the capability to communicate, cooperate, and adapt to other systems making it part of a wider ecosystem.

Back to Basics: Self-Healing in IT Systems

In the world of IT, self-healing systems are described as “any device or system that has the ability to perceive that it is not operating correctly and, without external assistance, make the necessary adjustments to restore itself to normal operation”. In short, a self-healing system can proactively monitor and identify a potential variance from its standard parameters, validate it with a degree of confidence and resume normal operations without human intervention.

It can basically be divided into three components:

A System

Here the system is always running and is alert. There is no external assistance required for the system to behave normally. It can be in the form of an application, a third-party API or hardware to the network itself.

A Monitoring Mechanism

This ensures that the system is being monitored to ensure it is functioning normally and there isn’t any deviation from its expected behaviour. This is done by monitoring a range of metrics within the system and keeping a check on their threshold values. This can include server monitoring, network monitoring, database monitoring, log monitoring and application performance monitoring, amongst other tools.

A Restore Protocol

This takes the necessary steps to bring the system back to normal functionality without external assistance. It may include simple scripts to sophisticated bots. It can be any from of software that has the ability to restore/repair the malfunctioning system.

It is important that these three components are fused together in the IT System and are working in tandem. Let us take a practical example.

Consider an AEM system which is being used for content management. During the first week of every month, content authors are very active authoring and publishing their content in AEM forms. This results in CPU utilization becoming high due to the huge volume of transactions often interrupting the content authors with system shutdowns and eventually impacting the business.

Here, we can monitor the AEM system performance by deploying a shell script which can monitor CPU utilization every 10 minutes and observe the variance from predefined healthy parameters. It can be easily done by verifying that consecutive 2-minute CPU usage has crossed the threshold of 80%. Once the variance is validated, the shell scripts should also be aware knowing what needs to be done to trigger the restore service. This is the monitoring & discovery mechanism of the self-healing system.

The restore service, here, could simply be a script which can collect the necessary log information and restart the AEM services in a controlled way within the clustered environment. This prevents system outage and large-scale service disruption, thus minimizing the loss of business and consumer goodwill.

Different types of self-healing systems

Seamless recovery from failures often requires building sophisticated means of dev-ops managed over clustered environments. We can classify self-healing systems into three categories: Level 1, Level 2 and Level 3.

Level 1

Here “rule” based solutions are defined based on which monitoring and discovery tools can trigger restore services. Examples can be simple server restart, configuration changes scenarios.

Level 2

Here ‘monitoring’ discovery tools help to monitor the IT System, gather all the information from various sources and show various metrics on a single dashboard for better visualization. Here, there is a lack of root cause identification.

Level 3

Here, the system is capable of predicting the system faults ahead of time based on real time analytics. It logs in past incidents and threshold levels to dynamically change the values based on system requirement. An example of this approach is dynamically monitoring the load and traffic of the system and predicting the load, based on the historical data and traffic of the day/time, as well as to scale/descale the system as required.

It is crucial to reduce system downtimes so that enterprises can focus more on their actual business than managing IT challenges. This, more than anything else, is the need of the hour in today’s technologically-driven world.


Head of Engineering (Automatenext)