Artificial intelligence for IT operations (AIOps) combines sophisticated methods from deep learning, data streaming processing, and domain knowledge to analyse infrastructure data from internal and external sources to automate operations and detect anomalies (unusual system behavior) before they impact the quality of service. Odej Kao, professor at the University of Technology Berlin, gave a keynote presentation about artificial intelligence for IT operations at DevOpsCon Berlin 2021.
Log data is the most powerful source of information, widely available, and can be well-processed by AI-based prediction models, as Kao explained:
In data stream processing we frequently struggle to find sufficient amounts of data. On the other hand, in AIOps we have many different sources (e.g., metric, logs, tracing, events, alerts) with several Terabytes of data produced in a typical IT infrastructure per day. We utilize the power of these hidden gems to assist DevOps administrators and jointly with the AI-models improve the availability, security, and the performance of the overall system.
According to Kao, AI-driven log analytics will be a mandatory component in future Industry 4.0, IoT, smart cities and homes, autonomous driving, data centers, and IT organizations
Most companies already have set the scene for operation of AIOPs platforms: monitoring, ELK-stacks are in place and need to be extended with AI-based analytics tools to ensure availability, performance, and security, Kao said.
Kao presented how an AIOps workflow can look:
The workflow starts with collecting data from many different sources, e.g. metric data from hardware CPU/mem/net utilization, system logs from logstash, and distributed traces from the resource manager. The hard part here is to get a holistic picture of the current infrastructure: due to virtualization, SDNs, VNFs, etc. the system is changing in short intervals, so we need to discover the current topology graph and the dependencies.
Then, we can map the recorded data to sources and activate the AIOPs pipeline, which typically consists of three steps: anomaly detection, root cause analysis, and decision-making remediation. The first two steps exploit various deep learning techniques, while the decision-making aims to automate the handling of system anomalies.
Alerting DevOps administrators is the lowest requirement. In the future, the activation of pre-defined recovery workflows or even the dynamic design of new workflows will be possible.
InfoQ interviewed Odej Kao about artificial intelligence for IT operations.
InfoQ: What’s the state of practice of AIOps?
Odej Kao: AIOps is on the rise. Many companies have already prepared the scene by installing sophisticated monitoring infrastructure, collecting and analyzing data from different sources. Especially logs have a long tradition of being used by system operators to identify problems. People are familiar with the power of this information for troubleshooting.
For example, a de facto standard for log storage and manual analysis is the ELK stack. The next logical step is to extend this infrastructure with add-ons for analytics like our logsight.ai, moogsoft or coralogix. These components take the available data, search it in real-time for anomalies, issue incident alerts and reports, and finally gather all necessary data for troubleshooting for visualisation in the company-owned, e.g. Kibana, dashboard.
The currently existing AIOPs platforms are working fine, but we need additional research and development work in terms of explainability, root cause analysis, false alarm prevention, and automatic remediation. I believe that in 2-3 years, the majority of the companies will operate AIOPs platforms simply to keep pace with the increasing data center complexity of a future IoT world.
InfoQ: Which approaches exist for AIOps and what are the main differences?
Kao: The main difference is the design of the prediction model. There are typically three different approaches balancing explainability (why a certain action was taken) vs. adaptivity (dealing with previously unknown situations and challenges).
A rule-based approach utilizes a set of rules derived from DevOps knowledge; fully explainable, but limited to the existing, pre-defined catalogue.
A supervised learning model is created by injecting failures into the system and recording the output. The corresponding input/output values serve as a learning base for the model. It works very fast, however lab systems used for injecting failures often differ from real systems in terms of noise (updates, upgrades, releases, competing applications, etc.).
An unsupervised approach assumes that the system is running smoothly for most of the time and that the number of anomalies is significantly less than normal values. Thus, the corresponding prediction model describes the normal state of the system and identifies deviations of the expected (normal) behaviour as anomalies. This approach has the best adaptivity, but the classification of the detected anomaly requires a mandatory root cause analysis execution step to detect the anomaly type.
InfoQ: How can we use AI to analyze logs, and what benefits do they bring?
Kao: Logs are the most powerful data source. They are written in the code by human developers and thus contain significant semantic information that we can exploit. They are widely available and in contrast to metric data cover the frequent changes that we see in agile development.
In large companies we see thousands of software, hardware, and configuration changes per day. Each of them is a possible source of error. The logs help us to understand the impact of changes to the overall system and to interpret the recorded data. Every update influences the prediction model and creates a "new normal". We see this in the logs and can adapt.
And only in cases where the system behaviour cannot be explained by the modification do we present the most likely log lines responsible for errors, performance degradation, or security problems. Our tool logsight.ai needs 3,5 minutes to load, pre-process, and analyse 350K log lines from production systems and to detect all 60 types of errors contained in the data. Thus, it assists the developers and operators by tremendously speeding up the troubleshooting.
The DevOps administrators do not need to scroll through thousands of unrelated log lines, but get all relevant information presented in the dashboard and can immediately start solving the detected problem. This has a significant impact on the availability, performance, and security of the system.
The analysis of logs is not limited to providing support to DevOps and troubleshooting. Analysis can also bring important contributions to other fields such as cyber security, compliance and regulations, and user experience.
InfoQ: What will the future bring us for AIOps?
Kao: I believe that AIOps platforms will be a standard component of every infrastructure. The current approach of hiring more SREs/NREs does not scale with the growing data centers and widening the scope into edge and fog computing environments.
Moreover, logs are a vital part of every autonomous system -- from large self-driving vehicles to IoT sensors in the smart cities and homes -- and serve for debugging, for detecting fraud, for improving security but also as a foundation for legal claims.
Therefore, I do not see how data centers and complex infrastructures can fulfill the future obligations without investing into AI-driven automation of such basic operations.