InfoQ Software Architects' Newsletter

A monthly overview of things you need to know as an architect or aspiring architect.

Enter your e-mail address

Select your country

We protect your privacy.

InfoQ Homepage News Advancing System Reliability: Meta's AI-Driven Approach to Root Cause Analysis

DevOps

Advancing System Reliability: Meta's AI-Driven Approach to Root Cause Analysis

Aug 22, 2024 2 min read

Write & Win: InfoQ Contest

Join the contest to:

Win a conference ticket
Boost your profile
Help the community

Send your article proposal

Meta recently shared how they are enhancing their system reliability through advanced investigation tools, including the AI-assisted Hawkeye, which aids in debugging machine learning workflows. By integrating Artificial Intelligence, Meta has developed a new investigation system that combines heuristic-based retrieval with large language model (LLM) ranking to assist in root cause analysis. This system has shown promising results, achieving 42% accuracy in identifying root causes at the start of an investigation related to Meta's web monorepo.

HawkEye is a toolkit developed by Meta as part of the Prediction Robustness program. It was created to drive innovative tools and services and ensure the quality of Meta products relying on machine learning (ML) model predictions. HawkEye is developed to enhance the monitoring, observability and debuggability of Meta ML products. It includes everything from mining root causes with UX workflows for guided explorations.

Investigating issues within large systems like Meta's can be complex, especially when dealing with monolithic repositories that involve multiple teams and numerous changes. To build context and isolate the root cause, traditional investigations require significant time and effort. To streamline this process, Meta's new system reduces the search space for potential causes using heuristics, such as code ownership and runtime code graphs. After narrowing down to a few hundred relevant changes, an LLM-based ranking system identifies the most likely root causes, ultimately focusing on the top five changes.

The ranking system, which uses a fine-tuned Llama model, employs a structured prompt technique to handle context window limitations, allowing it to rank changes effectively. Back testing has shown that in 42% of cases, the actual root cause is among the top five ranked suggestions.

Llama 2 (7B) root cause analysis training process.

Training the LLM involved fine-tuning a Llama 2 (7B) model using Meta's historical investigation data, which helped the model learn to follow root cause analysis (RCA) instructions. This training process used a specially curated dataset of 5,000 instruction-tuning examples with details of 2-20 changes from Meta retriever, including the known root cause and information about the investigation at its start, e.g., its title and impact. This curated dataset allows the model to rank potential code changes based on their relevance to an investigation with a good confidence level.

Meta's AI-assisted investigation tools aim to reduce the time and effort needed for root cause analysis, but they also present challenges, such as the risk of incorrect suggestions. To address this, Meta ensures that the system's results are explainable and reproducible, with confidence measurements used to avoid low-confidence recommendations.

Other AI-assisted investigation tools available on the market are:

BigPanda Root Cause Analysis: An AI-powered tool that quickly identifies the root cause of issues in IT systems by analyzing data and providing recommendations. It features automatic issue identification in real time, reducing investigation and resolution time.
ZDX AI-Powered Root Cause Analysis: A tool that leverages AI and machine learning to analyze data and provide recommendations for remediation, enabling fast identification of issues in networks and applications.
IBM Watson AIOps: An AI-powered tool that analyzes data to identify the root cause of issues in IT systems, providing recommendations for remediation and automatic issue identification in real time.
Skylar Automated Root Cause Analysis: A tool that automates log analysis using machine learning, processing millions or billions of log messages from applications to identify the root cause of issues quickly.

Looking ahead, Meta plans to expand the capabilities of its AI systems, potentially allowing them to autonomously execute workflows and even detect potential incidents before they occur, further enhancing system reliability.

About the Author

Claudio Masolo

Claudio is a cloud engineer. In his spare time, he likes running, reading, and playing old video games.

This content is in the DevOps topic

Write Your Way to a QCon or InfoQ Dev Summit!

Join the InfoQ article competition to win a complimentary ticket to QCon or InfoQ Dev Summit! We're seeking in-depth technical articles written by software developers for software developers.

Send your proposal

The InfoQ Newsletter

A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example

We protect your privacy.

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

Advancing System Reliability: Meta's AI-Driven Approach to Root Cause Analysis

Write & Win: InfoQ Contest

About the Author

Claudio Masolo

This content is in the DevOps topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

Write Your Way to a QCon or InfoQ Dev Summit!

The InfoQ Newsletter