BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News Advancing System Reliability: Meta's AI-Driven Approach to Root Cause Analysis

Advancing System Reliability: Meta's AI-Driven Approach to Root Cause Analysis

Meta recently shared how they are enhancing their system reliability through advanced investigation tools, including the AI-assisted Hawkeye, which aids in debugging machine learning workflows. By integrating Artificial Intelligence, Meta has developed a new investigation system that combines heuristic-based retrieval with large language model (LLM) ranking to assist in root cause analysis. This system has shown promising results, achieving 42% accuracy in identifying root causes at the start of an investigation related to Meta's web monorepo.

HawkEye is a toolkit developed by Meta as part of the Prediction Robustness program. It was created to drive innovative tools and services and ensure the quality of Meta products relying on machine learning (ML) model predictions. HawkEye is developed to enhance the monitoring, observability and debuggability of Meta ML products. It includes everything from mining root causes with UX workflows for guided explorations.

Investigating issues within large systems like Meta's can be complex, especially when dealing with monolithic repositories that involve multiple teams and numerous changes. To build context and isolate the root cause, traditional investigations require significant time and effort. To streamline this process, Meta's new system reduces the search space for potential causes using heuristics, such as code ownership and runtime code graphs. After narrowing down to a few hundred relevant changes, an LLM-based ranking system identifies the most likely root causes, ultimately focusing on the top five changes.

The ranking system, which uses a fine-tuned Llama model, employs a structured prompt technique to handle context window limitations, allowing it to rank changes effectively. Back testing has shown that in 42% of cases, the actual root cause is among the top five ranked suggestions.

 

Llama 2 (7B) root cause analysis training process.

Training the LLM involved fine-tuning a Llama 2 (7B) model using Meta's historical investigation data, which helped the model learn to follow root cause analysis (RCA) instructions. This training process used a specially curated dataset of 5,000 instruction-tuning examples with details of 2-20 changes from Meta retriever, including the known root cause and information about the investigation at its start, e.g., its title and impact. This curated dataset allows the model to rank potential code changes based on their relevance to an investigation with a good confidence level.

Meta's AI-assisted investigation tools aim to reduce the time and effort needed for root cause analysis, but they also present challenges, such as the risk of incorrect suggestions. To address this, Meta ensures that the system's results are explainable and reproducible, with confidence measurements used to avoid low-confidence recommendations.

Other AI-assisted investigation tools available on the market are:

  • BigPanda Root Cause Analysis: An AI-powered tool that quickly identifies the root cause of issues in IT systems by analyzing data and providing recommendations. It features automatic issue identification in real time, reducing investigation and resolution time.
  • ZDX AI-Powered Root Cause Analysis: A tool that leverages AI and machine learning to analyze data and provide recommendations for remediation, enabling fast identification of issues in networks and applications.
  • IBM Watson AIOps: An AI-powered tool that analyzes data to identify the root cause of issues in IT systems, providing recommendations for remediation and automatic issue identification in real time.
  • Skylar Automated Root Cause Analysis: A tool that automates log analysis using machine learning, processing millions or billions of log messages from applications to identify the root cause of issues quickly.

Looking ahead, Meta plans to expand the capabilities of its AI systems, potentially allowing them to autonomously execute workflows and even detect potential incidents before they occur, further enhancing system reliability.

About the Author

Rate this Article

Adoption
Style

BT