BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News Meta Optimizes Data Center Sustainability with Reinforcement Learning

Meta Optimizes Data Center Sustainability with Reinforcement Learning

This item in japanese

In a recent blog post, Meta describes how its engineers use reinforcement learning (RL) to optimize environmental controls in Meta’s data centers, reducing energy consumption and water usage while addressing broader challenges such as climate change. Reinforcement learning is a branch of machine learning and optimal control that focuses on how an intelligent agent can make decisions in a changing environment to maximize a reward signal.

The Meta’s reinforcement learning-based approach has proven effective in optimizing data center cooling systems, which consume significant energy and water, especially when adapting to changing weather conditions.

Since 2021, Meta’s engineers have applied RL to improve airflow supply for cooling across various weather conditions. Cooling systems are the second-largest consumer of resources in Meta’s data centers, following the IT load. Optimizing these systems has a profound effect on reducing not only energy use but also water consumption and greenhouse gas (GHG) emissions. One of our pilot regions has already demonstrated impressive results, reducing supply fan energy consumption by 20% and water usage by 4%.

Meta’s data centers primarily use outdoor air and evaporative cooling systems to maintain temperatures between 65°F and 85°F (18°C to 30°C) and relative humidity between 13% and 80% (data from the sustainability report). This method is both water- and energy-efficient, but further optimization is necessary to reduce the amount of air that must be conditioned. This is where reinforcement learning plays a key role.

To explain how our cooling systems work, Meta’s data centers use a two-tiered penthouse design that draws in 100% outside air. This air is regulated by modulating dampers and mixed with heat from server exhaust when necessary to balance the temperature. After passing through filters and a misting chamber, the air is cooled and humidified before being pushed through fans into the server room. The system also exhausts hot air out of the building to maintain efficient air circulation. Water plays a critical role in evaporative cooling and humidification, keeping the air temperature and moisture levels within optimal ranges.

The penthouse cooling system within Meta’s data centers

In optimizing the airflow, three control loops—temperature, humidity, and airflow—are adjusted to ensure the cooling system functions efficiently. However, given the complexities involved, the airflow setpoints are particularly challenging to model due to how they are influenced by local conditions within the data center. RL helps address this complexity by dynamically adjusting the airflow based on real-time data and environmental conditions.

Reinforcement learning is ideal for data center cooling because it models the control system as a series of sequential states. The RL agent acquires valuable knowledge by obtaining feedback from the environment in the form of rewards, specifically energy and water savings. By analyzing data collected from thousands of sensors, RL fine-tunes the airflow setpoints to achieve optimal cooling efficiency while staying within operational parameters.

To ensure reliability, Meta’s engineers use a simulator-based RL approach. This allows the RL model to be trained in a simulated environment, which mirrors real-life data center conditions. The simulator uses physics-based models to predict how the building’s systems will respond to changes in weather, IT load, and other variables. By incorporating both historical and simulated data, the RL model can be trained to handle a wide range of conditions, ensuring that the cooling system remains efficient even in outlier scenarios. This offline approach reduces the risks associated with deploying RL models directly in live environments, such as thermal safety breaches or service interruptions.

The results of the RL pilot project have been promising. By controlling the supply airflow setpoints, the engineers have managed to maintain stable temperature conditions while reducing the amount of air needed for cooling. This translates into significant energy savings for the supply fans and reduced water use during evaporative cooling.

Meta is applying the same RL methodology to optimize the design of its new data centers, which are specifically being developed to support artificial intelligence workloads. By integrating RL into the design phase, Meta’s engineers aim to ensure that these new data centers are sustainable from the outset. Additionally, they are rolling out this RL approach across our existing data centers to maximize energy and water savings over the coming years.

Google and Microsoft are also using AI to improve their data centers. DeepMind saved 40% of Google's data centers cooling energy. Microsoft has introduced AI-driven anomaly detection methods to monitor and address irregularities in power and water usage within its data centers. These methods utilize telemetry data from electrical and mechanical devices. Additionally, Microsoft employs AI-based techniques to detect and resolve issues with power meters and to identify optimal server placement, reducing wasted power, network, and cooling capacity.

In conclusion, using reinforcement learning for data center cooling optimization is a key component of Meta’s long-term sustainability strategy. By harnessing AI to make our data centers more efficient, we are taking meaningful steps toward reducing our environmental impact while meeting the growing demands of our digital infrastructure.

About the Author

Rate this Article

Adoption
Style

BT