BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles Cloud Waste Management: How to Optimize Your Cloud Resources

Cloud Waste Management: How to Optimize Your Cloud Resources

Key Takeaways

  • There is a tradeoff between speed, quality and cost. Only two can be achieved at any given time.
  • Along with financial impact, cloud waste has environmental impacts also.
  • Cloud provider, open-source or enterprise tools can be used to detect cloud waste.
  • Rightsizing resources and automated management of resources help reduce cloud waste significantly.
  • Education and awareness of teams is important in reducing waste generation.

 

Introduction

FinOps Foundation’s "State of FinOps" survey results of 2024 mentioned that the top priority of the organizations has shifted to reducing cloud waste or unused resources.

Image source here

Before understanding how to manage cloud waste, let’s first define what is considered waste in the cloud and why it is important.

Definition of cloud waste

FinOps.org defines waste as 'Any usage or cost of resources which provide no value to an organization'

In the new IT world, as cloud adoption is increasing, financial decisions are shifting at the edges, where engineers can now purchase and provision cloud infrastructure resources whenever needed. While this has opened new doors for innovation and companies to avoid upfront capital investments in data centers, this has increased challenges for finance teams to budget and forecast the IT spending of the company.

Often, if the cloud is not managed properly and there is inadequate or missing governance, the "cloud spend" can go haywire due to unused and underutilized resources.

These "unused or underutilized resources" are often called cloud waste. Organizations must realize they still pay for the capacity they provision in the cloud even if they are not using it.

Importance of cloud waste management

Reducing cloud spend is essential for organizations due to current macroeconomic trends where businesses want to reduce operational costs without compromising the value they deliver to their customers.

Organizations have also grown conscious of their carbon emissions and are committing to reducing their carbon footprint. Optimizing resources within the cloud helps them reduce their expenses in the cloud and contributes to their sustainability goals.

Understanding Cloud Waste

To better understand cloud waste, we need to understand the iron triangle of project management, which states that there is always a tradeoff between speed, quality, and cost. If you want to deliver a quality product/feature quickly, it will cost you more. Businesses are always trying to innovate and deliver continuous value to their customers. Often, it means putting pressure on the delivery teams to improve time to market. As an effect, there is the over provisioned capacity of resources; multiple resources that were provisioned to validate theory or concept were not deleted as the teams moved on either delivering the accepted solutions or to another project assignment. This is one of the major factors of cloud waste.

Another reason is manual provisioning of resources in the cloud. A variety of resources are provided to develop and host a solution architecture in the cloud that has multiple moving parts. While validating and testing the solution, there are multiple iterations of creating and tearing down these resources. If these are created manually, there are scenarios where pieces of it are missed during the teardown, contributing to the waste.

Other scenarios observed are that the business decides to shut down operations for a line of business because it is not profitable. The communications still need to be received by relevant teams managing cloud resources supporting that, or there was a modernization of technology done, and the legacy infrastructure components are still intact as they were managed by separate teams. There was no practical communication between them, so legacy systems were running while the business process was modernized and ran on another platform. This constitutes cloud waste; the resource is no longer needed and running.

Examples of cloud waste

Cloud waste can be categorized into different types for ease of understanding. In reality, no one category contributes to cloud waste within the organization; it is a combination of multiple or all of them. Let’s look at each one.

Idle resources

Resources no longer needed or 100% unutilized are considered idle resources. Examples include environments provisioned for testing that are necessary only during testing and not always required or development servers not needed after working hours.

Overprovisioned resources

This applies to major compute resources like virtual machines, RDBMS systems, or high SKU cloud resources without needing them. The capacity is higher than required to operate comfortably without impacting the business. This happens when static capacity allocation is done based on the peak load, while most of the time, it is significantly lower than the peak load. Or a portion of the business is shut down, and the usage has dropped while the capacity is still to support the previous requirement. These are categorized as overprovisioned resources and have implications for the business.

Impact of cloud waste on business

Financial implications

Since you pay for each resource provisioned in the cloud, managing cloud waste becomes critical, as it directly impacts your business’s bottom line. CFOs and finance teams struggle to manage the forecast and budget for cloud spend as they never know what capacity is wasted in the cloud, and there is no good way to review it regularly.

Environmental impact

The use of the cloud has significantly reduced carbon emissions compared to data centers. However, the unused capacity running in the cloud still consumes electricity and other resources, adding to carbon emissions that can be avoided. Companies are now committed to reducing their carbon footprint. Let’s understand how companies can identify cloud waste.

Identifying Cloud Waste

One of the first and most important steps in cloud waste management is identifying cloud waste. For companies with limited cloud use and very few teams provisioning resources in the cloud, it is easier to identify unused resources. However, for large enterprises with massive cloud footprints and multiple teams responsible for provisioning cloud resources, there is a need for automated and effective ways to detect waste at scale.

Below are some tools and techniques to identify cloud waste.

Cloud provider tools

Make use of cloud-native tools to detect cloud waste. Here are some of them for the three popular public cloud providers:

  1. Trusted Advisor and Cost Optimization Hub(AWS)
  2. Active Assist (GCP)
  3. Advisor(Azure)

To track your carbon footprint, you can use the following services:

  1. Customer Carbon Footprint Tool (AWS)
  2. Carbon Footprint(GCP)
  3. Emission Impact Dashboard(Azure)

Third-party tools

While it is sufficient to use native cloud provider tools in the initial stages, as you evolve in your FinOps journey, you realize there are limitations with the native tools. You can then explore third-party tools in the market, which are open-source software and enterprise licensed. Let’s look at how we, at my current organization, use open-source software and native cloud service to identify cloud waste.

Case studies/examples

At Tenerity, we used Cloud Custodian and Amazon QuickSight to detect and report on cloud waste in an automated way. Cloud Custodian is an open-source software that is part of the CNCF that you can use to manage governance in the cloud. It supports all three major public cloud providers—AWS, GCP, and Azure - with a unified syntax, making it easier to use with a multi-cloud setup. It uses "filters" to select the resources in scope and "action" to take any action you want on the filtered resources. There are some popular examples for each cloud provider that you can look through in the Cloud Custodian documentation. Still you can create custom filters and actions, making it very useful in practice.

Here is how we deployed the solution for Tenerity:

  1.  We deployed Cloud Custodian as a docker container using AWS Fargate to have minimal overhead and cost of running the infrastructure for Cloud Custodian. Cloud Custodian runs weekly for us, but you can configure it to run as you need.
  2. The results are then parsed into an Excel file stored in the S3 bucket.
  3. We used AWS Databrew job to do some pre-processing and cleaning of data as well as adding some mappings and transformation for business contextual information like business unit
  4. The resulting data is again stored in S3, but this time as a parquet file, so we are also optimizing the S3 cost.
  5.  A glue job is triggered to infer the schema and create an Athena table, which is then used to build the QuickSight dashboard.

Architecture Diagram

QuickSight dashboard

As a result, we were able to present a high-level summary of cloud waste across the organization, where the stakeholders can select the business unit or policy and view the results. We were also able to plot week-over-week trends of each policy or business unit, which then helped us track progress and any new resources classified as waste. It was easier for the engineering teams to look at the impacted resources and take action without anyone’s help providing the data and recommendations.

The results might overlap with the findings from other tools, but the reason for using Cloud Custodian was the flexibility it provides in creating your own filters and actions. Below are some strategies for reducing cloud waste that are effective and proven in the industry

Strategies for Reducing Cloud Waste

There could be multiple ways of reducing cloud waste, some of which are mentioned below.

Rightsizing instances

All the public cloud providers have their recommendation dashboards. For example, AWS has a Cost Optimization Hub, and GCP has Active Assist. Look at the rightsizing recommendations from the cloud provider. Have your engineering teams analyze each recommendation and validate whether it can be acted upon. Note that each recommendation is not actionable because it misses the business context that your teams have, so take action based on the usage data. Suppress any finding that cannot be remediated.

Automating resource management

Automating resource management is another way to manage cloud waste by providing whatever and whenever required.

Scheduling on/off times

Define a power schedule for your workloads and adhere to those schedules. Shut off resources automatically when they are not needed as per the schedule, like off business hours or over the weekends, especially for non-production. Turn back on when needed. You can measure the usage hours vs. the required hours as power scheduled adherence rate KPI.

Autoscaling

Design your workloads, especially the stateless services, to be configured with auto-scaling, which scales up or down based on demand, usage, or load.

Implementing policies and governance

Governance is paramount as it helps you manage cloud resources effectively. You should define your proactive and reactive approach to governance. A few of the governance activities are listed below.

Tagging policies

Define your organization’s tagging dictionary, laying out the mandatory and optional tags for resources, and defining the governance policies around that. For example, proactively, you can create service control policies in AWS to deny provisioning resources without the mandatory tags. Also, use tools like Cloud Custodian to identify and flag resources that somehow bypassed the SCP and got created, then define an action for those. Actions could be notifying the team, tagging the resource, or terminating the resource—track tag compliance KPI to track progress in your tagging initiatives.

Budgeting and alerts

Work with your finance teams to define and record the cloud spend budget. Create alerts if you are exceeding the budget and notify relevant teams. You can use the native cloud provider services and tools to do that. This helps you closely monitor your cloud spend and alert you if you are over budget.

Automated reporting

Automate delivery of reports. Like the case study mentioned above, you can set up email delivery of Cloud Custodian QuickSight reports to the stakeholders. You can also configure report delivery within the native cloud provider services.

Best Practices for Ongoing Cloud Waste Management

Building and deploying solutions are good, but delivering value to the business requires a process and practice to use those solutions. The initial phases require continuous awareness and training for the teams to be comfortable with the solution and build it as a practice within your teams. Once the teams consciously review and resolve the findings, you can then track KPIs to understand which areas or services are generating more waste and identify the root cause and gaps within the organization so they are mitigated at the source. FinOps Foundation has a reference on activities that can be carried out for optimizing cost.

Regular audits and reviews

Regular review meetings with individual teams should be conducted to plan the actions and track progress, especially when the remediation is not automated. Also, meet with stakeholders and leadership teams to report the status, impediments, and show overall progress.

Training and awareness of teams

Conduct training sessions to educate teams about the need for cloud waste management and clearly set the expectations with them. Define a process for educating new employees through recorded sessions and documentations.

Conclusion

Cloud waste management has become the top priority for organizations in 2024 due to uncertainty in macroeconomics as well as sustainability commitments. Businesses need to maximize the value of their cloud investments; a considerable amount is spent on cloud waste. Investments should be made to promote FinOps culture during all the phases of product development and operations, as well as training and educating teams to consider cost as a non-functional requirement, such as performance and security, while building products and applications. Ensure that automated ways are used to detect, report, and act on it. Remember, the goal should be to maximize the value of cloud investment through optimal use and not cost savings.

About the Author

Rate this Article

Adoption
Style

BT