BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News Prometheus Monitoring Platform "Graduates" from the Cloud Native Computing Foundation (CNCF)

Prometheus Monitoring Platform "Graduates" from the Cloud Native Computing Foundation (CNCF)

This item in japanese

On August 9th, 2018, the Cloud Native Computing Foundation (CNCF) announced open source monitoring toolkit, Prometheus, has graduated from its incubation status. In order to achieve this rating, projects must demonstrate growth, documentation, organized governance processes, commitment to community sustainability and inclusivity.

As more organizations continue to build cloud native applications, monitoring these distributed applications has become very important. As a result of this growth, it made sense to elevate Prometheus's status within the CNCF. Chris Aniszczyk, COO of the CNCF, explains:

Since its inception in 2012, Prometheus has become one of the top open source monitoring tools of choice for enterprises building modern cloud native applications. Prometheus has cultivated an active developer and user community, giving the Technical Oversight Committee (TOC) full confidence to graduate the project.

To graduate from an incubating status to a graduated status, the project had to adopt the CNCF code of conduct, under-go an independent security audit and disclose its governance structure which outlines how they plan to grow their community. Many people within the open source community applauded achieving this graduated status, including Thomas Di Giacomo, SUSE CTO.

It's a great complement to Cloud Native Computing Foundation projects like Kubernetes and to other distributed systems. Today's graduation is a well-deserved recognition of the technology, its community maturity, and the value Prometheus brings to a wide range of use cases.

Prometheus originated at SoundCloud, a popular streaming music and podcast service, in 2012 and was contributed to CNCF in May 2016. Since entering the incubation phase, there have been 30 official major and minor releases.

In a 2017 GOTO conference presentation, Björn Rabenstein, a production engineer at SoundCloud, discussed some of the drivers behind developing Prometheus. His message included why black-box monitoring is not sufficient. Joeseph Yankel, senior developer team lead at CERT, defines black-box testing as:

[A method that] allows you to check a system or application (e.g., checking disk space, or pinging a host) to see if a host or service is alive, but does not help you understand how it may have gotten to the current state.

 

Image Source: (screen-shot) https://www.youtube.com/watch?v=hhZrOHKIxLw

Instead, Prometheus takes advantage of a white-box testing approach that leverages a time-series approach which allows you to determine why you reached the state you are in. Another driver in moving to this methodology is how SoundCloud uses a DevOps approach to monitoring their systems. Rabenstein refers to this as "you build it, you run it". While previous black-box approaches resulted in throwing products over the fence to an operations team to manage without a lot of context. Now, SoundCloud focuses on building in monitoring capabilities from the ground-up by including metrics in code or stat tables.

By leveraging white-box monitoring, alerting within Prometheus becomes more intelligent. Instead of just relying upon thresholds, Prometheus can take advantage of a time-series approach. Rabenstein illustrated a scenario where a threshold for disk space could be set at 85% using traditional monitoring. In this scenario the alerts would be firing unnecessarily as the disk doesn't actually fill-up. But, by using a time-series trend, an alert could be sent when the trend line predicts an actual outage. Rabenstein argues that for engineers who are on-call, avoiding unnecessary call-outs is really important to avoid alerting fatigue.

Image Source: (screen-shot) https://www.youtube.com/watch?v=hhZrOHKIxLw

In addition to leveraging trend lines, Prometheus provides a way to build alert routing trees and the ability to group, or bundle, notifications. So if an underlying host disappears, a single alert can be sent instead of one alert for each service that is running on that host.

Image Source: (screen-shot) https://www.youtube.com/watch?v=hhZrOHKIxLw

For additional information regarding Prometheus milestones and its tenure at CNCF, please visit the following blog.

Rate this Article

Adoption
Style

BT