BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News LinkedIn's Open-Source "iris-message-processor" Achieves 86.6x Faster Escalation Management Speeds

LinkedIn's Open-Source "iris-message-processor" Achieves 86.6x Faster Escalation Management Speeds

LinkedIn developed a new open-source service called "iris-message-processor" to enhance the performance and reliability of its existing Iris escalation management system. "iris-message-processor" significantly improves processing speeds, being ~4.6x faster under average loads and ~86.6x faster under high loads than its predecessor.

"iris-message-processor" is a fully-distributed replacement for the previous iris-sender subprocess, allowing for more concurrent processing and direct sending of messages to appropriate vendors. It enables increased horizontal scaling and reduces demands on the existing database by not using it as a message queue.

The iris-message-processor can handle large bursts of escalations, processing a test load of 6000 escalations in less than 10 seconds, a massive improvement from the 30 minutes taken by the previous system. The service maintained efficient processing times even when dropping 50% of its nodes simultaneously, rebalancing the cluster in less than 30 seconds and processing active escalations in under three seconds, even under an above-average load.

LinkedIn achieved this performance improvement by splitting Iris escalations into buckets, dynamically assigned to different nodes in the iris-message-processor cluster, facilitating concurrent processing and direct message sending. It re-architected Iris to handle a "10x" growth, targeting to avoid future redesigns due to increased demands.


Architecture diagram of the new Iris ecosystem (Source)

The new service is written in Go. It avoids using the database as a message queue, thus reducing strain on the existing database system. This change allows utilizing an eventually consistent database to store resultant messages for tracking, enhancing scalability.

LinkedIn used the iris-message-processor for about a year with no outages, consistently beating SLOs of 1000ms/msg. It designed the iris-message-processor to be a drop-in replacement for the iris-sender, retaining the existing Iris API while offering substantial performance improvements.

LinekedIn's engineers implemented various measures to ensure stability during rollout. These include preserving backward compatibility and facilitating gradual ramp-up for verification and testing.

The company has made the iris-message-processor an open-source project, sharing it along the existing Iris and Oncall repositories. It designed the tool to operate externally to LinkedIn's internal environment, completely replacing other off-the-shelf incident management systems.

InfoQ spoke with Diego Cepeda, senior software engineer at LinkedIn, about the iris-message-processor, its implementation, and its open-sourcing.

InfoQ: How has the resource usage (CPU, memory, network bandwidth) changed with the introduction of iris-message-processor, and what impact has this had on operational costs?

Diego Cepeda: Practically speaking, the resource utilization changes have been negligible. LinkedIn operates its critical monitoring infrastructure with very high levels of redundancy for fault tolerance, as we consider reliable monitoring the foundation for reliable operations.

We have instances of Iris geographically distributed across our different data centers, each with enough capacity to handle the entire site's alerting needs. However, Iris is efficient enough that we only use 3 instances, each with only 8 cores and 32 GiB of memory, per data center to surpass that threshold well.

We also have 3 MySQL hosts per data center to serve Iris and iris-message-processor. It is worth noting that this configuration, even for the scale LinkedIn is operating at, is over-provisioned as each iris-message-processor instance uses, on average, less than 5% of its CPU allocation and less than 1% of its memory allocation.

InfoQ: Do you have any Return on Investment (ROI) metrics demonstrating the cost-effectiveness of switching to iris-message-processor vs. switching to commercial, off-the-shelf solutions?

Cepeda: We have not performed any comparisons or analysis versus commercial systems. However, with ~6k internal users of Iris who actively get scheduled for on-call, we can speculate that it would be a sizable investment to provision them all with off-the-shelf solutions.

We do have time as a comparison variable, such as the effort it takes to maintain Iris on an ongoing basis, and we would estimate that a majority of which is spent on assisting our developers onboard to the system and answering questions.

Moreover, we do not have anyone who exclusively works on Iris. Instead, it is just one of the many different services that the monitoring infrastructure team concurrently works on.

Iris has already provided a return on our overall investment, including development costs. For an organization that chooses to leverage Iris, there's a possibility that the ROI would be more dramatic because they would only incur the cost of hardware and maintenance and can freely benefit from Iris' open source status.

InfoQ: You chose Go as the language for the iris-message-processor. What were the reasons behind this decision, and how has this choice impacted the performance and scalability of the system?

Cepeda: We selected Go for its capacity for rapidly developing bug-free, highly scalable, concurrent applications. Go's suitability for this task stems from its lightweight goroutines, which make it easy to manage concurrent tasks without the complexities of traditional threading. The use of channels for communication between goroutines enhances safety and synchronization.

Furthermore, Go's comprehensive standard library includes concurrency, networking, distributed systems, and testing packages, reducing the need for third-party dependencies and streamlining development. These features collectively enable us to write code with fewer bugs, support efficient scaling, and quickly deliver robust concurrent applications like iris-message-processor.

Over the last few years, we have been gradually re-writing most of our critical monitoring stack in Go and have consistently reaped significant benefits in speed of development, performance, and operability. A prime example is the iris-message-processor, which outperformed its previous incarnation in performance by several orders of magnitude under heavy load.

InfoQ: You mentioned that iris-message-processor has been running for about a year at LinkedIn with no outages and consistently beating SLOs. Can you share some instances where the system was put to the test in a high-stress situation and how it performed?

Cepeda: Thankfully, the systems at LinkedIn are reliable, so it is rare that we truly get to put the system to the test. Of course, with any sufficiently large and complex systems, some form of failure is always inevitable. One such occasion happened when a DNS issue caused a large portion of the services in one of our production data centers, including the iris-message-processor itself, to become unreachable.

This was as close to a high-stress scenario for our system as we have seen, with thousands of alerts for hundreds of services going bad simultaneously and an entire third of the Iris cluster unavailable to boot.

Thankfully, this was a scenario that we had in mind when designing an iris-message-processor.

As we had planned, the cluster could drop the unreachable nodes, rebalance itself, and process the entire wave of escalations in less than 60 seconds. This is something that could have taken the previous system tens of minutes to resolve fully, resulting in untimely messages and costing valuable time for engineers to be alerted of the issue and resolve it.

InfoQ: With iris-message-processor now being open-source, what features or improvements do you hope to see from the community? How do you plan to manage contributions to make sure they align with the project's goals and quality standards?

Cepeda: The iris-message-processor has at its core an idea of extensibility and flexibility. We recognize that no two environments are alike, and that is reflected in the design and implementation. For example, we currently leverage Twilio to send voice calls and text messages; however, other organizations could have different preferred vendors.

Rather than lock them in and force them to adapt, we made message vendors an easily pluggable interface so that any organization that wants to integrate into any other system can very quickly write a new vendor plugin and be up and running with minimal changes to the codebase.

This pattern is also present in our choice of storage. The iris-message-processor uses MySQL as a data store, but plenty of alternative data stores could easily fit the bill. That is why all MySQL-specific code is abstracted behind an interface so that it's possible to write a new integration for a different data store without having to change any other parts of the code. This design makes it much more viable for organizations with different tech stacks to adopt the new system easily.

Managing contributions is definitely a challenge we have faced with the already open-source Iris and Oncall. Here, our approach is test-driven development and verification. To ensure a consistent standard of quality, we expect new contributions to be testable and, in addition to that, verified in our local development environments before being accepted.

That being said, we are looking forward to the possibility of external contributors helping us make this an even better and more easy to use platform.

About the Author

BT