Marketo is a marketing automation software, executing over 20 billions customer defined actions per month. Apurva Pawar, Daniel Pugliese, Dennis Bronnikov and Pei-Chiang Ma from Marketo’s engineering team explained at Reactive Summit how they rewrote the core of their system with Akka, an actor model framework for the JVM.
Marketo is used by marketers to execute automated tasks based on configurables rules. To provide some context, the core concepts are:
- Lead: customer or potential customer
- Action: one of 50+ operations that may act on a lead
- (Trigger) campaign: a set of 1-N customer defined actions that occur on event listeners
- Task: a set of customer defined actions
The campaign engine is the component in the center of the rework. It is an event-driven framework, triggering actions to be carried on leads. The legacy system had several shortcomings leading to the redesign:
- Not leveraging multi-threading efficiently
- Resources only shared across a subset of customers
- Underutilized database resources
- Difficulty to scale up components independently
To fix these shortcomings, the team established the project goals:
- Scale horizontally
- Handle fault-tolerance with self-healing
- Create a responsive system
- Provide fairness and elasticity to tenants
- Handle back pressure
- Ensure exactly once execution
- Preserve order for tasks to same target
Backpressure is handled through coordination between the producer and consumer actor. A persistent queue is used to buffer messages but only when the consumer can’t keep up. This is achieved through negotiation between the consumer and producer: messages are pushed directly when possible and pulled from the persistent queue when the producer can’t keep up. This technique of negotiation allowed the team to reduce the writes by 99,84%, from 65,000 to 300 writes per minute.
Fairness is achieved through the actor design. One top-level supervisor by tenant is created, giving some control over splitting system resources by tenant. This reduces the risk of a bigger tenant monopolizing the system and starving the system of resources for smaller tenants.
Resource sharing is optimized through custom sharding allocation. For example, some threads might be using database connections heavily. Multiple shards of that same database landing on the same node might exhaust the connection pool under heavy load. Shards are then allocated to be spread out on different nodes. Some nodes might have more shards than others, which is tolerated in order to maximize resource usage.
The rewrite resulted in several performance improvements. The throughput increased by a factor of 20. Additionally, the system consumes 75% less cpu cores and 33% less memory.
The team concludes by enumerating the key points of their solution: fault-tolerance, self-healing, light runtime footprint, high degree of parallelism and actor domain-driven design. They also mention the next improvements in their pipelines, which are integrating akka streams and implement adapative execution bandwidth control.