Spotify clients generate up to 1.5 million events per second at peak hours and all are handled by their Event Delivery System, which is designed to have a predictable latency and to never lose an event, Igor Maravic noted in his presentation at the recent QCon London conference, where he gave a high level overview of the system and some of the key operational aspects.
Over 250 unique event types are generated from the different clients, ranging from the size of a few bytes up to a few kB. Some of the events have a strict no loss requirement, one example being the ones used for royalty calculations, but to simplify the system it is designed to deliver 100% of all events irrespective of the individual requirement. All events are stored in an hourly bucket, a bucket containing all events for a specific date and hour. Each event is stamped with the time it was received, thereby guaranteeing that it is stored in the right bucket.
Maravic, Software Engineer at Spotify, emphasizes that designing for guaranteed delivery of all events is not enough; monitoring is essential to finding out if the design requirements are actually met. Their Event Delivery System is a complex distributed system with many microservices working together. To see what parts may need optimizing, simplify finding the actual problem when incidents occur, and finding problems in the data delivery; each component is monitored. They have recognized three types of monitoring:
- System monitoring for the general health of the system, CPU and memory usage, etc.
- Data monitoring for checking their timeliness. This enables them to ensure that data is delivered according to the latency requirements.
- Data loss monitoring for completeness in event delivery. For this they have built a tool that is monitoring all the inputs and all the outputs, making it possible to find data loss or other delivery problems.
Maravic notes that although their systems must run 24/7, they don’t have an operations team; instead, developers building services also have operational responsibility and he believes this to be a good thing which pushes good developers to becoming great developers.
Maravic has also written a series of blog posts with more detailed information about the architecture including some performance figures.