Transcript
Samuels: Imagine putting on your headphones and listening to your favorite song or discovering a new artist whose songs you just can't live without, or hearing an album that just dropped by an artist that you already know and love and adding those songs into your rotation. These are the types of listening experiences that we at Spotify want to connect to our users. In this talk, we'll discuss how we use technology to get there. I'm Emily, I'm a staff engineer at Spotify and I've been here for over five years now and I've been working on music recommendations.
Muppalla: I'm Anil, I'm a data engineer at Spotify, I've been there for about two and a half years. I also work in recommendations.
Samuels: At Spotify, we're a music streaming service and our mission is to unlock the potential of human creativity by giving a million creative artists the opportunity to live off of their art and billions of fans the opportunity to enjoy and be inspired by it. At Spotify, we have 96 million subscribers, 207 million monthly active users, we've paid out over €10 billion to rights holders. There are over 40 million songs on our platform, over 3 billion playlists on our service, and we're available in 79 markets.
You could see that we're dealing with a large scale here and specifically, at Spotify, Anil and I, we work on the home tab. For those of you that aren't familiar with the home tab, it looks something like this. There are a lot of different ways that you can discover music and listen to music on the home tab. There are things like your heavy rotation which could be playlists and albums and artists that you've been listening to a lot in the last month. There's recently played, which is just the last few things that you've played on Spotify. For this user, we've also seen that they're a fan of Linkin Park, we wanted to give them more ways to listen to that artist, so we have a playlist featuring Linkin Park and a Linkin Park radio station. There are also album picks, which are albums that we think this user might like, so there are a lot of different ways that you can listen to music on the home tab.
You're going to hear some vocabulary in this talk and I just wanted to explain it to you all before we get into it. You might hear us say the word "card" in reference to playlists or albums or podcasts. It's just any kind of individual playable item on Spotify and multiple cards can make up a shelf. In this case, the shelf is your heavy rotation and the cards are just the items that this person has listened to a lot recently in the last month.
In this talk, we're going to go through how do we create those recommendations. Back in 2016, we started with a batch architecture, then in 2017, we moved over to a services architecture to hide the complexity in that batch architecture and it allowed us to be more reactive. Then in 2018 to today, we leveraged our move to the Google Cloud Platform and added in streaming pipelines into our architecture to build a product based on user activity.
Batch
Let me paint you a picture of Spotify back in 2016. We were running a lot of Hadoop jobs back in 2016. We had a big Hadoop cluster, one of the largest in Europe at the time, and we were managing our services and our databases in-house, so we were running a lot of things on-premise. In this batch architecture, we started off with some inputs that were the songs played logs. These are just all the songs that uses are listening to on our platform and we also use Word2Vec to figure out what songs are similar to other songs.
Many of you may already be familiar with Word2Vec, but I'll just give you a short crash course. Word2Vec is a machine learning model that allows you to find vector representations of words. Given a corpus of text or a bunch of documents, you could see that words are similar to each other based on how often they co-occur amongst those different documents. How does that apply to Spotify? How do we use Word2Vec? Well, like I said, we have over three billion playlists on our platform, we use those playlists as the input into Word2Vec and we treat the tracks as if they are the words, and the playlists as if they are documents. We say that tracks that occur together amongst different playlists are going to be similar to each other, and they'll be closer to each other in this vector space. You could see that 2Pac could be in one section of the vector space and Mozart could be in another section. Bach would be closer to Mozart than to 2Pac, because they are more similar.
Now, we have all the songs that users have played and we know what songs are similar to each other. We are able to apply this not to just songs, but also to albums, and to playlists, and artists as well, we could find similar artists and playlists to songs. We have these Hadoop jobs that take this as input and for a specific user, we'd look at what songs they played and then use Word2Vec to figure out what are similar playlists to those songs. From there, we could create a shelf of playlist, that we think a user might like to listen to. Once we had that, we wrote it out into Cassandra, our data store.
So when a user opens up the home tab and wants to look at music recommendations, there's this content management service that knows that these are the different types of shelves that we want to load on home, and it knows the services that it needs to call to get the content. We'll call a service to fetch that shelf from home, we would go to Cassandra, get the content, and return that back up to the client. For all this talk, we're just talking about mobile clients here.
Pros & Cons of the Batch Approach
What are the pros and cons of this approach? What are the advantages and disadvantages? Well, one advantage is there's a low latency to load home. It's pretty fast to go from the client to Cassandra and back, so home can be pretty snappy when it loads. You can also fall back to old data if you fail to generate recommendations, if for some reason our Hadoop jobs fail, the cluster is messed up, we can't run our jobs, they're taking too long, we can always fall back to an older recommendation and just give the user something that's a little bit stale, rather than nothing at all.
What are some of the drawbacks to this approach? For our users, the recommendations are updated only once every 24 hours, so we're not able to react to what users are doing during the day in our app. We have to wait until they finish listening to their songs for the day and then process all of that data, it's in the nature of how Hadoop jobs work and the amount of data that we're processing that we can only react once every 24 hours to give a new music recommendation.
We also calculate recommendations for every user that's listened to a song on Spotify, not just home users. There are a lot of ways to listen to music on Spotify, you can listen through the home tab, but you could also listen through your library, through search, through browse, through radio. There's a lot of other ways to get music on our service and because of that, when we're processing all of those songs played in that log, we're processing more data than we need to, and we're creating recommendations for users who may never even come to the home tab.
Experimentation in the system can be difficult. Let's say you have a new idea for a shelf, a new way you want to make a recommendation to a user, there's a lot in the system that you need to know about to be able to get to an A/B test. You have to know how to write a Hadoop job and actually, this time we were running Scalding, it's a framework on top of MapReduce that Twitter wrote, so you have to know how to write a Scalding job in this case. You need to know how to process those songs played logs, you need to know how Word2Vec works and how to use that data, you need to know how to ingest data into Cassandra, and you also need to know how to stand up a service to fetch that data out of Cassandra. There's a lot of institutional knowledge that you need to have in order to just try out a new hypothesis for an A/B test.
There’s also a lot of operational overhead needed to maintain Cassandra and Hadoop. At that time we were running our own Hadoop cluster, we had a team whose job it was just to make sure that that thing was running, making sure that it was getting the latest updates, making sure that we had enough nodes, and making sure that the resources were being shared fairly. When we first started out, it was kind of a free for all in terms of using this Hadoop cluster. If you had a resource-intensive job, you could make it so other people's jobs wouldn't be able to run, it would have to wait until those resources freed up.
We did a couple of strategies to try and mitigate this, we worked with resource pools. We had a resource pool for production jobs versus development jobs, so that way when you would try out your development jobs, they wouldn't impact your ability to run production jobs and then even that wasn't enough, and we had resource pools per team, so that one team's jobs couldn't impact another team's jobs.
It was a lot of work just to maintain the system. We were also maintaining our own Cassandra clusters and Cassandra has a lot of knobs and things that you can tune to make it fit your use case. There was a lot of tweaking that we had to do to make sure that it worked for us, sometimes Cassandra would run a lot of compactions, and that would impact our ability to serve production traffic. That was something else that we had spend a lot of time making sure that it worked correctly.
We knew that this architecture, it was not going to be the best one for us. We knew that we could do better and we could improve upon this. Anil [Muppalla] is going to talk about how we moved to a more services architecture in 2017.
Services
Muppalla: We started to adopt services in 2017, this is at the time where Spotify was investing and moving to GCP. We had moved our back-end infrastructure to GCP as well and we also realized that we wanted to be reactive to users and we wanted to give them the best possible recommendations as soon as they wanted. What did we do? We replaced the songs played data set and the Word2Vec data set with the corresponding services. The songs played service would return a bunch of songs for the user in real time, the Word2Vec service would give latest recommendations for a bunch of songs for that user.
In this system, when a user loads up the homepage, the Spotify client would make a request to the content management system. The CMS system is where we define which content the user should see and how we should render this content, and where the content lives. The CMS system would then talk to our create shelf for home service, this is the service that would fetch the songs played for that user from the songs played service, and get a bunch of playlist recommendations based on these songs for that user, package these playlists as cards into a shelf, and return that to the user, and the user would see it instantly on the homepage.
In this architecture, we made sure that it's easy to write shelves. In the system, you could write a shelf as simple as writing a back-end service, so the more back end services you wrote, the more shelves there were and it would process messages as they came in.
Pros & Cons of the Services Approach
What are some of the pros and cons of this approach? We were updating recommendations at request time. Every time a user loads the homepage, we are either creating new recommendations or we are updating existing recommendations, so the user always see something that is relevant, that is more reactive. We are now calculating these recommendations only for the users that are using the home page, so we have reduced the number of calculations that we are doing, as opposed to what we used to do in the back system, we saw there an improvement.
The stack is further simplified because now the complexity of the Word2Vec model is hidden behind a service. All you need to know is how to interact with the service and it's much easier to manage the system because it's all just services. It's easier to experiment; anybody, any developer can come in, can think up a content hypothesis, and just write a service, and you can just go from there. Since we moved our back-end infrastructure to Google, we saw that there was a decreased overhead in managing our systems because we moved away from our on-premise data centers.
What are some of the cons of this? You saw that as we added more and more content, as we added more and more recommendations for the users, it would take longer to load home because we are computing these recommendations at the request spot. We also saw that since we don't store these recommendations anywhere, if for some reason the request failed, the user would just see nothing on the homepage, that's a very bad experience.
Streaming ++ Services
In 2018, Spotify is investing heavily in moving the data stack also to Google Cloud. Today, we're using a combination of streaming pipelines and services to compute recommendations on home that you see today. What's the streaming pipeline? At Spotify, we write Google data for pipelines, for both batch and streaming use cases. We use Spotify Scio, it's a Scala wrapper on an Apache Beam to write these pipelines. A streaming pipeline processes real-time data, real-time data is an unbounded stream of user events.
At Spotify, all user events are available to us as Google Pub/Sub topics, every interaction you make on the app is available to us as a real-time as Pubsub topics. In the streaming pipeline, you can perform aggregations based on this data by collecting them into time-based windows. You can apply operations like groupBy, countBy, and join to perform operations and then once you have these results, you can store them in other Google stores like Pubsub, BigQuery, Google Cloud Storage, and Bigtable.
For the Spotify home use case, we care about three signals, we care about songs played, follows, and hearts. A song is played signal is fired every time a user completes listening to a song, a follow signal is fired when a user either follows an artist or follows a playlist and a heart signal is fired every time a user hearts a track. All these events are available to us as individual Pub/Sub topics. We take these Pub/Sub topics, we write a streaming pipeline that consumes these messages from these Pubsub topics by making what we call subscriptions to them, so when the events are coming in, the streaming pipeline processes them instantly.
What do we do in the streaming pipeline? We perform aggregations on these messages that we get to determine how often we need to either create a new recommendation or update an existing recommendation for that user. Once we've decided the cadence that we should do it, we publish another message to another Pubsub topic saying, "Hey, this user is ready for a new recommendation."
We then have this create shelf service, it's a huge monolith service that has shelves and content hypothesis written as functions inside it. What this service does is it consumes from that create recommendation for this user topic and starts creating recommendations for that user. Each recommendation is nothing but a function here, we've reduced the complexity even further. For example, we take what we did in the services system, so every time an event comes in for that user, we fetch the songs for that user and we get playlist recommendations based on the songs, we neatly package it into our shelf for playlist recommendations, and we then write it to Google Bigtable. Google Bigtable is a highly scalable key value store which is very similar to Cassandra.
In this architecture, when a user loads the homepage, the client makes a request to the CMS and the CMS quickly fetches this content from Bigtable and shows it on home. We've reduced the complexity to write new shelves in the system by just adding as many functions as you want in that create shelf service.
Pros & Cons of the Streaming ++ Services Approach
What did we learn? What are the pros and cons of this approach? We are now updating recommendations based on user events. We are listening to the songs you have listened to, the artists you have followed, and the tracks you have hearted, and we make decisions based on that. We've separated out computation of recommendations and serving those recommendations in the system. Since we are sensitive to what the user is doing on the app, we are sure that we are giving the user fresher content. We are able to fall back to older recommendations if for some reason we are unable to compute new recommendations, if the streaming pipeline is having an incident or the services are down for some reason, the app is still responsive and you're still getting recommendations. It's easy to experiment now, because since we moved from services to this approach, all the developer has to do is just write a function for any content hypothesis that he or she wants and you're ready to go.
What are some of the cons? Since we added the streaming pipelines into this ecosystem, the stack has just become a little bit more complex. You need to know how the streaming ecosystem works and have an awareness to how to deal with services to deal with any kind of issues. There's more tuning that needs to be done in the system, when you process real-time events, you need to be aware of event spikes. I can tell you an incident that happened where, for some reason, we didn't see any events coming from Google Pub/Sub, when that incident was resolved, we saw a huge spike of millions of events come in. We hadn't accounted for this, and then that basically bombarded all the downstream services and we had to reduce the capacity so that we consumed those messages slowly, so we had to manage that. It's very important to consider this use case when you're building and using streaming pipelines, it's very important to have guardrails, so you protect your ecosystem.
Debugging is more complicated, if there is an incident on your side, you have to know whether it's the streaming pipeline, or your service, it's the logic, or it is because Bigtable is having an issue. It's just more pieces to consider when you're debugging issues.
Lessons Learned and Takeaways
What did we learn through this evolution? From batch, we learned that because we store the recommendations in Cassandra, it's easy to fall back to old ones. Since we're just fetching from Cassandra, the latency to load home is really fast. The updates are slow because of the way we ran our batch jobs, it takes longer to update recommendations. The services, the updates were fast, we were responding to users as quickly as they were interacting with the app. Since we added more and more recommendations, the home was slow and since we didn't store any recommendations, there was no fallback.
In the streaming and services combination, the updates were fast and frequent because we were listening to user events as they were happening. Since we stored the recommendations in Bigtable, we were able to load the home with low latency and we're also able to fall back to old recommendations in case of incidents. One caveat is that we have to manage the balance between computing recommendations as quickly as we want, and the downstream load that we put on other systems that we depend on. Through the evolution of these three years, we managed to pick the best parts of each architecture and still be fast and relevant today.
Takeaways
What are some of the key takeaways? We've seen that since we moved to the managed infrastructure, we could focus more on the product, we could iterate faster on our content ideas, we spent less time managing our infrastructure. If you care about timeliness, if you want to react to events that are happening at the moment, I would suggest that we use streaming pipelines for this, please be aware of event spikes; it can really harm your ecosystem.
Through the evolution of the home architecture, we've optimized for developer productivity and ease of experimentation. We've moved from batch pipelines and in services, it was just easy to write a service. Then today, writing a shelf is as easy as writing a function, and then the rest is already taken care of.
Questions & Answers
Participant 1: In the new architecture services and streams, do you have versioning in the Bigtable? How do you support the fall back to the older versions?
Muppalla: We make sure that we have at least one version for each shelf that we have. Bigtable has GCS policy that you can set based on either time or the number of versions it has to keep for each shelf. That way we make sure that there is at least one recommendation for that specific shelf for that user.
Participant 2: Very interesting talk, by the way. This is a question more related to how you categorize songs rather than recommendations, because I'm wondering with regards to radio edits, for example, it's often the case that the metadata cannot be used to identify a song in a different way to the actual same song which is the original version. Do you have any advanced ways of figuring out whether something is a radio edit when you categorize it or things like that, so that you wouldn't recommend something to a user as part of a playlist and the same song appearing three or four times as alternatives?
Samuels: I think I understand what you're saying. We also have a concept of track IDs, but we also have recording IDs, that's unique for each recording. That can help us make sure that we're not recommending duplicate things if we are creating our playlist based off of those recording IDs instead. I'm not sure if that's exactly what you mean.
Participant 2: Yes. I have a question about that, how do you guys do A/B testing in these architectures? Because A/B testing is quite important in terms of a recommendation system. Could you briefly elaborate how you do it?
Muppalla: We have an entire A/B system ecosystem already in place. They're all exposed to us as back-end services, so when you write, say, a function here to test your hypothesis, every time this shelf is supposed to be curated, we make sure that this user is in that specific A/B test and we curate that. When we show that we are also sensitive to this user, we should see the right experiment.
Participant 3: I would like to know if the team set up did change with the architecture set up?
Samuels: Yes. It definitely changed since 2016.
Participant 3: No. I mean, did they change from batch to the microservices? Was it needed to change the team as well to be able to work with the new architecture?
Samuels: I don't think that we had to change the team, the teams just change at Spotify over time just because we do a lot of reorgs there. We didn't have to change the structure of our teams to be able to implement this. Our teams are usually set up where we have data engineers, ML engineers, back-end engineers all together working on them, so we have the full stack skillset.
Participant 4: The architecture looked like you guys are serving the recommendations real time. Can you talk about the training and if that's instance-based as well, or if that's batch, and how that works?
Samuels: The training for Word2Vec?
Participant 4: Yes, Word2Vec and how you do all that kind of stuff.
Samuels: Word2Vec gets trained, I believe, weekly across all of our systems. We use user vectors to figure out what to recommend for users, and those get updated daily, but the actual model itself gets trained weekly.
Participant 5: Do you guys use the lyrics of the songs or the tempo of the song to drive recommendations?
Samuels: No, not right now.
Participant 6: A related question. I think the [inaudible 00:27:32] challenge last year was Spotify, or I think it was 2017, that one was proposing a very different algorithm. In addition to Word2Vec, it had a whole bunch of things, the one that won the prize. I don't know if that's actually coming into production or not.
Samuels: I don't know anything about that.
See more presentations with transcripts