InfoQ Software Architects' Newsletter

A monthly overview of things you need to know as an architect or aspiring architect.

Enter your e-mail address

Select your country

We protect your privacy.

InfoQ Homepage News Leveraging the Transformer Architecture for Music Recommendation on YouTube

AI, ML & Data Engineering

Leveraging the Transformer Architecture for Music Recommendation on YouTube

This item in japanese

Sep 06, 2024 3 min read

Write for InfoQ

Feed your curiosity. Help 550k+ global
senior developers
each month stay ahead.Get in touch

Google has described an approach to use transformer models, which ignited the current generative AI boom, for music recommendation. This approach, which is currently being applied experimentally on YouTube, aims to build a recommender that can understand sequences of user actions when listening to music to better predict user preferences based on their context.

A recommender leverages the information conveyed by different user actions, such as listening, skipping, or liking a piece, which is then used to make recommendations about items the user could be likely interested in.

A typical scenario where current music recommenders would fail, say Google researchers, is when a user's context changes, e.g., from home listening to gym listening. This context change can produce a shift in their music preferences towards a different genre or rhythm, e.g., from relaxing to upbeat music. Trying to take such contextual changes into account makes the task of recommendation systems much harder, say Google researchers, since they need to understand user actions in the user's current context.

This is where the transformer architecture may help, they believe, since it is especially suited to making sense of sequences of input data, as shown by NLP and, more generally, large language models (LLMs). Google researchers are confident that the transformer architecture may show the same ability to make sense of sequences of user actions as they do of language based on the user's context.

The self-attention layers capture the relationship between words of text in a sentence, which suggests that they might be able to resolve the relationship between user actions as well. The attention layers in transformers learn attention weights between the pieces of input (tokens), which are akin to word relationships in the input sentence.

Google researchers aim to adapt the transformer architecture from generative models to understanding sequential user actions based on the current user context. This understanding is then blended with personalized ranking models to produce a recommendation. To explain how user actions may have different meanings depending on the context, the researchers depict a user listening to music at the gym who might prefer more upbeat music. They would normally skip that kind of music when at home, so this action should get a lower attention weight when at the gym. In other words, the recommender applies different attention weights in the user context versus the global user's listening history.

We still utilize their previous music listening, while recommending upbeat music that is close to their usual music listening. In effect, we are learning which previous actions are relevant in the current task of ranking music, and which actions are irrelevant.

As a short summary of how it works, Google's transformer-based recommender follows the typical structure of a recommendation system and is comprised of three different phases: retrieving items from a corpus or library, ranking them based on user actions, and filtering them to show a reduced selection to the user. While ranking items, the system combines a transformer with an existing ranking model.

Each track is associated with a vector called track embedding, which is used both for the transformer and the model. Signals associated to user actions and track metadata are projected on to a vector of the same length, so they can be manipulated just like track embeddings. For example, when providing inputs to the transformer the user-action embedding and the music-track embedding are simply added together to generate a token. Finally, the output of the transformer is combined with that of the ranking model using a multi-layer neural network.

According to Google's researchers, initial experiments show an improvement of the recommender, measured as a reduction in skip-rate and an increase in time users spend listening to music.

About the Author

Sergio De Simone

Sergio De Simone is a software engineer. Sergio has been working as a software engineer for over twenty five years across a range of different projects and companies, including such different work environments as Siemens, HP, and small startups. For the last 10+ years, his focus has been on development for mobile platforms and related technologies. He is currently working for BigML, Inc., where he leads iOS and macOS development.

Show moreShow less

The InfoQ Newsletter

A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example

We protect your privacy.

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

Leveraging the Transformer Architecture for Music Recommendation on YouTube

Write for InfoQ

About the Author

Sergio De Simone

This content is in the AI, ML & Data Engineering topic

Related Topics:

Popular in AI, ML & Data Engineering

Related Sponsored Content

Popular across InfoQ

The InfoQ Newsletter