While using a Large Language Model chatbot opens the door to innovative solutions, Spotify engineer Ates Goral argues that crafting the user experience so it is as natural as possible requires some specific efforts in order to prevent rendering jank and to reduce latency.
Streaming a Markdown response returned by the LLM leads to rendering jank due to the fact that special Markdown characters, like *
, remain ambiguous until the full expression is received, e.g., until the closing *
is received. The same problem applies to links and all other Mardown operators. This implies that Markdown expressions cannot be correctly rendered until they are complete, which means that for a short period of time Markdown rendering is not correct.
To solve this problem, Spotify uses a buffering parser that does not emit any character after a Markdown special character and waits until either the full Markdown expression is complete, or an unexpected character is received.
Doing this while streaming requires the use of a stateful stream processor that can consume characters one-by-one. The stream processor either passes through the characters as they come in, or it updates the buffer as it encounters Markdown-like character sequences.
While this solution is, in principle, relatively easy to implement manually, supporting the full Markdown specification requires using an off-the-shelf parser, says Goral.
Latency is, on the other hand, mostly the result of the need to make multiple LLM roundtrips to consume external data sources to extend the LLM initial response.
LLMs have a good grasp of general human language and culture, but they’re not a great source of up-to-date, accurate information. We therefore tell LLMs to tell us when they need information beyond their grasp through the use of tools.
In other words, based on user input, the initial response provided by the LLM also includes which other services to consult to get the information that is missing. When those additional pieces of data are received, the LLM forges the full response, which is finally displayed to the user.
To prevent the user having to wait until all external services have responded, Sidekick uses the concept of "cards", which are placeholders. Sidekick renders the initial response received from the LLM, including any placeholders. Once the additional requests complete, Sidekick replaces the placeholders with the received information.
The solution implemented in Sidekick fully exploits the asynchronicity inherent in this workflow and integrates the response demultiplexing step with the Markdown buffering parser. If you are interested to the full details of their solution, do not miss Goral's original article.