At Amazon's re:MARS AI conference in Las Vegas, Alexa vice president Rohit Prasad demonstrated a new conversational model for the Alexa smart assistant. In this new model, Alexa can seamlessly transition between skills and remember the context of the conversation to resolve ambiguous references.
Alexa users are no doubt familiar with the concept of skills, which are the building-blocks of Alexa's functionality. Users choose a skill by asking Alexa to "open" or "launch" the skill. Once a skill is active, Alexa's conversation abilities are constrained by the requirements of the skill. Like many other chatbots, Alexa uses an intent-classification and slot-filling model. An example of intent/slot model might be a movie-ticket purchasing skill, which would have the intent PurchaseTickets with slots theaterCity, theaterName, movieTitle, showTime, and numberOfTickets.
Each phrase spoken by the user is analyzed for the action the user wants Alexa to take (the intent), and the specific parameters of the action (the slots). Once an intent is identified, much of the subsequent conversation with Alexa is to fill in any empty slots for the intent. One shortcoming of this model is that Alexa does not remember any details of the conversation. In particular, items that have been identified as slot values aren't available for use by other skills a user might select later. For example, if a user purchases movie tickets and then wants to use Alexa to schedule a ride-share to the theater, it's not currently possible to simply say, "I need a ride there." Instead, the user must explicitly open a ride-share skill and fill in the destinationAddress and arrivalTime slots of that skill. The skill isn't able to infer these from the context of the theaterName and showTime slots filled for the previous skill.
During the re:MARS keynote, Prasad showed a video demonstrating a conversation with Alexa where a user carried on an extended conversation with Alexa, purchasing movie tickets, making reservations at a restaurant, and scheduling a ride-share, all using a new dialog model where Alexa can maintain contextual information, switch between skills, and fill in slots using the remembered context. According to the Alexa team, "with every round of dialog, the system produces a vector...that represents the context and the semantic content of the conversation."
This demo represents the culmination of many research papers, and the inner workings were hinted at last year in a blog post and conference paper. The system creates an embedding for slots, which groups semantically similar slots close together in embedding space. This allows Alexa to recognize when slots from one skill might be fillable with values from slots in another skill. There is also a LSTM neural network that determines whether to "carryover" a slot value to the next step in the dialog.
Although this technology hasn't been fully released to the Alexa developer community, there is a Developer Preview program called Alexa Conversations. Devs can apply for "early access to upcoming cross-topic capabilities." Mark Tucker, an Alexa Champion, said on Twitter:
Alexa Conversations preview promises less coding & a more conversational experience. Every developer should start looking. Amazon has focused heavily on Dialog Management as opposed to State Management & Intent Context. Will this be the solution?
However, not everyone is excited by the new conversation model. A commenter on Hacker News noted:
I wish it would move toward a more explicit list of items -> descend -> new list of items approach. The area where Alexa is most likely to show its warts is when you fall for the trap of trying to engage in a normal dialog with it.