Apple Reveals the Inner Workings of Siri's New Intonation

Apple has explained how they use deep learning to make Siri's intonation sound more natural.

IPhone owners can interact with Siri by asking questions in natural language and Siri responds by voice. The voice Siri uses is available in 21 languages and localized for 36 countries. At WWDC 2017, Apple announced that in iOS 11 Siri would use a new text to speech engine. In August 2017 Apple's machine learning journal unveiled how they were able to make Siri sound more human.

To generate speech your iPhone stitches pre-recorded human speech. Many hours of pre-recorded speech are broken down into words, and these words are broken down into their most elemental components: phonemes. Whenever a sentence must be generated, recordings of the appropriate phonemes are selected and stitched together.

Selecting what recording to use for each phoneme is a big challenge. Each component has to both match what they want to pronounce, and it has to match the selected units around that component. Your old navigation system only has a few recordings per phoneme, which is the reason the voice sounds unnatural. Apple decided to use deep learning to determine what properties a sound must have to be appropriate in a sentence.

Every iOS device contains a small database of pre-recorded phonemes. Each recording is associated with audio properties: spectrum pitch and duration. A so-called "deep mixture density network" is trained to predict a distribution over each of the features a phoneme must have to fit in a natural sounding sentence. Apple designed a cost function to train this network that takes two aspects into account: how well a phoneme would match what you want to pronounce, and how well it fits into a sentence.

After determining what exactly to look for, your phone searcher goes through its database using the "Viterbi" search algorithm. The best path following the recorded phonemes is selected, and the recordings are concatenated and played.

An alternative way would be to generate the sound waves, without concatenating recorded sound. In September 2016 Alphabets Deepmind unveiled a computer-generated text to speech engine called WaveNet. The downside is that it is slow to generate speech, even your fast desktop computer would take a long time. Siri won't be replaced by directly generated speech soon.

InfoQ Software Architects' Newsletter

Write for InfoQ

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter