Microsoft announced Satin, a new audio codec that leverages AI techniques to outperform Skype's Silk codec over ultra-low bandwidth and highly constrained network conditions.
While high-connectivity is widely available today, 3G and 4G cellular networks often limit the quality of conversation, with over 50% packet loss and sporadic loss of coverage, explain Microsoft's Jigar Dani and Sriram Srinivasan. Additionally, reducing the bandwidth required by audio conversation has the benefit of increasing the bandwidth available for other concurrent tasks by the same users or other people sharing the same network connection.
After all these years, it turns out that utilization of available bitrate is every bit as important today as it was in the dial-up world.
According to Dani and Srinivasan, Satin can cover frequencies up to 16KHz, which is dubbed "super wide band" and doubles the usual 8KHz bandwidth used for human speech sampling, at just 6kbps. As a comparison, Silk could only provide a 4KHz bandwidth at the same 6kbps bitrate.
This improvement is made possible by a two-fold approach: on the encoding end, only a sparse representation of the audio signal is processed. This is made possible by more advanced models of speech production and psychoacoustics. On the decoding end, instead:
Satin uses deep neural networks to estimate the high band parameters from the received low band parameters, and a minimal amount of side information sent over the wire.
Microsoft hasn't released many additional details about their approach with bandwidth extension through the use of a neural network (DNN), but a similar technique is used by Google WaveNet text-to-speech synthesizer. In WaveNet's case, a DNN is trained to upsample to 24KHz an audio signal filtered through a 8KHz codec. The signal reconstructed using this DNN has only slightly lower quality to that of a 16KHz-filtered audio signal.
Microsoft's approach with Satin, whatever it looks like in detail, poses a computational challenge, explain Microsoft engineers, since both the encoding and the decoding steps are computationally intensive.
To solve this, the team then focused on both algorithmic optimizations as well as techniques like loop vectorization beyond what the compiler could achieve. This achieved nearly 40% reduction in computational complexity and allowed us to run on all our users’ devices.
To evaluate Satin impact, Microsoft has carried through A/B tests showing that at low bitrates Satin has enabled longer calls and received a 1.7 MOS higher opinion score than Silk.
As a final remark, Satin also outperforms Silk as to packet loss, which is quite a common occurrence over Wi-Fi and mobile networks. Here, Microsoft's engineers explain, its approach of encoding each packet separately is what makes a difference, since losing a packet does not affect subsequent packets.
At the moment, Satin is being used for Teams and Skype calls and will extend to Teams meetings soon. In future, says Microsoft, it will support full-band stereo music at a maximum sampling rate of 48KHz starting at a bitrate of 17kbps.