GPU-manufacturer NVIDIA announced their Maxine platform for AI-enhanced video-conferencing services, which includes a technology that can reduce bandwidth requirements by an order of magnitude. By moving much of the data processing to the cloud, end-users can take advantage of the compression without needing specialized hardware.
NVIDIA CEO Jensen Huang described the platform and its applications in his keynote address during the recent GPU Technology Conference (GTC). Maxine's video compression uses a generative adversarial network (GAN) on the receiver side to reconstruct images of human faces from the position information of only a few key points taken from the sender's images. By sending only these points, instead of pixel data, the bandwidth requirements are drastically reduced compared to the H.264 compression standard, by up to 10x. Maxine also provides several other features, including face alignment and animated avatars. With Maxine and NVIDIA's Jarvis framework for conversational AI, Huang says:
We have an opportunity to revolutionize video conferencing of today and invent the virtual presence of tomorrow.
The core AI algorithm in Maxine is based on NVIDIA's research on GANs. GANs use two deep-learning models, a generator which learns to create "realistic" data and a discriminator which learns to distinguish between real data and the generator's output. Once trained, the generator can produce very convincing output. In a paper presented at the 2019 Computer Vision and Pattern Recognition (CVPR) conference, an NVIDIA research team described a model that can covert simple drawings into photorealistic images using "style transfer." The team also created a demo app of the technology dubbed GauGAN, which allows users to make their own drawing and apply one of many available reference images as a style.
Recently, GauGAN co-developer Ming-Yu Liu and other NVIDIA colleagues realized they could apply the technique to video-conferencing. Most video compression algorithms take advantage of the fact that not all of the image data changes between frames, occasionally transmitting full keyframe images, and then sending only the changes between that frame and subsequent frames. Minerva also requires a keyframe, or reference image, of the transmitting user's face. Subsequent frames in the source video are analyzed to locate "keypoints" of the transmitter's face in the image. Instead of differences between images, only the locations of the keypoints are sent. The Minerva software on the receiving side then reconstructs the sender's face by applying style transfer of the initial reference image to the simple keypoint "drawing" of the face.
Besides reducing bandwidth required for video conferencing, Maxine provides several other features, which include improving images taken in low light and removing background noise. Additionally, the platform can "re-align" a sender's face in the video image. Many video-conferencing participants tend to look at their own screens instead of directly at the camera, with the result that they are not making "eye contact" with the other participants. Maxine can reconstruct the video of the sender such that they appear to be making eye contact. Maxine also supports animated virtual "avatars." By using a different reference image---for example, a cartoon character's face instead of the sender's actual face, Maxine will "reconstruct" the sender as an animated version of the cartoon character. Using conversational AI services from NVIDIA's Jarvis, video conferences can also include real-time closed-captioning and language translation.
In a discussion on Hacker News, many users pointed out the similarity between Minerva's algorithm and "deepfakes," raising concerns about abuse of the technology. Others pointed out the positive possibilities:
I think the ability to...have yourself look a bit tidier than you actually are (working from home) could be a huge benefit. I mean taking away focus on things that [do not] matter in a virtual meeting such as where you are sitting [or] your daily hair style status...
NVIDIA's Maxine platform is currently in closed beta, and developers can apply for early access. The GauGAN model code is available on GitHub.