Google has open-sourced a new component for its MediaPipe framework aimed to bring real-time hand detection and tracking to mobile devices.
Google's algorithm uses machine learning (ML) techniques to detect 21 keypoints from a single frame and can be used with multiple hands. Its ability to provide real-time performance on mobile devices sets it apart from competing approaches, which requires desktop performance, Google says. It is integrated within MediaPipe, a graph-based framework for building applied machine learning pipelines involving video, audio, and sensor data.
Google's approach is based on three ML models working in a pipeline. A first model, called BlazePalm, is used to detect an oriented hand bounding box. The detected bounding box is fed to a second model to detect 3D hand keypoints, which are then classified using a third model into a discrete set of gestures. The outcome of such a pipeline is shown in the following image.
(Image from Google blog)
According to Google researchers, one of the key parts of their approach is carried through by the BlazePalm component:
Providing the accurately cropped palm image to the hand landmark model drastically reduces the need for data augmentation (e.g. rotations, translation and scale) and instead allows the network to dedicate most of its capacity towards coordinate prediction accuracy.
This architecture is similar to the one used in their face mesh pipeline, which is also available in MediaPipe. In comparison with face detection, hand detection is made harder by the lack of high contrast zones, so BlazePalm relies on additional information, such as arm, body, or other features, to improve hand localization. Their approach gives a 95.7% average precision in palm detection, Google says. For the second stage model, Google annotated about 30K real-world images with 21 keypoints, which were used along with an unspecified number of synthetic hand images.
Among the use cases Google suggests for this technology is sign language understanding and device control through hand gestures.
In the future, Google researchers will work on increasing the number of gestures that can be recognized and support dynamic gesture unfolding in time.
MediaPipe is a cross-platform framework for mobile devices, workstations and servers, and supports mobile GPU acceleration. It allows to build processing pipelines using ML-enabled components. Currently, Google MediaPipe provides support for hand tracking, face detection, hair segmentation, and object detection.