Holistic tracking is a new feature in MediaPipe that enables the simultaneous detection of body and hand pose and face landmarks on mobile devices. The three capabilities were previously already available separately but they are now combined in a single, highly optimized solution.
MediaPipe Holistic consists of a new pipeline with optimized pose, face and hand components that each run in real-time, with minimum memory transfer between their inference backends, and added support for interchangeability of the three components, depending on the quality/speed trade-offs.
One of the features of the pipeline is adapting the inputs to each model requirement. For example, pose estimation requires a 256x256 frame, which would be not enough detailed for use with the hand tracking model.
According to Google engineers, combining the detection of human pose, hand tracking, and face landmarks is a very complex problem that requires the use of multiple, dependent neural networks.
MediaPipe Holistic requires coordination between up to 8 models per frame — 1 pose detector, 1 pose landmark model, 3 re-crop models and 3 keypoint models for hands and face. While building this solution, we optimized not only machine learning models, but also pre- and post-processing algorithms.
The first model in the pipeline is the pose detector. The results of this inference are used to identify both hands and the face position and to crop the original, high-resolution frame accordingly. The resulting images are finally passed to the hands and face models.
To achieve maximum performance, the pipeline assumes that the object does not move significantly from frame to frame, so the result of the previous frame analysis, i.e., the body region of interest, can be used to start the inference on the new frame. Similarly, pose detection is used as a preliminary step on each frame to speed up inference when reacting to fast movements.
Thanks to this approach, Google engineers say, Holistic tracking is able to detect over 540 keypoints while providing near real-time performance.
Holistic tracking API allows developers to define a number of input parameters, such as whether the input images should be considered as part of a video stream or not; whether it should provide full body or upper body inference; minimum confidence, etc. Additionally, it allows to define precisely which output landmarks should be provided by the inference.
According to Google, the unification of pose, hand tracking, and face expression will enable new applications including remote gesture interfaces, full-body augmented reality, sign language recognition, and more. As an example of this, Google engineers developed a remote control interface running in the browser and allowing the user to manipulate objects on the screen, type on a virtual keyboard, and so on, using gestures.
MediaPipe Holistic is available on-device for mobile (Android, iOS) and desktop. Ready-to-use solutions are available in Python and JavaScript to accelerate adoption by Web developers.