11. MediaPipe development

mediapipe github: https://github.com/google/mediapipe

mediapipe official website: https://google.github.io/mediapipe/

dlib official website: http://dlib.net/

dlib github : https://github.com/davisking/dlib

11.1. Introduction

MediaPipe is a data stream processing machine learning application development framework developed and open sourced by Google. It is a graph-based data processing pipeline for building data sources that use many forms, such as video, audio, sensor data, and any time series data. MediaPipe is cross-platform and can run on embedded platforms (Raspberry Pi, etc.), mobile devices (iOS and Android), workstations and servers, and supports mobile GPU acceleration. MediaPipe provides cross-platform, customizable ML solutions for live and streaming media.

The core framework of MediaPipe is implemented in C++ and supports languages such as Java and Objective C. The main concepts of MediaPipe include Packet, Stream, Calculator, Graph and Subgraph.

Features of MediaPipe:

Deep Learning Solutions in MediaPipe

Face DetectionFace MeshIrisHandsPoseHolistic
face_detectionface_meshirishandposehair_segmentation
Hair SegmentationObject DetectionBox TrackingInstant Motion TrackingObjectronKNIFT
hair_segmentationobject_detectionbox_trackinginstant_motion_trackingobjectronknife
 Android iOS C++ Python JS Coral
Face Detection
Face Mesh  
Iris    
Hands  
Pose  
Holistic  
Selfie Segmentation  
Hair Segmentation     
Object Detection   
Box Tracking    
Instant Motion Tracking      
Objectron   
KNIFT      
AutoFlip      
MediaSequence      
YouTube 8M      

11.2. Use

In the process of use, it should be noted that

01. Hand detectionimage-20220302141320102
02. Attitude detectionimage-20220302142121625
03. Overall inspectionimage-20220302142355218
04. Face Detectionimage-20220302143734631
  
05. Face recognition06. Face special effects07. Face detection
image-20220302143242074image-20220302144840935image-20220302145204848
08. 3D object recognition 09. Brush
image-20220302145503026 image-20220302145800776
10. Finger controlimage-20220302145908363
11. Gesture recognitionimage-20220302150312408

11.3. MediaPipe Hands

MediaPipe Hands is a high-fidelity hand and finger tracking solution. It uses machine learning (ML) to infer the 3D coordinates of 21 hands from a single frame.

After palm detection is performed on the entire image, accurate key point positioning is performed on the 21 3D hand joint coordinates in the detected hand region by regression according to the hand marker model, that is, direct coordinate prediction. The model learns consistent internal hand pose representations and is robust to even partially visible hands and self-occlusion.

To obtain ground truth data, about 30K real-world images were manually annotated with 21 3D coordinates as shown below (Z-values are obtained from the image depth map, if there is a Z-value for each corresponding coordinate). To better cover the possible hand poses and provide additional supervision on the nature of the hand geometry, high-quality synthetic hand models in various contexts are also drawn and mapped to the corresponding 3D coordinates.

hand_landmarks

11.4. MediaPipe Pose

MediaPipe Pose is an ML solution for high-fidelity body pose tracking that infers 33 3D coordinates and a full-body background segmentation mask from RGB video frames using the BlazePose research that also powers the ML Kit pose detection API.

The landmark model in MediaPipe pose predicts the location of 33 pose coordinates (see figure below).

pose_tracking_full_body_landmarks

11.5. dlib

The corresponding case is the face effect.

DLIB is a modern C++ toolkit containing machine learning algorithms and tools for creating complex software in C++ to solve real-world problems. It is widely used by industry and academia in fields such as robotics, embedded devices, mobile phones, and large-scale high-performance computing environments.

The dlib library uses 68 points to mark important parts of the face, such as 18-22 points to mark the right eyebrow and 51-68 points to mark the mouth. Use the get_frontal_face_detector module of the dlib library to detect the face, and use the shape_predictor_68_face_landmarks.dat feature data to predict the face feature value

face_landmarks