Monocular 3D head tracking is a core technique for designing intelligent computer-human interfaces. Over the last decade, long-term tracking in complex environments remains a challenging problem. In this thesis, we investigate this problem by presenting two alternative frameworks and exploit its potential applications in computer-human interactions. The first framework is a robust implementation of the differential tracking approach along with a 3D ellipsoid for geometric reasoning. It recursively estimates the head pose over time from prior prediction and dynamically updates its template model built beforehand. This makes it robust to appearance changes and lead to smooth estimations. However, they also bring two severe problems: the system can only handle slowly moving targets and the need for updating the model makes it prone to drift, which together make tracking over long periods of time impossible. To avoid these important limitations, the second part of this thesis turns to a novel tracking by detection approach. Compared to the last approach, it requires an offline modeling and learning procedure, and performs tracking without the motion prior and dynamic updating. Tracking is actually made by matching features detected from the input images with the reference ones via a novel multi-view learning scheme. The learning relies on face texture synthesis to produce training examples, stable class detection and multi-view selection that are executed within a simple head modeling system. Extensive experiments show that this prevents drift while successfully tracking natural head motions. To further improve the performances, we also integrate optical-flow correspondences to enforce temporal consistency and incorporate color prior to identify possible outlier features. Finally, fusing all these components leads to a system that can be used for computer-human interactions. Lastly, we present two applications of the proposed 3D head tracking system. The first estimates the user’s gaze direction in the presence of natural head rotations. The second transfers facial expressions from a user to his online avatar. As a result, integrating these two functions into a virtual collaborative system greatly improves the communications of two remote partners.
修改评论