英文摘要 | Perceiving human motion and user intent through the visual system is the primary means of intelligent human-computer interaction in virtual scenarios. Among these, the visual-based three-dimensional (3D) hand modeling method aims to real-time reconstruct the geometric shape of user hands using camera systems. Utilizing this technology enables precise collision interaction between virtual 3D objects and user hands, enhancing
the realism of user experience. Distinguished from previous methods of 3D hand reconstruction, this research focuses on alleviating depth ambiguity and self-occlusion issues of hands by leveraging stereo vision information provided by multi-camera systems to achieve high-precision 3D hand reconstruction. Through efficient 3D hand modeling and real-time multi-camera 3D hand reconstruction, this study advances research
in two aspects:
(1) Efficient hand modeling: This research extends the general framework of implicit neural geometric representation and proposes a Multi-Layer Perceptron (MLP) architecture-based method for explicit geometric modeling of hands. This method employs a divide-and-conquer strategy to decompose the highly non-convex hand geometry into a series of convex components, achieving a geometric decoupling for boneby-
bone reconstruction. Furthermore, a three-axis modeling approach is proposed for each local geometry, extending the single-value regression framework of implicit neural geometric representation to a multi-value regression framework for explicit point cloud representation. Utilizing this framework, this research achieves efficient modeling of local point cloud geometry. Experimental results on multiple datasets demonstrate the effectiveness of the proposed 3D hand point cloud modeling method. Specifically, compared to hand modeling methods based on implicit geometric representation, the proposed method achieves sub-millimeter precision but improves inference speed by nearly a thousand times. Compared to learning-based explicit mesh modeling methods, the proposed method outperforms significantly in both accuracy and inference overhead, with single-inference taking only 5.6 milliseconds and modeling precision reaching submillimeter level.
(2) Real-time multi-view 3D hand reconstruction: This research further extends the proposed 3D hand point cloud representation and proposes a more geometrically dense and lightweight 3D hand mesh representation. Based on this representation, a realtime multi-view hand reconstruction method is proposed. This method first estimates the 3D hand skeleton and extracts visual features from multi-view visual information, then injects visual information into the trained 3D hand mesh representation model in a noise-free manner, achieving enhanced prediction based on visual information. Experiments on multiple public datasets demonstrate the efficiency of the proposed method. Compared to the state-of-the-art multi-view hand reconstruction baseline, the proposed method achieves comparable accuracy but doubles the inference speed, reaching a inference speed of 60 FPS. This demonstrates the rationality and efficiency of the proposed method. |
修改评论