CASIA OpenIR  > 学术期刊  > IEEE/CAA Journal of Automatica Sinica
Depth-Guided Vision Transformer With Normalizing Flows for Monocular 3D Object Detection
Cong Pan; Junran Peng; Zhaoxiang Zhang
Source PublicationIEEE/CAA Journal of Automatica Sinica
AbstractMonocular 3D object detection is challenging due to the lack of accurate depth information. Some methods estimate the pixel-wise depth maps from off-the-shelf depth estimators and then use them as an additional input to augment the RGB images. Depth-based methods attempt to convert estimated depth maps to pseudo-LiDAR and then use LiDAR-based object detectors or focus on the perspective of image and depth fusion learning. However, they demonstrate limited performance and efficiency as a result of depth inaccuracy and complex fusion mode with convolutions. Different from these approaches, our proposed depth-guided vision transformer with a normalizing flows (NF-DVT) network uses normalizing flows to build priors in depth maps to achieve more accurate depth information. Then we develop a novel Swin-Transformer-based backbone with a fusion module to process RGB image patches and depth map patches with two separate branches and fuse them using cross-attention to exchange information with each other. Furthermore, with the help of pixel-wise relative depth values in depth maps, we develop new relative position embeddings in the cross-attention mechanism to capture more accurate sequence ordering of input tokens. Our method is the first Swin-Transformer-based backbone architecture for monocular 3D object detection. The experimental results on the KITTI and the challenging Waymo Open datasets show the effectiveness of our proposed method and superior performance over previous counterparts.
KeywordMonocular 3D object detection normalizing flows Swin Transformer
Citation statistics
Document Type期刊论文
Collection学术期刊_IEEE/CAA Journal of Automatica Sinica
Recommended Citation
GB/T 7714
Cong Pan,Junran Peng,Zhaoxiang Zhang. Depth-Guided Vision Transformer With Normalizing Flows for Monocular 3D Object Detection[J]. IEEE/CAA Journal of Automatica Sinica,2024,11(3):673-689.
APA Cong Pan,Junran Peng,&Zhaoxiang Zhang.(2024).Depth-Guided Vision Transformer With Normalizing Flows for Monocular 3D Object Detection.IEEE/CAA Journal of Automatica Sinica,11(3),673-689.
MLA Cong Pan,et al."Depth-Guided Vision Transformer With Normalizing Flows for Monocular 3D Object Detection".IEEE/CAA Journal of Automatica Sinica 11.3(2024):673-689.
Files in This Item: Download All
File Name/Size DocType Version Access License
JAS-2023-0177.pdf(37784KB)期刊论文出版稿开放获取CC BY-NC-SAView Download
Related Services
Recommend this item
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[Cong Pan]'s Articles
[Junran Peng]'s Articles
[Zhaoxiang Zhang]'s Articles
Baidu academic
Similar articles in Baidu academic
[Cong Pan]'s Articles
[Junran Peng]'s Articles
[Zhaoxiang Zhang]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[Cong Pan]'s Articles
[Junran Peng]'s Articles
[Zhaoxiang Zhang]'s Articles
Terms of Use
No data!
Social Bookmark/Share
File name: JAS-2023-0177.pdf
Format: Adobe PDF
All comments (0)
No comment.

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.