基于 C/S 架构的室内复杂场景视觉定位系统研究

CASIA OpenIR > 多模态人工智能系统全国重点实验室 > 机器人视觉

基于 C/S 架构的室内复杂场景视觉定位系统研究

王超

2020-05

页数

学位类型

硕士

中文摘要

随着移动机器人的日益普及，机器人的室内定位需求越来越大，人们期望移动机器人能在室内实现精确地定位导航，然后执行一系列任务。室内定位在工厂、家居、商场等室内都具有广泛的应用领域，为了实现室内精确定位，基于视觉的定位技术是一种可行的解决方案，尤其是通过相机拍摄的单幅图像估计相机在世界坐标系中的六自由度位置和姿态，操作便捷，然而由于室内复杂的环境变化，相对视频而言，单幅图像含有的信息少，如何提高基于单幅图像定位系统的精确性、鲁棒性、高效性，仍是一项具有挑战性的工作。本文对基于单幅图像的室内复杂场景的视觉定位问题进行了一系列的探讨和研究，主要的贡献如下：

1. 构建了一个中等规模的室内复杂场景测试数据集。针对目前已有公开数据集规模偏小和测试样本没有真值的问题，本文采用了激光-视觉SLAM扫描设备Navvis对北京市石景山万达广场总共3层8000平方米商场面积进行扫描，获取场景1567幅全景图和对应的编码深度图，同时使用全站型电子测距仪辅助定位以提高生成的三维点云模型精度。本文通过视角合成的方法从每幅全景图中生成36幅不同视角的透视图，透视图的每个像素对应的三维点坐标由三维点云模型生成，由此构建了包含56412幅图像的场景三维数据库。另外使用三台不同型号的手机采集了场景4701幅图像作为测试集，手机的内参采用二维棋盘格方法单独标定，测试集图像采用EPnP估计手机与Navvis设备的外参数，从而获得测试图像的真实位姿。

2. 设计了一套基于图像检索的视觉定位流程。针对目前视觉定位方法存在定位效率低、精度低的缺点，本系统首先对“BOW”，“Disloc”，“Inloc”三种目前主流的图像检索算法进行了对比实验，分析了各自优缺点并选择BOW模型作为最终的图像检索方案。为了进一步提高定位效率，在BOW相似性检索基础上，引入了GPU加速提取RootSIFT特征，根据RootSIFT特征匹配结果对检索结果进行重排序，选取最相似的10幅图像和对应的特征匹配结果输入到N点透视定位算法估计查询图像位姿。实验表明，该定位方法比传统定位算法效率提高10倍以上，同时具有高精度、高效率、鲁棒性强的特性。

3. 实现了一套完整的基于C/S架构的室内视觉定位系统，该系统定位时间200ms，位置精度6cm，角度精度0.32°，定位成功率91.6%，达到了实际应用的需求。系统包含以下模块：客户端模块，负责完成用户拍摄场景图像并压缩上传到服务器，接收返回的定位结果的任务；服务器端模块，负责加载离线重建好的场景三维模型，并对接收到的客户端图像进行相似性检索和位姿估计，将估计的位姿发送到客户端。

英文摘要

With the increasing popularity of mobile robots, the demand for indoor localization of robots is growing as well. People are in the hope that mobile robots can achieve accurate localization and navigation indoors, and then perform a series of tasks. Indoor localization has a wide range of applications in factories, homes, shopping malls, etc. For the sake of achieving indoor accurate localization, the visual localization technology is a feasible solution, especially to estimate the camera's six-degree-of-freedom position and orientation in the world coordinate system through a single image shot by the camera, which is easy to operate. However, due to the complex indoor environmental changes, a single image contains less information in comparison with a video. How to improve the accuracy, robustness and efficiency of the localization system based on a single image is still a challenging work. In this paper, a series of researches on the visual location problems of the indoor complex scene based on a single image are carried out. The main contributions are as follows:

1.A medium-sized indoor complex scene test dataset is constructed. In view of the problems that the size of the existing public dataset is too small and the test sample has no ground truth value, this paper uses the laser-vision SLAM scanning equipment Navvis to scan the total area of 8000 square meters of three floors of Wanda Plaza in Shijingshan, Beijing, so as to obtain the 1567 mixed panoramic pictures and the depth images. In the meantime, the full-stop electronic rangefinder is used to assist the localization in improving the accuracy of the 3D point cloud model. In this paper, 36 perspective views of different perspectives are generated from each panorama by the method of perspective synthesis. The three-dimensional point coordinates corresponding to each pixel of the perspective view are generated by the three-dimensional point cloud model, thus a three-dimensional scene database containing 56412 images is constructed. In addition, 4701 images of the scene are collected by three different types of mobile phones as the test dataset. The internal parameters of the mobile phone are individually calibrated by the two-dimensional checkerboard method. The external parameters of the mobile phone and Navvis device are estimated by EPnP in the test dataset image, so as to obtain the ground truth pose of the test image.

2.A set of visual localization flow based on the image retrieval is designed. In view of the deficiencies of low efficiency and low accuracy of the current visual localization method, this system first compares three mainstream image retrieval algorithms of “BOW”，“Disloc”，and “Inloc”, compare their advantages and disadvantages, and selected BOW model as the final image retrieval scheme. In order to further improve the efficiency of localization, GPU is introduced to accelerate the extraction of the features of RootSIFT on the basis of BOW similarity retrieval. In accordance with the results of RootSIFT feature matching, the retrieval results are reordered. The top-10 most similar images and their corresponding feature matching results are input into the Perspective-N-Point localization algorithm to estimate the query image pose. The experiment results show that the efficiency of this method is more than 10 times higher than that of the traditional algorithm, with the characteristics of high precision, high efficiency and strong robustness.

3.A complete set of indoor vision localization system based on Client-Server architecture is realized. The localization time of this system is 200ms, the localization accuracy is 6cm, the angle accuracy is 0.32 °, and the localization success rate is 91.6%, which meets the needs of practical application. The system includes the following modules: the client module, which is responsible for completing the task of the user's taking the scene image, compressing and uploading it to the server, and receiving the returned localization results; the server module, which is responsible for loading the 3D scene model reconstructed offline, carrying out the similarity retrieval and the pose estimation for the received client images, and sending the estimated pose to the client.

关键词

数据采集,图像检索,相机位姿估计,C/S 架构,室内视觉定位

语种

中文

七大方向——子方向分类

三维视觉

文献类型

学位论文

条目标识符

http://ir.ia.ac.cn/handle/173211/39255

专题

多模态人工智能系统全国重点实验室_机器人视觉

推荐引用方式
GB/T 7714

王超. 基于 C/S 架构的室内复杂场景视觉定位系统研究[D]. 中国科学院自动化研究所. 中国科学院大学,2020.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
基于CS架构的室内复杂场景视觉定位系统研（14603KB）	学位论文		开放获取	CC BY-NC-SA