|Thesis Advisor||谭民 ; 赵晓光|
|Place of Conferral||中科院自动化所|
|Keyword||视频摘要 机器⼈学习 机器视觉 机械臂抓取控制|
With the continued improvement of artificial intelligence, robots are gradually being put into use in many different areas of industry, which helps enhance human productivity. In order to facilitate this, the field of machine vision has played an important role in enabling robots to acquire and sense the external environment. It remains one of the critical approaches to improve robot intelligence. However, the performance of a robot's learning ability as well as learning efficiency can be directly affected by the large amount of redundancy and noise that is often inherent in the visual data of a robot. Video Summarization (VS) techniques aim to reduce this redundancy and noise by extracting key contents from massive amounts of information. Further, by conditioning VS techniques on text, a robot can achieve efficient retrieval of specific information for multiple topics. This thesis explores the use of VS in order to facilitate better robot learning, providing robots with the ability to extract information autonomously and dynamically from a large amount of data that has contents of multiple tasks as well as to learn and complete a specific task. In this way, VS helps to increase the level of robots intelligence.
By simulating the way of the human observational learning process that starts on a general level and continues to more and more fine-grained levels, we propose to extract key contents of the videos that are captured by the robot from a hierarchical VS pipeline that consists of three parts: general VS for selecting key frames that contain the information of different tasks, keyword-conditioned (describing specific tasks) VS for obtaining task-related video segments to filter out the contents that are particularly related to a specific task, and fine-grained object-level VS for extracting object-motion clips which are restricted to instance-level results of a specific task. Afterward, the robot obtains the rule that is learned from the hierarchical summarization results and learns to complete a specific task. The main contributions of the thesis are as follows:
1. For the task of general VS, a dilated temporal relational adversarial network is proposed. This framework aims to extract key frames and removes redundancy and noise, as well as improves the accuracy of summarization results compared to previous approaches. The proposed approach first introduces a dilated temporal relational module, which can model the global multi-scale temporal information of the videos, to compensate for the problem of insufficient memory in the Long Short-Term Memory network, and improves the quality of the temporal encoding. In the network training phase, the proposed approach further introduces a three-player adversarial objective function, which introduces a more complex loss that learns to encode and represent the higher-order statistical information in the ground-truths summary, leading to performance improvements. It further applies a supervised regularization term to allow the model to predict more robust summary results. The learned model is evaluated on three public datasets, and the experimental results demonstrate the effectiveness of the proposed approach.
2. For the task of keyword-conditioned VS, a query-conditioned adversarial network is proposed, to encode each keyword as a task-selector in order to guide the summary predictions. The proposed approach introduces a query-conditioned three-player loss that jointly encodes the visual and textual features. At the same time, a length regularization and a summary regularization term are introduced for adversarial training, so that the model can produce summary results that are compact and match the input keyword. Further, another model based on deep reinforcement learning is proposed for this task. It first introduces a mapping network to provide a mapping from the visual space (frames) to a space representing the query (keyword), which is used to obtain distance measures between the two modalities. A deep reinforcement learning summarization network is then proposed that integrates the above mapping mechanism and combines three rewards: relatedness, representativeness, and diversity to maximize the cumulative rewards and guide the agent to take a correct sequence of decisions. The above two learned models are evaluated on a public dataset, and the experimental results demonstrate the effectiveness of the proposed approaches.
3. For the task of fine-grained VS, an online motion autoencoder network is proposed, which extracts the key moving targets and their key motion information for a given video, to achieve more fine-grained instance-level summary results. The proposed approach first introduces a series of pre-processing procedures including object detection, multi-target tracking, motion segmentation and context feature extraction to obtain the feature representation for each object-motion clip. Then a stacked sparse Long Short-Term Memory autoencoder network is introduced, taking feature representations of the object-motion clips as the input and utilizes online dictionary learning in order to store the information and update the model simultaneously. The reconstruction errors, which are computed by the model, are used to predict key object-motion clips as the summary results. The learned model is evaluated on a newly-collected surveillance dataset that has been collected specifically for the task of fine-grained VS. A modified version of the framework has further been tested on available benchmark datasets for the general VS task. The experimental results demonstrate the effectiveness of the proposed approach.
4. For the task of robot learning, a robot learning method based on hierarchical VS is proposed. It first applies general, keyword-conditioned and fine-grained VS techniques sequentially, then uses the extracted key information as the knowledge to guide the robot to learn and complete a specific task. In this way, it simulates the way of the human observational learning process, which progresses from a general to a more fine-grained level. The proposed framework first removes the redundancy in order to extract only the meaningful information via general summarization, then obtains the video segments that are related to a specific task via keyword-conditioned VS, and then further refines the summarization results to obtain key object-motion clips that are based on instance-level. Afterward, the robot learns the rule from the summaries that enables it to guide itself to learn and complete a specific task. In this way, the robot can autonomously extract the required key knowledge, which makes it possible to learn different tasks in a more flexible and robust way. The proposed approach is evaluated on a robot platform, and the experimental results demonstrate the effectiveness of the proposed approach through the implementation of the table tidying task.
|张宇佳. 基于视频摘要的机器⼈学习⽅法研究[D]. 中科院自动化所. 中科院自动化所,2019.|
|Files in This Item:|
|Recommend this item|
|Export to Endnote|
|Similar articles in Google Scholar|
|Similar articles in Baidu academic|
|Similar articles in Bing Scholar|
Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.