基于视频摘要的机器⼈学习⽅法研究

	基于视频摘要的机器⼈学习⽅法研究
	张宇佳
	2019-05
页数	134
学位类型	博士
中文摘要	随着人工智能技术的发展，智能机器人的应用越来越广泛，机器人不仅能够在工业领域提高生产效率，也能够为人类提供高效的服务。机器视觉作为机器人获取和感知外界信息的重要手段，是提高机器人智能化水平的关键技术之一，而如何从存在大量冗余、干扰和噪声的视觉内容中快速获取有价值的信息，直接影响了机器人学习的能力和效率。视频摘要（Video Summarization）技术目的是从多种信息中提取关键内容，有助于提高信息提取的高效性和灵活性。因此，本文研究了通过视频摘要提取重要信息并实现机器人学习的方法，赋予机器人对多种任务信息的更加灵活和自主的信息提取能力，并从中进行任务学习，从而提高机器人的智能化程度。本论文的研究工作通过模拟人对事物从概要到精细化的观察学习过程，从三个角度使用视频摘要对机器人捕获的视频信息进行关键内容的提取，包括通用视频摘要提取关键帧、任务关键词约束的视频摘要提取特定任务相关的片段、细粒度约束视频摘要提取特定任务下关键目标的运动序列，之后再通过上述层级摘要提取的关键内容进行规则学习，引导机器人学习并完成特定的任务。论文的主要工作和贡献如下： 1. 针对通用视频摘要任务，提出了一种膨胀时序依赖对抗网络的方法，实现关键帧的提取，同时去除冗余信息以及噪声干扰，提高了视频摘要结果的准确度。算法提出了一种膨胀时序依赖单元，能够建模视频全局多尺度的时域信息，同时弥补长短时记忆网络中存在的对长序列记忆能力不足的问题，从而提高了时域特征的编码能力。在网络训练阶段，算法进一步提出了一种三元损失结构的对抗训练目标函数，通过学习较复杂的损失函数来对真实摘要中包含的高阶统计信息进行编码，并联合一个有监督的正则项使模型预测出更鲁棒的摘要结果。所提方法在三个公共数据集上进行了实验，结果验证了算法的有效性。 2. 针对关键词约束的视频摘要任务，提出了一种查询关键词条件对抗网络的方法，以关键词表达为限定约束来指导摘要预测。算法提出了一种融合查询表达的三元条件对抗网络，同时引入摘要正则项和摘要长度正则项进行端到端的联合训练，使模型能够产生简洁且与关键词相匹配的摘要结果。针对此任务，还提出了一种深度强化学习摘要网络的方法。算法首先提出了一种映射网络用于学习视觉与文本表达之间直接的映射关系，并设计了两种模态之间的相关性评价指标，之后进一步提出了一种融合映射机制的深度强化学习摘要网络，联合相关性、表达性和多样性三种回报函数，通过最大化累积回报来使模型学习预测正确的决策序列。上述两种方法在公共数据集上进行了实验，结果验证了算法的有效性。 3. 针对细粒度约束的视频摘要任务，提出了一种在线运动自动编码器网络的方法，提取视频中的关键运动目标及其关键的运动信息，实现更加精细化的实例级的摘要结果。算法首先提出了结合目标检测、多目标跟踪、运动轨迹分割、上下文特征提取等预处理操作，来获得每个精细物体运动片段的特征表达，之后进一步提出了一种在线运动自动编码器网络以模拟在线字典学习的方式，存储输入的物体运动信息并进行模型更新，最后使用由模型输出的重构误差预测出关键的物体运动片段，并将其作为摘要结果。所提算法在收集的监控数据集上进行了细粒度视频摘要的实验，同时也将算法进行相应调整，在视频帧级通用摘要的公共数据集上进行了实验，结果验证了算法的有效性。 4. 针对机器人任务学习问题，提出了一种基于层级视频摘要的机器人学习方法，将层级摘要提取的关键信息作为机器人学习的知识并引导机器人完成特定的任务。算法提出了一种模拟人由概要到精细的观察和学习过程来实现机器人任务学习的方法，首先通过通用摘要去除冗余并提取关键视频帧，之后由关键词约束摘要获取与特定任务（关键词表示）相关的视频片段，再由细粒度摘要细化到关键的物体运动层面，最后，由摘要结果学习得到的规则引导机器人完成特定任务。以此，机器人能够一定程度地自主获取所需的关键知识，并为其学习不同种任务提供了可能，具有更好的灵活性和鲁棒性。所提方法在机器人平台上进行了实验，通过桌面整理任务的实现验证了方法的有效性。
英文摘要	With the continued improvement of artificial intelligence, robots are gradually being put into use in many different areas of industry, which helps enhance human productivity. In order to facilitate this, the field of machine vision has played an important role in enabling robots to acquire and sense the external environment. It remains one of the critical approaches to improve robot intelligence. However, the performance of a robot's learning ability as well as learning efficiency can be directly affected by the large amount of redundancy and noise that is often inherent in the visual data of a robot. Video Summarization (VS) techniques aim to reduce this redundancy and noise by extracting key contents from massive amounts of information. Further, by conditioning VS techniques on text, a robot can achieve efficient retrieval of specific information for multiple topics. This thesis explores the use of VS in order to facilitate better robot learning, providing robots with the ability to extract information autonomously and dynamically from a large amount of data that has contents of multiple tasks as well as to learn and complete a specific task. In this way, VS helps to increase the level of robots intelligence. By simulating the way of the human observational learning process that starts on a general level and continues to more and more fine-grained levels, we propose to extract key contents of the videos that are captured by the robot from a hierarchical VS pipeline that consists of three parts: general VS for selecting key frames that contain the information of different tasks, keyword-conditioned (describing specific tasks) VS for obtaining task-related video segments to filter out the contents that are particularly related to a specific task, and fine-grained object-level VS for extracting object-motion clips which are restricted to instance-level results of a specific task. Afterward, the robot obtains the rule that is learned from the hierarchical summarization results and learns to complete a specific task. The main contributions of the thesis are as follows: 1. For the task of general VS, a dilated temporal relational adversarial network is proposed. This framework aims to extract key frames and removes redundancy and noise, as well as improves the accuracy of summarization results compared to previous approaches. The proposed approach first introduces a dilated temporal relational module, which can model the global multi-scale temporal information of the videos, to compensate for the problem of insufficient memory in the Long Short-Term Memory network, and improves the quality of the temporal encoding. In the network training phase, the proposed approach further introduces a three-player adversarial objective function, which introduces a more complex loss that learns to encode and represent the higher-order statistical information in the ground-truths summary, leading to performance improvements. It further applies a supervised regularization term to allow the model to predict more robust summary results. The learned model is evaluated on three public datasets, and the experimental results demonstrate the effectiveness of the proposed approach. 2. For the task of keyword-conditioned VS, a query-conditioned adversarial network is proposed, to encode each keyword as a task-selector in order to guide the summary predictions. The proposed approach introduces a query-conditioned three-player loss that jointly encodes the visual and textual features. At the same time, a length regularization and a summary regularization term are introduced for adversarial training, so that the model can produce summary results that are compact and match the input keyword. Further, another model based on deep reinforcement learning is proposed for this task. It first introduces a mapping network to provide a mapping from the visual space (frames) to a space representing the query (keyword), which is used to obtain distance measures between the two modalities. A deep reinforcement learning summarization network is then proposed that integrates the above mapping mechanism and combines three rewards: relatedness, representativeness, and diversity to maximize the cumulative rewards and guide the agent to take a correct sequence of decisions. The above two learned models are evaluated on a public dataset, and the experimental results demonstrate the effectiveness of the proposed approaches. 3. For the task of fine-grained VS, an online motion autoencoder network is proposed, which extracts the key moving targets and their key motion information for a given video, to achieve more fine-grained instance-level summary results. The proposed approach first introduces a series of pre-processing procedures including object detection, multi-target tracking, motion segmentation and context feature extraction to obtain the feature representation for each object-motion clip. Then a stacked sparse Long Short-Term Memory autoencoder network is introduced, taking feature representations of the object-motion clips as the input and utilizes online dictionary learning in order to store the information and update the model simultaneously. The reconstruction errors, which are computed by the model, are used to predict key object-motion clips as the summary results. The learned model is evaluated on a newly-collected surveillance dataset that has been collected specifically for the task of fine-grained VS. A modified version of the framework has further been tested on available benchmark datasets for the general VS task. The experimental results demonstrate the effectiveness of the proposed approach. 4. For the task of robot learning, a robot learning method based on hierarchical VS is proposed. It first applies general, keyword-conditioned and fine-grained VS techniques sequentially, then uses the extracted key information as the knowledge to guide the robot to learn and complete a specific task. In this way, it simulates the way of the human observational learning process, which progresses from a general to a more fine-grained level. The proposed framework first removes the redundancy in order to extract only the meaningful information via general summarization, then obtains the video segments that are related to a specific task via keyword-conditioned VS, and then further refines the summarization results to obtain key object-motion clips that are based on instance-level. Afterward, the robot learns the rule from the summaries that enables it to guide itself to learn and complete a specific task. In this way, the robot can autonomously extract the required key knowledge, which makes it possible to learn different tasks in a more flexible and robust way. The proposed approach is evaluated on a robot platform, and the experimental results demonstrate the effectiveness of the proposed approach through the implementation of the table tidying task.
关键词	视频摘要机器⼈学习机器视觉机械臂抓取控制
语种	中文
七大方向——子方向分类	智能机器人
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/23921
专题	复杂系统认知与决策实验室_先进机器人
推荐引用方式 GB/T 7714	张宇佳. 基于视频摘要的机器⼈学习⽅法研究[D]. 中科院自动化所. 中科院自动化所,2019.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
Thesis.pdf（48735KB）	学位论文		开放获取	CC BY-NC-SA