CASIA OpenIR  > 毕业生  > 博士学位论文
非稳态环境下的数据流在线变化检测
卜丽
学位类型工学博士
导师赵冬斌
2017-05-25
学位授予单位中国科学院大学
学位授予地点北京
关键词非稳态变化 数据流 适时学习 在线变化检测 最小二乘密度差估计
摘要目前,大多数机器学习技术都假设系统过程满足稳态要求,可保证在最初训练阶段获得的模型能够长期有效地发挥作用,且其性能始终满足设计者的期望。然而在实际应用领域,受到传感器老化漂移、软硬件故障、待监测环境/平台中的变化等内外因素的影响,系统不再满足稳态假设,即发生了非稳态变化。这些变化使得数据或目标变量的统计特性随时间朝着不可预知的方向发展,因而基于原模型的预测、分类或决策等将不再适用于当前系统过程。
 
非稳态环境下的学习问题是目前机器学习领域的研究热点,受到研究者的广泛关注。尤其是包含变化检测和模型更新两大模块的主动学习,能够及时检测到变化并更新模型,可以有效降低非稳态变化的影响、保障模型的学习决策性能。然而,现有的非稳态环境下的学习问题在方法研究和应用实现上还有很多未解决的问题,如模型的自适应更新和数据库的管理等。其次,现有变化检测方法往往假设数据的分布形式,或要求离线检测,或仅能处理一维数据情况,无法实现多维数据流的在线变化检测,即无法应对数据分布未知、参数未知、变化类型未知等问题;同时,如何有效控制计算复杂度,减少训练、检测时间,使其满足实时检测的需求也是研究的难点;最后,变化检测方法理论证明上的欠缺及其应用的局限性也制约着非稳态环境下学习问题的研究与应用。为此,本文以非稳态环境下连续分布的数据流为主要研究对象,设计其学习框架,并着重提出适用范围更广泛、理论依据更充分的变化检测方法。论文将包含以下工作和创新点:
 
1. 针对非稳态环境下的学习问题,介绍了一种适时 (Just-In-Time, JIT) 学习框架,其具有一种自适应信息管理机制,综合考虑了监督数据对学习性能的提升作用以及概念漂移 (Concept Drift) 的显著影响。分别介绍了两种不同的变化检测方法和分类器模型,并通过实验对比了耦合得到的四种适时分类模型之间的性能差异,及变化检测结果对其性能的影响。
2. 针对主动学习中变化检测方法对分类模型性能和适用性的影响,提出了基于最小二乘密度差 (Least Squares Density Difference, LSDD) 估计的数据流在线变化检测方法 (LSDD-based Change Detection Test, LSDD-CDT)。引入了蓄水池采样算法,以管理数据窗口中的新旧样本;并提出一种分层递阶的变化检测机制,可以更精确地估计变化发生的位置。该方法无需先验知识或假设变化类型,而直接估计前后数据窗口中分布差异,是一种适用于数据流问题的在线变化检测方法。在具有不同分布类型和变化类型的模拟数据集和实际应用上进行的实验比较中,验证了该算法的广泛适用性和变化检测的准确性。
3.针对参考集中有限样本选取的随机性、参数选择不合理等对算法性能的影响,研究并提出了多种基于集成学习的变化检测方案,分别探索了训练子集的获取方式、参考集的更新方法以及检测结果的耦合形式等多样性表达形式的影响。实验结果表明,基于集成学习的、多样性充分表达下的变化检测方法,比 LSDD-CDT 方法具有更高的检测准确性。
4. 针对前述变化检测方法理论研究上的欠缺,率先证明了特征统计量的分布形式及窗口大小对算法误检率和漏检率的影响等。在此理论基础上,提出了增大参考集窗口下的阈值调整策略,可以充分利用已知信息在线更新阈值,无需重新训练模型;并引入了增量式的特征统计量的迭代算法,以加快变化检测速度。将该算法与其他先进的变化检测方法进行了充分的实验比较,并在检测精度上进行了多组假设检验,其结果表明了该算法具有更优越的广泛适用性和检测准确性,且计算时间更短。
5. 考虑海量数据环境下的变化检测问题,提出了融合 KS 检验和最小二乘密度差的变化检测方法。探索了变化检测层与特征提取层窗口的分离机制:基于 KS 检验的变化检测,满足窗口自适应调整的需求;同时保证基于最小二乘密度差的特征提取过程中小窗口的快速计算。实验验证了所提算法对误检率的有效控制及其对小变化的可检测性。
其他摘要The traditional learning models are generally designed by assuming the stationary hypothesis for the process during which data are generated, implying that the models acquired during the training phase can function effectively over a long period of time. However, such a hypothesis is hardly met. In fact, aging effects affecting the readout electronics of the transducer, soft and hard faults influencing the sensor unit, changes in the phenomenon under observation (e.g., a plant) introduce changes in the process. Such concept drift means that the statistical properties of the target variable change over time in unforeseen ways, with the consequence that the application performance decreases unless adaptive strategies are taken into account.
 
Learning in nonstationary environments is a hot research topic in the field of machine learning. In particular, the active learning, including change detection and model updating modules, can effectively reduce the influence of changes and improve the performance of learning models. However, there are many unsolved problems in the theoretical research and application of active learning, such as the adaptive updating of models and the management of database. Second, the existing change detection methods can hardly deal with changes in multi-dimensional datastreams, since they either assume the availability of underlying distribution of data, or can only operate on scalar streams. How to effectively control the computational complexity, so as to reduce the computational time to meet online requirement, is also the difficulty of this issue. Finally, the lack of theoretical foundation of change detection methods and the limitations of their application have become the key restrictions. In this thesis, we take the datastreams in nonstationary conditions as the research object, design the learning framework, and propose pdf-free change detection tests with comprehensive theoretical basis and extensive application. The main contributions of this thesis include the following five parts.
 
1. We introduce a Just-In-Time learning framework which possesses an adaptive information management mechanism. It takes into account both the possibly improved performance with newly arrived supervised information and the significant effect of concept drift. Two different change detection tests (CDTs) and classifiers are introduced, respectively. Thus, four outcome classifiers are contrasted with a complete experimental setup, which also reveals the influence of change detection performance over the classifiers.
2. Given the importance of the change detection test in active learning, we propose an online test based on the least squares density difference (LSDD) estimation. The test does not require any assumptions about the underlying data distribution, and is able to operate immediately after having been configured by adopting a reservoir sampling mechanism. A hierarchical threshold mechanism is proposed to be more sensitive to changes and estimate the change location more accurately. Comprehensive experiments, both on simulated and real data, validate the effectiveness in detection of the proposed method both in terms of detection promptness and accuracy.
3. We propose a family of LSDD methods, constructed by exploring different ensemble options applied to the basic CDT procedure, to build possibly high-performing change detection mechanisms and investigate how they influence the detection performance. Experiments show that ensemble methods, when with significant diversity, are characterized by improved performance in change detection once compared with the direct ensemble-free counterpart.
4. For lack of theoretical basis of the aforementioned change detection test, we take the lead in proving the distribution of the derived statistics and the influence of the window sizes on the detection performance. As a consequence, the test can adaptively enlarge the window size to improve the change detection performance without requesting any retraining phase. Furthermore, the proposed test can operate online, with needed estimates and thresholds computed incrementally as fresh samples come. Comprehensive experiments validate the improved performance and effectiveness of the test both in detection promptness and accuracy.
5. We extend our work by combining the LSDD CDT with the Kolmogorov Smirnov test (KS test). Independent windows are used to derive independent and one-dimensional LSDD values; then the KS test is applied to detect possible changes which are reflected by the statistics, instead of comparing them with a threshold as usual. The method can work with small windows for LSDD, which reduces the execution time. The experiments validate that our method is effective in detecting changes with controllable FP rates, and it also can detect small changes accurately.
文献类型学位论文
条目标识符http://ir.ia.ac.cn/handle/173211/14841
专题毕业生_博士学位论文
作者单位中国科学院自动化研究所,中国科学院大学
推荐引用方式
GB/T 7714
卜丽. 非稳态环境下的数据流在线变化检测[D]. 北京. 中国科学院大学,2017.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
终版-Thesis_LiBu.pdf(10095KB)学位论文 暂不开放CC BY-NC-SA请求全文
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[卜丽]的文章
百度学术
百度学术中相似的文章
[卜丽]的文章
必应学术
必应学术中相似的文章
[卜丽]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。