海量结构化数据快速检索系统的设计与实现

CASIA OpenIR > 毕业生 > 硕士学位论文

	海量结构化数据快速检索系统的设计与实现
其他题名	the Design and Implementation of Massive Structured Data Fast Query System
	安丰春
	2014-05-20
学位类型	工程硕士
中文摘要	大数据时代严峻地考验着人们对数据的存储和处理能力，而且大数据环境下存在着的许许多多的不同业务都需要不同的存储和处理平台。目前尽管几乎所有的平台都基于分布式环境，都没有哪个平台能适用所有的业务需求，为了达到资源的最大化利用，我们需要对不同的场景开发不同的应用平台。本论文提出的海量结构化数据快速检索系统旨在解决的是海量数据中的快速的多条件检索问题。目前主流的KV数据库只能通过行主键进行检索，主流的数据仓库平台擅长的都是全表暴力扫描之后的分析、统计以及挖掘等操作。本系统重点关注数据的多条件检索，同时通过扩展功能使得系统可以方便的跟主流的数据仓库系统结合起来，从而支持对数据的分析、统计和挖掘。本系统相关的具体工作主要有： (1) 数据表的分区组织：通过将数据表划分一个个的小表，从而将存储和计算工作分配到不同的数据节点。小表间通过副本冗余来保证可靠性。系统将数据表的查询请求转换为其所有小表的查询请求 (2) 小表内数据的多条件检索：本系统对外支持类SQL（Structured Query Language，结构化查询语言）的查询接口，通过对查询语言的检索条件进行分析生成语法树，进而生成最终检索的执行树，通过倒排索引实现对小表的多条件检索 (3) 集群的多主节点机制：本系统支持多台主节点，并通过选举机制来保证对外提供服务的主节点宕机后，其它主节点能及时接管集群，最低限度降低对业务的影响 (4) 提供对MapReduce（Hadoop生态系统的分布式计算平台）和Hive（Hadoop生态系统的数据仓库平台）的扩展支持：通过扩展使得本系统同样能支持对数据的分析、统计和挖掘工作。经过测试，本系统能够响应数据的多条件检索，并且能够通过MapReduce平台和Hive平台实现对数据的计算任务。
英文摘要	Big Data era is greatly challenging people's data storing and processing capacity. And there are so many different businesses in Big Data field which need different platforms. Despite all the platforms are based on distributed environments, are no one platform can be applied to all the businesses. In order to maximize the use of resources, we need to develop different platforms for different application scenarios. This thesis presents the Massive Structured Data Fast Query System which is designed to address faster retrieval of Big Data with more query patterns. The current popular KV database could only be retrieved by the primary key, and the current popular data warehouse platform is only good at the data analysis, statistics, and data mining which are all based on brute-force scan of the whole table. This thesis focuses compound queries and provides extended functionality enables the system being easily integrated with popular data warehouse systems which could to data analysis, statistics, and data mining. The specific related work are: (1) Data’s organization: Every table is partitioned into many tablelets, which could be easily stored on different data nodes. The reliability of data is ensured by replications. When a table is queried, the system would broadcast the query to all tablelets of the table. (2) Compound query patterns for a tablelet: The system supports SQL-like (Structured Query Language, Structured Query Language) language as query interface. It first analyzes the query pattern and generates a syntax tree, and then it transforms the syntax tree to the final retrieval execution tree. (3) Multi-master node for the cluster system: The system supports multiple master nodes with an election mechanisms. Once the main master crashes, the election mechanism can ensure that another master could quickly take over the cluster and become the new main master. This could minimizes the impact on the business (4) Provides expanded functionality for MapReduce (distributed computing platform of the Hadoop ecosystem) and Hive (the data warehouse platform of the Hadoop ecosystem): This could make the system support for data analysis, statistics and data mining. The tests shows that the Massive Structured Data Fast Query System can quickly respond to compound query patterns, and could also finish computing tasks through the MapReduce platform or Hive platform.
关键词	大数据分布式 Hadoop 倒排索引检索 Big Data Distributed System Hadoop Inverted Index Query
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/7735
专题	毕业生_硕士学位论文
推荐引用方式 GB/T 7714	安丰春. 海量结构化数据快速检索系统的设计与实现[D]. 中国科学院自动化研究所. 中国科学院大学,2014.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
CASIA_2011E800906101（1654KB）			暂不开放	CC BY-NC-SA