This paper puts stress on the Web information mining, and systematically expatiates the theory and method of Web pagers mining. The goal is to manage and query Web using RDBMS. The basic idea is that applying data mining technique, we will map semi-structured data into relational data, and storing it in a database. On the other hand, the path traversal pattern mining and topics of interest discovery are also discussed for decision making. First, paper introduces the background of Web information mining and the value of its application. A application model is proposed. For every module in the model, the correspondent technique and application requirement are analyzed. Next, the mining structure of Web pages is presented, the mining algorithm is given. The algorithm can effectively mine typical data structure about a domain from a group of the same type Web document objects. In the following, with aspect to the discovery of domain information record in the Web documents, we map Web document into a tag-tree according to the nested hierarchy of HTML tags. Based the mapped tag-tree, we locate the domain information in the Web document with the supporting of domain knowledge base and some heuristics. Then, extracting data from table is discussed. We analyze the relevant HTML tags and their attributes having an effect on the structure of table. The algorithm is also given. For the domain information discovered, we model the data with OEM model. Then, typical domain data structure is mined from the OEM data model. The reason that we mine the typical domain data structure is that storing space can be used effectively, that the number of table and null fields can be reduced. Next, the method mapping semi- structured data into relational data is researched. In the last two sections, the discovery of user access pattern, topics of interest is discussed. By comparing user access pattern mining with association rules mining, the difference between them is analyzed. The algorithm suitable for access pattern mining is proposed. Finally, we only generally introduce the discovery of topics of interest.
修改评论