Metasynthesis is the methodology for Open Complex Giant Systems(OCGSs) originally proposed by Chinese scientists. And Cyberspace for Workshop of Meta-synthetic Engineering (CWME) is a type of workspace that embodies this methodology. It is a man-computer co-operated intelligent complex problem solving system,whose key goal is to synthesize the wisdom of experts, the intelligence of computers, all sorts of information and knowledge into a whole. In CWME, experts express their opinions and exchange their domain knowledge through online discussion with the help of different disciplines and knowledge of human being and the intelligence of computer system. It is obvious that plenty of web resources relevant to the discussion topic will inspire the creative thinking of experts a lot supposing they can be introduced into CWME timely and precisely. And how to organize and utilize these information is a very important problem. As a component of CWME system, an Active Information Retrieval Prototype System(AIRPS) is used to provide experts in CWME with webpages from Internet. But the webpages collected by AIRPS sometimes contain useless ones and ignores the interests of the experts. Therefore, we developed the research work of webpage classification, webpage's body text extraction, and interest-based expert modeling method. More specifically, this paper involves with the following issues: 1.Webpage body text extraction with the ability to classify webpages as well. On Internet, as the expression forms of webpages varies, there are great difference in information quantity of webpages. Generally, webpages of category "topic" can help experts in CWME a lot. The contents extracted from these pages are plain texts, which can be recommended directly to release the time pressure of the participants, and are also easy for computers to process. Therefore, this paper proposed a webpage body text extraction method that meets this requirement. Based on the characteristics of html pages, this method discriminated the category of a web page as "useful" or "useless" through analyzing the proportion of the number of characters in anchor text to that in the whole page and the number of anchor texts. It then extracted the body text of the "useful" pages by a hybrid of character statistic and html-tag analysis algorithm. Experimental results showed that the proposed web page classification algorithm performed better than threshold-based methods in general. And the proposed w...
修改评论