Knowledge Commons of Institute of Automation,CAS
A WebPage Content Block Detection Method Based on Layout Features and Languages Features | |
Han Xianpei; Liu Kang![]() ![]() | |
发表期刊 | Chinese Journal of Computers
![]() |
2008 | |
期号 | 22页码:15-21 |
摘要 | This paper analyzed the different feature types of web-page blocks, and presented a Web-page content block detection method based on layout features and language features, which effectively resolved the seesaw problem between detection accuracy and model generality across different types of web-pages. The method used the vision-block tree to represent web-page, built two individual classifiers respectively for web-page’s layout features and language features, and used different strategies to combine these two classifiers. The experimental results show that, with holding the content block detection recall higher than 90%, thecombined classifiers’ accuracy can reach 85 percents, 5 percents higher than the classifier using only the layout features, and 15 percents higher than the classifier using only the language features; and the experimental results also show that the combined classifiers obtained good detection performance over five selected websites which means that it have good generality. |
关键词 | Web-page Cleaning |
文献类型 | 期刊论文 |
条目标识符 | http://ir.ia.ac.cn/handle/173211/40979 |
专题 | 多模态人工智能系统全国重点实验室_自然语言处理 |
推荐引用方式 GB/T 7714 | Han Xianpei,Liu Kang,Zhao Jun. A WebPage Content Block Detection Method Based on Layout Features and Languages Features[J]. Chinese Journal of Computers,2008(22):15-21. |
APA | Han Xianpei,Liu Kang,&Zhao Jun.(2008).A WebPage Content Block Detection Method Based on Layout Features and Languages Features.Chinese Journal of Computers(22),15-21. |
MLA | Han Xianpei,et al."A WebPage Content Block Detection Method Based on Layout Features and Languages Features".Chinese Journal of Computers .22(2008):15-21. |
条目包含的文件 | 条目无相关文件。 |
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。
修改评论