视觉与标签信息的Deep Web查询页面内容提取
CSTR:
作者:
作者单位:

作者简介:

通讯作者:

中图分类号:

基金项目:

国家自然科学基金资助项目(61103114);重庆市高等教育教学改革研究重点资助项目(112023);中央高校基本科研业务基金资助项目(CDJXS11181164);“211工程”三期建设资助项目(S 10218)


Combining vision information and tag information to extract Deep Web result pages content
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    提出了一种结合页面视觉信息和标签信息来提取页面内容结构的方法——DVS。DVS首先通过分析页面的CSS样式信息、DOM树以获得页面的视觉信息和标签信息,初步得到页面的视觉树;然后利用树的路径相似算法,既考虑标签信息又考虑视觉信息来计算树中模块的相似性,对模块进行聚类,最终得到页面的视觉树,即页面的内容结构。DVS主要的特色在于从视觉信息和标签信息两方面来提取页面的内容结构;采用树形结构表示视觉信息,将分析视觉信息转换成分析“视觉属性”树。实验采用UIUC的TEL数据集,分别与WTS算法、VIPS算法进行了比较,文中算法可以获得更高的准确性。

    Abstract:

    Extracting content from deep web pages is a challenging problem due to the underlying intricate structures of such pages. A vision and tags based approach (DVS) is proposed. It primarily utilizes the vision information and tag information on the Deep Web result pages to extract the content structure of pages. This approach includes two steps as follows: First, the vision information and tag information are produced by analyzing the Cascading Style Sheet and the DOM Tree to generate an initial visual tree of the Deep Web result page. And then, the Path Shingle (PS) algorithm is employed, by considering both of the vision and the tag information, and the blocks in the visual tree are clustered according to the similarity computing result of them to produce the final visual tree, i.e., the content structure of pages. The innovations of DVS are that it utilizes the vision information and tag information on the Deep Web pages to extract the content structure; and stores the vision information as a tree to tansform the analysis of the vision information to a vision attribute tree. Experiments are conducted with a large set of Web databases called UIUC’s TEL. The experimental results show that the vision and tag based approach has high precision compared with the WTS algorithm and the VIPS algorithm.

    参考文献
    相似文献
    引证文献
引用本文

冯永,唐黎.视觉与标签信息的Deep Web查询页面内容提取[J].重庆大学学报,2012,35(6):117-124.

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:
  • 最后修改日期:
  • 录用日期:
  • 在线发布日期:
  • 出版日期:
文章二维码