Abstract:Extracting content from deep web pages is a challenging problem due to the underlying intricate structures of such pages. A vision and tags based approach (DVS) is proposed. It primarily utilizes the vision information and tag information on the Deep Web result pages to extract the content structure of pages. This approach includes two steps as follows: First, the vision information and tag information are produced by analyzing the Cascading Style Sheet and the DOM Tree to generate an initial visual tree of the Deep Web result page. And then, the Path Shingle (PS) algorithm is employed, by considering both of the vision and the tag information, and the blocks in the visual tree are clustered according to the similarity computing result of them to produce the final visual tree, i.e., the content structure of pages. The innovations of DVS are that it utilizes the vision information and tag information on the Deep Web pages to extract the content structure; and stores the vision information as a tree to tansform the analysis of the vision information to a vision attribute tree. Experiments are conducted with a large set of Web databases called UIUC’s TEL. The experimental results show that the vision and tag based approach has high precision compared with the WTS algorithm and the VIPS algorithm.