Y. Wu (PRC) and H. Yokota (Japan)
Information Storage and Retrieval, Web Page Analysis,Table, List
Mining the Web for desired information is one of hot topics in recent years. According to human writing manner, there are all kinds of tables and lists on the Web. These tables and lists contain a lot of useful relation information. Analyzing and recognizing them is one of important works for Web content mining. In this paper, we present a method to recognize them. Our method is based on logical structure analysis. It can recognize all kinds of tables and lists with different HTML tags. We give a formal definition “repeated structure” to describe the logical structure character of the tables and lists on the Web, design a special data structure, called WPS-tree, for web page analysis and then develop an algorithm of constructing the tree and an algorithm for searching repeated structures. Finally we give our experiments.
Important Links:
Go Back