OLERA: OnLine Extraction Rule Analysis for Semi-structured Documents

C.-H. Chang and S.-C. Kuo (Taiwan)

Keywords

information extraction, semi-structured documents, stringalignment, approximate matching

Abstract

Information extraction (IE) from semi-structured Web doc uments plays an important role for a variety of informa tion agents. Over the past decade, researchers have devel oped a rich family of generic IE techniques based on su pervised approach which learn extraction rules from user labelled training examples. However, annotating training data can be expensive when a lot of data sources need to be extracted. In this article, we introduce annotation-free IE using pattern mining and string alignment techniques. We describe OLERA, a semi-supervised IE system that produces extraction rules by aligning similar contents of multiple input records together and presents the result in a spreadsheet-like table. Therefore, users do not need to an notate the input documents but only to specify the scheme for the extracted data after the extraction pattern is discov ered. Another plus is that this approach works not only for multi-record Web pages (as a limitation of some unsuper vised IE approaches) but also single-record Web pages.

Important Links:



Go Back