Distributed Multi-Lingual Content based Text Mining DML – CBTM

S. Chitrakala and D. Manjula (India)

Keywords

Multilingual Text mining, content-based mining, association mining, Text content classification, Language Wise-Keyword-Repository

Abstract

With the explosion in information over the internet, extracting knowledge from media-based data in the form of images, audio streams and videos replacing textual ones is getting more complex. So a comprehensive methodology covering all forms of data are needed which is able to provide the contents of the data in a short period of time. Text mining tools and algorithms are becoming increasingly popular with many of the books, texts and documentation getting converted to soft-copy versions and being made globally accessible. Though this trend is predominantly in English language, the need has arisen for such an approach for other languages too, as many of the ancient and out-of-print texts in different languages are getting ‘softer’ versions for preserving and extraction of Information and Knowledge. In the context of Indian languages this need is more pronounced as many texts in different languages, scripts, different material forms ranging from palm leaves to stone cutting and dialects are available having wealth of information in variety of disciplines. In this paper, we propose a novel content based approach and demonstrate for textual data in the first instance, to be termed as CBTM (Content-Based Text-Mining) for knowledge discovery of multilingual texts. The proposed methodology employs a content based approach using keywords and patterns stored in the form of gif strings so that extensions to other forms of data are possible. Potential applications of this approach in a distributed environment are also highlighted. We have used the advertisements in newspapers for demonstrating the system.

Important Links:



Go Back