Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

On ontology-driven document clustering using core semantic features

On ontology-driven document clustering using core semantic features Incorporating semantic knowledge from an ontology into document clustering is an important but challenging problem. While numerous methods have been developed, the value of using such an ontology is still not clear. We show in this paper that an ontology can be used to greatly reduce the number of features needed to do document clustering. Our hypothesis is that polysemous and synonymous nouns are both relatively prevalent and fundamentally important for document cluster formation. We show that nouns can be efficiently identified in documents and that this alone provides improved clustering. We next show the importance of the polysemous and synonymous nouns in clustering and develop a unique approach that allows us to measure the information gain in disambiguating these nouns in an unsupervised learning setting. In so doing, we can identify a core subset of semantic features that represent a text corpus. Empirical results show that by using core semantic features for clustering, one can reduce the number of features by 90% or more and still produce clusters that capture the main themes in a text corpus. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Knowledge and Information Systems Springer Journals

On ontology-driven document clustering using core semantic features

Loading next page...
 
/lp/springer-journals/on-ontology-driven-document-clustering-using-core-semantic-features-UQu4bnj8Az

References (44)

Publisher
Springer Journals
Copyright
Copyright © 2011 by Springer-Verlag London Limited
Subject
Computer Science; Business Information Systems; Information Systems and Communication Service
ISSN
0219-1377
eISSN
0219-3116
DOI
10.1007/s10115-010-0370-4
Publisher site
See Article on Publisher Site

Abstract

Incorporating semantic knowledge from an ontology into document clustering is an important but challenging problem. While numerous methods have been developed, the value of using such an ontology is still not clear. We show in this paper that an ontology can be used to greatly reduce the number of features needed to do document clustering. Our hypothesis is that polysemous and synonymous nouns are both relatively prevalent and fundamentally important for document cluster formation. We show that nouns can be efficiently identified in documents and that this alone provides improved clustering. We next show the importance of the polysemous and synonymous nouns in clustering and develop a unique approach that allows us to measure the information gain in disambiguating these nouns in an unsupervised learning setting. In so doing, we can identify a core subset of semantic features that represent a text corpus. Empirical results show that by using core semantic features for clustering, one can reduce the number of features by 90% or more and still produce clusters that capture the main themes in a text corpus.

Journal

Knowledge and Information SystemsSpringer Journals

Published: Jan 29, 2011

There are no references for this article.