Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

A new benchmark dataset with production methodology for short text semantic similarity algorithms

A new benchmark dataset with production methodology for short text semantic similarity algorithms A New Benchmark Dataset with Production Methodology for Short Text Semantic Similarity Algorithms JAMES O'SHEA, ZUHAIR BANDAR, and KEELEY CROCKETT, Manchester Metropolitan University This research presents a new benchmark dataset for evaluating Short Text Semantic Similarity (STSS) measurement algorithms and the methodology used for its creation. The power of the dataset is evaluated by using it to compare two established algorithms, STASIS and Latent Semantic Analysis. This dataset focuses on measures for use in Conversational Agents; other potential applications include email processing and data mining of social networks. Such applications involve integrating the STSS algorithm in a complex system, but STSS algorithms must be evaluated in their own right and compared with others for their effectiveness before systems integration. Semantic similarity is an artifact of human perception; therefore its evaluation is inherently empirical and requires benchmark datasets derived from human similarity ratings. The new dataset of 64 sentence pairs, STSS-131, has been designed to meet these requirements drawing on a range of resources from traditional grammar to cognitive neuroscience. The human ratings are obtained from a set of trials using new and improved experimental methods, with validated measures and statistics. The results illustrate the increased challenge and the http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png ACM Transactions on Speech and Language Processing (TSLP) Association for Computing Machinery

A new benchmark dataset with production methodology for short text semantic similarity algorithms

Loading next page...
 
/lp/association-for-computing-machinery/a-new-benchmark-dataset-with-production-methodology-for-short-text-YDSEouZx99
Publisher
Association for Computing Machinery
Copyright
Copyright © 2013 by ACM Inc.
ISSN
1550-4875
DOI
10.1145/2537046
Publisher site
See Article on Publisher Site

Abstract

A New Benchmark Dataset with Production Methodology for Short Text Semantic Similarity Algorithms JAMES O'SHEA, ZUHAIR BANDAR, and KEELEY CROCKETT, Manchester Metropolitan University This research presents a new benchmark dataset for evaluating Short Text Semantic Similarity (STSS) measurement algorithms and the methodology used for its creation. The power of the dataset is evaluated by using it to compare two established algorithms, STASIS and Latent Semantic Analysis. This dataset focuses on measures for use in Conversational Agents; other potential applications include email processing and data mining of social networks. Such applications involve integrating the STSS algorithm in a complex system, but STSS algorithms must be evaluated in their own right and compared with others for their effectiveness before systems integration. Semantic similarity is an artifact of human perception; therefore its evaluation is inherently empirical and requires benchmark datasets derived from human similarity ratings. The new dataset of 64 sentence pairs, STSS-131, has been designed to meet these requirements drawing on a range of resources from traditional grammar to cognitive neuroscience. The human ratings are obtained from a set of trials using new and improved experimental methods, with validated measures and statistics. The results illustrate the increased challenge and the

Journal

ACM Transactions on Speech and Language Processing (TSLP)Association for Computing Machinery

Published: Dec 1, 2013

There are no references for this article.