Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Siblings and cousins — statistical methods for spoken language analysis

Siblings and cousins — statistical methods for spoken language analysis Abstract In this paper we discuss two simple statistical methods for analyzing spoken language as represented in transcription corpora. The methods are strictly data-driven, in the sense that no preset grammatical category system and no lexical knowledge is assumed. The first method (‘Siblings’) is used for word type clustering within one language; the second method (‘Cousins’) is used for translation between two cognate languages. Exploiting two large Scandinavian speech corpora (one Danish and one Swedish), we show how bilingual dictionary entries can be derived from raw transcription data directly. Our investigations shed new light on the so called ‘disfluencies’ typical of the spoken language, showing them to be syntax errors only from the viewpoint of written language grammar. Finally, we discuss how the proposed methods could be applied to languages that do not have a writing system at all and no record of linguistic description. 1 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Acta Linguistica Hafniensia: International Taylor & Francis

Siblings and cousins — statistical methods for spoken language analysis

27 pages

Siblings and cousins — statistical methods for spoken language analysis

Abstract

Abstract In this paper we discuss two simple statistical methods for analyzing spoken language as represented in transcription corpora. The methods are strictly data-driven, in the sense that no preset grammatical category system and no lexical knowledge is assumed. The first method (‘Siblings’) is used for word type clustering within one language; the second method (‘Cousins’) is used for translation between two cognate languages. Exploiting two large Scandinavian...
Loading next page...
 
/lp/taylor-francis/siblings-and-cousins-statistical-methods-for-spoken-language-analysis-wDLAhvqJui
Publisher
Taylor & Francis
Copyright
Copyright Taylor & Francis Group, LLC
ISSN
1949-0763
eISSN
0374-0463
DOI
10.1080/03740463.2004.10415468
Publisher site
See Article on Publisher Site

Abstract

Abstract In this paper we discuss two simple statistical methods for analyzing spoken language as represented in transcription corpora. The methods are strictly data-driven, in the sense that no preset grammatical category system and no lexical knowledge is assumed. The first method (‘Siblings’) is used for word type clustering within one language; the second method (‘Cousins’) is used for translation between two cognate languages. Exploiting two large Scandinavian speech corpora (one Danish and one Swedish), we show how bilingual dictionary entries can be derived from raw transcription data directly. Our investigations shed new light on the so called ‘disfluencies’ typical of the spoken language, showing them to be syntax errors only from the viewpoint of written language grammar. Finally, we discuss how the proposed methods could be applied to languages that do not have a writing system at all and no record of linguistic description. 1

Journal

Acta Linguistica Hafniensia: InternationalTaylor & Francis

Published: Jan 1, 2004

References