Siblings and cousins — statistical methods for spoken language analysis
Abstract
Abstract In this paper we discuss two simple statistical methods for analyzing spoken language as represented in transcription corpora. The methods are strictly data-driven, in the sense that no preset grammatical category system and no lexical knowledge is assumed. The first method (‘Siblings’) is used for word type clustering within one language; the second method (‘Cousins’) is used for translation between two cognate languages. Exploiting two large Scandinavian...