1. bookVolume 72 (2021): Issue 2 (December 2021)
    NLP, Corpus Linguistics and Interdisciplinarity
Journal Details
License
Format
Journal
eISSN
1338-4287
First Published
05 Mar 2010
Publication timeframe
2 times per year
Languages
English
access type Open Access

Designing a Corpus of Czech Monologues: Orator v2

Published Online: 30 Dec 2021
Page range: 520 - 530
Journal Details
License
Format
Journal
eISSN
1338-4287
First Published
05 Mar 2010
Publication timeframe
2 times per year
Languages
English
Abstract

ORATOR v2 is a new 1.5M word corpus of Czech monologues, delivered to a live audience in semi-formal to formal settings. It was designed to chart the space of naturally occurring monologues which can be obtained for corpus processing. As such, it aims for diversity but does not attempt any balancing of subcategories, recognizing that some types of data are inherently easier to obtain in high volume than others. The transcription guidelines and annotation tools employed are the same as other recent spoken corpora published by the CNC, which facilitates interesting comparisons between various types of spoken Czech. The present paper sketches out three case studies, comparing ORATOR to the informal conversations of ORTOFON v2 in terms of the frequencies of demonstratives and hesitations, as well as lexical richness.

Keywords

[1] Kopřivová, M., Laubeová, Z., Lukeš, D., Poukarová, P., and Škarpová, M. (2020). ORTOFON v2: Korpus neformální mluvené češtiny s víceúrovňovým přepisem. ÚČNK FF UK: Prague. Accessible at: https://korpus.cz. Search in Google Scholar

[2] Kopřivová, M., Laubeová, Z., Lukeš, D., and Poukarová, P. (2020). ORATOR v2: Korpus monologů. ÚČNK FF UK: Prague. Accessible at: https://korpus.cz. Search in Google Scholar

[3] Kopřivová, M., Komrsková, Z., Lukeš, D., and Poukarová, P. (2017). Korpus ORAL: sestavení, lemmatizace a morfologické značkování. Korpus – Gramatika – Axiologie, 15, pages 47–67. Search in Google Scholar

[4] Hoffmannová, J. (2017). Monolog. CzechEncy – Nový encyklopedický slovník češtiny. Accessible at: https://www.czechency.org. Search in Google Scholar

[5] Müllerová, O. (2000). Žánry a syntaktické rysy mluvených projevů. In Tváře češtiny, pages 21–54, Ostrava. Ostravská univerzita. Search in Google Scholar

[6] Čermák, F., Adamovičová, A., and Pešička, J. (2001). PMK: Pražský mluvený korpus. ÚČNK FF UK, Praha. Search in Google Scholar

[7] Hladká, Z. (2002). BMK: Brněnský mluvený korpus. ÚČNK FF UK, Praha. Search in Google Scholar

[8] Štěpánová, V. (2016). Korpus Monolog 1.1. Accessible at: http://monolog.dialogy.org. Search in Google Scholar

[9] Kopřivová, M., Komrsková, Z., Poukarová, P., and Lukeš, D. (2019). Relevant criteria for selection of spoken data: theory meets practice. Jazykovedný časopis, 70(2), pages 324–335.10.2478/jazcas-2019-0062 Search in Google Scholar

[10] Křen, M. et al. (2016). SYN2015: Representative Corpus of Contemporary Written Czech. In Proceedings of LREC, pages 2522–2528, Portoroz. ELRA. Search in Google Scholar

[11] Čermáková, A., Jílková, L., Komrsková, Z., Kopřivová, M., and Poukarová, P. (2019). Diskurzní markery. In Syntax mluvené češtiny, pages 244–351, Prague. Academia. Search in Google Scholar

[12] Skarnitzl, R., and Machač, P. (2012). Míra rušivosti parazitních zvuků v řeči mediálních mluvčích. Naše řeč, 95, pages 3–14. Search in Google Scholar

[13] Kubát, M., and Milička, J. (2013). Vocabulary Richness Measure in Genres. Journal of Quantitative Linguistics, 20(4), pages 339–349.10.1080/09296174.2013.830552 Search in Google Scholar

[14] Cvrček, V., and Chlumská, L. (2015). Simplification in translated Czech: a new approach to type-token ratio. Russian Linguistics, 39(3), pages 309–325.10.1007/s11185-015-9151-8 Search in Google Scholar

Recommended articles from Trend MD

Plan your remote conference with Sciendo