1. bookVolume 70 (2019): Issue 2 (December 2019)
Journal Details
License
Format
Journal
First Published
05 Mar 2010
Publication timeframe
2 times per year
Languages
English
access type Open Access

From the National Corpus of Polish to the Polish Corpus Infrastructure

Published Online: 21 Dec 2019
Page range: 315 - 323
Journal Details
License
Format
Journal
First Published
05 Mar 2010
Publication timeframe
2 times per year
Languages
English

The National Corpus of Polish emerged as a cumulative result of many years of work on large reference corpora by computer scientists and linguists in Poland. While its impact on research in linguistics, humanities and language technology is unquestionable and highly significant, the construction of the national corpus was halted in 2011. In the paper we call for activating the research community and funding institutions around the construction of a corpus infrastructure with the national corpus at its heart. It is claimed that on the verge of an artificial intelligence revolution the envisaged Polish Corpus Infrastructure would provide reliable language data, combine available resources and allow easy integration of new ones.

Keywords

[1] Czerepowicka M. (2014). SEJF – Słownik elektroniczny jednostek frazeologicznych. Język Polski XCIV (2), pages 116–129.Search in Google Scholar

[2] Čermák, F. (1997). Czech National Corpus: A case in many contexts. International Journal of Corpus Linguistics 2 (2), pages 181–197.Search in Google Scholar

[3] Derwojedowa M., Kieraś W., Skowrońska D., and Wołosz R. (2014). Korpus polszczyzny XIX wieku — od mikrokorpusu do korpusu średniej wielkości. Prace Filologiczne LXV, pages 251–256.Search in Google Scholar

[4] Grochola-Szczepanek H., Górski R. L., von Waldenfels R., and Woźniak M. (2019). Korpus języka mówionego mieszkańców Spisza. LingVaria LV (1), pages 165–180.Search in Google Scholar

[5] Gruszczyński W., Adamiec D., and Ogrodniczuk M. (2013). Elektroniczny korpus tekstów polskich z XVII i XVIII w. (do 1772 r.) Polonica XXXIII, pages 311–318.Search in Google Scholar

[6] Hajnicz E., Patejuk A., Przepiórkowski A., and Woliński M. (2016). Walenty: słownik walencyjny języka polskiego z bogatym komponentem frazeologicznym. In K. Skwarska and E. Kaczmarska (eds.) Výzkum slovesné valence ve slovanských zemích, pages 71–102. Prague, Czech Republic, Slovanský ústav AV ČR.Search in Google Scholar

[7] Janus D., and Przepiórkowski A. (2007). Poliqarp: An open source corpus indexer and search engine with syntactic extensions. In Proceedings of the ACL 2007 Demo and Poster Sessions, pages 85–88, Prague, Czech Republic.Search in Google Scholar

[8] Kieraś W., and Woliński M. (2018). Manually annotated corpus of Polish texts published between 1830 and 1918. In N. Calzolari, K. Choukri, C. Cieri, T. Declerck, S. Goggi, K. Hasida, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, S. Piperidis, and T. Tokunaga (eds.) Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC 2018), pages 3854–3859, Paris, France: European Language Resources Association.Search in Google Scholar

[9] Kirk J., Čermáková A., Ebeling S. O., Ebeling J., Kren M., Aijmer K., Benko V., Garabík R., Górski R. L., Jantunen J., Kupietz M., Simkova M., Schmidt T., and Wicher O. (2018). Introducing the International Comparable Corpus. In S. Granger, M–A. Lefer and L. Aguiar de Souza Penha Marion (eds.) Book of Abstracts: Using Corpora in Contrastive and Translation Studies Conference (5th edition). CECL Papers, Louvain-la-Neuve.Search in Google Scholar

[10] Król M., Derwojedowa M., Górski R. L., Gruszczyński W., Opaliński K. W., Potoniec P., Woliński M., Kieraś W., and Eder M. (2019). Narodowy Korpus Diachroniczny Polszczyzny. Projekt. Język Polski XCXIX (1), pages 92–101.Search in Google Scholar

[11] Łaziński M. (2018). Nowe zjawiska w języku młodzieży. Gramatyka slangu. In B. Pędzich, M. Wanot-Miśtura, and D. Zdunkiewicz-Jedynak (eds.) Tyle się we mnie słów zebrało. Szkice o języku i tekstach, pages 339–356. Warsaw, Poland.Search in Google Scholar

[12] Mykowiecka A., Marciniak M., and Rychlik P. (2017). Testing word embeddings for Polish. Cognitive Studies / Études Cognitives 17, pages 1–19.Search in Google Scholar

[13] Ogrodniczuk M., Głowińska K., Kopeć M., Savary A., and Zawisławska M. (2013). Polish Coreference Corpus. In Z. Vetulani (ed.), Proceedings of the 6th Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics, pages 494–498, Poznań, Poland: Wydawnictwo Poznańskie, Fundacja Uniwersytetu im. Adama Mickiewicza.Search in Google Scholar

[14] Ogrodniczuk M., Derwojedowa M., Łaziński M., and Pęzik P. (2017). Narodowy Korpus Języka Polskiego – co dalej? Prace Filologiczne, LXXI, pages 237–245.Search in Google Scholar

[15] Pęzik P. (2014). Graph-Based Analysis of Collocational Profiles. In V. Jesenšek and P. Grzybek (eds.) Phraseologie Im Wörterbuch Und Korpus (Phraseology in Dictionaries and Corpora), pages 227–243. ZORA 97. Maribor.Search in Google Scholar

[16] Pęzik P. (2015). Spokes – a Search and Exploration Service for Conversational Corpus Data. In Selected Papers from CLARIN 2014, pages 99–109. Linköping Electronic Conference Proceedings. Linköping University Electronic Press.Search in Google Scholar

[17] Pęzik P. (2016). Exploring Phraseological Equivalence with Paralela. In Polish-Language Parallel Corpora, edited by Ewa Gruszczyńska and Agnieszka Leńko-Szymańska, pages 67–81. Warsaw, Instytut Lingwistyki Stosowanej UW.Search in Google Scholar

[18] Pęzik P. (forthcoming, 2019). Budowa i zastosowania korpusu monitorującego MoncoPL. Forum Lingwistyczne.Search in Google Scholar

[19] Przepiórkowski A., Bańko M., Górski R. L., and Lewandowska-Tomaszczyk B. (eds.) (2012). Narodowy Korpus Języka Polskiego. Warsaw, Wydawnictwo Naukowe PWN.Search in Google Scholar

[20] Riegel M., Wierzba M., Wypych M., Żurawski Ł., Jednoróg K., Grabowska A., and Marchewka A. (2015). Nencki Affective Word List (NAWL): The Cultural Adaptation of the Berlin Affective Word List–Reloaded (BAWL-R) for Polish. Behavior Research Methods 47(4), pages 1222–1236.Search in Google Scholar

[21] Twardzik W., and Górski R. L. (2003). Korpus staropolski Instytutu Języka Polskiego PAN w Krakowie. In S. Gajda (ed.) Językoznawstwo w Polsce. Stan i perspektywy, pages 155–157.Search in Google Scholar

[22] Waszczuk J. (2012). Harnessing the CRF complexity with domain-specific constraints: The case of morphosyntactic tagging of a highly inflected language. In Proceedings of COLING 2012, pages 2789–2804. Mumbai, India.Search in Google Scholar

[23] Waszczuk J., Kieraś W., and Woliński M. (2018). Morphosyntactic disambiguation and segmentation for historical Polish with graph-based conditional random fields. In P. Sojka, A. Horák, I. Kopeček, and K. Pala (eds.) Proceedings of the 21st Text, Speech, and Dialogue International Conference (TSD 2018), Brno, Czech Republic. Lecture Notes in Artificial Intelligence 11107, pages 188–196. Springer-Verlag.Search in Google Scholar

[24] Woliński M. (2014). Morfeusz reloaded. In N. Calzolari, K. Choukri, T. Declerck, H. Loftsson, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, and S. Piperidis (eds.) Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC 2014), pages 1106–1111, Reykjavík, Iceland: European Language Resources Association.Search in Google Scholar

[25] Wróblewska A. (2012). Polish dependency bank. Linguistic Issues in Language Technology 7 (2), pages 1–18.Search in Google Scholar

[26] Żmigrodzki P., Bańko M., Batko-Tokarz B., Bobrowski J., Czelakowska A., Grochowski M., Przybylska R., Waniakowa J., and Węgrzynek K. (eds.) (2018). Wielki słownik języka polskiego PAN. Geneza, koncepcja, zasady opracowania. Kraków, Instytut Języka Polskiego PAN/LIBRON, 264 p.Search in Google Scholar

Recommended articles from Trend MD

Plan your remote conference with Sciendo