1. bookVolume 72 (2021): Issue 2 (December 2021)
    NLP, Corpus Linguistics and Interdisciplinarity
Journal Details
License
Format
Journal
eISSN
1338-4287
First Published
05 Mar 2010
Publication timeframe
2 times per year
Languages
English
access type Open Access

An HMM-Based PoS Tagger for Old Church Slavonic

Published Online: 30 Dec 2021
Page range: 556 - 567
Journal Details
License
Format
Journal
eISSN
1338-4287
First Published
05 Mar 2010
Publication timeframe
2 times per year
Languages
English
Abstract

We present a hybrid HMM-based PoS tagger for Old Church Slavonic. The training corpus is a portion of one text, Codex Marianus (40k) annotated with the Universal Dependencies UPOS tags in the UD-PROIEL treebank. We perform a number of experiments in within-domain and out-of-domain settings, in which the remaining part of Codex Marianus serves as a within-domain test set, and Kiev Folia is used as an out-of-domain test set. Analysing by-PoS-class precision and sensitivity in each run, we combine a simple context-free n-gram-based approach and Hidden Markov method (HMM), and added linguistic rules for specific cases such as punctuation and digits. While the model achieves a rather non-impressive accuracy of 81% in in-domain settings, we observe an accuracy of 51% in out-of-domain evaluation, which is comparable to the results of large neural architectures based on pre-trained contextual embeddings.

Keywords

[1] Behera, P. (2017). An Experiment with the CRF++ Parts of Speech (POS) Tagger for Odia Language in India, 17(1), pages 18–40. Search in Google Scholar

[2] Loftsson, H. (2008). Tagging Icelandic text: A linguistic rule-based approach. Nordic Journal of Linguistics, 31(1), pages 47–72.10.1017/S0332586508001820 Search in Google Scholar

[3] Dandapat, S., Sarkar, S., and Basu, A. (2007). Automatic Part-of-Speech Tagging for Bengali: An Approach for Morphologically Rich Languages in a Poor Resource Scenario. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, pages 221–224, Association for Computational Linguistics.10.3115/1557769.1557833 Search in Google Scholar

[4] Rajendran, S., and Krishnakumar. K. (2019). A Comprehensive Study of Shallow Parsing and Machine Translation in Malaylam. Coimbatore: Amrita Vishwa Vidyapeetham, 295 p. Search in Google Scholar

[5] Uludoğan, G. (2018). HMM POS tagger. Accessible at: https://github.com/gokceuludogan/hmm-pos-tagger. Search in Google Scholar

[6] Jurish, B. (2003). A Hybrid Approach to Part-of-Speech Tagging. Berlin: Berlin-Brandenburgishe Akademie der Wissenschaften, 2003, 27 p. Search in Google Scholar

[7] (UD) UPOS tag set. Accessible at: https://universaldependencies.org/u/pos/. Search in Google Scholar

[8] Mohamed Elhadj, Y. O. (2009). Statistical Part-of-Speech Tagger for Traditional Arabic Texts. Journal of Computer Science, 5(11), pages 794–800. Search in Google Scholar

[9] Danso, S., and Lamb, W. (2014). Developing an Automatic Part-of-Speech Tagger for Scottish Gaelic. In Proceedings of the First Celtic Language Technology Workshop, pages 1–5, ACL. Search in Google Scholar

[10] Mirzanezhad, Z., and Feizi-Derakhshi, M.-R. (2016). Using morphological analyzer to statistical POS Tagging on Persian Text. IJCSIS, 14(8), pages 1093–1103. Search in Google Scholar

[11] Abumalloh, R. A., Al-Sarhan, H. M., Ibrahim, O. B., and Abu-Ulbeh, W. (2016). Arabic Part-of-Speech Tagging. Journal of Soft Computing and Decision Support Systems, 3(2), pages 45–52. Search in Google Scholar

[12] Kumar, S. S., Kumar, M. A., and Soman, K. P. (2016). Experimental analysis of Malayalam PoS tagger using EPIC framework in Scala. ARPN Journal of Engineering and Applied Sciences, 11(13), pages 8017–8023. Search in Google Scholar

[13] Gambäck, B., Olsson, F., Argaw, A. A., and Asker, L. (2009). Methods for Amharic Part-of-Speech Tagging. In Proceedings of the EACL 2009 Workshop on Language Technologies for African Languages, pages 104–111, ACL.10.3115/1564508.1564527 Search in Google Scholar

[14] Saharia, N., Das, D., Sharma, S., and Kalita, J. (2009). Part of Speech Tagger for Assamese Text. In Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, pages 33–36, World Scientific Publishing Co Pte Ltd.10.3115/1667583.1667595 Search in Google Scholar

[15] Reddy, S., and Sharoff, S. (2011). Cross Language POS Taggers (and other Tools) for Indian Languages: An Experiment with Kannada using Telugu Resources. In Proceedings of the 5th International Joint Conference on Natural Language Processing, pages 11–19, Asian Federation of Natural Language Processing. Search in Google Scholar

[16] Kann, K. et al. (2018). Character-level supervision for low-resource POS tagging. In Proceedings of the Workshop on Deep Learning Approaches for Low-Resource NLP, pages 1–11, Association for Computational Linguistics.10.18653/v1/W18-3401 Search in Google Scholar

[17] Straka, M., Strakova, J., and Hajič, J. (2019). Czech Text Processing with Contextual Embeddings: POS Tagging, Lemmatization, Parsing and NER. In Proceedings of 22nd International Conference “Text, Speech and Dialogue” 2019, pages 137–150, TSD.10.1007/978-3-030-27947-9_12 Search in Google Scholar

[18] Hajič, J., and Hladká, B. (1998). Czech language processing, POS tagging. Accessible at: https://ufal.mff.cuni.cz/czech-tagging/HajicHladkaLREC1998.pdf. Search in Google Scholar

[19] Korobov, M. (2015). Morphological Analyzer and Generator for Russian and Ukrainian Languages. Analysis of Images, Social Networks and Texts, pages 320–332.10.1007/978-3-319-26123-2_31 Search in Google Scholar

[20] Bird, S., Loper, E., and Klein, E. (2009). Natural Language Processing with Python. O’Reilly Media Inc, 479 p. Search in Google Scholar

[21] Segalovich, I. (2003). A Fast Morphological Algorithm with Unknown Word Guessing Induced by a Dictionary for a Web Search Engine. In Proceedings of the International Conference on Machine Learning; Models, Technologies and Applications, pages 273–280, MLMTA’03, June 23–26, 2003, Las Vegas, Nevada, USA. Search in Google Scholar

[22] Eckhoff, H. M., and Berdicevskis, A. (2015). Linguistics vs. digital editions: The Tromsø Old Russian and OCS Treebank. Scripta & e-Scripta 2015, 14(15), pages 9–25. Search in Google Scholar

[23] Pedrazzini, N., and Eckhoff, H. M. (2021). OldSlavNet: A scalable Early Slavic dependency parser trained on modern language data. Software Impacts, 8. Accessible at: https://www.sciencedirect.com/science/article/pii/S2665963821000117. Search in Google Scholar

[24] TITUS. Accessible at: http://titus.uni-frankfurt.de/indexe.htm. Search in Google Scholar

[25] Manuscript. Accessible at: http://manuscripts.ru/. Search in Google Scholar

[26] Zeman, D., Nivre, J., Abrams, M. et al. (2020). Universal Dependencies 2.7. LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University. Accessible at: http://hdl.handle.net/11234/1-3424. Search in Google Scholar

[27] Haug, D. T. T., and Jøhndal, M. L. (2008). Creating a Parallel Treebank of the Old Indo-European Bible Translations. In Proceedings of the Second Workshop on Language Technology for Cultural Heritage Data (LaTeCH 2008), pages 27–34, ACM, New York, NY. Search in Google Scholar

[28] Straka, M., Hajič, J., and Straková, J. (2016). UDPipe: Trainable Pipeline for Processing CoNLL-U Files Performing Tokenization, Morphological Analysis, POS Tagging and Parsing. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Portorož, Slovenia. Accessible at: https://ufal.mff.cuni.cz/~straka/papers/2016-lrec_udpipe.pdf. Search in Google Scholar

[29] Straka, M., Straková, J., and Hajič, J. (2019). Evaluating Contextualized Embeddings on 54 Languages in POS Tagging, Lemmatization and Dependency Parsing. In Arxiv.org Computing Research Repository, ISSN 2331-8422, 1904.02099. Search in Google Scholar

[30] Kamphuis, J. (2020). Verbal Aspect in Old Church Slavonic: A Corpus-based Approach. Leiden: Brill, 329 p.10.1163/9789004422032 Search in Google Scholar

[31] Strobl, C., Malley, J., and Tutz, G. (2009). An Introduction to Recursive Partitioning: Rationale, Application and Characteristics of Classification and Regression Trees, Bagging and Random Forests. Psychological Methods, 14(4), pages 323–348.10.1037/a0016973 Search in Google Scholar

[32] Kiev Folia. Accessible at: http://www.schaeken.nl/lu/research/online/editions/kievfol.html. Search in Google Scholar

[33] Afanasev, I. (2020). Korpus staroslavianskogo iazyka: nedostaiushchee zveno v diakhronicheskoi slavistike. In Slavica iuvenum xxI: sbornik trudov mezhdunarodnoi nauchnoi konferentsii Slavica iuvenum 2020, March 31–April 1, 2020, pages 13–21, Ostravskii universitet, Ostrava. Search in Google Scholar

[34] Project GitHub Repository. Accessible at: https://github.com/The-One-Who-Speaks-and-Depicts/hmm-pos-tagger. Search in Google Scholar

[35] Helmut, S. (1994). Probabilistic Part-of-Speech Tagging Using Decision Trees. In Proceedings of International Conference on New Methods in Language Processing, pages 1–9, Manchester, UK. Accessible at: https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/tree-tagger2.pdf. Search in Google Scholar

[36] Simov, K., Osenova, P., and Slavcheva, M. (2004). BTB-TR03: BulTreeBank Morphosyntactic Tagset. Accessible at: http://bultreebank.org/wp-content/uploads/2017/06/BTB-TR03.pdf. Search in Google Scholar

[37] Schmid, H., and Laws, F. (2008). Estimation of Conditional Probabilities with Decision Trees and an Application to Fine-Grained POS Tagging. Accessible at: https://cis.lmu.de/~schmid/papers/Schmid-Laws.pdf. Search in Google Scholar

Recommended articles from Trend MD

Plan your remote conference with Sciendo