The Effect of (Historical) Language Variation on the East Slavic Lects Lematisers Performance

The need to develop tools for historical and regional variations is becoming more urgent in natural language processing. In this paper, we present two candidate systems for lemmatising historical East Slavic lects (Late Old East Slavic and Middle Russian), as well as modern regional East Slavic lects (Belogornoje and Megra): BERT-based end-to-end pipeline with language-specific heuristics and sequence-to-sequence BART-based encoderdecoder. To evaluate their predictions, we use accuracy score and string similarity measures, such as Levenshtein distance. The BERT-based model is more suitable for the regional data, achieving 85% accuracy score, and only 74% on the historical data. BART-based model climbs up to 92.6% accuracy score on the historical data, yet gets only 80% on the regional data. We provide an error analysis and discuss ways to enhance models, such as dictionary lookup and spellchecker.

eISSN:: 1338-4287
Language:: English

Publication timeframe:: 2 times per year
Journal Subjects:: Linguistics and Semiotics, Theoretical Frameworks and Disciplines, Linguistics, other

Journal RSS Feed

The Effect of (Historical) Language Variation on the East Slavic Lects Lematisers Performance

Published Online: Dec 25, 2023

Page range: 225 - 233

DOI: https://doi.org/10.2478/jazcas-2023-0040

Keywords
East Slavic, language variation, lemmatisation, dialectology, historical linguistics, historical NLP

© 2023 Ilia Afanasev et al., published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

The Effect of (Historical) Language Variation on the East Slavic Lects Lematisers Performance

Published Online: Dec 25, 2023

Page range: 225 - 233

DOI: https://doi.org/10.2478/jazcas-2023-0040

KeywordsEast Slavic, language variation, lemmatisation, dialectology, historical linguistics, historical NLP

© 2023 Ilia Afanasev et al., published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Keywords
East Slavic, language variation, lemmatisation, dialectology, historical linguistics, historical NLP