1. bookVolume 16 (2021): Issue 2 (December 2021)
Journal Details
License
Format
Journal
eISSN
1857-8462
First Published
01 Jul 2005
Publication timeframe
2 times per year
Languages
English
access type Open Access

Morphological Tagging and Lemmatization in the Albanian Language

Published Online: 30 Dec 2021
Volume & Issue: Volume 16 (2021) - Issue 2 (December 2021)
Page range: 3 - 16
Journal Details
License
Format
Journal
eISSN
1857-8462
First Published
01 Jul 2005
Publication timeframe
2 times per year
Languages
English
Abstract

An important element of Natural Language Processing is parts of speech tagging. With fine-grained word-class annotations, the word forms in a text can be enhanced and can also be used in downstream processes, such as dependency parsing. The improved search options that tagged data offers also greatly benefit linguists and lexicographers. Natural language processing research is becoming increasingly popular and important as unsupervised learning methods are developed. There are some aspects of the Albanian language that make the creation of a part-of-speech tag set challenging.

This research provides a discussion of those issues linguistic phenomena and presents a proposal for a part-of-speech tag set that can adequately represent them.

The corpus contains more than 250,000 tokens, each annotated with a medium-sized tag set. The Albanian language’s syntagmatic aspects are adequately represented. Additionally, in this paper are morphologically and part-of-speech tagged corpora for the Albanian language, as well as lemmatize and neural morphological tagger trained on these corpora. Based on the held-out evaluation set, the model achieves 93.65% accuracy on part-of-speech tagging, The morphological tagging rate was 85.31 % and the lemmatization rate was 88.95%. Furthermore, the TF-IDF technique weighs terms and with the scores are highlighted words that have additional information for the Albanian corpus.

Keywords

1. Balakrishnan, V., & Lloyd-Yemoh, E. (2014). Stemming and Lemmatization: A Comparison of Retrieval Performances. Lecture Notes on Software Engineering, 262-267.10.7763/LNSE.2014.V2.134 Search in Google Scholar

2. Hasanaj, B. (2012). A Part of Speech Tagging Model for Albanian. Saarbrücken: LAP Lambert Academic Publishing. Search in Google Scholar

3. Kabashi, B., & Proisl, T. (2016). A proposal for a part-of-speech tagset for the Albanian language. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16) (pp. 4305-4310). Portorož, Slovenia: The International Conference on Language Resources and Evaluation. Search in Google Scholar

4. Kadriu, A. (2013). NLTK tagger for Albanian using iteraIterative Approach. Proceedings of the 35th Internationa Conference on Information Technology Interfaces (ITI). Search in Google Scholar

5. Kote, N., Biba, M., Kanerva, J., Rönnqvist, S., & Ginter, F. (2019). Morphological Tagging and Lemmatization of Albanian: A Manually Annotated Corpus and Neural Models. Computation and Language (cs.CL), 50-62. Search in Google Scholar

6. Kurani, A., & Muho, A. (2014). A morphological comparative study between Albanian and English language. European Scientific Journal (European Scientific Institute). England. Search in Google Scholar

7. Mati, D. N., Ajdari, J., Raufi, B., Hamiti, M., & Selimi, B. (2019). A Systematic Mapping Study of Language Features Identification from Large Text Collection. 2019 8th Mediterranean Conference on Embedded Computing (MECO). Budva, Montenegro, Montenegro: IEEE. Search in Google Scholar

8. Nagavci Mati, D., Hamiti, M., Susuri, A., Selimi, B., & Ajdari, J. (2021, July 1st). Building Dictionaries for Low Resource Languages: Challenges of Unsupervised Learning. Annals of Emerging Technologies in Computing (AETiC), 52-58,. doi:0.33166/AETiC.2021.03.00510.33166/AETiC.2021.03.005 Search in Google Scholar

9. Pagliardini, M., Gupta, P., & Jaggi, M. (2018). Unsupervised Learning of Sentence Embeddings using Compositional n-Gram Features. Conference of the North American Chapter of the Association for Computational Linguistics (pp. 528-540). North American: IEEE. doi:10.18653/v1/N18-104910.18653/v1/N18-1049 Search in Google Scholar

10. Piton, O., & Lagji, K. (2010). Morphological study of Albanian words, and processing with NooJ. Computation and Language, (pp. 189-205). Barcelona. Search in Google Scholar

11. Rushali, D., & Kiwelekar, A. (2020). Deep Learning Techniques for Part of Speech Tagging by Natural Language Processing. 2020 2nd International Conference on Innovative Mechanisms for Industry Applications (ICIMIA). 2, pp. 50-62. Bangalore, India: IEEE. doi:10.1109/ICIMIA48430.2020.907494110.1109/ICIMIA48430.2020.9074941 Search in Google Scholar

12. Toska, M., Nivre, J., & Zeman, D. (2020). Universal Dependencies for Albanian. Proceedings of the Fourth Workshop on Universal Dependencies, (pp. 178-188). Barcelona. Search in Google Scholar

13. Trommer, J. (2000). The Post-Syntactic Morphology of the Albanian Pre-Posed Article: Evidence for Distributed. Proceedings of the third conference on South-Slavic and Balkan languages, Plovdiv (pp. 8-16). IEEE. Search in Google Scholar

14. Trommer, J., & Kallulli, D. (2004). A Morphological Tagger for Standard Albanian. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)., (pp. 1-8). Vienna. Search in Google Scholar

Recommended articles from Trend MD

Plan your remote conference with Sciendo