1. bookVolume 72 (2021): Issue 2 (December 2021)
    NLP, Corpus Linguistics and Interdisciplinarity
Journal Details
License
Format
Journal
eISSN
1338-4287
First Published
05 Mar 2010
Publication timeframe
2 times per year
Languages
English
access type Open Access

Linguistic Annotation of Translated Chinese Texts: Coordinating Theory, Algorithms and Data

Journal Details
License
Format
Journal
eISSN
1338-4287
First Published
05 Mar 2010
Publication timeframe
2 times per year
Languages
English
Abstract

The article tackles the problems of linguistic annotation in the Chinese texts presented in the Ruzhcorp – Russian-Chinese Parallel Corpus of RNC, and the ways to solve them. Particular attention is paid to the processing of Russian loanwords. On the one hand, we present the theoretical comparison of the widespread standards of Chinese text processing. On the other hand, we describe our experiments in three fields: word segmentation, grapheme-to-phoneme conversion, and PoS-tagging, on the specific corpus data that contains many transliterations and loanwords. As a result, we propose the preprocessing pipeline of the Chinese texts, that will be implemented in Ruzhcorp.

Keywords

[1] Semenov, K. I., Kuznetsova, Y. N., and Durneva, S. P. (2020). Russian-Chinese parallel corpus of RNC: Problems and perspectives. Proceedings of the 10th International Conference “Russia and China: History and Perspectives for Cooperation”, pages 633–640. Search in Google Scholar

[2] Emerson, T. (2005). The Second International Chinese Word Segmentation Bakeoff. Accessible at: http://sighan.cs.uchicago.edu/bakeoff2005/. Search in Google Scholar

[3] Li, P.-H., and Ma, W.-Y. (2019). CkipTagger. Accessible at: https://github.com/ckiplab/ckiptagger. Search in Google Scholar

[4] Qi, P., Zhang, Y., Zhang, Y., Bolton, J., and Manning, C. D. (2020). Stanza: A Python Natural Language Processing Toolkit for Many Human Languages. Association for Computational Linguistics (ACL) System Demonstrations. Accessible at: https://nlp.stanford.edu/pubs/qi2020stanza.pdf. Search in Google Scholar

[5] Honnibal, M., and Montani, I. (2017). spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. Accessible at: https://spacy.io/. Search in Google Scholar

[6] Luo, R., xu, J., Zhang, Y., Ren, x., and Sun, x. (2019). PKUSEG: A Toolkit for Multi-Domain Chinese Word Segmentation. Accessible at: http://arxiv.org/abs/1906.11455. Search in Google Scholar

[7] Geng, Z., Yan, H., Qiu, x., and Huang, x. (2020). fastHan: A BERT-based Joint Many-Task Toolkit for Chinese NLP. Accessible at: http://arxiv.org/abs/2009.08633. Search in Google Scholar

[8] Zhang, H., and Shang, J. (2019). NLPIR-Parser: An intelligent semantic analysis toolkit for big data. Corpus Linguistics, 6(1), pages 87–104. Search in Google Scholar

[9] Che, W., Feng, Y., Qin, L., and Liu, T. (2021). N-LTP: A Open-source Neural Chinese Language Technology Platform with Pretrained Models. Accessible at: http://arxiv.org/abs/2009.11616. Search in Google Scholar

[10] Straka, M. (2018). UDPipe 2.0 Prototype at CoNLL 2018 UD Shared Task. Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 197–207. Accessible at: https://doi.org/10.18653/v1/K18-2020.10.18653/v1/K18-2020 Search in Google Scholar

[11] Semenov, K. I., Korotkova, Y. O., Volf, E. A., and Konovalova, A. S. (2021). Automatic Annotation of the Chinese Texts that Contain Loanwords: Word Segmentation, Transcription, PoS-tagging. DIALOG-2021: 27th International Conference on Computational Linguistics and Intellectual Technologies, Supplementary volume, pages 1081–1095. Accessible at: http://www.dialog-21.ru/media/5420/_-dialog2021supvol.pdf. Search in Google Scholar

[12] Cai, Z., Yang, Y., Zhang, C., Qin, x., and Li, M. (2019). Polyphone Disambiguation for Mandarin Chinese Using Conditional Neural Network with Multi-level Embedding Features. Accessible at: https://arxiv.org/abs/1907.01749. Search in Google Scholar

[13] Park, K., and Lee, S. (2020). g2pM: A Neural Grapheme-to-Phoneme Conversion Package for Mandarin Chinese Based on a New Open Benchmark Dataset. Accessible at: http://arxiv.org/abs/2004.03136. Search in Google Scholar

[14] Luo, E. (2020). xpinyin. Accessible at: https://github.com/lxneng/xpinyin. Search in Google Scholar

[15] Huang, H. (2020). pypinyin. Accessible at: https://github.com/mozillazg/python-pinyin. Search in Google Scholar

[16] Konovalova, A. S., and Tsvetkova, A. D. (2021). Comparative analysis of grapheme-to-phoneme models for the Russian-Chinese parallel corpus. Program book of Buckeye East Asian Linguistics Forum 4, pages 28–30. Accessible at: https://cpb-us-w2.wpmucdn.com/u.osu.edu/dist/6/3609/files/2021/03/BEALF-4_Program_Book_2021-3-5.pdf. Search in Google Scholar

[17] Roten, T. S. (2018). PyNLPIR PoS tagset. Accessible at: https://pynlpir.readthedocs.io/en/latest/api.html. Search in Google Scholar

[18] Semenov, K. I., Korotkova, Y. O., and Volf, E. A. (2021). Automatic Annotation of the Russian Loanwords in Chinese Texts: Issues in Word Segmentation and PoS-tagging. Proceedings of Corpora 2021 International Conference. 14 pages [in press]. Search in Google Scholar

[19] Konovalova, A. S. (2021). Automatic POS-tagging for Chinese Using Parallel Data [BA thesis]. Higher School of Economics. 82 pages. Search in Google Scholar

Recommended articles from Trend MD

Plan your remote conference with Sciendo