banner

Word translation for Indo-Aryan languages using different retrieval techniques

Kiranjeet Kaur, Shweta Chauhan

Abstract


The study of Natural Language Processing has been revolutionized by word embedding, enabling advanced language models to understand and generate human-like text. In this research article, we delve deep into the world of word embedding, aiming to provide a comprehensive exploration of its underlying principles, methodologies, and applications. One important factor that affects many multilingual language processing activities is the word translation or incorporation of bilingual dictionaries. We used bilingual dictionaries or parallel data for translation from one language to another. For this research work, this problem is addressed, and also generating the best cross-lingual word embedding for the different language pairs. So, we are using an aligned document sentence-aligned corpus, or any bilingual dictionary for this research analysis. For the most frequent word, we are assuming that there is an intra-lingual similarity distribution, and both the source and the target corpora have a comparable distribution graph. Additionally, these embeddings are isometric. These cross-lingual word embeddings are used for cross-lingual transfer learning and unsupervised neural machine translation. This research aims to improve the accuracy and efficiency of word translation between different language pairs by employing different retrieval techniques. The study analyzes the effectiveness of these techniques on different language pairs, including English-Hindi, English-Punjabi, English-Gujarati, English-Bengali, and English-Marathi. The research is expected to contribute significantly to the field of language translation by introducing innovative methods and other applications.


Keywords


cross-lingual embedding; word embedding; retrieval techniques; unsupervised word translation

Full Text:

PDF

References


1. Andrabi SAB, Wahid A. Machine Translation System Using Deep Learning for English to Urdu. Gupta SK, ed. Computational Intelligence and Neuroscience. 2022, 2022: 1-11. doi: 10.1155/2022/7873012

2. Kim Y, Geng J, Ney H. Improving Unsupervised Word-by-Word Translation with Language Model and Denoising Autoencoder. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Published online 2018. doi: 10.18653/v1/d18-1101

3. Premjith B, Kumar MA, Soman KP. Neural Machine Translation System for English to Indian Language Translation Using MTIL Parallel Corpus. Journal of Intelligent Systems. 2019, 28(3): 387-398. doi: 10.1515/jisys-2019-2510

4. Bojanowski P, Grave E, Joulin A, et al. Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics. 2017, 5: 135-146. doi: 10.1162/tacl_a_00051

5. Pennington J, Socher R, Manning C. Glove: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Published online 2014. doi: 10.3115/v1/d14-1162

6. Joulin A, Grave E, Bojanowski P, et al. Bag of Tricks for Efficient Text Classification. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. Published online 2017. doi: 10.18653/v1/e17-2068

7. Artetxe M, Labaka G, Agirre E. Learning principled bilingual mappings of word embeddings while preserving monolingual invariance. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Published online 2016. doi: 10.18653/v1/d16-1250

8. Artetxe M, Labaka G, Agirre E. Learning bilingual word embeddings with (almost) no bilingual data. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Published online 2017. doi: 10.18653/v1/p17-1042

9. Ruder S, Søgaard A, Vulić I. Unsupervised Cross-Lingual Representation Learning. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts. Published online 2019. doi: 10.18653/v1/p19-4007

10. Artetxe M, Ruder S, Yogatama D. On the cross-lingual transferability of monolingual representations. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Published online 2020: 4623–4637. doi: 10.18653/v1/2020.acl-main.421.

11. Zou WY, Socher R, Cer D, Manning CD. Bilingual word embeddings for phrase-based machine translation. In: Proceedings of the 2013 conference on empirical methods in natural language processing.

12. Adams O, Makarucha A, Neubig G, et al. Cross-lingual word embeddings for low-resource language modeling. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics.

13. Conneau A, Lample G, Ranzato MA, et al. Word translation without parallel data. arXiv preprint. arXiv:1710.04087.

14. Dinu G, Lazaridou A, Baroni M. Improving zero-shot learning by mitigating the hubness problem. arXiv preprint. arXiv:1412.6568.

15. Shigeto Y, Suzuki I, Hara K, et al. Ridge Regression, Hubness, and Zero-Shot Learning. Lecture Notes in Computer Science. Published online 2015: 135-151. doi: 10.1007/978-3-319-23528-8_9

16. Vulić I, Korhonen A. On the Role of Seed Lexicons in Learning Bilingual Word Embeddings. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Published online 2016. doi: 10.18653/v1/p16-1024

17. Artetxe M, Labaka G, Agirre E. A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Published online 2018. doi: 10.18653/v1/p18-1073

18. Smith SL, Turban DH, Hamblin S, Hammerla NY. Offline bilingual word vectors, orthogonal transformations and the inverted softmax. arXiv preprint. arXiv:1702.03859.

19. Zhang M, Liu Y, Luan H, Sun M. Adversarial training for unsupervised bilingual lexicon induction. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics.

20. Zhang M, Liu Y, Luan H, Sun M. Earth mover’s distance minimization for unsupervised bilingual lexicon induction. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing.

21. Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. In: Advances in neural information processing systems.

22. Hendy A, Abdelrehim M, Sharaf A, et al. How good are gpt models at machine translation? A comprehensive evaluation. arXiv preprint. arXiv:2302.09210.

23. Rivera-Trigueros I. Machine translation systems and quality assessment: a systematic review. Language Resources and Evaluation. 2022, 56(2): 593-619.

24. Rei R, Guerreiro NM, Treviso M, et al. The inside story: Towards better understanding of machine translation neural evaluation metrics. arXiv preprint. arXiv:2305.11806.

25. Singh M, Kumar R, Chana I. Machine translation systems for Indian languages: review of modelling techniques, challenges, open issues and future research directions. Archives of Computational Methods in Engineering. 2021, 28, 2165-2193.

26. Ramesh A, Parthasarathy VB, Haque R, Way A. Comparing statistical and neural machine translation performance on hindi-to-tamil and english-to- tamil. Digital. 2021, 1(2): 86-102.

27. Garje GV, Bansode A, Gandhi S, et al. Marathi to English Sentence Translator for Simple Assertive and Interrogative Sentences. International Journal of Computer Applications. 2016, 138(5): 42-45. doi: 10.5120/ijca2016908837

28. Chauhan S, Saxena S, Daniel P. Fully unsupervised word translation from cross-lingual word embeddings especially for healthcare professionals. International Journal of System Assurance Engineering and Management. 2021, 13(S1): 28-37. doi: 10.1007/s13198-021-01182-z

29. Chauhan S, Shet JP, Beram SM, et al. Rule Based Fuzzy Computing Approach on Self-Supervised Sentiment Polarity Classification with Word Sense Disambiguation in Machine Translation for Hindi Language. ACM Transactions on Asian and Low-Resource Language Information Processing. 2023, 22(5): 1-21. doi: 10.1145/3574130

30. Kunchukuttan A, Kakwani D, Golla S, et al. Ai4bharat-indicnlp corpus: Monolingual corpora and word embeddings for indic languages. arXiv preprint. arXiv:2005.00085.

31. Kunchukuttan A, Mehta P, Bhattacharyya P. The it bombay English-Hindi parallel corpus. arXiv preprint. arXiv:1710.02855.

32. Post M, Callison-Burch C, Osborne M. Constructing parallel corpora for six Indian languages via crowdsourcing. In: Proceedings of the seventh workshop on statistical machine translation.




DOI: https://doi.org/10.32629/jai.v7i4.1455

Refbacks

  • There are currently no refbacks.


Copyright (c) 2024 Kiranjeet Kaur, Shweta Chauhan

License URL: https://creativecommons.org/licenses/by-nc/4.0/