banner

Speech data collection system for KUI, a Low resourced tribal language

Subrat Kumar Nayak, Ajit Kumar Nayak, Smitaprava Mishra, Prithviraj Mohanty, Nrusingha Tripathy, Abhilash Pati, Amrutanshu Panigrahi

Abstract


A new generation of speech translation technology is being developed to enable natural cross-language communication. Research efforts must focus on large vocabulary, spontaneous speech, and speaker variances to accommodate the varying demands of speech recognition technologies. These are important issues that need to be resolved for the general application of voice recognition in realistic settings. Most languages with limited resources don’t even have any speech data. Creating speech corpora is extremely difficult and time-consuming. Among all, KUI is regarded as one of the low-resource languages. In this paper, we developed the speech dataset for the KUI language to document and preserve their culture, tradition, and history for future generations. We also discuss the design, data collection procedures, and implementations and outline the different research possibilities using our KUI dataset. This paper mainly describes the GUI and method for the collection of KUI speech more quickly. In this section, the statistics of the people who helped and contributed to the collection of this KUI dataset have been provided. This study details a novel method of gathering data for any speech dataset. Using this process, we collected 60 hours of speech data sampled at 16 kHz by three different devices such as a Zoom recorder, Mobile, and Laptop from 80 different speakers. Each speaker contributed 500 sentences in the KUI language. A GUI application is designed to capture the speeches of numerous speakers in the KUI language. Several guidelines are proposed and used for the collection of the KUI speech dataset. All the guidelines are based on real-time experience gained during the data collection process by our team members.


Keywords


automatic speech recognition; speaker design; KUI dataset; corpus design; audio recording

Full Text:

PDF

References


1. Magueresse A, Carles V, Heetderks E. Low-resource languages: A review of past work and future challenges. arXiv 2006; arXiv:2006.07264. doi: 10.48550/arXiv.2006.07264

2. Larcher A, Lee KA, Ma B, Li H. Text-dependent speaker verification: Classifiers, databases and RSR2015. Speech Communication 2014; 60: 56–77. doi: 10.1016/j.specom.2014.03.001

3. Singh A, Kadyan V, Kumar M, Bassan N. ASRoIL: A comprehensive survey for automatic speech recognition of Indian languages. Artificial Intelligence Review 2019; 53(5): 3673–3704. doi: 10.1007/s10462-019-09775-8

4. Nayak SK, Nayak AK, Mishra S, Mohanty P. Deep learning approaches for speech command recognition in a low resource KUI language. International Journal of Intelligent Systems and Applications in Engineering 2022; 11(2): 377–386.

5. Ranathunga S, de Silva N. Some languages are more equal than others: Probing deeper into the linguistic disparity in the NLP world. arXiv 2022; arXiv:2210.08523. doi: 10.48550/arXiv.2210.08523

6. Besacier L, Barnard E, Karpov A, Schultz T. Automatic speech recognition for under-resourced languages: A survey. Speech Communication 2014; 56: 85–100. doi: 10.1016/j.specom.2013.07.008

7. Ghyselen AS, Breitbarth A, Farasyn M, et al. Clearing the transcription hurdle in dialect corpus building: The corpus of southern Dutch dialects as case study. Frontiers in artificial intelligence 2020; 3: 10. doi: 10.3389/frai.2020.00010

8. Yilmaz E, van den Heuvel H, Dijkstra J, et al. Open source speech and language resources for Frisian. In: Proceedings of the 17th Annual Conference of the International Speech Communication Association (INTERSPEECH 2016); 8–12 September 2016; San Francisco, USA. pp. 1536–1540.

9. Lee KA, Wang G, Ng KP, et al. The reddots platform for mobile crowd-sourcing of speech data. In: Proceedings of the 16th Annual Conference of the International Speech Communication Association (INTERSPEECH 2015); 6–10 September 2015; Dresden, Germany.

10. Hinskens F, Grondelaers S, van Leeuwen D. Sprekend Nederland, a multi-purpose collection of Dutch speech. Linguistics Vanguard 2021; 7(s1): 20190024. doi: 10.1515/lingvan-2019-0024

11. Schultz T, Vu NT, Schlippe T. Globalphone: A multilingual text & speech database in 20 languages. In: Proceedings of the 38th International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2013); 26–31 May 2013; Vancouver, Canada. pp. 8126–8130.

12. Masumura R, Hahm S, Ito A. Training a language model using webdata for large vocabulary Japanese spontaneous speech recognition. In: Proceedings of the 12th Annual Conference of the International Speech Communication Association 2011 (INTERSPEECH 2011); 27–31 August 2011; Florence, Italy.

13. de Vries NJ, Davel MH, Badenhorst J, et al. A smartphone-based ASR data collection tool for under-resourced languages. Speech communication 2014; 56: 119–131. doi: 10.1016/j.specom.2013.07.001

14. Buddhika D, Liyadipita R, Nadeeshan S, et al. Voicer: A crowd sourcing tool for speech data collection. In: Proceedings of the 2018 18th International Conference on Advances in ICT for Emerging Regions (ICTer 2018); 26–29 September 2018; Colombo, Sri Lanka. pp. 174–181.

15. de Silva N. Survey on publicly available sinhala natural language processing tools and research. arXiv 2019; arXiv:1906.02358. doi: 10.48550/arXiv.1906.02358

16. Al-Fetyani M, Al-Barham M, Abandah G, et al. MASC: Massive Arabic speech corpus. In: Proceedings of the 2022 IEEE Spoken Language Technology Workshop (SLT 2022); 09–12 January 2023; Doha, Qatar. pp. 1006–1013.

17. Salama A, Bouamor H, Mohit B, Oflazer K. YouDACC: The YouTube dialectal Arabic commentary corpus. In: Calzolari N, Choukri K, Declerck T, et al. (editors). Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014); 26–31 May 2014; Reykjavik, Iceland.

18. Mubarak H, Hussein A, Chowdhury SA, Ali A. QASR: QCRI aljazeera speech resource—A large scale annotated Arabic speech corpus. arXiv 2021; arXiv:2106.13000. doi: 10.48550/arXiv.2106.13000

19. Gugliotta E, Dinarelli M. TArC: Tunisian Arabish corpus first complete release. arXiv 2022; arXiv:2207.04796. doi: 10.48550/arXiv.2207.04796

20. Barnard E, Davel M, van Heerden C. ASR corpus design for resource-scarce languages. In: Proceedings of the 10th Annual Conference of the International Speech Communication Association 2009 (INTERSPEECH 2009); 6–10 September 2009; Brighton, United Kingdom.

21. Gelas H, Besacier L, Pellegrino F. Developments of Swahili resources for an automatic speech recognition system. In: Beermann D, Besacier L, Sakti S, Soria C (editors). Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced Languages; May 2020; Marseille, France.

22. de Wet F, Louw P, Niesler T. The Design, Collection and Annotation of Speech Databases in South Africa. Pattern Recognition Association of South Africa and Robotics and Mechatronics (PRASA-RobMech); 2006.

23. Zergat KY, Selouani SA, Amrouche A, et al. The voice as a material clue: A new forensic Algerian Corpus. Multimedia Tools and Applications 2023; 82: 29095–29113. doi: 10.1007/s11042-023-14412-2

24. Nakamura A, Matsunaga S, Shimizu T, et al. Japanese speech databases for robust speech recognition. In: Proceedings of the 4th International Conference on Spoken Language Processing (ICSLP 1996); 3–6 October 1996; Philadelphia, PA, USA. pp. 2199–2202.

25. Kasuriya S, Sornlertlamvanich V, Cotsomrong P, et al. Thai speech corpus for Thai speech recognition. In: Proceedings of The Oriental COCOSDA 2003 International Coordinating Committee on Speech Databases and Speech I/O System Assessment; 1–3 October 2003; Singapore. pp. 54–61.

26. Al-Diri B, Sharieh A, Hudaib T. An Arabic speech corpus: a database for Arabic speech recognition. Dirasat: Pure Sciences 2004; 31(2): 208–219.

27. Calado A, Freitas J, Silva P, et al. Yourspeech: Desktop speech data collection based on crowd sourcing in the internet. In: Proceedings of the 9th International Conference on Computational Processing of the Portuguese Language; 27–30 April 2010; Porto Alegre/RS, Brazil.

28. Nadungodage T, Welgama V, Weerasinghe R. Developing a speech corpus for sinhala speech recognition. In: Alonso JM, Bugarín A, Reiter E (editors). Proceedings of the 10th International Conference on Natural Language Processing (ICON 2013); 18–20 December 2013; Noida, India.

29. Gaikwad S, Gawali B, Mehrotra S. Creation of Marathi speech corpus for automatic speech recognition. In: Proceedings of the 2013 International Conference Oriental COCOSDA held jointly with the 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE 2013); 25–27 November 2013; Gurgaon, India. pp. 1–5.

30. Oirere AM, Deshmukh RR, Shrishrimal PP. Development of isolated numeric speech corpus for Swahili language for development of automatic speech recognition system. International Journal of Computer Applications 2013; 74(11): 20–22. doi: 10.5120/12929-9841

31. Wang D, Zhang X. Thchs-30: A free Chinese speech corpus. arXiv 2015; arXiv:1512.01882. doi: 10.48550/arXiv.1512.01882

32. Żelasko P, Ziółko B, Jadczyk T, Skurzok D. AGH corpus of polish speech. Language Resources and Evaluation 2016; 50(3): 585–601.

33. Stan A, Dinescu F, Ţiple C, et al. The SWARA speech corpus: A large parallel Romanian read speech dataset. In: Proceedings of the 2017 International Conference on Speech Technology and Human-Computer Dialogue (SpeD 2017); 6–9 July 2017; Bucharest, Romania. pp. 1–6.

34. Akhtar AK, Sahoo G, Kumar M. Digital corpus of Santali language. In: Proceedings of the 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI 2017); 13–16 September 2017; Udupi, India. pp. 934–938.

35. Iakushkin OO, Fedoseev GA, Shaleva AS, Sedova OS. Building corpora of transcribed speech from open access sources. In: Proceedings of the VIII International Conference “Distributed Computing and Grid-technologies in Science and Education” (GRID 2018); 10–14 September 2018; Dubna, Moscow region, Russia. pp. 475–479.

36. Deka B, Chakraborty J, Dey A, et al. Speech corpora of under resourced languages of North-East India. In: Proceedings of the 2018 Oriental COCOSDA-International Conference on Speech Database and Assessments; 7–8 May 2018; Miyazaki, Japan. pp. 72–77.

37. Unnibhavi AH, Jangamshetti D. Development of Kannada speech corpus for continuous speech recognition. International Journal of Computer Applications 2018; 179(53). doi: 10.5120/ijca2018917255

38. Londhe ND, Kshirsagar GB. Chhattisgarhi speech corpus for research and development in automatic speech recognition. International Journal of Speech Technology 2018; 21: 193–210. doi: 10.1007/s10772-018-9496-7

39. Gabdrakhmanov L, Garaev R, Razinkov E. Ruslan: Russian spoken language corpus for speech synthesis. In: Salah AA, Karpov A, Potapova R (editors). Lecture Notes in Computer Science, 11658, Proceedings of the Speech and Computer: 21st International Conference, SPECOM 2019; 20–25 August 2019; Istanbul, Turkey. pp. 113–121.

40. Mon AN, Pa WP, Ye KT. UCSY-SC1: A Myanmar speech corpus for automatic speech recognition. International Journal of Electrical and Computer Engineering 2019; 9(4): 3194–3202. doi: 10.11591/ijece.v9i4.pp3194-3202

41. Khassanov Y, Mussakhojayeva S, Mirzakhmetov A, et al. A crowdsourced open-source Kazakh speech corpus and initial speech recognition baseline. arXiv 2009; arXiv:2009.10334. doi: 10.48550/arXiv.2009.10334

42. Abraham B, Goel D, Siddarth D, et al. Crowdsourcing speech data for low-resource languages from low-income workers. In: Calzolari N, Béchet F, Blache P, et al. (editors). Proceedings of the 12th Language Resources and Evaluation Conference; 11–16 May 2020; Marseille, France. pp. 2819–2826.

43. Beibut A, Darkhan K, Olimzhan B, Madina K. Development of automatic speech recognition for Kazakh language using transfer learning. International Journal of Advanced Trends in Computer Science and Engineering 2020; 9(4): 5880–5886. doi: 10.30534/ijatcse/2020/249942020

44. Polat H, Oyucu S. Building a speech and text corpus of Turkish: Large corpus collection with initial speech recognition results. Symmetry 2020; 12(2): 290. doi: 10.3390/sym12020290

45. Kirkedal A, Stepanović M, Plank B. FT speech: Danish parliament speech corpus. arXiv 2020; arXiv:2005.12368. doi: 10.21437/Interspeech.2020-3164

46. Lekshmi KR, Jithesh VS, Sherly E. Malayalam speech corpus: Design and development for dravidian language. In: Jha GA, Bali K, Sobha L, et al. (editors). Proceedings of the 5th Workshop on Indian Language Data: Resources and Evaluation (WILDRE5 2020); 11–16 May 2020; Marseille, France. pp. 25–28.

47. Karpov N, Denisenko A, Minkin F. Golos: Russian dataset for speech research. arXiv 2021; arXiv:2106.10161. doi: 10.48550/arXiv.2106.10161

48. Akmuradov B, Khamdamov U, Djuraev O, Mukhamedaminov A. Developing a database of Uzbek language concatenative speech synthesizer. In: Proceedings of the 2021 International Conference on Information Science and Communications Technologies (ICISCT 2021); 3–5 November 2021; Tashkent, Uzbekistan. pp. 1–5.

49. Mirishkar GS, Naroju MD, Maity S, et al. CSTD-Telugu Corpus: Crowd-Sourced Approach for Large-Scale Speech data collection. In: Proceedings of the 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC 2021); 14–17 December 2021; Tokyo, Japan. pp. 511–517.

50. Musaev M, Mussakhojayeva S, Khujayorov I, et al. USC: An open-source Uzbek speech corpus and initial speech recognition experiments. In: Karpov A, Potapova R (editors). Lecture Notes in Computer Science Book 12997, Proceedings of the Speech and Computer: 23rd International Conference, SPECOM 2021; 27–30 September 2021; Petersburg, Russia. Springer; 2021. pp. 437–447.

51. Adiga D, Kumar R, Krishna A, et al. Automatic speech recognition in Sanskrit: A new speech corpus and modelling insights. arXiv 2021; arXiv:2106.05852. doi: 10.48550/arXiv.2106.05852

52. Tiwari SA, Kanke RG, Maheshwari A. Marathi speech database standardization: A review and work. International Journal of Computer Science and Information Security (IJCSIS) 2021; 19(7): 92–97. doi: 10.5281/zenodo.5501910

53. Kuanyshbay D, Baimuratov O, Amirgaliyev Y, Kuanyshbayeva A. Speech data collection system for Kazakh language. In: Proceedings of the 2021 16th International Conference on Electronics Computer and Computation (ICECCO 2021); 25–26 November 2021; Kaskelen, Kazakhstan. pp. 1–8.

54. Wang D, Wang L, Wang D, QI H. DTZH1505: Large scale open source mandarin speech corpus. Journal of Computer Engineering & Applications 2022; 58(11): 295–301. doi: 10.3778/j.issn.1002-8331.2112-0333

55. Kumar R, Singh S, Ratan S, et al. Annotated speech corpus for low resource Indian languages: Awadhi, Bhojpuri, Braj and Magahi. arXiv 2022; arXiv:2206.12931. doi: 10.48550/arXiv.2206.12931

56. Veisi H, Hosseini H, MohammadAmini M, et al. Jira: A central Kurdish speech recognition system, designing and building speech corpus and pronunciation lexicon. Language Resources and Evaluation 2022; 56(3): 917–941. doi: 10.1007/s10579-022-09594-4

57. Zevallos R, Camacho L, Melgarejo N. Huqariq: A multilingual speech corpus of native languages of Peru for speech recognition. arXiv 2022; arXiv:2207.05498. doi: 10.48550/arXiv.2207.05498

58. Mussakhojayeva S, Khassanov Y, Varol HA. Kazakh TTS2: Extending the open-source Kazakh TTS corpus with more data, speakers, and topics. arXiv 2022; arXiv:2201.05771. doi: 10.48550/arXiv.2201.05771

59. Avram AM, Nichita MV, Bartusica RG, Mihai MV. RoSAC: A speech corpus for transcribing Romanian emergency calls. In: Proceedings of the 2022 14th International Conference on Communications (COMM 2022); 16–18 June 2022; Bucharest, Romania. pp. 1–5.

60. Safonova A, Yudina T, Nadimanov E, Davenport C. Automatic speech recognition of low-resource languages based on Chukchi. arXiv 2022; arXiv:2210.05726. doi: 10.48550/arXiv.2210.05726

61. Lahoti P, Mittal N, Singh G. A survey on NLP resources, tools, and techniques for Marathi language processing. ACM Transactions on Asian and Low-Resource Language Information Processing 2022; 22(2): 1–34. doi: 10.1145/3548457

62. Baghdasaryan VH. ArmSpeech: Armenian spoken language corpus. International Journal of Scientific Advances 2022; 3(3): 454–459. doi: 10.51542/ijscia.v3i3.25

63. Wang Y, Wu M, Zheng B, Zhu S. HuZhouSpeech: A Huzhou dialect speech recognition corpus. In: Proceedings of the 2022 5th International Conference on Information Communication and Signal Processing (ICICSP 2022); 26–28 November 2022; Shenzhen, China. IEEE; 2023. pp. 153–157.

64. Yu T, Frieske R, Xu P, et al. Automatic speech recognition datasets in cantonese: A survey and new dataset. arXiv 2022; arXiv:2201.02419. doi: 10.48550/arXiv.2201.02419

65. Kulebi B, Armentano-Oller C, Rodríguez-Penagos C, Villegas M. ParlamentParla: A speech corpus of catalan parliamentary sessions. In: Proceedings of the Workshop ParlaCLARIN III within the 13th Language Resources and Evaluation Conference; 20–25 June 2022; Marseille, France. pp. 125–130.

66. Solberg PE, Ortiz P. The Norwegian parliamentary speech corpus. arXiv 2022; arXiv:2201.10881. doi: 10.48550/arXiv.2201.10881

67. Kibria S, Samin AM, Kobir MH, et al. Bangladeshi Bangla speech corpus for automatic speech recognition research. Speech Communication 2022; 136: 84–97. doi: 10.1016/j.specom.2021.12.004

68. Mirishkar GS, Raju V VV, Naroju MD, et al. IIITH-CSTD corpus: Crowd-sourced strategies for the collection of a large scale Telugu speech corpus. ACM Transactions on Asian and Low-Resource Language Information Processing 2021; 22(7): 1–26. doi: 10.1145/3600228

69. Mamtimin I, Du W, Hamdulla A. M2ASR-KIRGHIZ: A free Kirghiz speech database and accompanied baselines. Information 2023; 14(1): 55. doi: 10.3390/info14010055

70. Sun X, Cai K, Chen B, et al. Application of voice recognition interaction and big data Internet of Things in urban fire fighting. Journal of Location Based Services 2022; 16: 1–22. doi: 10.1080/17489725.2022.2096937




DOI: https://doi.org/10.32629/jai.v7i1.1121

Refbacks

  • There are currently no refbacks.


Copyright (c) 2023 Subrat Kumar Nayak, Ajit Kumar Nayak, Smitaprava Mishra, Prithviraj Mohanty, Nrusingha Tripathy, Abhilash Pati, Amrutanshu Panigrahi

License URL: https://creativecommons.org/licenses/by-nc/4.0/