banner

An investigation and analysis on automatic speech recognition systems

R. V. Siva Balan, K. Vignesh, Teena Jose, P. Kalpana, Jothikumar R.

Abstract


A crucial part of a Speech Recognition System (SRS) is working on its most fundamental modules with the latest technology. While the fundamentals provide basic insights into the system, the recent technologies used on it would provide more ways of exploring and exploiting the fundamentals to upgrade the system itself. These upgrades end up in finding more specific ways to enhance the scope of SRS. Algorithms like the Hidden Markov Model (HMM), Artificial Neural Network (ANN), the hybrid versions of HMM and ANN, Recurrent Neural Networks (RNN), and many similar are used in accomplishing high performance in SRS systems. Considering the domain of application of SRS, the algorithm selection criteria play a critical role in enhancing the performance of SRS. The algorithm chosen for SRS should finally work in hand with the language model conformed to the natural language constraints. Each language model follows a variety of methods according to the application domain. Hybrid constraints are considered in the case of geography-specific dialects.


Keywords


speech recognition system; natural language; speech processing; language model; speech technology; ensemble methods

Full Text:

PDF

References


1. Ockph T. Fundamental technologies in modern speech recognition. IEEE Signal Processing Magazine, 2012.

2. Xiong W, Wu L, Alleva F, et al. The Microsoft 2017 Conversational Speech Recognition System. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2018. doi: 10.1109/icassp.2018.8461870

3. Pratap V, Hannun A, Xu Q, et al. Wav2Letter++: A Fast Open-source Speech Recognition System. ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2019. doi: 10.1109/icassp.2019.8683535

4. Boll S. Suppression of acoustic noise in speech using spectral subtraction. IEEE Transactions on Acoustics, Speech, and Signal Processing. 1979, 27(2): 113-120. doi: 10.1109/tassp.1979.1163209

5. Barlindhaug G. Analog sound in the age of digital tools. The story of the failure of digital technology.

6. Di Rosario G. Electronic poetry: understanding poetry in the digital environment (No. 154). University of Jyväskylä; 2011.

7. Layher W. Sound, Voice, and Vox: The Acoustics of the Self in the Middle Ages. Queenship and Voice in Medieval Northern Europe. 2010, 29-52. doi: 10.1057/9780230113022_3

8. Oppenheim AV. Speech spectrograms using the fast Fourier transform. IEEE Spectrum. 1970, 7(8): 57-62. doi: 10.1109/mspec.1970.5213512

9. Jeong J, Williams WJ. Mechanism of the cross-terms in spectrograms. IEEE Transactions on Signal Processing. 1992, 40(10): 2608-2613. doi: 10.1109/78.157305

10. Zeng Y, Mao H, Peng D, et al. Spectrogram based multi-task audio classification. Multimedia Tools and Applications. 2017, 78(3): 3705-3722. doi: 10.1007/s11042-017-5539-3

11. Banerjee A, Pandey S, Hussainy MA. Separability of Human Voices by Clustering Statistical Pitch Parameters. 2018 3rd International Conference for Convergence in Technology (I2CT). 2018. doi: 10.1109/i2ct.2018.8529762

12. McDonald JC. The Analog Termination. Fundamentals of Digital Switching. 1990, 237-284. doi: 10.1007/978-1-4684-9880-6_7

13. Story BH, Bunton K. Formant measurement in children’s speech based on spectral filtering. Speech Communication. 2016, 76: 93-111. doi: 10.1016/j.specom.2015.11.001

14. Harris ZS. Structural linguistics.

15. Caravolas M, Volín J, Hulme C. Phoneme awareness is a key component of alphabetic literacy skills in consistent and inconsistent orthographies: Evidence from Czech and English children. Journal of Experimental Child Psychology. 2005, 92(2): 107-139. doi: 10.1016/j.jecp.2005.04.003

16. Davidson A. Writing: the re-construction of language. Language Sciences. 2019, 72: 134-149. doi: 10.1016/j.langsci.2018.09.004

17. Rabiner LR. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE. 1989, 77(2): 257-286. doi: 10.1109/5.18626

18. Young S. HMMs and Related Speech Recognition Technologies. Springer Handbook of Speech Processing. 2008, 539-558. doi: 10.1007/978-3-540-49127-9_27

19. Hussain S, Nazir R, Javeed U, et al. Speech Recognition Using Artificial Neural Network. Lecture Notes in Networks and Systems. 2021, 83-92. doi: 10.1007/978-981-16-2422-3_7

20. Mary L. Significance of Prosody for Speaker, Language, Emotion, and Speech Recognition. Extraction of Prosody for Automatic Speaker, Language, Emotion and Speech Recognition. 2018, 1-22. doi: 10.1007/978-3-319-91171-7_1

21. Dennis D, Acar DAE, Mandikal V, et al. Shallow RNN: accurate time-series classification on resource constrained devices. Advances in Neural Information Processing Systems, 2019, 32.

22. Han K, et al. Transformer in transformer. Advances in Neural Information Processing Systems 2021, 34: 15908-15919.

23. Roodenrys S, Miller LM, Josifovski N. Phonemic interference in short-term memory contributes to forgetting but is not due to overwriting. Journal of Memory and Language. 2022, 122: 104301. doi: 10.1016/j.jml.2021.104301

24. Rafay A, Hasan Y, Iqbal A. Recognition of Fingerprint Biometric System Access Control for Car Memory Settings Through Artificial Neural Networks. Advances in Information and Communication Networks. 2018, 385-397. doi: 10.1007/978-3-030-03405-4_26

25. Bhatt S, Jain A, Dev A. Continuous Speech Recognition Technologies—A Review. Lecture Notes in Mechanical Engineering. 2020, 85-94. doi: 10.1007/978-981-15-5776-7_8

26. Bird A. Multi-task dynamical systems: Customising time series models. Journal of Machine Learning Research 2022, 23, 1-52.

27. Ciaburro G, Iannace G, Ali M, et al. An artificial neural network approach to modelling absorbent asphalts acoustic properties. Journal of King Saud University - Engineering Sciences. 2021, 33(4): 213-220. doi: 10.1016/j.jksues.2020.07.002

28. Deekshitha G, Mary L. Multilingual spoken term detection: a review. International Journal of Speech Technology. 2020, 23(3): 653-667. doi: 10.1007/s10772-020-09732-9

29. Tran DC, Nguyen DL, Ha HS, et al. Speech Recognizing Comparisons Between Web Speech API and FPT.AI API. Proceedings of the 12th National Technical Seminar on Unmanned System Technology 2020. 2021, 853-865. doi: 10.1007/978-981-16-2406-3_64

30. Martins M, Mota D, Morgado F, et al. ImageAI: Comparison Study on Different Custom Image Recognition Algorithms. Trends and Applications in Information Systems and Technologies. 2021, 602-610. doi: 10.1007/978-3-030-72651-5_57

31. Xu H, Chen T, Gao D, et al. A Pruned Rnnlm Lattice-Rescoring Algorithm for Automatic Speech Recognition. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2018. doi: 10.1109/icassp.2018.8461974

32. Březinová P. Computational Analysis and Synthesis of Song Lyrics. Ústav formální a aplikované lingvistiky; 2021.

33. Norouzian A, Mazoure B, Connolly D, et al. Exploring Attention Mechanism for Acoustic-based Classification of Speech Utterances into System-directed and Non-system-directed. ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2019. doi: 10.1109/icassp.2019.8683565

34. Dighe P, Asaei A, Bourlard H. On quantifying the quality of acoustic models in hybrid DNN-HMM ASR. Speech Communication. 2020, 119: 24-35. doi: 10.1016/j.specom.2020.03.001

35. Gwilliams L, King JR, Marantz A, Poeppel D. Neural dynamics of phoneme sequencing in real speech jointly encode order and invariant content. Nat Commun 2022, 13, 6606. doi: 10.1038/s41467-022-34326-1

36. Yuan W, Eckart B, Kim K, et al. DeepGMR: Learning Latent Gaussian Mixture Models for Registration. Lecture Notes in Computer Science. 2020, 733-750. doi: 10.1007/978-3-030-58558-7_43

37. Karhila R, Smolander AR, Ylinen S, et al. Transparent Pronunciation Scoring Using Articulatorily Weighted Phoneme Edit Distance. Interspeech 2019. 2019. doi: 10.21437/interspeech.2019-1785

38. Kreyssig F. Deep learning for user simulation in a dialogue system. University of Cambridge; 2018.

39. Banerjee T, Rao Thurlapati N, Pavithra V, et al. Few-Shot learning for frame-Wise phoneme recognition: Adaptation of matching networks. 2021 29th European Signal Processing Conference (EUSIPCO). 2021. doi: 10.23919/eusipco54536.2021.9616234

40. Wang YH, Lee HY, Lee LS. Segmental Audio Word2Vec: Representing Utterances as Sequences of Vectors with Applications in Spoken Term Detection. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2018. doi: 10.1109/icassp.2018.8462002

41. Rebello S, Yu H, Ma L. An integrated approach for system functional reliability assessment using Dynamic Bayesian Network and Hidden Markov Model. Reliability Engineering & System Safety. 2018, 180: 124-135. doi: 10.1016/j.ress.2018.07.002

42. Li Y, Shao Y, Zhao Y. Construction of a General Lexical-Semantic Knowledge Graph. Chinese Lexical Semantics. 2021, 464-472. doi: 10.1007/978-3-030-81197-6_39

43. Mamyrbayev O, Alimhan K, Zhumazhanov B, et al. End-to-End Speech Recognition in Agglutinative Languages. Lecture Notes in Computer Science. 2020, 391-401. doi: 10.1007/978-3-030-42058-1_33

44. Forkel R, List JM, Greenhill SJ, et al. Cross-Linguistic Data Formats, advancing data sharing and re-use in comparative linguistics. Scientific Data. 2018, 5(1). doi: 10.1038/sdata.2018.205

45. Ramanarayanan V, Tilsen S, Proctor M, et al. Analysis of speech production real-time MRI. Computer Speech & Language. 2018, 52: 1-22. doi: 10.1016/j.csl.2018.04.002

46. Wang S, Cao L, Wang Y, et al. A Survey on Session-based Recommender Systems. ACM Computing Surveys. 2021, 54(7): 1-38. doi: 10.1145/3465401

47. Wang W, Wang G, Bhatnagar A, et al. An Investigation of Phone-Based Subword Units for End-to-End Speech Recognition. Interspeech 2020. 2020. doi: 10.21437/interspeech.2020-1873

48. Li K, Xu H, Wang Y, et al. Recurrent Neural Network Language Model Adaptation for Conversational Speech Recognition. Interspeech 2018. 2018. doi: 10.21437/interspeech.2018-1413

49. Liu M, Ho S, Wang M, et al. Federated learning meets natural language processing: A survey. arXiv preprint arXiv:2107.12603

50. Doval Y, Gómez‐Rodríguez C. Comparing neural‐ and N‐gram‐based language models for word segmentation. Journal of the Association for Information Science and Technology. 2018, 70(2): 187-197. doi: 10.1002/asi.24082

51. Erdmann A, Elsner M, Wu S, et al. The Paradigm Discovery Problem. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi: 10.18653/v1/2020.acl-main.695

52. Avasthi S, Chauhan R, Acharjya DP. Processing Large Text Corpus Using N-Gram Language Modeling and Smoothing. Proceedings of the Second International Conference on Information Management and Machine Intelligence. 2021, 21-32. doi: 10.1007/978-981-15-9689-6_3

53. Sterpu G, Harte N. Taris: An online speech recognition framework with sequence to sequence neural networks for both audio-only and audio-visual speech. Computer Speech & Language. 2022, 74: 101349. doi: 10.1016/j.csl.2022.101349

54. Raj PP, Reddy PA, Chandrachoodan N. Reduced Memory Viterbi Decoding for Hardware-accelerated Speech Recognition. ACM Transactions on Embedded Computing Systems. 2022, 21(3): 1-18. doi: 10.1145/3510028

55. Pan B, Yang Y, Zhao Z, et al. Bi-Decoder Augmented Network for Neural Machine Translation. Neurocomputing. 2020, 387: 188-194. doi: 10.1016/j.neucom.2020.01.003

56. Cui Y, Che W, Yang Z, et al. Interactive Gated Decoder for Machine Reading Comprehension. ACM Transactions on Asian and Low-Resource Language Information Processing. 2022, 21(4): 1-19. doi: 10.1145/3501399

57. Wang J, Shen J. Fast spectral analysis for approximate nearest neighbor search. Machine Learning. 2022, 111(6): 2297-2322. doi: 10.1007/s10994-021-06124-1

58. Gurunath Shivakumar P, Narayanan S. End-to-end neural systems for automatic children speech recognition: An empirical study. Computer Speech & Language. 2022, 72: 101289. doi: 10.1016/j.csl.2021.101289

59. Wang YY, Acero A, Chelba C. Is word error rate a good indicator for spoken language understanding accuracy. In: Proceedings of the 2003 IEEE Workshop on Automatic Speech Recognition and Understanding. pp. 577-582.

60. Subakan C, Ravanelli M, Cornell S, et al. Attention Is All You Need in Speech Separation. ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2021. doi: 10.1109/icassp39728.2021.9413901

61. Wang D, Wang X, Lv S. An Overview of End-to-End Automatic Speech Recognition. Symmetry. 2019, 11(8): 1018. doi: 10.3390/sym11081018

62. Furnon N, Serizel R, Essid S, et al. DNN-Based Mask Estimation for Distributed Speech Enhancement in Spatially Unconstrained Microphone Arrays. IEEE/ACM Transactions on Audio, Speech, and Language Processing. 2021, 29: 2310-2323. doi: 10.1109/taslp.2021.3092838




DOI: https://doi.org/10.32629/jai.v7i3.1060

Refbacks

  • There are currently no refbacks.


Copyright (c) 2024 R. V. Siva Balan, K. Vignesh, Teena Jose, P. Kalpana, Jothikumar R.

License URL: https://creativecommons.org/licenses/by-nc/4.0/