An investigation and analysis on automatic speech recognition systems
Abstract
A crucial part of a Speech Recognition System (SRS) is working on its most fundamental modules with the latest technology. While the fundamentals provide basic insights into the system, the recent technologies used on it would provide more ways of exploring and exploiting the fundamentals to upgrade the system itself. These upgrades end up in finding more specific ways to enhance the scope of SRS. Algorithms like the Hidden Markov Model (HMM), Artificial Neural Network (ANN), the hybrid versions of HMM and ANN, Recurrent Neural Networks (RNN), and many similar are used in accomplishing high performance in SRS systems. Considering the domain of application of SRS, the algorithm selection criteria play a critical role in enhancing the performance of SRS. The algorithm chosen for SRS should finally work in hand with the language model conformed to the natural language constraints. Each language model follows a variety of methods according to the application domain. Hybrid constraints are considered in the case of geography-specific dialects.
Keywords
Full Text:
PDFReferences
1. Ockph T. Fundamental technologies in modern speech recognition. IEEE Signal Processing Magazine, 2012.
2. Xiong W, Wu L, Alleva F, et al. The Microsoft 2017 Conversational Speech Recognition System. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2018. doi: 10.1109/icassp.2018.8461870
3. Pratap V, Hannun A, Xu Q, et al. Wav2Letter++: A Fast Open-source Speech Recognition System. ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2019. doi: 10.1109/icassp.2019.8683535
4. Boll S. Suppression of acoustic noise in speech using spectral subtraction. IEEE Transactions on Acoustics, Speech, and Signal Processing. 1979, 27(2): 113-120. doi: 10.1109/tassp.1979.1163209
5. Barlindhaug G. Analog sound in the age of digital tools. The story of the failure of digital technology.
6. Di Rosario G. Electronic poetry: understanding poetry in the digital environment (No. 154). University of Jyväskylä; 2011.
7. Layher W. Sound, Voice, and Vox: The Acoustics of the Self in the Middle Ages. Queenship and Voice in Medieval Northern Europe. 2010, 29-52. doi: 10.1057/9780230113022_3
8. Oppenheim AV. Speech spectrograms using the fast Fourier transform. IEEE Spectrum. 1970, 7(8): 57-62. doi: 10.1109/mspec.1970.5213512
9. Jeong J, Williams WJ. Mechanism of the cross-terms in spectrograms. IEEE Transactions on Signal Processing. 1992, 40(10): 2608-2613. doi: 10.1109/78.157305
10. Zeng Y, Mao H, Peng D, et al. Spectrogram based multi-task audio classification. Multimedia Tools and Applications. 2017, 78(3): 3705-3722. doi: 10.1007/s11042-017-5539-3
11. Banerjee A, Pandey S, Hussainy MA. Separability of Human Voices by Clustering Statistical Pitch Parameters. 2018 3rd International Conference for Convergence in Technology (I2CT). 2018. doi: 10.1109/i2ct.2018.8529762
12. McDonald JC. The Analog Termination. Fundamentals of Digital Switching. 1990, 237-284. doi: 10.1007/978-1-4684-9880-6_7
13. Story BH, Bunton K. Formant measurement in children’s speech based on spectral filtering. Speech Communication. 2016, 76: 93-111. doi: 10.1016/j.specom.2015.11.001
14. Harris ZS. Structural linguistics.
15. Caravolas M, Volín J, Hulme C. Phoneme awareness is a key component of alphabetic literacy skills in consistent and inconsistent orthographies: Evidence from Czech and English children. Journal of Experimental Child Psychology. 2005, 92(2): 107-139. doi: 10.1016/j.jecp.2005.04.003
16. Davidson A. Writing: the re-construction of language. Language Sciences. 2019, 72: 134-149. doi: 10.1016/j.langsci.2018.09.004
17. Rabiner LR. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE. 1989, 77(2): 257-286. doi: 10.1109/5.18626
18. Young S. HMMs and Related Speech Recognition Technologies. Springer Handbook of Speech Processing. 2008, 539-558. doi: 10.1007/978-3-540-49127-9_27
19. Hussain S, Nazir R, Javeed U, et al. Speech Recognition Using Artificial Neural Network. Lecture Notes in Networks and Systems. 2021, 83-92. doi: 10.1007/978-981-16-2422-3_7
20. Mary L. Significance of Prosody for Speaker, Language, Emotion, and Speech Recognition. Extraction of Prosody for Automatic Speaker, Language, Emotion and Speech Recognition. 2018, 1-22. doi: 10.1007/978-3-319-91171-7_1
21. Dennis D, Acar DAE, Mandikal V, et al. Shallow RNN: accurate time-series classification on resource constrained devices. Advances in Neural Information Processing Systems, 2019, 32.
22. Han K, et al. Transformer in transformer. Advances in Neural Information Processing Systems 2021, 34: 15908-15919.
23. Roodenrys S, Miller LM, Josifovski N. Phonemic interference in short-term memory contributes to forgetting but is not due to overwriting. Journal of Memory and Language. 2022, 122: 104301. doi: 10.1016/j.jml.2021.104301
24. Rafay A, Hasan Y, Iqbal A. Recognition of Fingerprint Biometric System Access Control for Car Memory Settings Through Artificial Neural Networks. Advances in Information and Communication Networks. 2018, 385-397. doi: 10.1007/978-3-030-03405-4_26
25. Bhatt S, Jain A, Dev A. Continuous Speech Recognition Technologies—A Review. Lecture Notes in Mechanical Engineering. 2020, 85-94. doi: 10.1007/978-981-15-5776-7_8
26. Bird A. Multi-task dynamical systems: Customising time series models. Journal of Machine Learning Research 2022, 23, 1-52.
27. Ciaburro G, Iannace G, Ali M, et al. An artificial neural network approach to modelling absorbent asphalts acoustic properties. Journal of King Saud University - Engineering Sciences. 2021, 33(4): 213-220. doi: 10.1016/j.jksues.2020.07.002
28. Deekshitha G, Mary L. Multilingual spoken term detection: a review. International Journal of Speech Technology. 2020, 23(3): 653-667. doi: 10.1007/s10772-020-09732-9
29. Tran DC, Nguyen DL, Ha HS, et al. Speech Recognizing Comparisons Between Web Speech API and FPT.AI API. Proceedings of the 12th National Technical Seminar on Unmanned System Technology 2020. 2021, 853-865. doi: 10.1007/978-981-16-2406-3_64
30. Martins M, Mota D, Morgado F, et al. ImageAI: Comparison Study on Different Custom Image Recognition Algorithms. Trends and Applications in Information Systems and Technologies. 2021, 602-610. doi: 10.1007/978-3-030-72651-5_57
31. Xu H, Chen T, Gao D, et al. A Pruned Rnnlm Lattice-Rescoring Algorithm for Automatic Speech Recognition. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2018. doi: 10.1109/icassp.2018.8461974
32. Březinová P. Computational Analysis and Synthesis of Song Lyrics. Ústav formální a aplikované lingvistiky; 2021.
33. Norouzian A, Mazoure B, Connolly D, et al. Exploring Attention Mechanism for Acoustic-based Classification of Speech Utterances into System-directed and Non-system-directed. ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2019. doi: 10.1109/icassp.2019.8683565
34. Dighe P, Asaei A, Bourlard H. On quantifying the quality of acoustic models in hybrid DNN-HMM ASR. Speech Communication. 2020, 119: 24-35. doi: 10.1016/j.specom.2020.03.001
35. Gwilliams L, King JR, Marantz A, Poeppel D. Neural dynamics of phoneme sequencing in real speech jointly encode order and invariant content. Nat Commun 2022, 13, 6606. doi: 10.1038/s41467-022-34326-1
36. Yuan W, Eckart B, Kim K, et al. DeepGMR: Learning Latent Gaussian Mixture Models for Registration. Lecture Notes in Computer Science. 2020, 733-750. doi: 10.1007/978-3-030-58558-7_43
37. Karhila R, Smolander AR, Ylinen S, et al. Transparent Pronunciation Scoring Using Articulatorily Weighted Phoneme Edit Distance. Interspeech 2019. 2019. doi: 10.21437/interspeech.2019-1785
38. Kreyssig F. Deep learning for user simulation in a dialogue system. University of Cambridge; 2018.
39. Banerjee T, Rao Thurlapati N, Pavithra V, et al. Few-Shot learning for frame-Wise phoneme recognition: Adaptation of matching networks. 2021 29th European Signal Processing Conference (EUSIPCO). 2021. doi: 10.23919/eusipco54536.2021.9616234
40. Wang YH, Lee HY, Lee LS. Segmental Audio Word2Vec: Representing Utterances as Sequences of Vectors with Applications in Spoken Term Detection. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2018. doi: 10.1109/icassp.2018.8462002
41. Rebello S, Yu H, Ma L. An integrated approach for system functional reliability assessment using Dynamic Bayesian Network and Hidden Markov Model. Reliability Engineering & System Safety. 2018, 180: 124-135. doi: 10.1016/j.ress.2018.07.002
42. Li Y, Shao Y, Zhao Y. Construction of a General Lexical-Semantic Knowledge Graph. Chinese Lexical Semantics. 2021, 464-472. doi: 10.1007/978-3-030-81197-6_39
43. Mamyrbayev O, Alimhan K, Zhumazhanov B, et al. End-to-End Speech Recognition in Agglutinative Languages. Lecture Notes in Computer Science. 2020, 391-401. doi: 10.1007/978-3-030-42058-1_33
44. Forkel R, List JM, Greenhill SJ, et al. Cross-Linguistic Data Formats, advancing data sharing and re-use in comparative linguistics. Scientific Data. 2018, 5(1). doi: 10.1038/sdata.2018.205
45. Ramanarayanan V, Tilsen S, Proctor M, et al. Analysis of speech production real-time MRI. Computer Speech & Language. 2018, 52: 1-22. doi: 10.1016/j.csl.2018.04.002
46. Wang S, Cao L, Wang Y, et al. A Survey on Session-based Recommender Systems. ACM Computing Surveys. 2021, 54(7): 1-38. doi: 10.1145/3465401
47. Wang W, Wang G, Bhatnagar A, et al. An Investigation of Phone-Based Subword Units for End-to-End Speech Recognition. Interspeech 2020. 2020. doi: 10.21437/interspeech.2020-1873
48. Li K, Xu H, Wang Y, et al. Recurrent Neural Network Language Model Adaptation for Conversational Speech Recognition. Interspeech 2018. 2018. doi: 10.21437/interspeech.2018-1413
49. Liu M, Ho S, Wang M, et al. Federated learning meets natural language processing: A survey. arXiv preprint arXiv:2107.12603
50. Doval Y, Gómez‐Rodríguez C. Comparing neural‐ and N‐gram‐based language models for word segmentation. Journal of the Association for Information Science and Technology. 2018, 70(2): 187-197. doi: 10.1002/asi.24082
51. Erdmann A, Elsner M, Wu S, et al. The Paradigm Discovery Problem. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi: 10.18653/v1/2020.acl-main.695
52. Avasthi S, Chauhan R, Acharjya DP. Processing Large Text Corpus Using N-Gram Language Modeling and Smoothing. Proceedings of the Second International Conference on Information Management and Machine Intelligence. 2021, 21-32. doi: 10.1007/978-981-15-9689-6_3
53. Sterpu G, Harte N. Taris: An online speech recognition framework with sequence to sequence neural networks for both audio-only and audio-visual speech. Computer Speech & Language. 2022, 74: 101349. doi: 10.1016/j.csl.2022.101349
54. Raj PP, Reddy PA, Chandrachoodan N. Reduced Memory Viterbi Decoding for Hardware-accelerated Speech Recognition. ACM Transactions on Embedded Computing Systems. 2022, 21(3): 1-18. doi: 10.1145/3510028
55. Pan B, Yang Y, Zhao Z, et al. Bi-Decoder Augmented Network for Neural Machine Translation. Neurocomputing. 2020, 387: 188-194. doi: 10.1016/j.neucom.2020.01.003
56. Cui Y, Che W, Yang Z, et al. Interactive Gated Decoder for Machine Reading Comprehension. ACM Transactions on Asian and Low-Resource Language Information Processing. 2022, 21(4): 1-19. doi: 10.1145/3501399
57. Wang J, Shen J. Fast spectral analysis for approximate nearest neighbor search. Machine Learning. 2022, 111(6): 2297-2322. doi: 10.1007/s10994-021-06124-1
58. Gurunath Shivakumar P, Narayanan S. End-to-end neural systems for automatic children speech recognition: An empirical study. Computer Speech & Language. 2022, 72: 101289. doi: 10.1016/j.csl.2021.101289
59. Wang YY, Acero A, Chelba C. Is word error rate a good indicator for spoken language understanding accuracy. In: Proceedings of the 2003 IEEE Workshop on Automatic Speech Recognition and Understanding. pp. 577-582.
60. Subakan C, Ravanelli M, Cornell S, et al. Attention Is All You Need in Speech Separation. ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2021. doi: 10.1109/icassp39728.2021.9413901
61. Wang D, Wang X, Lv S. An Overview of End-to-End Automatic Speech Recognition. Symmetry. 2019, 11(8): 1018. doi: 10.3390/sym11081018
62. Furnon N, Serizel R, Essid S, et al. DNN-Based Mask Estimation for Distributed Speech Enhancement in Spatially Unconstrained Microphone Arrays. IEEE/ACM Transactions on Audio, Speech, and Language Processing. 2021, 29: 2310-2323. doi: 10.1109/taslp.2021.3092838
DOI: https://doi.org/10.32629/jai.v7i3.1060
Refbacks
- There are currently no refbacks.
Copyright (c) 2024 R. V. Siva Balan, K. Vignesh, Teena Jose, P. Kalpana, Jothikumar R.
License URL: https://creativecommons.org/licenses/by-nc/4.0/