Cross-domain synergy: Leveraging image processing techniques for enhanced sound classification through spectrogram analysis using CNNs
Abstract
In this paper, the innovative approach to sound classification by exploiting the potential of image processing techniques applied to spectrogram representations of audio signals is reviewed. This study shows the effectiveness of incorporating well-established image processing methodologies, such as filtering, segmentation, and pattern recognition, to enhance the feature extraction and classification performance of audio signals when transformed into spectrograms. An overview is provided of the mathematical methods shared by both image and spectrogram-based audio processing, focusing on the commonalities between the two domains in terms of the underlying principles, techniques, and algorithms. The proposed methodology leverages in particular the power of convolutional neural networks (CNNs) to extract and classify time-frequency features from spectrograms, capitalizing on the advantages of their hierarchical feature learning and robustness to translation and scale variations. Other deep-learning networks and advanced techniques are suggested during the analysis. We discuss the benefits and limitations of transforming audio signals into spectrograms, including human interpretability, compatibility with image processing techniques, and flexibility in time-frequency resolution. By bridging the gap between image processing and audio processing, spectrogram-based audio deep learning gives a deeper perspective on sound classification, offering fundamental insights that serve as a foundation for interdisciplinary research and applications in both domains.
Keywords
Full Text:
PDFReferences
1. Krizhevsky A, Sutskever I, Hinton GE. ImageNet classification with deep convolutional neural networks. Communications of the ACM 2017; 60(6): 84–90. doi: 10.1145/3065386
2. Simonyan K, Zisserman A. Very deep convolutional networks for largescale image recognition. arXiv 2014; arXiv:1409.1556. doi: 10.48550/arXiv.1409.1556
3. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 27–30 June 2016; Las Vegas, NV, USA. pp. 770–778.
4. Thiery S, Nyiri E, Gibaru O, Boots B. Combining pretrained CNN feature extractors to enhance clustering of complex natural images. Neurocomputing 2021; 423: 551–571.
5. Franzoni V, Biondi G, Milani A. Emotional sounds of crowds: Spectrogram-based analysis using deep learning. Multimedia Tools and Applications 2020; 79: 36063–36075. doi: 10.1007/s11042-020-09428-x
6. Deng J, Dong W, Socher R, et al. ImageNet: A large-scale hierarchical image database. In: Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition; 18 August 2009; Miami, FL, USA. pp. 248–255.
7. Szegedy C, Ioff S, Vanhoucke V, Alemi A. Inception-v4, inception-resnet and the impact of residual connections on learning. In: Proceedings of the Thirty-First AAAI conference on artificial intelligence; 4–9 February 2017; San Francisco, California USA. pp. 4728–4284.
8. Szegedy C, Liu W, Jia Y, et al. Going deeper with convolutions. In: Proceedings of the 2015 IEEE conference on computer vision and pattern recognition; 7–12 June 2015; Boston, MA, USA. pp. 1–9.
9. Lecun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. In: Proceedings of the IEEE; November 1998. pp. 2278–2324.
10. Huang J, Chen B, Yao B, He W. ECG arrhythmia classification using STFT-based spectrogram and convolutional neural network. IEEE Access 2019; 7: 92871–92880. doi: 10.1109/ACCESS.2019.2928017
11. McAllister T, Gambck B. Music style transfer using Constant-Q transform spectrograms. In: Martins T, Rodríguez-Fernández N, Rebelo SM (editors). Artificial Intelligence in Music, Sound, Art and Design, Proceedings of the 11th International Conference, EvoMUSART 2022; 20–22 April 2022; Madrid, Spain. Springer International Publishing; pp. 195–211.
12. Mu W, Yin B, Huang X, et al. Environmental sound classification using temporal-frequency attention based convolutional neural network. Scientific Reports 2021; 11(1): 21552. doi: 10.1038/s41598-021-01045-4
13. Lin J, Li J, Duan G, Ming J. Wavelet features and gaussian process classifiers for face recognition. Information Computing and Automation 2008; pp. 47–50. doi: 10.1142/9789812799524_0013
14. Pasternack RM, Qian Z, Zheng JY, et al. Highly sensitive size discrimination of sub-micron objects using optical Fourier processing based on two-dimensional Gabor filters. Optics Express2009; 17(14): 12001–12012. doi: 10.1364/oe.17.012001
15. Jolliffe I. Principal component analysis. Encyclopedia of Statistics in Behavioral Science 2005. doi: 10.1002/0470013192.bsa501
16. Hyvärinen A, Oja E. Independent component analysis: Algorithms and applications. Neural Networks 2000; 13(4–5): 411–430. doi: 10.1016/S0893-6080(00)00026-5
17. Lee DD, Seung HS. Algorithms for non-negative matrix factorization. In: Proceedings of the 13th International Conference on Neural Information Processing Systems (NIPS'00); 1 January 2000; Cambridge, MA, USA. pp. 535–541.
18. Weickert J. Anisotropic Diffusion in Image Processing. B. G. Teubner (Stuttgart); 1998.
19. Kelemenov T, Benedik O, Koláriková T, et al. Signal noise reduction and filtering. Acta Mechatronica–International Sciencefic Journal about Mechatronica 2020; 5(2): 29–34. doi: 10.22306/am.v5i2.65
20. Nguyen D, Widrow B. The truck backer-upper: An example of self-learning in neural networks. In: Advanced Neural Computers. NorthHolland; 1990. pp. 11–19.
21. Xiong F, Zhang Z, Ling Y, Zhang J. Image thresholding segmentation based on weighted Parzen-window and linear programming techniques. Scientific Reports 2022; 12(1): 13635. doi: 10.1038/s41598-022-17818-4
22. Cheng Z, Wang J. Improved region growing method for image segmentation of three-phase materials. Powder Technology 2020; 368: 80–89. doi: 10.1016/j.powtec.2020.04.032
23. Felzenszwalb PF, Huttenlocher DP. Efficient graph-based image segmentation. International Journal of Computer Vision 2004; 59: 167–181. doi: 10.1023/B:VISI.0000022288.19776.77
24. Wyse L. Audio spectrogram representations for processing with convolutional neural networks. In: Proceedings of the First International Workshop on Deep Learning and Music Joint with IJCNN; 17–18 May 2017; Anchorage, US. pp. 37–41.
25. Pal M, Roy R, Basu J, Bepari MS. Bind source separation: A review and analysis. In: Proceedings of the 2013 International Conference Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE); 25–27 November 2013; Gurgaon, India. pp. 1–5.
26. Zhang D. 2019. Wavelet transform. Fundamentals of Image Data Mining: Analysis, Features, Classification and Retrieval. Springer Cham; 2019. pp. 35–44.
27. Ding Y, Jia M, Miao Q, Cao Y. A novel time—Frequency transformer based on self–attention mechanism and its application in fault diagnosis of rolling bearings. Mechanical Systems and Signal Processing 2022; 168: 108616. doi: 10.1016/j.ymssp.2021.108616
28. Yu Y, Si X, Hu C, Zhang J. A review of recurrent neural networks: LSTM cells and network architectures. Neural Computation 2019; 31(7): 1235–1270. doi: 10.1162/neco_a_01199
29. Adavanne S, Politis A, Virtanen T. Direction of arrival estimation for multiple sound sources using convolutional recurrent neural network. In: Proceedings of the 2018 26th European Signal Processing Conference (EUSIPCO); 3–7 September 2018; Rome, Italy. pp. 1462–1466.
30. Parchami M, Zhu WP, Xhampagne B, Plourde E. Recent developments in speech enhancement in the short-time Fourier transform domain. IEEE Circuits and Systems Magazine 2016; 16(3): 45–77. doi: 10.1109/MCAS.2016.2583681
31. Perraudin N, Balazs P, Søndergaard PL. A fast Griffin-Lim algorithm. In: Proceedings of the 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics; 20–23 October 2013; New Paltz, NY, USA. pp. 1–4.
32. Kang Y, Lee J. Randomized learning-based classification of sound quality using spectrogram image and time-series data: A practical perspective. Engineering Applications of Artificial Intelligence 2023; 120: 105867. doi: 10.1016/j.engappai.2023.105867
33. Arias-Vergara T, Klumpp P, Vasquez-Correa JC, et al. Multi-channel spectrograms for speech processing applications using deep learning methods. Pattern Analysis and Applications 2021; 24: 423–431. doi: 10.1007/s10044-020-00921-5
34. Liu X, Delany SJ, McKeever S. Sound transformation: Applying image neural style transfer networks to audio spectrograms. In: Vento M, Percannella G (editors). Computer Analysis of Images and Patterns. Springer; 2019.
35. Pandey SK, Shekhawat HS, Prasanna SM. Deep learning techniques for speech emotion recognition: A review. In: Proceedings of the 2019 29th International Conference Radio elektronika (RADIOELEKTRONIKA); 16–18 April 2019; Pardubice, Czech Republic. pp. 1–6.
36. de Andrade DE, Buchkremer R. Edge emotion recognition: Applying fast Fourier transform on speech Mel spectrograms to classify emotion on a Raspberry Pi for near real-time analytics. doi: 10.21203/rs.3.rs-2198948/v1
37. Veltmeijer EA, Gerritsen C, Hindriks KV. Automatic emotion recognition for groups: A review. IEEE Transactions on Affective Computing 2023; 14(1): 89–107. doi: 10.1109/TAFFC.2021.3065726
DOI: https://doi.org/10.32629/jai.v6i3.678
Refbacks
- There are currently no refbacks.
Copyright (c) 2023 Valentina Franzoni
License URL: https://creativecommons.org/licenses/by-nc/4.0/