banner

Enhanced feature selection with bacterial foraging and rough set analysis for document clustering

S. Periyasamy, R. Kaniezhil

Abstract


Most applications, such as Information Retrieval and Natural Language Processing (NLP), utilize document clustering to improve their analysis. The document consists of various features that are utilized to determine the similar and dissimilar documents. However, the traditional techniques consume high computation difficulties and convergence problems while analyzing high-dimensional data. The research difficulties are addressed with the help of Bacterial Foraging and Rough Set Analysis (BF-RSA). This study uses the TF-IDF features for analyzing similar documents. The extracted features are explored using the Bacterial Foraging Optimization (BFO) approach that uses the exploration and exploitation characteristics to improve the overall clustering quality. The collected documents are analyzed using a roughest approach that generates the discernible matrix which helps to identify similar and dissimilar features. Then bacterial foraging method computes the fitness value according to their behavior to identify the optimal solution. The selected feature set is further analyzed in the roughest approximation condition to minimize the uncertainty and interpretability issues. The effective integration of bacteria foraging and rough set approach maximizes the feature selection accuracy and improves the clustering accuracy (97.05%) with minimum convergence speed (0.063 s).


Keywords


document clustering; information retrieval; bacterial foraging and rough set analysis (BF-RSA); uncertainty and interpretability

Full Text:

PDF

References


1. Lydia EL, Moses GJ, Varadarajan V, et al. Clustering And Indexing Of Multiple Documents Using Feature Extraction Through Apache Hadoop On Big Data. Malaysian Journal of Computer Science. 2020; 108-123. doi: 10.22452/mjcs.sp2020no1.8

2. Abualigah L, Gandomi AH, Elaziz MA, et al. Nature-Inspired Optimization Algorithms for Text Document Clustering—A Comprehensive Analysis. Algorithms. 2020; 13(12): 345. doi: 10.3390/a13120345

3. Alguliyev RM, Aliguliyev RM, Isazade NR, et al. COSUM: Text summarization based on clustering and optimization. Expert Systems. 2018; 36(1). doi: 10.1111/exsy.12340

4. Jacksi K, Salih N. State of the art document clustering algorithms based on semantic similarity. Jurnal Informatika. 2020; 14(2): 58. doi: 10.26555/jifo.v14i2.a17513

5. Yang W, Wang X, Lu J, et al. Interactive Steering of Hierarchical Clustering. IEEE Transactions on Visualization and Computer Graphics. 2021; 27(10): 3953-3967. doi: 10.1109/tvcg.2020.2995100

6. Greene D, Cunningham P. Practical solutions to the problem of diagonal dominance in kernel document clustering. Proceedings of the 23rd international conference on Machine learning - ICML ‘06. Published online 2006. doi: 10.1145/1143844.1143892

7. Rose RL, Puranik TG, Mavris DN. Natural Language Processing Based Method for Clustering and Analysis of Aviation Safety Narratives. Aerospace. 2020; 7(10): 143. doi: 10.3390/aerospace7100143

8. Mora L, Deakin M, Reid A. Combining co-citation clustering and text-based analysis to reveal the main development paths of smart cities. Technological Forecasting and Social Change. 2019; 142: 56-69. doi: 10.1016/j.techfore.2018.07.019

9. Alowaimer BH, Dahiya D. Performance Investigation of Phishing Website Detection by Improved Deep Learning Techniques. Wireless Personal Communications. 2023; 132(4): 2625-2644. doi: 10.1007/s11277-023-10736-2

10. Chen J, Kudjo PK, Mensah S, et al. An automatic software vulnerability classification framework using term frequency-inverse gravity moment and feature selection. Journal of Systems and Software. 2020; 167: 110616. doi: 10.1016/j.jss.2020.110616

11. Borrelli D, Svartzman GG, Lipizzi C. Correction: Unsupervised acquisition of idiomatic units of symbolic natural language: An n-gram frequency-based approach for the chunking of news articles and tweets. PLOS ONE. 2021; 16(1): e0245404. doi: 10.1371/journal.pone.0245404

12. Milička J, Cvrček V, Lukešová L. Modelling crosslinguistic n gram correspondence in typologically different languages. Languages in Contrast. 2021; 21(2): 217-249. doi: 10.1075/lic.19018.mil

13. Benabdellah AC, Benghabrit A, Bouhaddou I. A survey of clustering algorithms for an industrial context. Procedia Computer Science. 2019; 148: 291-302. doi: 10.1016/j.procs.2019.01.022

14. Fuchs M, Höpken W. Clustering: Hierarchical, k-Means, DBSCAN. In: Applied Data Science in Tourism: Interdisciplinary Approaches, Methodologies, and Applications. Springer International Publishing, Cham; 2020. pp. 129–149.

15. Ghosal A, Nandy A, Das AK, et al. A short review on different clustering techniques and their applications. Emerging Technology in Modelling and Graphics. In: Proceedings of IEM Graph. 2018. pp. 69–83.

16. Albalawi R, Yeap TH, Benyoucef M. Using Topic Modeling Methods for Short-Text Data: A Comparative Analysis. Frontiers in Artificial Intelligence. 2020; 3. doi: 10.3389/frai.2020.00042

17. Chen Y, Li CG, You C. Stochastic Sparse Subspace Clustering. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Published online June 2020. doi: 10.1109/cvpr42600.2020.00421

18. Baradaran AA, Navi K. HQCA-WSN: High-quality clustering algorithm and optimal cluster head selection using fuzzy logic in wireless sensor networks. Fuzzy Sets and Systems. 2020; 389: 114-144. doi: 10.1016/j.fss.2019.11.015

19. Abualigah LM, Khader AT, Hanandeh ES. A new feature selection method to improve the document clustering using particle swarm optimization algorithm. Journal of Computational Science. 2018; 25: 456-466. doi: 10.1016/j.jocs.2017.07.018

20. Christy A, Gandhi GM. Feature selection and clustering of documents using random feature set generation technique. In: Advances in Data Science and Management. Springer Singapore; 2020. pp. 67–79.

21. Lakshmi R, Baskar S. DIC-DOC-K-means: Dissimilarity-based Initial Centroid selection for DOCument clustering using K-means for improving the effectiveness of text document clustering. Journal of Information Science. 2018; 45(6): 818-832. doi: 10.1177/0165551518816302

22. Bezdan T, Stoean C, Naamany AA, et al. Hybrid Fruit-Fly Optimization Algorithm with K-Means for Text Document Clustering. Mathematics. 2021; 9(16): 1929. doi: 10.3390/math9161929

23. Kim H, Kim HK, Cho S. Improving spherical k-means for document clustering: Fast initialization, sparse centroid projection, and efficient cluster labeling. Expert Systems with Applications. 2020; 150: 113288. doi: 10.1016/j.eswa.2020.113288

24. Wang H, Zhou C, Li L. Design and Application of a Text Clustering Algorithm Based on Parallelized K-Means Clustering. Revue d’Intelligence Artificielle. 2019; 33(6): 453-460. doi: 10.18280/ria.330608

25. Abualigah L, Diabat A, Geem ZW. A Comprehensive Survey of the Harmony Search Algorithm in Clustering Applications. Applied Sciences. 2020; 10(11): 3827. doi: 10.3390/app10113827

26. Cekik R, Uysal AK. A novel filter feature selection method using rough set for short text data. Expert Systems with Applications. 2020; 160: 113691. doi: 10.1016/j.eswa.2020.113691

27. Abualigah L, Gandomi AH, Elaziz MA, et al. Advances in Meta-Heuristic Optimization Algorithms in Big Data Text Clustering. Electronics. 2021; 10(2): 101. doi: 10.3390/electronics10020101

28. Ibrahim RK, Zeebaree SRM, Jacksi KFS. Survey on Semantic Similarity Based on Document Clustering. Advances in Science, Technology and Engineering Systems Journal. 2019; 4(5): 115-122. doi: 10.25046/aj040515

29. Chen K, Zhou FY, Yuan XF. Hybrid particle swarm optimization with spiral-shaped mechanism for feature selection. Expert Systems with Applications. 2019; 128: 140-156. doi: 10.1016/j.eswa.2019.03.039

30. Abasi AK, Khader AT, Al-Betar MA, et al. Link-based multi-verse optimizer for text documents clustering. Applied Soft Computing. 2020; 87: 106002. doi: 10.1016/j.asoc.2019.106002

31. Hassani H, Beneki C, Unger S, et al. Text Mining in Big Data Analytics. Big Data and Cognitive Computing. 2020; 4(1): 1. doi: 10.3390/bdcc4010001

32. Abualigah LM, Khader AT, Hanandeh ES. A new feature selection method to improve the document clustering using particle swarm optimization algorithm. Journal of Computational Science. 2018; 25: 456-466. doi: 10.1016/j.jocs.2017.07.018




DOI: https://doi.org/10.32629/jai.v7i5.1631

Refbacks

  • There are currently no refbacks.


Copyright (c) 2024 S. Periyasamy, R. Kaniezhil

License URL: https://creativecommons.org/licenses/by-nc/4.0/