
STRTrans: An accurate scene text recognition based on improved transformer network

Prabu Selvam, Saravanan Palani, Marimuthu M, Elakkiya Rajasekar


Text recognition represents a significant research domain within the field of computer vision. Specifically, scene text recognition (STR), which involves the identification of text within real-world scenes, presents a distinctive set of challenges. These challenges encompass the need for text to capture attention immediately, the potential for text distortion, and the influence of various factors like occlusion, noise, and obstructions during the image capture process. All of these elements significantly complicate the task of recognizing text within scenes. In this paper, we introduce STRTrans, a modified Transformer network designed to enhance the performance of STR. This enhancement addresses the shortcomings observed in the existing model, characterized by lower accuracy and difficulties in recognizing irregular text. The modification of the encoder structure involves the implementation of two consecutive layers of the self-attention (SA) mechanism and the reduction of the point-wise feed-forward layer. This modification aims to enable the network to interpret the semantic arrangement better. Our approach underwent experimental validation using three publicly available datasets and was benchmarked against other advanced methods. The experimental results consistently demonstrate the robust performance of our approach across all three benchmark tests, achieving recognition accuracies of 90.60%, 86.20%, and 86.90% in the IC15, SVT-P, and CUTE datasets, respectively. Moreover, the improved model comprehensively surpasses the existing approaches.


text recognition; deep learning; transformer; attention; image rectification

