Tabular text embedding for Vietnamese text-based person search
182 viewsDOI:
https://doi.org/10.54939/1859-1043.j.mst.93.2024.128-136Keywords:
Text-based person search; Tabular data; TabTransformer; CNN; Bi-LSTMAbstract
Vietnamese text-based person search is still a challenging problem with the limited dataset of Vietnamese descriptions. The current popular approach to this problem is Deep Neural Networks (DNNs), and recently, transformer networks have been more favored because of their outperformance over CNN and RNN networks for both vision and natural language processing tasks. However, DNN, or transformer networks, require a large amount of training data and computing time for efficient learning of visual and textual features. This brings a burden for implementing Vietnamese text-based person search by DNN, or transformer networks. Towards building a Vietnamese text-based person search system on a scarce resource dataset of Vietnamese descriptive sentences with low computing cost, in this work, we propose to apply the transformer-based architecture named TabTransformer for contextual embedding of the noun phrases chunked from the Vietnamese descriptive sentences. This is the first time the TabTransformer network has been deployed together with CNN and RNN architectures for Vietnamese text-based person search. The experimental results on a limited dataset of 3000VnPersonSearch show the better recognition accuracy of the proposed method compared to the baseline method by about 7.5% at Rank 1. In addition, the computing time of our method is more effective than the baseline method.
References
[1]. Li, Shuang, Tong Xiao, Hongsheng Li, Bolei Zhou, Dayu Yue, and Xiaogang Wang. "Person search with natural language description." In Proceedings of the IEEE conference on computer vision and pattern recognition, (2017). DOI: https://doi.org/10.1109/CVPR.2017.551
[2]. Dosovitskiy, Alexey, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani et al. "An image is worth 16x16 words: Transformers for image recognition at scale." arXiv preprint arXiv:2010.11929, (2020).
[3]. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. Bert. “Pre-training of deep bidirectional transformers for language understanding”. arXiv preprint arXiv:1810.04805, (2018).
[4]. Huang, Xin, Ashish Khetan, Milan Cvitkovic, and Zohar Karnin. "Tabtransformer: Tabular data modeling using contextual embeddings. arXiv 2020." arXiv preprint arXiv:2012.06678, (2012).
[5]. Pham, Thi Thanh Thuy, et al. "Towards a large-scale person search by vietnamese natural language: dataset and methods." Multimedia Tools and Applications 81.19: 27569-27600, (2022). DOI: https://doi.org/10.1007/s11042-022-12138-1
[6]. Yan, Shuanglin, Neng Dong, Liyan Zhang, and Jinhui Tang. "Clip-driven fine-grained text-image person re-identification." arXiv preprint arXiv:2210.10276, (2022). DOI: https://doi.org/10.1109/TIP.2023.3327924
[7]. Jiang, Ding, and Mang Ye. "Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image Person Retrieval." In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2787-2797, (2023). DOI: https://doi.org/10.1109/CVPR52729.2023.00273