Real-time Siamese visual object tracking using attention and anchor-free mechanism



  • Hoang Dinh Thang Military Information Technology Institute, Academy of Military Science and Technology
  • Do Ngoc Tuan Military Information Technology Institute, Academy of Military Science and Technology
  • Thai Trung Kien Military Information Technology Institute, Academy of Military Science and Technology
  • Tran Quoc Long (Corresponding Author) VNU University of Engineering and Technology



Visual object tracking; Attention mechanism; Anchor-free mechanism.


Trackers based on Siamese have consistently demonstrated superior performance in tracking visual objects. The majority of existing trackers calculate the features of the target template and search image independently and then estimate the target's scale and aspect ratio using either a multi-scale searching scheme or pre-defined anchor boxes. This paper proposed a Siamese attention network for tracking visual objects. An attention fusion mechanism is generated using pixel-level matching of template and search features. The framework proposed is anchor-free, making it both simple and effective. Extensive experiments on visual tracking benchmark VOT2018 and UAV123 demonstrate that our tracker operates at 42 fps and achieves state-of-the-art performance.


[1]. L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. Torr, “Fully-convolutional siamese networks for object tracking”, in ECCV Workshops, (2016).

[2]. B. Li, J. Yan, W. Wu, Z. Zhu, and X. Hu, “High performance visual tracking with siamese region proposal network”, in CVPR, (2018).

[3]. Z. Zhu, Q. Wang, B. Li, W. Wu, J. Yan, and W. Hu, “Distractor-aware siamese networks for visual object tracking”, in ECCV, (2018).

[4]. B. Li, W. Wu, Q. Wang, F. Y. Zhang, J. L. Xing, and J. J. Yan, “SiamRPN++: Evolution of siamese visual tracking with very deep networks”, in CVPR, (2019).

[5]. D. Guo, J. Wang, Y. Cui, Z. Wang, and S. Chen, “SiamCAR: Siamese fully convolutional classification and regression for visual tracking”, in CVPR, (2020).

[6]. Z. Chen, B. Zhong, G. Li, S. Zhang, and R. Ji, “Siamese box adaptive network for visual tracking”, in CVPR, (2020).

[7]. Y. Yu, Y. Xiong, W. Huang, and M. R. Scott, “Deformable siamese attention networks for visual object tracking”, in CVPR, (2020).

[8]. Z. Tian, C. Shen, H. Chen, and T. He, “FCOS: Fully convolutional one-stage object detection”, in ICCV, pp. 9626–9635, (2019).

[9]. H. Law, Y. Teng, O. Russakovsky, and J. Deng, “SiamCorners: Siamese Corner Networks for Visual Tracking”, in arXiv:1904.08900, (2021).

[10]. H. Law and J. Deng, “Cornernet: Detecting objects as paired keypoints”, in CVPR, (2018).

[11]. Q. Wang, Z. Teng, J. Xing, J. Gao, W. Hu, and S. Maybank, “Learning attentions: residual attentional siamese network for high performance online visual tracking”, in CVPR, (2018).

[12]. S.Q. Ren, K.M. He, R. Girshick, and J. Sun. “Faster r-cnn: Towards real-time object detection with region proposal networks”, in NIPS, (2015).

[13]. K. He,X. Zhang, S. Ren, J. Sun, “Deep residual learning for image recognition”, in CVPR, (2016).

[14]. I. Bello, B. Zoph, A. Vaswani, J. Shlens, and Q. V. Le, “Attention augmented convolutional networks”, in ICCV, (2019).

[15]. P. Ramachandran, N. Parmar, A. Vaswani, I. Bello, A. Levskaya, and J. Shlens, “Stand-alone self-attention in vision models”, in NIPS, (2019).

[16]. J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, and H. Lu, “Dual attention network for scene segmentation”, in CVPR, (2019).

[17]. J. Choi, J. Kwon, and K. M. Lee, “Deep meta learning for real-time target-aware visual tracking”, in ICCV, (2019).

[18]. K. Wang, B. Wu, P. Zhu, P. Li, W. Zuo and Q. Hu, “ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks”, in CVPR, (2020).

[19]. A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks”, in NIPS, (2012).

[20]. H. Liu, F. Liu, X. Fan, and D. Huang, “Polarized self-attention: towards high-quality pixel-wise regression”, in in arXiv:2107.00782, (2021).

[21]. J. Yu, Y. Jiang, Z. Wang, Z. Cao, and T. Huang, “Unitbox: An advanced object detection network”, in Proceedings of the 24th ACM international conference on Multimedia, pp. 516-520, (2016).

[22]. M. Kristan, A. Leonardis, J. Matas, M. Felsberg, R. Pfugfelder, and et al, “The sixth visual object tracking vot2018 challenge results”, ECCV Workshops, (2018).

[23]. M. Muller,N.Smith, B. Ghanem, “A benchmark and simulator for uav tracking”, in ECCV, (2016).

[24]. Y. Wu, J. Lim, M. Yang, “Online object tracking: A benchmark”, in CVPR, pp.2411-2418, (2013).

[25]. T. Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick, “Microsoft coco: Common objects in context”, in ECCV, pages 740–755, Springer, (2014).

[26]. O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, “ImageNet Large Scale Visual Recognition Challenge”, in IJCV, (2015).

[27]. L. Huang, X. Zhao, and K. Huang, “GOT-10k: A large high-diversity benchmark for generic object tracking in the wild”, in IEEE Transactions on Pattern Analysis and Machine Intelligence, (2019).

[28]. H. Fan, L. Lin, F. Yang, P. Chu, G. Deng, S. Yu, H. Bai, Y. Xu, C. Liao, and H. Ling, “LaSOT: A high-quality benchmark for large-scale single object tracking”, (2018).

[29]. E. Real, J. Shlens, S Mazzocchi, X. Pan, and V. Vanhoucke, “YouTube-BoundingBoxes: A large high-precision human-annotated data set for object detection in video” in CVPR, (2017).

[30]. G. Bhat, M. Danelljan, L. V. Gool, and R. Timofte, “Learning Discriminative Model Prediction for Tracking”, in ECCV, (2019).

[31]. M. Danelljan, G. Bhat, F.S. Khan, and M. Felsberg, “Atom: Accurate tracking by overlap maximization”, in CVPR, (2019).

[32]. G. Bhat, J. Johnander, M. Danelljan, F. S. Khan, and M. Felsberg, “Unveiling the power of deep tracking”, in ECCV, (2018).

[33]. A. LuNežič, T. Vojíř, L. Čehovin Zajc, J. Matas, and M. Kristan, “Discriminative correlation filter tracker with channel and spatial reliability”, in IJCV, (2018).

[34]. M. Danelljan, A. Robinson, F. S. Khan, and M. Felsberg, “Beyond correlation filters: learning continuous convolution operators for visual tracking”, in ECCV, (2016).

[35]. M. Danelljan, G. Bhat, F. S. Khan, and M. Felsberg, “Eco: Efficient convolution operators for trackingCVPR, (2017).




How to Cite

Hoang Dinh, T., T. . Do Ngoc, K. Thai Trung, and L. Tran Quoc. “Real-Time Siamese Visual Object Tracking Using Attention and Anchor-Free Mechanism”. Journal of Military Science and Technology, no. 80, June 2022, pp. 132-41, doi:10.54939/1859-1043.j.mst.80.2022.132-141.



Research Articles