To read this content please select one of the options below:

QAGA-Net: enhanced vision transformer-based object detection for remote sensing images

Huaxiang Song (School of Geography Science and Tourism, Hunan University of Arts and Science, Changde, China)
Hanjun Xia (School of Geography Science and Tourism, Hunan University of Arts and Science, Changde, China)
Wenhui Wang (School of Geography Science and Tourism, Hunan University of Arts and Science, Changde, China)
Yang Zhou (School of Geography Science and Tourism, Hunan University of Arts and Science, Changde, China)
Wanbo Liu (School of Geography Science and Tourism, Hunan University of Arts and Science, Changde, China)
Qun Liu (School of Geography Science and Tourism, Hunan University of Arts and Science, Changde, China)
Jinling Liu (School of Geography Science and Tourism, Hunan University of Arts and Science, Changde, China)

International Journal of Intelligent Computing and Cybernetics

ISSN: 1756-378X

Article publication date: 13 November 2024

12

Abstract

Purpose

Vision transformers (ViT) detectors excel in processing natural images. However, when processing remote sensing images (RSIs), ViT methods generally exhibit inferior accuracy compared to approaches based on convolutional neural networks (CNNs). Recently, researchers have proposed various structural optimization strategies to enhance the performance of ViT detectors, but the progress has been insignificant. We contend that the frequent scarcity of RSI samples is the primary cause of this problem, and model modifications alone cannot solve it.

Design/methodology/approach

To address this, we introduce a faster RCNN-based approach, termed QAGA-Net, which significantly enhances the performance of ViT detectors in RSI recognition. Initially, we propose a novel quantitative augmentation learning (QAL) strategy to address the sparse data distribution in RSIs. This strategy is integrated as the QAL module, a plug-and-play component active exclusively during the model’s training phase. Subsequently, we enhanced the feature pyramid network (FPN) by introducing two efficient modules: a global attention (GA) module to model long-range feature dependencies and enhance multi-scale information fusion, and an efficient pooling (EP) module to optimize the model’s capability to understand both high and low frequency information. Importantly, QAGA-Net has a compact model size and achieves a balance between computational efficiency and accuracy.

Findings

We verified the performance of QAGA-Net by using two different efficient ViT models as the detector’s backbone. Extensive experiments on the NWPU-10 and DIOR20 datasets demonstrate that QAGA-Net achieves superior accuracy compared to 23 other ViT or CNN methods in the literature. Specifically, QAGA-Net shows an increase in mAP by 2.1% or 2.6% on the challenging DIOR20 dataset when compared to the top-ranked CNN or ViT detectors, respectively.

Originality/value

This paper highlights the impact of sparse data distribution on ViT detection performance. To address this, we introduce a fundamentally data-driven approach: the QAL module. Additionally, we introduced two efficient modules to enhance the performance of FPN. More importantly, our strategy has the potential to collaborate with other ViT detectors, as the proposed method does not require any structural modifications to the ViT backbone.

Keywords

Acknowledgements

This work was supported by the Research Foundation of Hunan University of Arts and Science (Geography Subject [2022] 351).

Citation

Song, H., Xia, H., Wang, W., Zhou, Y., Liu, W., Liu, Q. and Liu, J. (2024), "QAGA-Net: enhanced vision transformer-based object detection for remote sensing images", International Journal of Intelligent Computing and Cybernetics, Vol. ahead-of-print No. ahead-of-print. https://doi.org/10.1108/IJICC-08-2024-0383

Publisher

:

Emerald Publishing Limited

Copyright © 2024, Emerald Publishing Limited

Related articles