A Hierarchical Vision Transformer Using Overlapping Patch and Self-Supervised Learning
Transformer-based network architectures have gradually replaced convolutional neural networks in computer vision. Compared with convolutional neural networks, Transformer is able to learn global information of images and has better feature extraction capability. However, due to the lack of inductive bias, vision Transformers require a large amount of data for pre-training, such as ViT. Local-based Transformers effectively reduce the computational complexity, but could not establish long-range dependencies and do not perform as well on small-scale datasets. In response to these problems, OPSe Transformer is proposed. A global attention calculation module is designed to be added behind each stage of the vision Transformer, using a slightly larger and overlapping key patch and value patch to enhance the exchange of information between two adjacent windows and to aggregate global information in the local Transformer. In addition, a self-supervised learning proxy task is added to the architecture, corresponding to the loss function of the proxy task to constrain the training of the model on the dataset, so that the vision Transformer can learn spatial information within an image and improve the training effect of the network. Comparative experiments are conducted on the tiny-ImageNet, CIFAR-10/100, and other datasets, and the experimental results show that compared with the baseline algorithm, our model improves the accuracy by up to 3.91%.