The search functionality is under construction.
The search functionality is under construction.

Open Access
Vision Transformer with Key-Select Routing Attention for Single Image Dehazing

Lihan TONG, Weijia LI, Qingxia YANG, Liyuan CHEN, Peng CHEN

  • Full Text Views

    89

  • Cite this
  • Free PDF (11.8MB)

Summary :

We present Ksformer, utilizing Multi-scale Key-select Routing Attention (MKRA) for intelligent selection of key areas through multi-channel, multi-scale windows with a top-k operator, and Lightweight Frequency Processing Module (LFPM) to enhance high-frequency features, outperforming other dehazing methods in tests.

Publication
IEICE TRANSACTIONS on Information Vol.E107-D No.11 pp.1472-1475
Publication Date
2024/11/01
Publicized
2024/07/01
Online ISSN
1745-1361
DOI
10.1587/transinf.2024EDL8043
Type of Manuscript
LETTER
Category
Image Recognition, Computer Vision

1.  Introduction

Single image dehazing [3]-[5] aims to restore clear, high-quality images from hazy ones, essential for applications like object detection [2] and semantic segmentation [1]. Traditional methods [6], [8], [28] may not give ideal dehazing results because they can’t cover all scenarios [15]. With the rise of deep learning, convolutional neural networks (CNNs) [4], [5], [16] have been widely applied to image dehazing and have achieved good results. However, because CNNs cannot capture long-range dependencies, this limits further improvement in dehazing effects. Recently, Transformers [17], [18], [20], [21] have been widely used in computer vision tasks because they can capture long-range dependencies. However, they have a problem where their computational complexity is proportional to the square of the image resolution. Many efforts [21]-[24] have been made to address this issue by introducing handcrafted sparsity. But the sparsity added by hand doesn’t relate to the content, causing some loss of information.

We propose Ksformer, which is made up of MKRA and LFPM. MKRA estimates queries in windows of different sizes and then uses a top-k operator to select the most important k queries. This approach enhances computational efficiency and incorporates content-aware capabilities. Meanwhile, multi-scale windows adeptly manage blurs of varying sizes. LFPM employs lightweight parameters to extract spectral features. The contributions of this work are summarized as:

  • Ksformer is content-aware, selecting key-value pairs with important information to minimize content loss, while also capturing long-range dependencies and reducing computational complexity.

  • Ksformer extracts spectral features with ultra-lightweight parameters, performing MKRA in both spatial and frequency domains and then fusing them, which narrows the gap between clean and hazy images in terms of both space and spectrum.

  • Ksformer achieves a PSNR of 39.4 and an SSIM of 0.998 with only 5.8M parameters, which is significantly better than other state-of-the-art methods.

Fig. 1  The architecture of the proposed Ksformer.

2.  Method

2.1  Image Dehazing

We use three encoders and three decoders and downsample by \(4 \times 4\) for a compact model. We use Multi-scale Key-select Routing Attention Module (MKRAM) only in the smaller dimensions to reduce computational complexity. To lower the difficulty of training [25], [26], we strengthen the exchange of information between layers and use skip connections at both the feature and image levels.

2.2  Multi-Scale Key-Select Routing Attention

MKRA uses a top-k operator to select the most important key-value pairs, balancing content awareness with lower computational complexity. For any given input feature map \(X \in R^{H \times W \times C}\), first, we divide it into four parts along the channel dimension, with window sizes of \(2 \times 2,4 \times 4,8 \times 8,64 \times 64\). Then, it is divided into \(S \times S\) non-overlapping regions. Each region contains \(\frac{H W}{S^{2}}\) feature vectors. After this step, \(X\) is reshaped into \(X^{r} \in R^{S^{2} \times \frac{H W}{S^{2}} \times \frac{C}{4}}\). Then, we use linear projection to weight and derive \(Q, K, V \in R^{S^{2} \times \frac{H W}{S^{2}} \times \frac{C}{4}}\).

\[\begin{equation*} Q=X^{r} W^{q}, K=X^{r} W^{k}, V=X^{r} W^{v}, \tag{1} \end{equation*}\]

Here, \(W^{q}, W^{k}, W^{v} \in R^{\frac{C}{4} \times \frac{C}{4}}\) represent the weights for \(Q, K\), and \(V\), respectively. We construct an Attention module to identify the areas where important key-value pairs are located. In simple terms, we use the average values of each region to derive region-level queries and keys, \(Q_{r}, K_{r} \in R^{s^{2} \times \frac{C}{4}}\). Then, we derive the region-to-region importance association matrix using the following formula.

\[\begin{equation*} A_{r}^{n \times n}=\operatorname{Softmax}\left(\frac{Q_{r}\left(K_{r}\right)^{T}}{\sqrt{\frac{c}{4}}}\right), \tag{2} \end{equation*}\]

Here, \(A_{r}^{n \times n} \in R^{S^{2} \times \frac{C}{4}}\), represents the degree of association between two regions. \(n \times n\) represents the size of the window. Next, we concatenate \(A_{r}^{n \times n}\) along the channel dimension to obtain \(A_{r} \in R^{S^{2} \times C}\). Next, we retain the top \(\mathrm{k}\) most important queries using the top-k operator and prune the association graph to derive the index matrix.

\[\begin{equation*} I_{r}=\operatorname{topk}\left(A_{r}\right), \tag{3} \end{equation*}\]

Here, \(I_{r} \in R^{S^{2} \times K}\). So, the i-th row of \(I\) contains the k indices of the most relevant regions for the i-th region. Using the importance index matrix \(I_{r}\), we can capture long-range dependencies, be content-aware, and reduce computational complexity. For each query and token in region \(i\), it will focus on all key-value pairs in the union of \(\mathrm{k}\) important regions indexed by \(I^{(i, 1)}_{r}, I^{(i, 2)}_{r}, \ldots, I^{(i, k)}_{r}\). We collect the key and value tensors.

\[\begin{equation*} K^{g}=\operatorname{gather}\left(K, I_{r}\right), V^{g}=\operatorname{gather}\left(V_{r}\right), \tag{4} \end{equation*}\]

We collect the key and value tensors. Here \(K^{g}, V^{g} \in\) \(R^{S^{2} \times \frac{k W H}{S^{2}} \times C}\), Finally, we focus our attention on the collected key-value pairs.

\[\begin{equation*} \text {Output }=\operatorname{Attention}\left(Q, K^{g}, V^{g}\right). \tag{5} \end{equation*}\]

3.  Lightweight Frequency Processing Module

Past approaches often employed wavelet [29]-[32] or Fourier [33] transformations to divide image features into multiple frequency sub-bands. These approaches raised the computational load for reverse transformation and didn’t boost key frequency elements. To solve this, we added a lightweight module for spectral feature extraction and modulation. It efficiently splits the spectrum into different frequencies and uses a small number of learnable parameters to emphasize the most informative ones.

4.  Multi-Scale Key-Select Routing Attention Module

As shown in Fig. 2, to improve model efficiency, our MKRAM module processes the spatial domain, low frequencies, and high frequencies in parallel and then fuses the three outputs.

Fig. 2  (a) is the architecture of the proposed MKRAM. (b) is the architecture of the proposed LFPM.

5.  Experiments

5.1  Implementation Details

During our experiments, we employ PyTorch version 1.11.0 and utilize the capabilities of four NVIDIA RTX 4090 GPUs to perform all tests. In the training phase, images are randomly cropped into 320 \(\times\) 320 pixel patches. For assessing the model’s computational complexity, we adopt a size of 128 \(\times\) 128 pixels. The Adam optimizer is engaged for optimization, with decay rates set at 0.9 for \(\beta_1\) and 0.999 for \(\beta_2\). The initial learning rate is configured at 0.00015, and we apply a cosine annealing strategy for its scheduling. The batch size is maintained at 64. Through empirical determination, we set the penalty parameter \(\lambda\) to 0.2 and \(\gamma\) to 0.25, and we proceed with training for 80,000 iterations.

5.2  Quantitative and Qualitative Experiments

Visual Comparison. To thoroughly assess our method, we tested it on both the synthetic Haze4K [27] dataset and the real-world RTTS [34] dataset. As shown in Fig. 3 and Fig. 4, it’s clear that our method outperforms others in terms of edge sharpness, color fidelity, clarity of texture details, and handling of sky areas, whether on synthetic or real datasets. Quantitative Comparison. We quantitatively compared Ksformer with the current state-of-the-art methods on the SOTS indoor [34] and Haze4K [27] datasets. As shown in Table 1, for the SOTS indoor [34] dataset, Ksformer achieved a PSNR of 39.40 and an SSIM of 0.994, which is a 0.09 PSNR improvement over the second-best method, and it did so with only \(30\%\) of the parameter volume. For the Haze4K [27] dataset, Ksformer reached a PSNR of 33.74 and an SSIM of 0.98. The quantitative comparison fully demonstrates that Ksformer outperforms other state-of-the-art methods in terms of performance.

Fig. 3  Visual results comparisons on RTTS [34] dataset. Zoom in for best view.

Fig. 4  Visual results comparisons on Haze4K dataset [27]. Zoom in for best view.

Table 1  Quantitative comparisons with SOTA methods on the RESIDE-Indoor [34] and Haze4K [27] datasets.

5.3  Ablation Study

To prove the effectiveness of our method, we conducted an ablation study. We first built a U-Net as the base network, and then gradually added modules to the baseline. As shown in Table 2, both PSNR and SSIM improved with the step-by-step addition of modules, and the metrics reached their best values after effectively combining the modules we proposed.

Table 2  Ablation study of our Ksformer on the Haze4k dataset [27].

6.  Conclusion

This paper introduces Ksformer, which combines a top-k operator with multi-scale windows, giving the network the characteristics of content awareness and low complexity. At the same time, it obtains spectral features with ultra-lightweight parameters, narrowing the spectral gap between clean and foggy images. On the SOTS indoor [34] dataset, it achieved a PSNR of 39.4 and an SSIM of 0.994 with only 5.8M.

Although the Ksformer has a relatively small parameter count of just 5.8 million, it’s unfortunate that it can’t be implemented on embedded systems due to its high GFLOPs. We plan to further explore the balance between performance and computational complexity. By appropriately reducing the number of channels and modules, we aim to make the Ksformer suitable for embedded systems, allowing it to play a significant role in a broader range of fields.

Acknowledgments

This work was supported in part by the Youth Science and Technology Innovation Program of Xiamen Ocean and Fisheries Development Special Funds (23ZHZB039QCB24), Xiamen Ocean and Fisheries Development Special Funds (22CZB013HJ04).

References

[1] S. Hao, Y. Zhou, and Y. Guo, “A brief survey on semantic segmentation with deep learning,” Neurocomputing, vol.406, pp.302-321, 2020.
CrossRef

[2] Z. Zou, K. Chen, Z. Shi, Y. Guo, and J. Ye, “Object detection in 20 years: A survey,” Proc. IEEE, vol.111, no.3, pp.257-276, 2023.
CrossRef

[3] B. Li, X. Peng, Z. Wang, J. Xu, and D. Feng, “Aod-net: All-in-one dehazing network,” Proc. IEEE Int. Conf. Comput. Vis., pp.4770-4778, 2017.
CrossRef

[4] X. Qin, Z. Wang, Y. Bai, X. Xie, and H. Jia, “FFA-Net: Feature fusion attention network for single image dehazing,” Proc. AAAI Conf. Artif. Intell., vol.34, no.7, pp.11908-11915, 2020.
CrossRef

[5] H. Wu, Y. Qu, S. Lin, J. Zhou, R. Qiao, Z. Zhang, Y. Xie, and L. Ma,“ Contrastive learning for compact single image dehazing,” Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp.10551-10560, 2021.
CrossRef

[6] K. He, J. Sun, and X. Tang, “Single image haze removal using dark channel prior,” IEEE Trans. Pattern Anal. Mach. Intell., vol.33, no.12, pp.2341-2353, 2011.
CrossRef

[7] X. Liu, Y. Ma, Z. Shi, and J. Chen, “Griddehazenet: Attention-based multi-scale network for image dehazing,” Proc. IEEE/CVF Int. Conf. Comput. Vis., pp.7314-7323, 2019.
CrossRef

[8] D. Berman, T. Treibitz, and S. Avidan, “Non-local image dehazing,” Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp.1674-1682, 2016.
CrossRef

[9] H. Dong, J. Pan, L. Xiang, Z. Hu, X. Zhang, F. Wang, and M.-H. Yang, “Multi-scale boosted dehazing network with dense feature fusion,” Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp.2157-2167, 2020.
CrossRef

[10] T. Ye, M. Jiang, Y. Zhang, L. Chen, E. Chen, P. Chen, and Z. Lu, “Perceiving and modeling density is all you need for image dehazing,” arXiv preprint arXiv:2111.09733, 2021.

[11] B. Li, W. Ren, D. Fu, D. Tao, D. Feng, W. Zeng, and Z. Wang, “Benchmarking single-image dehazing and beyond,” IEEE Trans. Image Process., vol.28, no.1, pp.492-505, 2019.
CrossRef

[12] Y. Liu, L. Zhu, S. Pei, H. Fu, J. Qin, Q. Zhang, L. Wan, and W. Feng, “From synthetic to real: Image dehazing collaborating with unlabeled real data,” Proc. 29th ACM Int. Conf. Multimedia, pp.50-58, 2021.
CrossRef

[13] M. Hong, Y. Xie, C. Li, and Y. Qu, “Distilling image dehazing with heterogeneous task imitation,” Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp.3462-3471, 2020.
CrossRef

[14] C. Guo, Q. Yan, S. Anwar, R. Cong, W. Ren, and C. Li, “Image dehazing transformer with transmission-aware 3D position embedding,” Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp.5812-5820, 2022.
CrossRef

[15] K. Zhang, W. Ren, W. Luo, W.-S. Lai, B. Stenger, M.-H. Yang and H. Li, “Deep image deblurring: A survey,” Int. J. Comput. Vis., vol.130, no.9, pp.2103-2130, 2022.
CrossRef

[16] Y. Cui, Y. Tao, Z. Bing, W. Ren, X. Gao, X. Cao, K. Huang, and A. Knoll, “Selective frequency network for image restoration,” Proc. 11th Int. Conf. Learn. Represent., 2022.

[17] N. Parmar, A. Vaswani, J. Uszkoreit, L. Kaiser, N. Shazeer, A. Ku, and D. Tran, “Image transformer,” Proc. Int. Conf. Mach. Learn., PMLR, pp.4055-4064, 2018.

[18] H. Chen, Y. Wang, T. Guo, C. Xu, Y. Deng, Z. Liu, S. Ma, C. Xu, C. Xu, and W. Gao, “Pre-trained image processing transformer,” Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp.12299-12310, 2021.
CrossRef

[19] J. Liang, J. Cao, G. Sun, K. Zhang, L. Van Gool, and R.Timofte, “Swinir: Image restoration using swin transformer,” Proc. IEEE/CVF Int. Conf. Comput. Vis., pp.1833-1844, 2021.
CrossRef

[20] J. Ke, Q. Wang, Y. Wang, P. Milanfar, and F. Yang, “Musiq: Multi-scale image quality transformer,” Proc. IEEE/CVF Int. Conf. Comput. Vis., pp.5148-5157, 2021.
CrossRef

[21] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” Proc. IEEE/CVF Int. Conf. Comput. Vis., pp.10012-10022, 2021.
CrossRef

[22] Z. Wang, X. Cun, J. Bao, W. Zhou, J. Liu, and H. Li, “Uformer: A general u-shaped transformer for image restoration,” Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp.17683-17693, 2022.
CrossRef

[23] S.W. Zamir, A. Arora, S. Khan, M. Hayat, F.S. Khan, and M.-H. Yang, “Restormer: Efficient transformer for high-resolution image restoration,” Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp.5728-5739, 2022.
CrossRef

[24] Y. Qiu, K. Zhang, C. Wang, W. Luo, H. Li, and Z. Jin, “MB-TaylorFormer: Multi-branch efficient transformer expanded by Taylor formula for image dehazing,” Proc. IEEE/CVF Int. Conf. Comput. Vis., pp.12802-12813, 2023.
CrossRef

[25] X. Mao, Y. Liu, W. Shen, Q. Li, and Y. Wang, “Deep residual Fourier transformation for single image deblurring,” arXiv preprint arXiv:2111.11745, 2021.

[26] Z. Tu, H. Talebi, H. Zhang, F. Yang, P. Milanfar, A. Bovik, and Y. Li, “Maxim: Multi-axis mlp for image processing,” Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp.5769-5780, 2022.
CrossRef

[27] Y. Liu, L. Zhu, S. Pei, H. Fu, J. Qin, Q. Zhang, L. Wan, and W. Feng, “From synthetic to real: Image dehazing collaborating with unlabeled real data,” Proc. 29th ACM Int. Conf. Multimedia, pp.50-58, 2021.
CrossRef

[28] Q. Zhu, J. Mai, and L. Shao, “Single image dehazing using color attenuation prior,” Proc. BMVC, pp.1-10, Citeseer, 2014.

[29] I.W. Selesnick, R.G. Baraniuk, and N.C. Kingsbury, “The dual-tree complex wavelet transform,” IEEE Signal Process. Mag., vol.22, no.6, pp.123-151, 2005.
CrossRef

[30] H.H. Yang and Y. Fu, “Wavelet u-net and the chromatic adaptation transform for single image dehazing,” Proc. IEEE Int. Conf. Image Process. (ICIP), pp.2736-2740, 2019.
CrossRef

[31] W.-T. Chen, H.-Y. Fang, C.-L. Hsieh, C.-C. Tsai, I.-H. Chen, J.-J. Ding, and S.-Y. Kuo, “All snow removed: Single image desnowing algorithm using hierarchical dual-tree complex wavelet representation and contradict channel loss,” Proc. IEEE/CVF Int. Conf. Comput. Vis., pp.4196-4205, 2021.
CrossRef

[32] W. Zou, M. Jiang, Y. Zhang, L. Chen, Z. Lu, and Y. Wu, “Sdwnet: A straight dilated network with wavelet transformation for image deblurring,” Proc. IEEE/CVF Int. Conf. Comput. Vis., pp.1895-1904, 2021.
CrossRef

[33] H. Yu, N. Zheng, M. Zhou, J. Huang, Z. Xiao, and F. Zhao, “Frequency and spatial dual guidance for image dehazing,” Proc. Eur. Conf. Comput. Vis., vol.13679, pp.181-198, 2022.
CrossRef

[34] B. Li, W. Ren, D. Fu, D. Tao, D. Feng, W. Zeng, and Z. Wang, “Benchmarking single-image dehazing and beyond,” IEEE Trans. Image Process., vol.28, no.1, pp.492-505, 2021.
CrossRef

Authors

Lihan TONG
  Jimei University
Weijia LI
  Jimei University
Qingxia YANG
  Jimei University
Liyuan CHEN
  Jimei University
Peng CHEN
  Jimei University

Keyword