1. Introduction
Single image dehazing [3][5] aims to restore clear, highquality images from hazy ones, essential for applications like object detection [2] and semantic segmentation [1]. Traditional methods [6], [8], [28] may not give ideal dehazing results because they can’t cover all scenarios [15]. With the rise of deep learning, convolutional neural networks (CNNs) [4], [5], [16] have been widely applied to image dehazing and have achieved good results. However, because CNNs cannot capture longrange dependencies, this limits further improvement in dehazing effects. Recently, Transformers [17], [18], [20], [21] have been widely used in computer vision tasks because they can capture longrange dependencies. However, they have a problem where their computational complexity is proportional to the square of the image resolution. Many efforts [21][24] have been made to address this issue by introducing handcrafted sparsity. But the sparsity added by hand doesn’t relate to the content, causing some loss of information.
We propose Ksformer, which is made up of MKRA and LFPM. MKRA estimates queries in windows of different sizes and then uses a topk operator to select the most important k queries. This approach enhances computational efficiency and incorporates contentaware capabilities. Meanwhile, multiscale windows adeptly manage blurs of varying sizes. LFPM employs lightweight parameters to extract spectral features. The contributions of this work are summarized as:

Ksformer is contentaware, selecting keyvalue pairs with important information to minimize content loss, while also capturing longrange dependencies and reducing computational complexity.

Ksformer extracts spectral features with ultralightweight parameters, performing MKRA in both spatial and frequency domains and then fusing them, which narrows the gap between clean and hazy images in terms of both space and spectrum.

Ksformer achieves a PSNR of 39.4 and an SSIM of 0.998 with only 5.8M parameters, which is significantly better than other stateoftheart methods.
2. Method
2.1 Image Dehazing
We use three encoders and three decoders and downsample by \(4 \times 4\) for a compact model. We use Multiscale Keyselect Routing Attention Module (MKRAM) only in the smaller dimensions to reduce computational complexity. To lower the difficulty of training [25], [26], we strengthen the exchange of information between layers and use skip connections at both the feature and image levels.
2.2 MultiScale KeySelect Routing Attention
MKRA uses a topk operator to select the most important keyvalue pairs, balancing content awareness with lower computational complexity. For any given input feature map \(X \in R^{H \times W \times C}\), first, we divide it into four parts along the channel dimension, with window sizes of \(2 \times 2,4 \times 4,8 \times 8,64 \times 64\). Then, it is divided into \(S \times S\) nonoverlapping regions. Each region contains \(\frac{H W}{S^{2}}\) feature vectors. After this step, \(X\) is reshaped into \(X^{r} \in R^{S^{2} \times \frac{H W}{S^{2}} \times \frac{C}{4}}\). Then, we use linear projection to weight and derive \(Q, K, V \in R^{S^{2} \times \frac{H W}{S^{2}} \times \frac{C}{4}}\).
\[\begin{equation*} Q=X^{r} W^{q}, K=X^{r} W^{k}, V=X^{r} W^{v}, \tag{1} \end{equation*}\] 
Here, \(W^{q}, W^{k}, W^{v} \in R^{\frac{C}{4} \times \frac{C}{4}}\) represent the weights for \(Q, K\), and \(V\), respectively. We construct an Attention module to identify the areas where important keyvalue pairs are located. In simple terms, we use the average values of each region to derive regionlevel queries and keys, \(Q_{r}, K_{r} \in R^{s^{2} \times \frac{C}{4}}\). Then, we derive the regiontoregion importance association matrix using the following formula.
\[\begin{equation*} A_{r}^{n \times n}=\operatorname{Softmax}\left(\frac{Q_{r}\left(K_{r}\right)^{T}}{\sqrt{\frac{c}{4}}}\right), \tag{2} \end{equation*}\] 
Here, \(A_{r}^{n \times n} \in R^{S^{2} \times \frac{C}{4}}\), represents the degree of association between two regions. \(n \times n\) represents the size of the window. Next, we concatenate \(A_{r}^{n \times n}\) along the channel dimension to obtain \(A_{r} \in R^{S^{2} \times C}\). Next, we retain the top \(\mathrm{k}\) most important queries using the topk operator and prune the association graph to derive the index matrix.
\[\begin{equation*} I_{r}=\operatorname{topk}\left(A_{r}\right), \tag{3} \end{equation*}\] 
Here, \(I_{r} \in R^{S^{2} \times K}\). So, the ith row of \(I\) contains the k indices of the most relevant regions for the ith region. Using the importance index matrix \(I_{r}\), we can capture longrange dependencies, be contentaware, and reduce computational complexity. For each query and token in region \(i\), it will focus on all keyvalue pairs in the union of \(\mathrm{k}\) important regions indexed by \(I^{(i, 1)}_{r}, I^{(i, 2)}_{r}, \ldots, I^{(i, k)}_{r}\). We collect the key and value tensors.
\[\begin{equation*} K^{g}=\operatorname{gather}\left(K, I_{r}\right), V^{g}=\operatorname{gather}\left(V_{r}\right), \tag{4} \end{equation*}\] 
We collect the key and value tensors. Here \(K^{g}, V^{g} \in\) \(R^{S^{2} \times \frac{k W H}{S^{2}} \times C}\), Finally, we focus our attention on the collected keyvalue pairs.
\[\begin{equation*} \text {Output }=\operatorname{Attention}\left(Q, K^{g}, V^{g}\right). \tag{5} \end{equation*}\] 
3. Lightweight Frequency Processing Module
Past approaches often employed wavelet [29][32] or Fourier [33] transformations to divide image features into multiple frequency subbands. These approaches raised the computational load for reverse transformation and didn’t boost key frequency elements. To solve this, we added a lightweight module for spectral feature extraction and modulation. It efficiently splits the spectrum into different frequencies and uses a small number of learnable parameters to emphasize the most informative ones.
4. MultiScale KeySelect Routing Attention Module
As shown in Fig. 2, to improve model efficiency, our MKRAM module processes the spatial domain, low frequencies, and high frequencies in parallel and then fuses the three outputs.
5. Experiments
5.1 Implementation Details
During our experiments, we employ PyTorch version 1.11.0 and utilize the capabilities of four NVIDIA RTX 4090 GPUs to perform all tests. In the training phase, images are randomly cropped into 320 \(\times\) 320 pixel patches. For assessing the model’s computational complexity, we adopt a size of 128 \(\times\) 128 pixels. The Adam optimizer is engaged for optimization, with decay rates set at 0.9 for \(\beta_1\) and 0.999 for \(\beta_2\). The initial learning rate is configured at 0.00015, and we apply a cosine annealing strategy for its scheduling. The batch size is maintained at 64. Through empirical determination, we set the penalty parameter \(\lambda\) to 0.2 and \(\gamma\) to 0.25, and we proceed with training for 80,000 iterations.
5.2 Quantitative and Qualitative Experiments
Visual Comparison. To thoroughly assess our method, we tested it on both the synthetic Haze4K [27] dataset and the realworld RTTS [34] dataset. As shown in Fig. 3 and Fig. 4, it’s clear that our method outperforms others in terms of edge sharpness, color fidelity, clarity of texture details, and handling of sky areas, whether on synthetic or real datasets. Quantitative Comparison. We quantitatively compared Ksformer with the current stateoftheart methods on the SOTS indoor [34] and Haze4K [27] datasets. As shown in Table 1, for the SOTS indoor [34] dataset, Ksformer achieved a PSNR of 39.40 and an SSIM of 0.994, which is a 0.09 PSNR improvement over the secondbest method, and it did so with only \(30\%\) of the parameter volume. For the Haze4K [27] dataset, Ksformer reached a PSNR of 33.74 and an SSIM of 0.98. The quantitative comparison fully demonstrates that Ksformer outperforms other stateoftheart methods in terms of performance.
5.3 Ablation Study
To prove the effectiveness of our method, we conducted an ablation study. We first built a UNet as the base network, and then gradually added modules to the baseline. As shown in Table 2, both PSNR and SSIM improved with the stepbystep addition of modules, and the metrics reached their best values after effectively combining the modules we proposed.
6. Conclusion
This paper introduces Ksformer, which combines a topk operator with multiscale windows, giving the network the characteristics of content awareness and low complexity. At the same time, it obtains spectral features with ultralightweight parameters, narrowing the spectral gap between clean and foggy images. On the SOTS indoor [34] dataset, it achieved a PSNR of 39.4 and an SSIM of 0.994 with only 5.8M.
Although the Ksformer has a relatively small parameter count of just 5.8 million, it’s unfortunate that it can’t be implemented on embedded systems due to its high GFLOPs. We plan to further explore the balance between performance and computational complexity. By appropriately reducing the number of channels and modules, we aim to make the Ksformer suitable for embedded systems, allowing it to play a significant role in a broader range of fields.
Acknowledgments
This work was supported in part by the Youth Science and Technology Innovation Program of Xiamen Ocean and Fisheries Development Special Funds (23ZHZB039QCB24), Xiamen Ocean and Fisheries Development Special Funds (22CZB013HJ04).
References
[1] S. Hao, Y. Zhou, and Y. Guo, “A brief survey on semantic segmentation with deep learning,” Neurocomputing, vol.406, pp.302321, 2020.
CrossRef
[2] Z. Zou, K. Chen, Z. Shi, Y. Guo, and J. Ye, “Object detection in 20 years: A survey,” Proc. IEEE, vol.111, no.3, pp.257276, 2023.
CrossRef
[3] B. Li, X. Peng, Z. Wang, J. Xu, and D. Feng, “Aodnet: Allinone dehazing network,” Proc. IEEE Int. Conf. Comput. Vis., pp.47704778, 2017.
CrossRef
[4] X. Qin, Z. Wang, Y. Bai, X. Xie, and H. Jia, “FFANet: Feature fusion attention network for single image dehazing,” Proc. AAAI Conf. Artif. Intell., vol.34, no.7, pp.1190811915, 2020.
CrossRef
[5] H. Wu, Y. Qu, S. Lin, J. Zhou, R. Qiao, Z. Zhang, Y. Xie, and L. Ma,“ Contrastive learning for compact single image dehazing,” Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp.1055110560, 2021.
CrossRef
[6] K. He, J. Sun, and X. Tang, “Single image haze removal using dark channel prior,” IEEE Trans. Pattern Anal. Mach. Intell., vol.33, no.12, pp.23412353, 2011.
CrossRef
[7] X. Liu, Y. Ma, Z. Shi, and J. Chen, “Griddehazenet: Attentionbased multiscale network for image dehazing,” Proc. IEEE/CVF Int. Conf. Comput. Vis., pp.73147323, 2019.
CrossRef
[8] D. Berman, T. Treibitz, and S. Avidan, “Nonlocal image dehazing,” Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp.16741682, 2016.
CrossRef
[9] H. Dong, J. Pan, L. Xiang, Z. Hu, X. Zhang, F. Wang, and M.H. Yang, “Multiscale boosted dehazing network with dense feature fusion,” Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp.21572167, 2020.
CrossRef
[10] T. Ye, M. Jiang, Y. Zhang, L. Chen, E. Chen, P. Chen, and Z. Lu, “Perceiving and modeling density is all you need for image dehazing,” arXiv preprint arXiv:2111.09733, 2021.
[11] B. Li, W. Ren, D. Fu, D. Tao, D. Feng, W. Zeng, and Z. Wang, “Benchmarking singleimage dehazing and beyond,” IEEE Trans. Image Process., vol.28, no.1, pp.492505, 2019.
CrossRef
[12] Y. Liu, L. Zhu, S. Pei, H. Fu, J. Qin, Q. Zhang, L. Wan, and W. Feng, “From synthetic to real: Image dehazing collaborating with unlabeled real data,” Proc. 29th ACM Int. Conf. Multimedia, pp.5058, 2021.
CrossRef
[13] M. Hong, Y. Xie, C. Li, and Y. Qu, “Distilling image dehazing with heterogeneous task imitation,” Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp.34623471, 2020.
CrossRef
[14] C. Guo, Q. Yan, S. Anwar, R. Cong, W. Ren, and C. Li, “Image dehazing transformer with transmissionaware 3D position embedding,” Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp.58125820, 2022.
CrossRef
[15] K. Zhang, W. Ren, W. Luo, W.S. Lai, B. Stenger, M.H. Yang and H. Li, “Deep image deblurring: A survey,” Int. J. Comput. Vis., vol.130, no.9, pp.21032130, 2022.
CrossRef
[16] Y. Cui, Y. Tao, Z. Bing, W. Ren, X. Gao, X. Cao, K. Huang, and A. Knoll, “Selective frequency network for image restoration,” Proc. 11th Int. Conf. Learn. Represent., 2022.
[17] N. Parmar, A. Vaswani, J. Uszkoreit, L. Kaiser, N. Shazeer, A. Ku, and D. Tran, “Image transformer,” Proc. Int. Conf. Mach. Learn., PMLR, pp.40554064, 2018.
[18] H. Chen, Y. Wang, T. Guo, C. Xu, Y. Deng, Z. Liu, S. Ma, C. Xu, C. Xu, and W. Gao, “Pretrained image processing transformer,” Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp.1229912310, 2021.
CrossRef
[19] J. Liang, J. Cao, G. Sun, K. Zhang, L. Van Gool, and R.Timofte, “Swinir: Image restoration using swin transformer,” Proc. IEEE/CVF Int. Conf. Comput. Vis., pp.18331844, 2021.
CrossRef
[20] J. Ke, Q. Wang, Y. Wang, P. Milanfar, and F. Yang, “Musiq: Multiscale image quality transformer,” Proc. IEEE/CVF Int. Conf. Comput. Vis., pp.51485157, 2021.
CrossRef
[21] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” Proc. IEEE/CVF Int. Conf. Comput. Vis., pp.1001210022, 2021.
CrossRef
[22] Z. Wang, X. Cun, J. Bao, W. Zhou, J. Liu, and H. Li, “Uformer: A general ushaped transformer for image restoration,” Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp.1768317693, 2022.
CrossRef
[23] S.W. Zamir, A. Arora, S. Khan, M. Hayat, F.S. Khan, and M.H. Yang, “Restormer: Efficient transformer for highresolution image restoration,” Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp.57285739, 2022.
CrossRef
[24] Y. Qiu, K. Zhang, C. Wang, W. Luo, H. Li, and Z. Jin, “MBTaylorFormer: Multibranch efficient transformer expanded by Taylor formula for image dehazing,” Proc. IEEE/CVF Int. Conf. Comput. Vis., pp.1280212813, 2023.
CrossRef
[25] X. Mao, Y. Liu, W. Shen, Q. Li, and Y. Wang, “Deep residual Fourier transformation for single image deblurring,” arXiv preprint arXiv:2111.11745, 2021.
[26] Z. Tu, H. Talebi, H. Zhang, F. Yang, P. Milanfar, A. Bovik, and Y. Li, “Maxim: Multiaxis mlp for image processing,” Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp.57695780, 2022.
CrossRef
[27] Y. Liu, L. Zhu, S. Pei, H. Fu, J. Qin, Q. Zhang, L. Wan, and W. Feng, “From synthetic to real: Image dehazing collaborating with unlabeled real data,” Proc. 29th ACM Int. Conf. Multimedia, pp.5058, 2021.
CrossRef
[28] Q. Zhu, J. Mai, and L. Shao, “Single image dehazing using color attenuation prior,” Proc. BMVC, pp.110, Citeseer, 2014.
[29] I.W. Selesnick, R.G. Baraniuk, and N.C. Kingsbury, “The dualtree complex wavelet transform,” IEEE Signal Process. Mag., vol.22, no.6, pp.123151, 2005.
CrossRef
[30] H.H. Yang and Y. Fu, “Wavelet unet and the chromatic adaptation transform for single image dehazing,” Proc. IEEE Int. Conf. Image Process. (ICIP), pp.27362740, 2019.
CrossRef
[31] W.T. Chen, H.Y. Fang, C.L. Hsieh, C.C. Tsai, I.H. Chen, J.J. Ding, and S.Y. Kuo, “All snow removed: Single image desnowing algorithm using hierarchical dualtree complex wavelet representation and contradict channel loss,” Proc. IEEE/CVF Int. Conf. Comput. Vis., pp.41964205, 2021.
CrossRef
[32] W. Zou, M. Jiang, Y. Zhang, L. Chen, Z. Lu, and Y. Wu, “Sdwnet: A straight dilated network with wavelet transformation for image deblurring,” Proc. IEEE/CVF Int. Conf. Comput. Vis., pp.18951904, 2021.
CrossRef
[33] H. Yu, N. Zheng, M. Zhou, J. Huang, Z. Xiao, and F. Zhao, “Frequency and spatial dual guidance for image dehazing,” Proc. Eur. Conf. Comput. Vis., vol.13679, pp.181198, 2022.
CrossRef
[34] B. Li, W. Ren, D. Fu, D. Tao, D. Feng, W. Zeng, and Z. Wang, “Benchmarking singleimage dehazing and beyond,” IEEE Trans. Image Process., vol.28, no.1, pp.492505, 2021.
CrossRef