The search functionality is under construction.

IEICE TRANSACTIONS on Information

Open Access
Channel Pruning via Improved Grey Wolf Optimizer Pruner

Xueying WANG, Yuan HUANG, Xin LONG, Ziji MA

  • Full Text Views

    38

  • Cite this
  • Free PDF (354.9KB)

Summary :

In recent years, the increasing complexity of deep network structures has hindered their application in small resource constrained hardware. Therefore, we urgently need to compress and accelerate deep network models. Channel pruning is an effective method to compress deep neural networks. However, most existing channel pruning methods are prone to falling into local optima. In this paper, we propose a channel pruning method via Improved Grey Wolf Optimizer Pruner which called IGWO-Pruner to prune redundant channels of convolutional neural networks. It identifies pruning ratio of each layer by using Improved Grey Wolf algorithm, and then fine-tuning the new pruned network model. In experimental section, we evaluate the proposed method in CIFAR datasets and ILSVRC-2012 with several classical networks, including VGGNet, GoogLeNet and ResNet-18/34/56/152, and experimental results demonstrate the proposed method is able to prune a large number of redundant channels and parameters with rare performance loss.

Publication
IEICE TRANSACTIONS on Information Vol.E107-D No.7 pp.894-897
Publication Date
2024/07/01
Publicized
2024/03/07
Online ISSN
1745-1361
DOI
10.1587/transinf.2024EDL8007
Type of Manuscript
LETTER
Category
Fundamentals of Information Systems

1.  Introduction

In recent years, deep learning, especially deep neural networks, has played an important role in various aspects of society [1], [2]. However, the increasing demands in computing power and memory footprint of deep network has hindered their application in small resource constrained hardware. Considerable efforts have been proposed to address this problem, including compact architecture designment [3]-[5], parameter decomposition [6], knowledge distillation [7]-[9], quantization [10]-[12], pruning [13]-[16]. Among them, channel pruning has been considered as one of the most effective methods for model compression and is easy to implement for convolutional neural networks while other approaches not.

The goal of channel pruning is to compress the number of channels in each layer of the original structure, ultimately minimizing the accuracy degradation or even achieving better accuracy of the overall network structure. In this paper, we propose a novel channel pruning method via Improved Grey Wolf algorithm [17] which called IGWO-Pruner to prune redundant channels of convolutional neural networks. It identifies pruning ratio of each layer by using Improved Grey Wolf algorithm, and then fine-tuning the new pruned network to obtain final compact model. The flow chart of the proposed algorithm is shown in Fig. 1.

Fig. 1  Pipeline of IGWO-Pruner.

In Fig. 1, Adaptive Batch Normalization (Adaptive BN) is sourced from Ref. [18], [18] proposed a method that can quickly measure the performance of pruned models through Adaptive BN. In Fig. 1, the circle represents different elements of channels. Different colors mean that the pruning ratio of the channels’ elements are different, which will affect whether they are pruned and thus affect the pruning strategy.

2.  The Proposed IGWO-Pruner

In this section, we propose our method to achieve channel pruning, which called Improved Grey Wolf Optimizer Pruner (IGWO-Pruner). The pipeline is shown in Fig. 1. Assuming a general deep convolutional network S containing n layers, its original structure is represented as \(S=[c_{1},c_{2},\cdots ,c_{n}]\), where \(c_{i}\) (\(\mathrm{i} = 1, \ldots, \mathrm{n}\)) represents the number of channels in the n-th layer of the network.

2.1  Description of Deep Network Pruning Problem

The network structure obtained after pruning the original network is \(S'=[c_{1}',c_{2}',\cdots ,c_{n}']\), where \(c_{n}' < c_{n}\). The pruning rate for each layer of the network is set to \(R=[r_{1},r_{2},\cdots ,r_{n}]\), where \(r_{i}=c_{i}'/c_{i}\), the optimization problem is to find the optimal pruning rate r while obtaining the optimal network inference accuracy under a given test and training set. Therefore, this problem can be summarized as follows:

\[\begin{align} R^{\ast }={arg}_{r} \max acc(S'(r)) \tag{1} \end{align}\]

where \(acc(S'( r ))\) represents the inference accuracy obtained after pruning and adaptive batch standardization processing using pruning rate r for each layer of the model. However, in order to obtain the optimal solution of the above equation, it involves high-dimensional optimization and the process is very complex. In order to simplify the computational cost of search, the method proposed in this article constrains the pruning rate \(r_{i}\) (where \(r_{i}\) is in %, represents the \(i_{th}\) channel’s pruning rate) within {0,10%, 20%, …, 100%}, making the above solving problem a search optimal combination problem. This constraint condition can greatly reduce the combination of pruning structures, and the final number of search set elements becomes \(11^{\mathrm{n}}\), making the solution of Eq. (1) more efficient. In order to further solve the above optimization problem, we use an Improved Grey Wolf algorithm to search for the optimal pruning rate.

2.2  Pruning Algorithm Process

To avoid previous channel pruning algorithms based on channel importance falling into local optima, we propose an automatic search pruning algorithm that considers network structure pruning from the perspective of the entire network. We use the Improved Grey Wolf algorithm to search for the optimal pruning rate for each convolutional layer. The search process of this algorithm is shown in the Fig. 2.

Fig. 2  Diagram of the Improved Grey Wolf algorithm search process.

Search initialization: Due to the fact that the Improved Grey Wolf algorithm is less affected by initial values, we use the random number method to generate the initial search population:

\[\begin{align} X_{i,j}\sim U(lb,ub) \tag{2} \end{align}\]

where \(X_{i,j}\) represents the gray wolf population involved in the search, i is the number of gray wolf population and \(\in \{1,2, \ldots, \mathrm{N}\}\), j is the population dimension (which represents the number of network layers), U is a random function, \(lb\) and \(ub\) are the upper and lower bounds of the search interval (the search space is {0%, 10%, 20%, …, 100%}).

Search process: All gray wolves approach their prey in the following ways, which leads to an optimal solution:

\[\begin{align} & D=\vert C\cdot X_{p}(t)-X(t)\vert \tag{3} \\ & X(t+1)=X_{p}(t)-A\cdot D \tag{4} \end{align}\]

Where \(D\) represents the distance between the gray wolf and its prey, t is the current number of iterations, \(X_{p}(t)\) and \(X(t)\) are the prey position and the wolf position, A and C are the adjustment coefficients, can be calculated as follows:

\[\begin{align} & A=2a\cdot r_{1}-a \tag{5} \\ & C=2\cdot r_{2} \tag{6} \end{align}\]

where \(r_{1}\) and \(r_{2}\) is a random vector between 0 and 1, \(a=2e^{-t/T}\), and T is the maximum number of iterations set by the algorithm.

Parameter update: By obtaining the pruning rates of the network layers corresponding to the position vectors of each gray wolf, the L1 norm of each channel in each layer is calculated. Channels with lower L1 norm parameter values in each layer of the network are pruned. Then, the pruning model is updated with adaptive batch standardization methods to update the batch standardization layer. The obtained inference accuracy is used as the fitness of gray wolf individuals, with the optimal individual marked as \(W\), the optimal individual marked as \(Y\), and the suboptimal individual marked as \(Z\), the rest are \(V\).

During the update, due to the individual’s optimality, candidate gray wolves update their position by calculating the movement distances \(D_{W}\), \(D_{Y}\), and \(D_{Z}\) with gray wolves \(W\), \(Y\), and \(Z\), respectively. The relevant calculation formula is as follows:

\[\begin{align} & D_{W}=\vert C_{1}\cdot X_{W}-X(t)\vert \tag{7} \\ & D_{Y}=\vert C_{2}\cdot X_{Y}-X(t)\vert \tag{8} \\ & D_{Z}=\vert C_{3}\cdot X_{Z}-X(t)\vert \tag{9} \\ & X(t+1)=(1/3)( (X_{W}-A_{1}D_{W})+(X_{Y}-A_{2}D_{Y}) \nonumber\\ & \hphantom{X(t+1)={}} +(X_{Z}-A_{3}D_{Z})) (1-t/T) \nonumber\\ & \hphantom{X(t+1)={}} +(X_{W}-A_{1}D_{W})\cdot (t/T) \tag{10} \end{align}\]

Where \(X(t)\) represents the current candidate gray wolf position, \(X(t+1)\) is the next candidate gray wolf position, and \(X_{W}\), \(X_{Y}\) and \(X_{Z}\) represents the current positions of W wolf, Y wolf, and Z wolf respectively. \(A_{1}\), \(A_{2}\), \(A_{3}\) and \(C_{1}\), \(C_{2}\), \(C_{3}\), like \(A\) and \(C\) in above search process, are random variables.

2.3  Channel Pruning and Fine-Tuning

We add L1 norm to the Loss function to constrain the weight. From the perspective of optimizing the objective function, L1 norm can make most of the weights 0, which makes the weights of network channels have a certain sparsity, so that related channels can be pruned. The objective function is:

\[\begin{align} \mathit{Loss} = \mathit{Loss} + \gamma \sum\nolimits_{w\in K} \| w \|_{1} \tag{11} \end{align}\]

Where Loss is the standard loss function of deep network, K is the network weight set, w is the element in the set K, and \(\| \bullet \|_1\) is the L1 norm, \(\gamma\) is the penalty factor. Besides, pruning extra channel would lead to some accuracy loss when the pruning percentile is pretty high. In experimental sections, this can be largely compensated by fine-tuning process which needs less training epochs and time.

3.  Experiments

In this section, we mainly conduct the effectiveness of proposed method on several representative network and datasets. We implement our method based on Pytorch.

3.1  Implementation Details

In this paper, we empirically conduct experiments on CIFAR-10 and ILSVRC-2012 [19]. The same standard data augmentation strategy in [20] is adopted by this paper. For network architectures, we evaluate proposed method on some frequently-used network: VGGNet, GoogLeNet and ResNet-18/34/56/152. During training process, we use the Stochastic Gradient Descent algorithm (SGD) for fine-tuning with momentum 0.9 and the batch size is set to 256. We also bring several evaluation terms which will be used in the following parts, like Channel number, FLOPs (floating point operations) and parameters, which are used to measure the network pruning and compression.

3.2  Results and Discussions

CIFAR-10: We conduct our experiments on CIFAR-10 with three deep networks including VGGNet, GoogLeNet and ResNet-56. The results are shown in Table 1.

Table 1  Accuracy and pruning results on CIFAR-10.

As shown in Table 1, it could be seen that the proposed method can achieve significant pruning of channel numbers, parameters, and computational complexity with minimal accuracy degradation.

ILSVRC-2012: We further conduct our experiments on ILSVRC-2012 with some deep networks including ResNet-18/34/152. The results are shown in Table 2.

Table 2  Accuracy and pruning results on ILSVRC-2012.

As shown in Table 2, ILSVRC-2012 is a large-scale dataset and contains 1,000 categories, which is much complex than the CIFAR-10 with only 10 categories, so it could be seen that the performance drops on ILSVRC-2012 are more than these on CIFAR-10. On the other hand, it comes that the proposed method obtains higher pruning rates and less accuracy drops as the depth of network increases.

Comparison with Other Methods: Reference [21] provides a good review and summary of Pruning Deep Neural Networks, we have selected from [21] several methods (in [22]-[24]) that are similar to our application field and pruning approach for algorithm performance comparison. Besides, we also selected some the state-of-art methods (in [25]-[28]) for algorithm performance comparison. The results in Table 3 show that IGWO-Pruner could obtain better FLOPs reduction or less accuracy loss, it seems that proposed channel pruning method would provide a great tradeoff between model size and performance.

Table 3  Performance comparison with the state-of-art methods.

4.  Conclusions

In this paper, we propose a novel channel pruning method based on improved Grey Wolf algorithm which called IGWO-Pruner to prune redundant channels of convolutional neural networks. It identifies proper pruning ratio of each layer by using intelligent search algorithm, and then fine-tuning the new pruned network model so as to compensate accuracy loss. Experimental results show that the proposed method can achieve great pruning results than existing pruning methods.

Acknowledgments

This work was supported in part by project funded by China National Natural Science Foundation of China, NSFC (62207030).

References

[1] A. Krizhevsky, I. Sutskever, and G.E. Hinton, “ImageNet classification with deep convolutional neural networks,” Advances in Neural Information Processing Systems, pp.1097-1105, 2012.
CrossRef

[2] L. Yang, Y. Wang, X. Xiong, J. Yang, and A.K. Katsaggelos, “Efficient video object segmentation via network modulation,” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, pp.6499-6507, 2018.
CrossRef

[3] N. Ma, X. Zhang, H.-T. Zheng, and J. Sun, “ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design,” Computer Vision - ECCV 2018, ed. V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss, pp.122-138, Springer International Publishing, Cham, 2018.
CrossRef

[4] X. Zhang, X. Zhou, M. Lin, and J. Sun, “ShuffleNet: An extremely efficient convolutional neural network for mobile devices,” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, pp.6848-6856, 2018.
CrossRef

[5] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “MobileNetV2: Inverted residuals and linear bottlenecks,” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, pp.4510-4520, 2018.
CrossRef

[6] X. Yu, T. Liu, X. Wang, and D. Tao, “On compressing deep models by low rank and sparse decomposition,” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, pp.67-76, 2017.
CrossRef

[7] S. Zhou, Y. Wang, D. Chen, J. Chen, X. Wang, C. Wang, and J. Bu, “Distilling holistic knowledge with graph neural networks,” 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, pp.10367-10376, 2021.
CrossRef

[8] P. Luo, Z. Zhu, Z. Liu, X. Wang, and X. Tang, “Face model compression by distilling knowledge from neurons,” Proc. Thirtieth AAAI Conference on Artificial Intelligence, Phoenix, Arizona, vol.30, no.1, pp.3560-3566, 2016.
CrossRef

[9] L. Lu, M. Guo, and S. Renals, “ Knowledge distillation for small-footprint highway networks,” 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, pp.4820-4824, 2017.
CrossRef

[10] M. Courbariaux, Y. Bengio, and J. David, “Binary connect: Training deep neural networks with binary weights during propagations,” I Proc. 28th International Conference on Neural Information Processing Systems - Volume 2, Montreal, Canada, pp.3123-3131, 2015.
URL

[11] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “XNOR-Net: ImageNet classification using binary convolutional neural networks,” Computer Vision - ECCV 2016, ed. B. Leibe, J. Matas, N. Sebe, and M. Welling, pp.525-542, Springer International Publishing, Cham, 2016.
CrossRef

[12] Z. Li, B. Ni, W. Zhang, X. Yang, and W. Gao, “Performance guaranteed network acceleration via high-order residual quantization,” 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, pp.2603-2611, 2017.
CrossRef

[13] S. Han, J. Pool, J. Tran, and W.J. Dally, “Learning both weights and connections for efficient neural networks,” Proc. 28th International Conference on Neural Information Processing Systems - Volume 1, Cambridge, MA, USA, pp.1135-1143, 2015.
URL

[14] J. Guo and M. Potkonjak, “Pruning ConvNets online for efficient specialist models,” 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Honolulu, HI, USA, pp.430-437, 2017.
CrossRef

[15] X. Long, Z. Ben, X. Zeng, Y. Liu, M. Zhang, and D. Zhou, “Learning sparse convolutional neural networks via quantization with low rank regularization,” IEEE Access, vol.7, pp.51866-51876, 2019.
CrossRef

[16] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang, “Learning efficient convolutional networks through network slimming,” 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, pp.2755-2763, 2017.
CrossRef

[17] S. Mirjalili, S.M. Mirjalili, and A. Lewis, “Grey Wolf optimizer,” Adv. Eng. Softw., vol.69, pp.46-61, 2014.
CrossRef

[18] B. Li, B. Wu, J. Su, and G. Wang, “EagleEye: Fast sub-net evaluation for efficient neural network pruning,” Computer Vision - ECCV 2020, ed. A. Vedaldi, H. Bischof, T. Brox, and J.-M. Frahm, pp.639-654, Springer International Publishing, Cham, 2020.
CrossRef

[19] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A.C. Berg, and L. Fei-Fei, “ImageNet large scale visual recognition challenge,” International Journal of Computer Vision, vol.115, no.3, pp.211-252, 2015.
CrossRef

[20] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H.P. Graf, “Pruning filters for efficient ConvNets,” arXiv preprint arXiv:1608.08710, 2016.
URL

[21] S. Vadera and S. Ameen, “Methods for pruning deep neural networks,” IEEE Access, vol.10, pp.63280-63300, 2022.
CrossRef

[22] Y. He, X. Zhang, and J. Sun, “Channel pruning for accelerating very deep neural networks,” 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, pp.1398-1406, 2017.
CrossRef

[23] Z. Huang and N. Wang, “Data-driven sparse structure selection for deep neural networks,” Computer Vision - ECCV 2018, ed. V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss, pp.317-334, Springer International Publishing, Cham, 2018.
CrossRef

[24] Z. Liu, H. Mu, X. Zhang, Z, Guo, X. Yang, K.-T. Cheng, and J. Sun, “Meta-pruning: Meta learning for automatic neural network channel pruning,” 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea (South), pp.3295-3304, 2019.
CrossRef

[25] Z. Hou, M. Qin, F. Sun, X. Ma, K. Yuan, Y. Xu, Y.-K. Chen, R. Jin, Y. Xie, and S.-Y. Kung, “CHEX: Channel EXplotration for CNN model compression,” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, pp.12277-12288, 2022.
CrossRef

[26] M. Lin, R. Ji, Y. Zhang, B. Zhang, Y. Wu, and Y. Tian, “Channel pruning via automatic structure search,” Proc. 29th Int. Joint Conf. Artif. Intell., pp.673-679, July 2020.
CrossRef

[27] K. Zhang and G. Liu, “Layer pruning for obtaining shallower ResNets,” IEEE Signal Process. Lett., vol.29, pp.1172-1176, 2022.
CrossRef

[28] J. Wu, D. Zhu, L. Fang, Y. Deng, and Z. Zhong, “Efficient layer compression without pruning,” IEEE Trans. Image Process., vol.32, pp.4689-4699, 2023.
CrossRef

Authors

Xueying WANG
  National University of Defense Technology
Yuan HUANG
  National University of Defense Technology
Xin LONG
  National University of Defense Technology
Ziji MA
  Hunan University

Keyword