1. Introduction
Since 2019, respiratory infectious diseases, represented by the COVID-19 epidemic, have had a great impact on global finance and social economy, and brought great challenges to the health and safety of all mankind. Respiratory droplets are the main route of transmission in Covid-19 [1], so wearing masks correctly can effectively block the invasion of respiratory infectious diseases and become the first line of defense [2]. In daily work and life, in the face of respiratory infectious diseases, PM2.5, dust and harmful particles produced in the production process of factories, wearing masks correctly can also protect life safety and reduce the contact with hazard sources. Especially in crowded places, it is of great theoretical and practical significance to realize real-time mask-wearing and normative detection.
With the rapid development of computer vision, image processing, object detection, video noise reduction and so on have become important research directions. Face mask wearing detection belongs to the category of target detection. In the past twenty years, the development of target detection has roughly experienced the evolution from traditional methods to detection methods based on deep learning [3]. Traditional detection algorithms often need to extract manually designed features, so that they can be used for face target detection. For example, Viola et al. [4] quickly calculate rectangular features through integral graphs, and use Ada Boost method to obtain classifiers with small er-rors; Lienhart et al. [5] put forward a face detection method based on Haar feature, which can identify inclined faces, but it is seriously disturbed by external factors and cannot identify dense targets. These methods rely on manual feature extraction, which have weak pertinence to different feature targets. In addition, the process of feature extraction is complex, time-consuming and labor-intensive. In 2014, Girshick et al. [6] first put forward the first model of using deep learning to realize object detection Regions with CNN Features (RCNN), which increased the mAP from 35.1% to 53.3%, and made a great breakthrough, thus creating a new era of deep learning for object detection. Since then, Fast R-CNN [7] has added Spatial Pyramid Pooling (SPP) [8] on the basis of R-CNN to realize multi-scale input. In view of the time-consuming for Fast R-CNN to use se-lective search to screen candidate boxes, researchers added Region Proposals Network (RPN) to Fast R-CNN [9]. Different from the above-mentioned neural networks, Redmon et al. [10] proposed You Only Look Once YOLO (YOLO) in 2016, which solved the problem of target detection as a regression problem, and directly output the prediction bounding boxes and category probabilities of images through a single neural network, thus achieving end-to-end object detection, which is more suitable for real-time object detection tasks.
The task of mask-wearing detection has strict requirements on the detection model. Firstly, because of the limited computing resources of embedded platforms in complex environments, the detection model needs to be as lightweight as possible while improving the detection performance; Secondly, the detection model needs to achieve a further balance between detection speed and detection accuracy when facing small targets, low resolution, high density and complex background. Finally, considering the actual deployment of mask-wearing detection, the protection effect will be greatly reduced because of improper wearing, so the detection of the target who does not wear the mask correctly should be added. To solve the above problems, this paper proposes a mask-wearing detection algorithm based on improved YOLOv7-Tiny. The main contributions of this paper are summarized as follows:
- Replacing DSConv convolution with \(3\times 3\) convolution in the backbone and head parts can greatly reduce the amount of calculation and improve the detection accuracy.
- Adopting the loss function MPDIoU, which further reduces the coordinate regression loss of the bounding box and improves the detection performance.
- The GSConv and VoVGSCSP modules are introduced into the feature fusion layer to improve the detection accuracy, reduce the model size and reduce the calculation amount, which is beneficial to the mask-wearing detection deployment of the model in complex environments.
- By adding P6 detection layer to the model structure, the detection accuracy of the model for small targets is greatly improved, and the missed detection rate and false detection rate are lower.
- A mask-wearing data set with 9600 pictures in multiple environments was created and marked. It includes three categories: wearing masks correctly (R-mask), not wearing masks correctly (W-mask) and not wearing masks (N-mask). The pictures cover a variety of occupations, different environments and scales, which truly reflect the crowd scenes in different places in real life.
The rest of the paper is organized as follows: Sect. 2 outlines the related work in relevant fields. Section 3 introduces the principles of the YOLOv7-Tiny algorithm. Section 4 introduces the calculation of MPDIoU, and presents the structure of DSConv, GSConv, VoVGSCSP and improved YOLOv7-Tiny. The experimental environment, experimental datasets, configuration of laboratory and test environment, evaluation metrics, training results, comparative experiments, ablation experiments, and visualization are all introduced in Sect. 5. The advantages of our research are outlined in Sect. 6 along with an introduction for further study.
2. Related Works
In recent years, to better contribute to the epidemic prevention and control work, many research results of mask-wearing detection have emerged, which are mainly di-vided into two categories, one is a “two-stage” detection algorithm represented by Faster R-CNN, and the other is a “single-stage” detection algorithm represented by YOLO series [10]-[15].
The traditional “two-stage” inspection is a “coarse to fine” process. The researchers proposed a face detection system based on deep transfer learning, which achieved good detection accuracy, but the detection speed was slow and difficult to perform real-time detection tasks [16], [17]. Considering the necessity of real-time processing, Gupta et al. [18] designed an Ex-Mask R-CNN structure to speed up detection. Dewantara et al. [19] proposed an adaptive enhancement and cascading classifier to detect whether the target was wearing a mask, but did not take into account the case of improper mask wearing.
The end-to-end “one-step” algorithm shines in real-time object detection tasks due to its fast detection speed. Zhao et al. [20] demonstrated that the Convolutional Block Attention Module (CBAM) mechanism can enhance the key feature points of mask wearing detection and suppress useless information. Han et al. [21] used the improved strategy of network structure optimization and K-means++ clustering algorithm to improve the detection accuracy. Xiao et al. [22] introduced Efficient Channel Attention (ECA) to achieve a trade-off between effectiveness and efficiency. Yu et al. [23] used the two-way feature pyramid network feature pyramid network (FPN) and SIoU loss function to reduce the false detection rate of the model. Youssry et al. [24] improved the detection accuracy by normalizing and adding noise to the image preprocessing, adding negative samples and augmenting the data, and the mAP reached 84.8%. Guo et al. [25] introduced the coordinate attention (CA) mechanism in the feature fusion process of YOLOv5, and then used bi-directional feature pyramid network (BiFPN) as a new feature pyramid network to improve the ability of feature extraction and fusion of the model. Wang et al. [26] studied the effects of different attention mechanisms (CBAM, SE and CA) on YOLOv5s, and explored the effects of different bounding box loss functions (GIoU, CIoU and DIoU) on the accuracy of mask wearing detection. Wang et al. [27] applied Content-Aware ReAssembly of FEatures (CARAFE) to YOLOv7 to improve the training speed of the model by increasing the idea of receptive fields. Wang et al. [28] enhanced the model’s ability to capture small targets by adding a small object detection layer.
The above-mentioned methods can basically meet the requirements of mask-wearing detection and the prevention and control of respiratory infectious diseases, at the same time, good detection results have been achieved. However, there are many shortcomings: the real-time performance is poor, which can not meet the balance between the speed and accuracy of mask detection; lack of detection of targets for incorrect mask-wearing.
3. Principle of YOLOv7-Tiny Algorithm
YOLOv7-tiny is a lightweight version of YOLOv7, which consists of four parts: Input, Backbone, Neck and Head. In the Input part, mosaic data enhancement and adaptive anchor frame computation are used for preprocessing to meet the needs of feature extraction network. The backbone consists of several CBL modules, ElAN module and MP module. The CBL module consists of conv layer, Batch Normalization (BN) layer and LeakyReLU function. CBL is responsible for original feature extraction, Elan learning original features, and MP module splicing different features. Because ELAN module reduces two groups of feature blocks compared with YOLOv7, the speed of feature extraction is improved, but the ability of feature extraction is decreased. Neck combines the strong semantic information transmitted by FPN from high level with the strong positioning information tensor transmitted by Path Aggregation Network (PANet) [29] from bottom to top, and fully integrates different levels of feature information to realize multi-scale learning. However, the fusion network does not pay enough attention to the feature information of small targets, which leads to the loss of the feature information. The Head section detects heads using IDetect [30], introducing an Implicit representation strategy to re-fine the predictions, outputting three different scales of predictions based on the post-fusion eigenvalues.
4. Improvement of YOLOv7-Tiny Algorithm
4.1 Distribution Shifting Convolution (DSConv)
The YOLOv7 model has ten million parameters, and even with GPU computing, it takes longer and requires more storage. Most of the storage and computation required to run the convolutional neural network is spent in the convolutional layer, which means that to make the network run faster and more efficiently, the computing power of the convolutional layer must be increased. In traditional convolution, the convolution kernel is applied to each channel of the input feature graph, and the convolution results of each channel are added to realize the output of a single feature graph. If repeated on all the channels of the input feature graph, the output of multiple feature graphs can be realized, which will result in a large amount of computational load. In this paper, DSConv [31] is introduced to replace \(3\times 3\) traditional Convolutions in YOLOv7 to achieve lower memory usage and higher computing speed. The overall structure for DSconv is shown in Fig. 1.
DSConv decomposes the traditional convolution Kernel into two components: Variable Quantized Kerne (VQK) and Distribution Shifts. The overall goal is to simulate the behavior of the convolution layer by using quantization and distribution offsets, and to achieve lower memory usage and higher speeds by storing only integer values in VQK, the same output as the original convolution is preserved by applying kernel and channel-based distribution offsets. VQK stores only variable bit-length integer values, and it has the same size (cho, chi, k, k) as the original convolution tensor. VQK as the quantization part of DSConv realizes faster multiplication and improves the operation of the storage efficiency. The purpose of distribution migration is to shift the distribution of VQK, which mimics the distribution of the original convolution kernel by moving two tensors in two domains. The first tensor is the Kernel Distribution Shifter (KDS), which moves the Distribution in each Kernel slice, and the second tensor is the Channel Distribution Shifter (CDS), which moves the Distribution in each Channel slice. Compared with common convolution and depth-separable convolution, distributed displacement convolution can significantly reduce the computational cost while maintaining similar detection performance. Therefore, distributed displacement convolution is preferred for mobile and edge computing devices with limited computing power and memory in complex environments.
4.2 Minimum Points Distance (MPDIoU)
Bounding Box Regression (BBR) is an important part of the loss function of object detection, and its good definition will improve the performance of object detection model. The loss function in the YOLOv7-Tiny network model is shown in Formula (1):
\[\begin{equation*} Loss _{all}=Loss_{bbox}+Loss_{obj}+Loss_{class} \tag{1} \end{equation*}\] |
Where \(Loss_{bbox}\) represents the coordinate regression loss of a bounding box; \(Loss_{obj}\) represents a loss of confidence; \(Loss_{class}\) represents category classification loss.
The YOLOv7-Tiny coordinate regression loss was calculated using CIoU [32] with the following formula:
\[\begin{align} & \mathcal{L}_{CIoU}=1-IoU+\frac{\rho^{2}(b,b_{gt})}{c^{2}}+\alpha\upsilon \tag{2} \\ & \upsilon=\frac{4}{\pi^{2}}(arctan\frac{w_{gt}}{h_{gt}}-arctan\frac{w}{h})^{2} \tag{3} \\ & \alpha=\frac{\upsilon}{(1-IoU)+\upsilon} \tag{4} \end{align}\] |
Where \(B_{gt}\) represents a real bounding box; \(B_{prd}\) represents a prediction bounding box; \(\rho^{2}(B_{gt},B_{prd})\) is the Euclidean distance between the center point of the predicted boundary box and the center point of the real boundary box; \(C^{2}\) represents the diagonal length of the minimum enclosed rectangle; \(\alpha\) is a positive measure parameter; and \(\upsilon\) is used to measure the consistency of the aspect ratio. When the predicted bounding box and the real bounding box have the same aspect ratio, but the width and height of the bounding box are different, the penalty term based on the aspect ratio will not work, and the CIoU will lose its validity, this will limit the convergence speed and precision of the network model.
Inspired by the geometry of the bounding box, consider using the coordinates of the upper-left and lower-right corners to define a unique rectangle. MPDIoU [33] directly minimizes the distance between the upper-left and lower-right points of the predicted bounding box and the real bounding box, which both allows for overlapping bounding box regression and works equally well for non-overlapping bounding boxes. This method combines the advantages of GIoU [34], DIoU [35], CIoU, EIoU [36], which has higher efficiency and precision. The parameters of the MPDIoU loss function are shown in Fig. 2. Therefore, all the factors of the existing bounding box regression loss function can be expressed by four point coordinates, and the conversion formula is as follows:
\[\begin{equation}\begin{aligned} |C| & = \left(\max \left(x_{2}^{g t}, x_{2}^{p r d}\right)-\min \left(x_{1}^{g t}, x_{1}^{p r d}\right)\right) \\ & *\left(\max \left(y_{2}^{g t}, y_{2}^{p r d}\right)-\min \left(y_{1}^{g t}, y_{1}^{p r d}\right)\right) \end{aligned}\tag{5} \end{equation} \] |
\[\begin{equation} \begin{aligned} & x_{c}^{g t} = \frac{x_{1}^{g t}+x_{2}^{g t}}{2}, y_{c}^{g t} = \frac{y_{1}^{g t}+y_{2}^{g t}}{2}\\ & x_{c}^{p r d} = \frac{x_{1}^{p r d}+x_{2}^{p r d}}{2}, y_{c}^{p r d} = \frac{y_{1}^{p r d}+y_{2}^{p r d}}{2} \end{aligned}\tag{6} \end{equation} \] |
\[\begin{equation} \begin{aligned} &w_{g t} = x_{2}^{g t}-x_{1}^{g t}, h_{g t} = y_{2}^{g t}-y_{1}^{g t} \\ &w_{p r d} = x_{2}^{p r d}-x_{1}^{p r d}, h_{p r d} = y_{2}^{p r d}-y_{1}^{p r d} \end{aligned}\tag{7} \end{equation} \] |
Where \(|C|\) represents the minimum enclosed rectangular area of \(B_{gt}\) and \(B_{prd}\); \((x^{gt}_{c},y^{gt}_{c})\) and \((x^{prd}_{c},y^{prd}_{c})\) represents the center point coordinates of the real bounding box and the predicted bounding box, respectively. \(w_{gt}\) and \(h_{gt}\) represents the width and height of a real bounding box; \(w_{prd}\) and \(h_{prd}\) represents the width and height of a prediction box.
4.3 GSConv
For real-time mask detection in complex environments, it is important to make the model lightweight for easy deployment while ensuring the detection speed and accuracy. Standard convolution (SConv) is the simultaneous operation of three channels. The number of convolution cores equals the number of output channels, and the number of convolution cores equals the number of input channels, so when using too much standard convolution to extract image features, it will cause parameter accumulation and feature redundancy. The deeper the layers, the greater the effect.
MobileNets [37] uses \(1\times 1\) convolution to fuse independently computed channel information, this causes a greater computational load; ShuffleNets [38] uses “Channel shuffling” for the interaction of channel information, but does not affect the result of SC; Ghost Conv [39] operates through a small number of convolutions and linear transformations, the obtained feature map is spliced together for output, but it has the dis-advantage of losing a lot of channel information. To overcome the above difficulties, Li et al. [40] proposed a GSconv module. The structure diagram of GSconv is shown in Fig. 3, where PWConv and DWconv represent point-by-point convolution and deep convolution in deep separable convolutions, respectively. The number of input channels are C1 and the number of output channels are C2. First, through a standard convolution, the number of channels becomes C2/2, and then adopts a depth of separable convolution, the number of channels remains unchanged. It makes the channel information even and chaotic, enhances the extracted semantic information, strengthens the fusion of feature information, and improves the expression ability of image features. GSConv combines the accuracy of intensive computation with the lightweight characteristics of depth computation, making it an efficient and lightweight convolution method. In addition, based on GSConv, the cross-level local network module VoVGSCSP is designed by one-time aggregation method, which has faster reasoning speed. The structure of VoVGSCSP is shown in Fig. 4.
When YOLOv7-Tiny network performs feature fusion in the Neck part, the semantic information will be continuously transmitted downwards. When the width and height of the feature graph and the number of channels are continuously compressed and expanded, some semantic information will be lost, affect the final detection effect. In this paper, GSConv module is introduced in the Neck part, and GSConv module is used to replace standard convolution for up-sampling and down-sampling. At the same time, using VoVGSCSP to replace the ELAN module in the original network model Neck. Under the condition of ensuring the detection accuracy, the lightweight of the model is realized.
4.4 P6 Detection Layer
YOLOv7-tiny network has three different detection layers: P3, P4 and P5. It can realize multiscale learning by fusing different levels of feature information in the Neck part, but the fusion network does not pay enough attention to the small target feature information, easy to cause the loss of feature information, leading to small target missed detection. In this paper, the image size of the input dataset is \(640\times 640\). Considering the large number of dense small targets in mask-wearing detection task, we add a P6 detection layer. P6 detection layer consists of MP module and ELAN module, and the network will use K-means algorithm to cluster and adaptively generate 12 anchors boxes. Therefore, the detection precision of the model for small targets is greatly improved, and the false detection rate and the missed detection rate are reduced.
The improved Yolov7-tiny network structure is shown in Fig. 5. Replacing the \(3\times 3\) traditional convolution in Backbone and Head with DSConv to construct DBL modules greatly reduces computation and memory access. The traditional convolution in Neck is replaced by GSConv to construct GBL module, and Elan module is replaced by VoVGSCSP to speed up reasoning. Add the P6 detection layer in the Head section and use the MPDIoU loss function.
5. Experimental Results and Analysis
5.1 Datasets
In all experiments, a total of 9,600 images were collected from open source MAFA, WIDER Face data set, web crawler and self-collection, including 7,600 images for training set, 1,000 images for test set and 1,000 images for verification set. Using the Labelimg tool for dataset annotation, txt files in Yolo format are grouped into three categories: wearing masks correctly (R-mask), not wearing masks correctly (W-mask) and not wearing masks (N-mask). The pictures cover a variety of occupations, different environments and scales. Thereby, actualizing a more realistic reflection of real life in different places of the crowd scene. The Labelimg dataset annotation is shown in Fig. 6.
The txt files generated include: category number, center of the horizontal coordinates represented by \(x\), center of the vertical coordinates represented by \(y\), the width of the box represented by \(\omega\) and height represented by \(h\). The formula is as follows:
\[\begin{align} & x = \frac{1}{i}\left(\frac{x_{\max }+x_{\min }}{2}\right) \tag{8} \\ & y = \frac{1}{j}\left(\frac{y_{\max }+y_{\min }}{2}\right) \tag{9} \\ & \omega = \frac{x_{\max }-x_{\min }}{i} \tag{10} \\ & h = \frac{x_{\max }-x_{\min }}{j} \tag{11} \end{align}\] |
where the coordinates of the point in the upper left corner is \((x_{min},y_{min})\), and the coordinates of the lower right corner of the label box is \((x_{max},y_{max})\).
To enrich the data diversity and background, reduce the risk of over-fitting and the cost of training, and improve the generalization ability of the model, we use Mixup and Mosaic to enhance the data. The result is shown in Fig. 7.
5.2 Configuration of Experiment and Test Environment
The configuration of the training experiment and test environment in this paper is shown in Table 1. We have adopted \(640\times 640\). Due to the memory limitations of the experimental platform GPU, a batch size of 64 was chosen. We performed 200 epochs on the model to ensure convergence. And a data augmentation method combining Mosaic and Mixup is adopted to improve the generalization ability of the model. In addition, hyperparameters play a crucial role in determining the convergence speed and final performance of the model. The initial learning rate (Lr0) was set to 0.01, while the final learning rate (Lrf) was 0.1. We also incorporated a momentum of 0.937 and a weight decay factor of 0.0005 to regulate the model’s training dynamics.
5.3 Evaluation Metrics
The main evaluation indexes selected in this paper include Precision (P), Recall (R), mean average precision (mAP), FPS (Frames Per Second) and Floating-point Operations (FLOPs). Among them, the mAP value when the IoU threshold is 0.5 is used for evaluation, and mAP@.5 is recorded.
\[\begin{align} & { Precision } = \frac{T P}{T P+F P} \tag{12} \\ & { Recall } = \frac{T P}{T P+F N} \tag{13} \end{align}\] |
where \(TP\) is the number of positive class targets correctly identified; \(TN\) is the number of negative class targets correctly identified; \(FN\) is the number of positive class targets incorrectly identified; \(FP\) is the number of negative class targets incorrectly identified.
\[\begin{eqnarray*} { FPS } & = & \frac{ { Framenum }}{ { ElapsedTime }} \tag{14} \end{eqnarray*}\] |
where \(Framenum\) represents the total number of images detected, and \(ElapsedTime\) represents the total Time spent in detecting. The higher the FPS, the faster the detection rate, and the better the real-Time performance of the model.
\[\begin{eqnarray*} { mAP } & = &\frac{1}{c}\sum_{i=1}^{c}AP_{i} \times100 \% \tag{15} \end{eqnarray*}\] |
where \(c\) is the total number of image categories, \(i\) is the number of detections, and \(AP\) is the average recognition accuracy of a single category.
5.4 Training Result
After training the model for 200 epochs, the training results with different modules can be obtained as shown in Fig. 8. As can be seen from the figure, the introduction of DSConv convolution has played a positive role in the model, and mAP@.5 has been improved. After introducing MPDIoU, the coordinate regression loss of bounding box is further reduced. After adpoting the lightweight module GSConv+VoVGSCSP, mAP@.5 has not been reduced. After adding P6 detection layer module, mAP@.5 is further im-proved, and at the same time, the regression loss, confidence loss and category classification loss of bounding box are greatly reduced. In a word, the above-mentioned improved modules have made positive effects on the model.
5.5 Ablation Experiments
To verify the effectiveness and generalization of each improved module, and to determine the impact of the integration of each improved module, we conduct ablation experiments on 1000 test data sets, and the YOLOv7-Tiny network model and the improved network model proposed in this paper were compared and tested. The results of ablation experiments are shown in Table 2. The area under the P-R curve is the average accuracy (mAP). The higher the mAP value, the higher the detection accuracy. Model A represents YOLOv7-Tiny. After introducing DSConv convolution into model B, the Precision is increased by 2.7%, the mAP@.5 is increased by 1.3%, and the FLOPs is decreased by 8.5G. It shows that DSConv can improve the detection accuracy and greatly reduce the computational load through VQK and KDS. In model C, MPDIoU was used to replace the CIoU of the original model, and the Precision reached 87.2%. After using GSConv+VoVGSCSP module in Model D, the detection accuracy did not decrease obviously, but it made the model more portable. Model E represents the addition of P6 detection layer, which improves the accuracy of target detection in complex environment. After the integration of each improved module, it also has a positive impact, and no large mutual exclusion phenomenon has been found. Finally, when using DSConv convolution, MPDIoU, GSConv+VoVGSCSP module and P6 detection layer at the same time, mAP@.5 reaches 88.1%, which proves the superiority of the improved algorithm in this paper.
5.6 Comparative Experiments
Under the premise of ensuring the consistency of the experimental environment configuration and initial training parameters, this paper verifies the progressiveness of the improved model by comparing the improved YOLOv7 Tiny model with YOLOv3, YOLOv4, YOLOv5s and the model of literature [25]. The comparison experiment results are shown in Table 3. As can be seen from Table 3, both YOLOv3 and YOLOv4 have good detection performance, but the models are too redundant and require a large amount of computation, making them difficult to handle deployment in complex environments. Compared to YOLOv5s and Ref. [25], the improved algorithm proposed in this paper has smaller computational complexity, faster detection speed, and improved precision. Compared to the original YOLOv7 Tiny network model, this paper proposes an improved network model that significantly reduces FLOPs while sacrificing a small amount of FPS. However, Precision, Recall, and mAP@.5 all have been greatly improved, which verifies the progressiveness of the algorithm in this paper, and is more suitable for real-time mask wearing detection tasks in complex environments.
5.7 Visualization of Experimental Results
To visually show the advanced nature of the improved algorithm in this paper, this paper conducts inference verification experiments, and the accompanying figures show the detection results of YOLOv7-Tiny and the improved algorithm in different scenarios. Figure 9 shows the detection of complex targets, and the improved algorithm has greatly improved the detection performance and higher detection accuracy when faced with correct and non-wearing masks. Figure 10 shows the advanced nature of the improved algorithm when detecting incorrect mask wear, it can be seen in comparison with Fig. 10 (b) and (c). Figure 11 shows that when detecting occluded targets, the improved algorithm can detect more targets and achieve higher detection accuracy. To sum up, the improved algorithm proposed in this paper can detect more targets and have higher accuracy than the YOLOv7-Tiny algorithm when faced with small targets, low resolution, high density and background blur.
Fig. 9 Detection of complex targets. (a) Initial image sample; (b) YOLOv7-Tiny detection result; (c) detection result of this algorithm. |
Fig. 10 Detection of targets not wearing masks correctly. (a) Initial image sample; (b) YOLOv7-Tiny detection result; (c) detection result of this algorithm. |
Fig. 11 Detection of occlusion target. (a) Initial image sample; (b) YOLOv7-Tiny detection result; (c) detection result of this algorithm. |
In addition, to verify that the method proposed in this article can focus more on the important areas of mask wearing detection, visualization experiments were conducted using Grad-CAM, and the experimental results are shown in Fig. 12. The darker the red color, the higher the model’s attention to the region, followed by the yellow part of the image. The blue part of the feature indicates that the impact on object detection and recognition is relatively small, and the model considers this part to be redundant information. From Fig. 12, the method proposed in this paper focuses more accurately on the important areas of mask wearing detection compared to the YOLOv7 Tiny original model, with less redundant information.
Fig. 12 Grad-CAM thermal map. (a) Initial image sample; (b) YOLOv7-Tiny detection result; (c) detection result of this algorithm. |
In summary, the improved algorithm proposed in this article can detect more targets and has higher accuracy and target attention compared to the YOLOv7 Tiny algorithm when facing small targets, low resolution, high density, and background blur.
6. Conclusions
This paper proposes an enhanced YOLOv7-Tiny mask-wearing detection algorithm to aid in the prevention and control of respiratory infectious diseases and reduce the infiltration of harmful substances. The algorithm achieves this by introducing DSConv convolutions to replace the \(3\times 3\) convolutions in the Backbone and Head parts of the original model, resulting in reduced computational complexity for seamless deployment on mobile embedded devices. Additionally, the algorithm replaces the original model’s loss function CIoU with MPDIoU, leading to a decrease in coordinate regression loss and an improvement in detection performance. To enhance the utilization of feature semantic information, the model incorporates GSConv and VoVGSCSP modules, making it more adaptable. Furthermore, to enhance the detection accuracy of small targets and reduce missed detections, a P6 detection layer is added to the Head section. The improved YOLOv7-Tiny algorithm is evaluated on a self-labeled mask-wearing dataset, and the experimental results demonstrate its excellent mask detection performance, with high accuracy and low computational complexity. The Precision and Recall rates are improved by 5.4% and 1.8% respectively, and the mAP@.5 is increased by 3%, reaching 88.1%. Moreover, the mAP@.5:.95 is improved by 1.7%, and the FLOPs is reduced to 8.3G, enabling real-time and accurate mask wear detection, as well as easier deployment on embedded devices. However, there is still potential for further improvement, particularly in terms of accelerating the reasoning speed. Future research can focus on this aspect.
Acknowledgments
This work was supported by the Natural Science Foundation of Shanxi Province (No.202203021211198) and Doctoral Initiation Fund Project of Taiyuan University of Science and Technology (No. 20222026).
References
[1] M. Liao, H. Liu, X. Wang, X. Hu, and Y. Huang, “A technical review of face mask wearing in preventing respiratory COVID-19 transmission,” Current Opinion in Colloid & Interface Science, vol.52, p.101417, 2021.
[2] Y. Liu, A.A. Gayle, A. Wilder-Smith, and J. Rocklöv, “The reproductive number of COVID-19 is higher compared to SARS coronavirus,” Journal of travel medicine, vol.27, no.2, pp.1-4, 2020.
CrossRef
[3] Z. Zou, K. Chen, Z. Shi, Y. Guo, and J. Ye, “Object detection in 20 years: A survey,” Proceedings of the IEEE, vol.111, no.3, pp.257-276, 2023.
CrossRef
[4] P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple features,” CVPR’1, vol.1, pp.I-I, 2001.
CrossRef
[5] R. Lienhart and J. Maydt, “An extended set of haar-like features for rapid object detection,” Proceedings international conference on image processing, vol.1, pp.I-I, 2002.
CrossRef
[6] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” CVPR’14, pp.580-587, 2014.
CrossRef
[7] R. Girshick, “Fast r-cnn,” CVPR’15, pp.1440-1448, 2015.
CrossRef
[8] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolutional networks for visual recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol.37, no.9, pp.1904-1916, 2015.
CrossRef
[9] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” IEEE Trans. Pattern Anal. Mach. Intell., vol.39, no.6, pp.1137-1149, 2017.
CrossRef
[10] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You Only Look Once: Unified, Real-Time Object Detection,” CVPR’16, pp.779-788, 2016.
CrossRef
[11] J. Redmon and A. Farhadi, “YOLO9000: Better, Faster, Stronger,” CVPR’17, pp.6517-6525, 2017.
CrossRef
[12] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” arXiv preprint arXiv:1804.02767, 2018.
[13] A. Bochkovskiy, C.Y. Wang, and H.Y.M. Liao, “Yolov4: Optimal speed and accuracy of object detection,” arXiv preprint arXiv:2004.10934, 2020.
[14] C. Li, L. Li, H. Jiang, K. Weng, and Y. Geng, “YOLOv6: A single-stage object detection framework for industrial applications,” arXiv preprint arXiv:2209.02976, 2022.
[15] C.-Y. Wang, A. Bochkovskiy, and H.-Y.M. Liao, “YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors,” CVPR’23, pp.7464-7475, 2023.
CrossRef
[16] M. Balasubramanian, K. Ramyadevi, and R. Geetha, “Deep transfer learning based real time face mask detection system with computer vision,” Multimedia Tools and Applications, vol.83, no.6, pp.17511-17530, 2023.
CrossRef
[17] F. Li, X. Wang, Y. Sun, T. Li, and J. Ge, “Transfer learning based cascaded deep learning network and mask recognition for COVID-19,” World Wide Web, vol.26, pp.2931-2946, 2023.
CrossRef
[18] P. Gupta, V. Sharma, and S. Varma, “A novel algorithm for mask detection and recognizing actions of human,” Expert Systems with Applications, vol.198, p.116823, 2022.
CrossRef
[19] B.S. Bayu Dewantara and D. Twinda Rhamadhaningrum, “Detecting multi-pose masked face using adaptive boosting and cascade classifier,” IES’6, pp.436-441, 2020.
CrossRef
[20] G. Zhao, S. Zou, and H. Wu, “Improved Algorithm for Face Mask Detection Based on YOLO-v4,” International Journal of Computational Intelligence Systems, vol.16, no.1, p.104, 2023.
CrossRef
[21] Z. Han, H. Huang, Q. Fan, Y. Li, and Y. Li, “SMD-YOLO: An efficient and lightweight detection method for mask wearing status during the COVID-19 pandemic,” Computer methods and programs in biomedicine, vol.221, p.106888, 2022.
CrossRef
[22] H. Xiao, B. Wang, J. Zheng, L. Liu, and C.L.P. Chen, “A Fine-grained Detector of Face Mask Wearing Status Based on Improved YOLOX,” IEEE Transactions on Artificial Intelligence, pp.1-15, 2023.
CrossRef
[23] F. Yu, G. Zhang, F. Zhao, X. Wang, H. Liu, P. Lin, and Y. Chen, “Improved YOLO-v5 model for boosting face mask recognition accuracy on heterogeneous IoT computing platforms,” Internet of Things, vol.23, p.100881, 2023.
CrossRef
[24] N. Youssry and A. Khattab, “Accurate real-time face mask detection framework using YOLOv5,” DTS’4, pp.1-6, 2022.
CrossRef
[25] S. Guo, L. Li, T. Guo, Y. Cao, and Y. Li, “Research on mask-wearing detection algorithm based on improved YOLOv5,” Sensors, vol.22, no.13, p.4933, 2022.
CrossRef
[26] Z. Wang, W. Sun, Q. Zhu, and P. Shi, “Face Mask-Wearing Detection Model Based on Loss Function and Attention Mechanism,” Computational Intelligence and Neuroscience, vol.2022, pp.1-10, 2022.
CrossRef
[27] C. Wang, B. Zhang, Y. Cao, M. Sun, K. He, Z. Cao, and M. Wang, “Mask Detection Method Based on YOLO-GBC Network,” Electronics, vol.12, no.2, p.408, 2023.
CrossRef
[28] J. Wang, J. Wang, X. Zhang, and N. Yu, “A Mask-Wearing Detection Model in Complex Scenarios Based on YOLOv7-CPCSDSA,” Electronics, vol.12, no.14, p.3128, 2023.
CrossRef
[29] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path aggregation network for instance segmentation,” CVPR’18, pp.8759-8768, 2018.
CrossRef
[30] C.Y. Wang, I.H. Yeh, and H.Y.M. Liao, “You only learn one representation: Unified network for multiple tasks,” arXiv preprint arXiv:2105.04206, 2021.
[31] M.G.D. Nascimento, V. Prisacariu, and R. Fawcett, “Dsconv: Efficient convolution operator,” ICCV’19, pp.5148-5157, 2019.
CrossRef
[32] Z. Zheng, P. Wang, D. Ren, W. Liu, R. Ye, Q. Hu, and W. Zuo, “Enhancing geometric factors in model learning and inference for object detection and instance segmentation,” IEEE Trans. Cybern., vol.52, no.8, pp.8574-8586, 2021.
CrossRef
[33] M. Siliang and X. Yong, “MPDIoU: A loss for efficient and accurate bounding box regression,” arXiv preprint arXiv:2307.07662, 2023.
[34] H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Savarese, “Generalized intersection over union: A metric and a loss for bounding box regression,” CVPR’19, pp.658-666, 2019.
CrossRef
[35] Z., Zheng, P. Wang, W. Liu, J. Li, R. Ye, and D. Ren, “Distance-IoU loss: Faster and better learning for bounding box regression,” Proceedings of the AAAI conference on artificial intelligence, vol.34, no.7, pp.12993-13000, 2020.
CrossRef
[36] Y.-F. Zhang, W. Ren, Z. Zhang, Z. Jia, L. Wang, and T. Tan, “Focal and efficient IOU loss for accurate bounding box regression,” Neurocomputing, vol.506, pp.146-157, 2022.
CrossRef
[37] A.G. Howard, M. Zhu, B. Chen, D. Kalenichenko, and W. Wang, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017.
[38] X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufflenet: An extremely efficient convolutional neural network for mobile devices,” CVPR’18, pp.6848-6856, 2018.
CrossRef
[39] K. Han, Y. Wang, Q. Tian, J. Guo, C. Xu, and C. Xu, “GhostNet: More Features from Cheap Operations,” CVPR’20, pp.1580-1589, 2020.
CrossRef
[40] H. Li, J. Li, H. Wei, Z. Liu, and Z. Zhan, “Slim-neck by GSConv: A better design paradigm of detector architectures for autonomous vehicles,” arXiv preprint arXiv:2206.02424, 2022.