1. Introduction
As the prevalence of intelligent cars increases, image segmentation technology is increasingly utilized for detecting and perceiving the driving environment in autonomous vehicles. The development of image segmentation technology has undergone two stages. In the first stage, segmentation was based on traditional machine learning, and most methods relied heavily on the design of manually crafted features using prior knowledge, such as the Dimensionality Reduction Algorithm (DRA) [1], K-Means [2], and C-Means [3]. However, these methods were mostly heuristic. In the second stage, segmentation was based on deep learning for image segmentation. Following the pathbreaking work of Fully Convolutional Networks (FCN) [4], which inspired many subsequent works [5]-[7], deep learning has become the primary design choice for automatic driving scene image segmentation.
Given the limited computing power of autonomous driving hardware systems, there is still a high potential for the development of lightweight and efficient image segmentation schemes. The macro design concept of the multi-branch network structure advocates the utilization of the multi-branch mode to execute differential designs for diverse feature extraction tasks, resulting in a lightweight and efficient segmentation network. This concept is exemplified by the well-known BiSeNet v2 [8], which suggests a bilateral branches network that integrates spatial details and deep semantic features. The detail branch deploys wide channels and shallow networks to acquire low-level details and generate high-resolution feature representations. In contrast, the semantic branch focuses solely on obtaining deep, high-level semantic context information. The spatial details and class semantics are processed separately and then fused using the bilateral guided aggregation layer to achieve high precision and efficiency in real-time semantic segmentation. For the detail branch, the spatial detail information in the image is very important to retain the boundary. The conventional 3\(\times\)3 convolution kernel used in the detail branch to extract spatial detail information has limited effectiveness. This is because the conventional convolution kernel prioritizes local features in the receptive field, overlooking the surrounding and global context information that can help retain spatial details and improve segmentation accuracy. Furthermore, the conventional 3\(\times\)3 convolution is redundant. For the semantic branch, a larger receptive field is crucial to learn complex correlations between objects. But, BiSeNet v2’s semantic branch uses an inverted bottleneck residual module that combines regular convolution and depthwise separable convolution to extract deep contextual semantic information. However, this module has a weak correlation between spatial features and channel features, resulting in a small receptive field and leaving plenty of room for improvement in segmentation accuracy.
This article proposes a network called “BiConvNet,” which builds on the improvements made in BiSeNet v2, specifically for image segmentation tasks in autonomous driving scenarios. To improve the ability of the detail branch to learn spatial detail information, BiConvNet introduces PCSD convolutional modules for feature extraction. These modules combine the advantages of dilated convolution and strip convolution to encode local features and contextual features from three receptive fields, thereby enhancing the ability to preserve spatial details. In addition, BiConvNet reconstructed the semantic branch of BiSeNet v2 by making several modifications. Instead of using the original inverted bottleneck residual module combining conventional convolution and depthwise separable convolution, BiConvNet utilized a simple depthwise separable convolution for downsampling at each stage. It also used the inverted bottleneck convolution module from ConvNeXt to encode semantic information for each stage, increasing the correlation and receptive field between spatial and channel features. These modifications led to improved segmentation accuracy. The bilateral branches feature aggregation layer of BiConvNet employs the bilateral guided aggregation (BGA) module, which was proposed in BiSeNet v2, and fine-tunes it to enhance segmentation accuracy without increasing computational complexity. Through ablation and comparison experiments with mainstream algorithms, we have demonstrated the efficacy and feasibility of the proposed improvement scheme for BiSeNet v2. Moreover, our results have confirmed that the BiConvNet algorithm outperforms commonly used autonomous driving image segmentation algorithms in terms of both accuracy and model size.
The paper’s primary contributions are as follows:
One. This paper introduces a PCSD convolutional module to enhance the detail branch of BiSeNet v2 for improved spatial detail information extraction. The semantic branch undergoes reconstruction through depthwise separable convolution and ConvNeXt modules to enhance deep-level semantic feature encoding. Fine-tuning of the BGA module further improves segmentation accuracy gains, completing the construction of BiConvNet.
Two. This article examines the impact of stripe convolution and various-sized dilated convolutions on image segmentation accuracy. Experimental results show that the proposed PCSD convolution module outperforms conventional ones, achieving higher accuracy on segmentation datasets. BiConvNet demonstrates a stronger competitive advantage compared to recent commonly used segmentation algorithms.
The subsequent content layout is as follows. The related work on the lightweight network construction scheme is introduced in Sect. 2. The construction of the BiConvNet network (Sect. 3) is first introduced in its entirety, including the overall architecture and instance parameters, followed by the construction of the detail branch (Sect. 3.1), semantic branch (Sect. 3.2), and aggregation layer (Sect. 3.3). In the experimental study (Sect. 4), the Cityscapes [10], BDD100K [11] datasets and experimental environment are introduced (Sect. 4.1), and the effectiveness of the improvements to the detail branch, semantic branch, and aggregation layer for BiSeNet v2 is verified through ablation experiments (Sect. 4.2). Then, the superiority of BiConvNet is verified through comparative experiments with existing automatic driving image segmentation algorithms (Sect. 4.3). Finally, the overall work is summarized, and future prospects are proposed (Sect. 5).
2. Related Work
As deep learning-based image segmentation gains traction in practical applications, researchers seek lightweight, high-precision network solutions. Depthwise separable convolution, known for its smaller size and lower computational cost, is pivotal in efficient neural network designs [12], [13]. MobileNets [14], a widely-used backbone, blends depthwise separable and regular convolutions for speed in embedded systems. MobileNets v2 [15] affirms 8-9 times size reduction with depthwise separable convolutions, offering a substitute to standard convolutions. Introducing a module combining inverted bottleneck residual structures with depthwise separable convolutions, it balances accuracy and speed, applied in networks like Fast-SCNN [22] and ContextNet [20]. In CGNet [16], addressing the limitation of regular convolutions, a dual-path module merges regular and dilated convolutions, extracting local and surrounding contextual features.
In addition to employing lightweight convolution modules for network construction, a prevalent approach involves rethinking convolutional kernel design. The ERFNet [17] challenges the effectiveness of stacking conventional convolutions to increase depth, citing significant computational costs with minimal accuracy gains. It introduces the “Non-bottleneck-1D” module, a one-dimensional strip convolution, reducing parameters by 33% compared to conventional 3\(\times\)3 convolutions. This achieves model compactness and computational efficiency by minimizing redundancy through one-dimensional banded convolutions. Similarly, SegNeXt [18] utilizes a multi-branch convolutional attention module, combining various sizes of one-dimensional banded convolutions to capture multi-scale contextual information from local to global scales. This highlights the lightweight nature of banded convolutions, particularly beneficial for extracting features of strip-like objects in segmentation scenarios, such as people and utility poles. CCNet [19] introduces the Cross-Correlation Module, obtaining contextual information along cross paths and iteratively achieving full-image dependencies for all pixels. This method reduces GPU memory usage by 11 times, enhancing computational efficiency and yielding promising results on autonomous driving datasets.
In addition to optimizing convolutional modules and reconstructing kernels at the microscopic level, researchers are exploring multi-branch network architectures. ContextNet [20] validates the effectiveness of combining deep network branches with low-resolution ones to aggregate contextual information from multiple resolutions. This captures high-resolution segmentation details while incorporating global contextual information. Image Cascade Network (ICNet) [21] introduces a multi-resolution branch network generating rough prediction maps from low-resolution images through semantic perception. Cascade feature fusion units and label-guided strategies integrate middle and high-resolution features, gradually refining the rough semantic map. In contrast, Fast-SCNN [22] downsamples a single input image before constructing a dual-branch network for spatial detail and deeper semantic information extraction with higher receptive fields.
This article proposes an improved BiConvNet image segmentation network based on BiSeNet v2, comprising detail and semantic branches. Through comparison experiments with Fast-SCNN and BiSeNet v2, BiConvNet demonstrates higher segmentation accuracy.
3. BiConvNet Network Construction
The BiConvNet network framework, illustrated in Fig. 1, comprises three main components: the detail branch trunk, which extracts spatial detail information; the semantic branch trunk, which extracts advanced deep-level semantics; and the aggregation layer, which integrates the dual-branch feature maps. The reconstructed detail branch is composed of three stages, with each stage utilizing a 3\(\times\)3 regular convolution to downsample and a PCSD convolution module for feature extraction. The image size is halved at each stage, and the output feature channel ratio is (64:64:128), while the PCSD convolution quantity ratio is (1:2:2). The reconstructed semantic branch is composed of five stages, with each stage using lightweight depthwise separable convolution for downsampling and the ConvNeXt inverted bottleneck convolution module to generate dense semantic feature information. The ConvNeXt convolution module quantity ratio varies between stages (3:3:9:6:3), and the output feature channels are (16:32:96:128:128). The aggregation layer is fine-tuned based on the bilateral guidance aggregation layer of BiseNet v2. It multiplies the upsampled and downsampled feature maps of the two branches pixel-wise after sigmoid activation and performs a regular convolution on the pixel-wise addition of the two feature maps to complete the fusion of the feature maps.
Table 1 shows the parameters for each stage of the BiConvNet network’s detail and semantic branches. Each stage S contains one or more operations, such as the PCSD Block, regular convolution with Conv2d, ConvNeXt Block, and DSConv depthwise separable convolution. Each operation has an output channel c, as well as other parameters like the number of repetitions r.
3.1 Detail Branch Construction
The detail branch of the BiConvNet network is responsible for processing the spatial details of shallow, low-level semantic features. These spatial details are critical for preserving object boundaries in images. Therefore, this branch requires an abundance of channel capacity, as well as more efficient convolution modules, to encode rich spatial details in a shallow network context.
BiSeNet v2 uses traditional 2D convolution with a 3\(\times\)3 kernel to extract spatial details from the branch of fine details, but the performance of 3\(\times\)3 convolution in extracting fine detail information is poor. However, any 2D convolution can be represented by a combination of 1D convolutions [23]. Non-bottleneck-1D [17] uses a strip convolution Conv1d with 1\(\times\)3 and 3\(\times\)1 kernels instead of a 3\(\times\)3 convolution kernel to extract local features of the target, as shown in Fig. 2 (a). Experiments have shown that this method reduces 33% of parameters compared to using a 3\(\times\)3 convolution kernel, further improving computational efficiency. Its definition \({W_{2D}} \in {\Re ^{C \times {d^h} \times {d^v} \times F}}\) is the weight of a 2D convolution layer, where \({W_{1D}} \in {\Re ^{C \times d \times F}}\) represents the weight of a one-dimensional convolution, \(C\) is the input channel, \(F\) is the output channel, and \(d^h\times d^v\) is the size of the convolutional kernel, typically \(d^h\equiv d^v\equiv d\). The output results of the original 2D residual module convolutional feature map can be expressed as:
\[\begin{equation*} y=F(x,{W_{2D}})+I_x, \tag{1} \end{equation*}\] |
\(I_x\) represents the identity mapping in residual networks, \(F(x,{W_i})\) represents the residual mapping to be learned, and the output of the residual module that uses 1D strided convolution can be expressed as:
\[\begin{equation*} y=F(x,{W_{1\times3},W_{3\times1}})+I_{x}, \tag{2} \end{equation*}\] |
where \(W_{1\times3}\) and \(W_{3\times1}\) represent the weights of 1\(\times\)3 and 3\(\times\)1 strided convolutions, respectively.
Fig. 2 The structure of Non-bottleneck-1D and PCSD convolution modules is compared. (a) Non-bottleneck-1D. (b) PCSD block |
The PCSD convolution module proposed in this paper uses the Non-bottleneck-1D convolution module as the local feature encoding branch. It focuses on local feature information within a receptive field and adds BN normalization to the second and fourth convolution kernels to reduce computation and prevent overfitting. The orange branch in Fig. 2 (b) illustrates this module. In addition, three surrounding context encoding branches are designed based on the PCSD convolution module. These branches use dilation convolution (DConv) with kernel sizes of 3\(\times\)3, 5\(\times\)5, and 7\(\times\)7 and a dilation rate of 2 to encode the surrounding context features of the target from three different perspectives. This improves the learning of spatial details and increases the receptive field of the network model.
Dilated convolution is a technique for increasing the effective receptive field of a convolutional neural network by inserting gaps between the kernel elements. This allows the network to capture more contextual information from the input feature maps. The output of a dilated convolution operation can be defined as follows:
\[\begin{equation*} {D=\sum_{h=1}^H\sum_{w=1}^W x\big(i+ar\times h,j+ar\times w\big)\times W_d,} \tag{3} \end{equation*}\] |
\(H\) and \(W\) represent the height and width of the input image, \(x(i,j)\) denotes the \((i,j)\) feature value on the image, ar represents the dilation rate, and the dilation rate used by PCSD for the dilation convolution is 2. \(D\) represents the output result of the dilation convolution, and \({W_{d}} \in {\Re ^{C \times {d^h} \times {d^v} \times F}}\) represents the dilation convolution weight. After obtaining the feature maps from the three surrounding context encoding branches, they are concatenated along the channel dimension:
\[\begin{equation*} P=\sigma(y+F_{1\times 1}(Z_{\mathrm{c}})), \tag{4} \end{equation*}\] |
where \(\sigma\) is the ReLU activation function, the function \(F_{1\times1}\) applies convolution with a 1x1 kernel. \(Z_{c}\) is the result of concatenating the feature maps from the three dilated convolution branches.
The comparison of the detailed branch before and after reconstruction is presented in Table 2, where the reference object is the detailed branch of the BiSeNet v2 network, and Conv2d represents the conventional 3\(\times\)3 convolution. The reconstructed detailed branch has aligned output channel and convolution module numbers with the BiSeNet v2 detailed branch, which are (64:64:128) and (2:3:3), respectively. To enhance the detailed branch, a strategy was employed wherein the PCSD convolution module replaced the second conventional convolution module in each stage of the original detailed branch. The reconstructed detailed branch achieved an mIoU accuracy of 63.98%, which is 4.62% higher than the original detailed branch’s mIoU accuracy of 59.36%. Additionally, the model size only increased slightly by 0.18M, providing further evidence of the effectiveness of the detailed branch reconstruction.
At the end of the experiment, the accuracy gains brought about by each context encoding branch are investigated, and the combination of single-link, double-link, and the final three-link expanded convolution with the local feature encoding branch is discussed. Experimental comparisons on the Cityscapes dataset are conducted. Additionally, experiments comparing the accuracy of the PCSD convolution module with conventional 3\(\times\)3 convolution [8], MSCA convolution module [18], CCA convolution module [19], and CG convolution [16] module are performed. These comparisons confirm that the PCSD convolution module exhibits high accuracy and superior performance.
3.2 Semantic Branch Construction
The semantic branch adopts a deep-level classification semantic extraction method with narrow channels to capture deeper and more advanced semantics. As the detailed branch is present, the semantic branch does not require excessive channel usage or complex downsampling for feature extraction in the shallow layers to avoid increasing model computation. Instead, it focuses solely on deep-level and high-level features. The channel capacity of each stage of the semantic branch is (16:32:96:128:128). The high-resolution shallow stages employ a lower channel capacity for preliminary feature encoding, and more channel resources are allocated to learning low-resolution deep-level high-level semantic features.
The detail branch aims to reduce feature loss during downsampling, while the semantic branch focuses on deeper semantic features. To reduce computational costs during downsampling, the semantic branch uses depthwise separable convolution [14] instead of the conventional 3\(\times\)3 convolution used in the detail branch. Depthwise separable convolution decomposes standard convolution into depthwise convolution for filtering and 1\(\times\)1 pointwise convolution for combination. In the semantic branch, for instance, the downsampling convolution has a stride of 2, image padding of 1, and a kernel size of 3. When the input feature map \(T\) has a length, width, and input channel size of \(D_i \times D_i \times M\) and the output feature map \(G\) has a length, width, and target output channel size of \(D_o \times D_o \times N\), and the output feature map \(G\) has a length, width, and target output channel size of \(D_o \times D_o \times N\), the calculation cost of the regular convolution is \(C_n\):
\[\begin{equation*} C_n=D_o^2\cdot{D_k}^2\cdot M\cdot N, \tag{5} \end{equation*}\] |
where \(D_k\) is the kernel size. While for depthwise separable convolution, the computation cost \(C_d\) is:
\[\begin{equation*} C_d=D_o^2\cdot D_k^2\cdot M+D_o^2\cdot N\cdot M, \tag{6} \end{equation*}\] |
the computational cost of depthwise separable convolution is equivalent to that of standard convolution:
\[\begin{equation*} \frac{{D_{o}}^{2}\cdot{D_{k}}^{2}\cdot M+{D_{o}}^{2}\cdot N\cdot M,}{{D_{o}}^{2}\cdot{D_{k}}^{2}\cdot M\cdot N,}=\frac{1}{N}+\frac{1}{{D_{k}}^{2}},\quad \tag{7} \end{equation*}\] |
The depthwise separable convolution used in the downsampling of the semantic branch has a kernel size of 3, which reduces the computational cost by 8 to 9 times compared to the standard convolution used in BiSeNet v2.
The convolution module structure used by the semantic branch for semantic information extraction is compared in Fig. 3. \(d_n \times n\) represents a standard convolution with a kernel size of n, DS represents a depthwise separable convolution, and C is the number of channels in the feature map. In each stage of the semantic branch, after downsampling, the ConvNeXt convolution [9] module is used to generate dense contextual semantic information. The module is comprised of an inverted bottleneck residual convolution with one 7\(\times\)7 and two 1\(\times\)1 convolution kernels and outperforms conventional convolution [9]. To restructure the original semantic branch of BiSeNet v2, this paper combines the simple and efficient ConvNeXt convolution module with depth separable convolution.
Fig. 3 Schematic diagram of the structure of GE and ConvNeXt convolution module. (a) GE block of BiSeNet v2. (b) ConvNeXt block |
Table 3 provides a parameter comparison before and after the reconstruction of the original semantic branch in BiSeNet v2. At each stage, depth-wise separable convolution (DSConv) is used for downsampling, followed by dense semantic feature encoding using the ConvNeXt Block convolution. The number of convolutions for ConvNeXt in each stage is (3:3:9:6:3) after balancing between accuracy and model performance. The proposed semantic branch reconstruction achieved a 2.94% improvement in mIoU accuracy compared to the original semantic branch of BiSeNet v2 on the Cityscapes dataset while increasing the model parameter by only 0.44M. This experiment confirms the effectiveness of the proposed semantic branch reconstruction in BiSeNet v2.
3.3 Bilateral Aggregation Layer Optimization
The feature representations of the bilateral branches are complementary, with each branch unaware of the other’s information, and the outputs of the two branches have different levels of feature representation. Therefore, the aggregation layer aims to merge these two types of feature representations. The BiSeNet v2 algorithm designed the Bilateral Guided Aggregation (BGA) layer, which achieved good performance. This paper made minor adjustments to the aggregation layer based on the BGA. Figure 4 shows the modifications made to the BGA layer. The red part indicates the deleted part based on the BGA, while the green part indicates the added and modified parts based on it. The black text remains consistent with the original BGA. The original structure of the BGA uses the features of the semantic branch to activate the sigmoid function to cater to the features of the detail branch. The proposed approach applies the sigmoid activation function to the results of upsampling and downsampling of both branches before multiplying them pixel by pixel to restore the original size of the feature map. Additionally, the original size feature map of the detail branch, which was previously processed by depthwise separable convolution, is now processed by a conventional 3\(\times\)3 convolution to reduce the impact on the original detail features. The improved aggregation layer achieved a 1.83% improvement in mIoU accuracy compared to the original BGA module, and the number of parameters remained unchanged.
4. Experimental Study
In this section, the dataset and implementation details are first introduced. Next, the impact of each surrounding context encoding branch in the PCSD convolution module on accuracy is further studied, and the reliability and feasibility of the proposed PCSD are verified by comparison with commonly used convolution modules. Then, through overall ablation experiments, we demonstrate the impact of each component of our proposed semantic branch, detail branch, and aggregation layer improvement method on the accuracy of the Cityscapes validation set. Finally, we report the final accuracy compared with other algorithms to verify the advancement of the proposed BiConvNet and the effectiveness of the improvements to BiSeNet v2.
4.1 Datasets and Experimental Environment
The Cityscapes dataset focuses on the semantic understanding of urban street scenes from a car’s perspective and contains a set of high-resolution images from 50 different cities in Europe. The dataset is divided into a training set, a validation set, and a test set, with 2,975, 500, and 1,525 images, respectively. In our experiments, we only use finely annotated images to validate the effectiveness of our proposed method, which includes 19 segmentation classes for semantic segmentation tasks. The BDD100K image segmentation dataset is a large-scale and diverse driving video dataset designed for autonomous driving research. This image segmentation dataset includes 10,000 image frames, covering various weather conditions, times, and geographical locations. Each image is finely annotated at the pixel level, including 19 major categories such as lane markings, traffic signs, pedestrians, and vehicles. Among these images, 7,000 are used for training, 1,000 for validation, and 2,000 for testing. The batch size during training is 2, and the actual input image size is (512\(\times\)512\(\times\)3). Error fluctuations are eliminated through 400 iterations, and the highest accuracy attained during the iteration process is considered as the final accuracy. The segmentation accuracy metric used is the standard measure of Mean Intersection of Union (mIoU). Python in PyTorch 1.13 and the MMsegmentation framework are utilized for the experiments. Inference is performed using a GPU with 12GB of memory (NVIDIA GeForce RTX 3060) and the CUDA 11.6 environment. The algorithm construction process uses the AdamW [24] optimizer, with a weight decay of 0.05. Inspired by MobileNet [14], BiConvNet uses the poly learning rate, with a base of 0.045 and a power of 1. In data augmentation, BiConvNet uses operations such as random resizing between 0.5 and 2, random cropping, horizontal flipping, optical image enhancement, normalization, etc. during training. The model uses cross-entropy loss as the head layer loss during training, with a loss weight of 1.0.
4.2 Ablation Experiment of PCSD Convolution Module
This section introduces the ablation experiment of the PCSD convolution module to validate the effectiveness of each surrounding context encoding branch of the proposed PCSD. The experimental design is as follows: Non-bottleneck-1D is used as the detail branch to extract spatial detail information under the constructed BiConvNet network framework, and dilated convolution is gradually introduced. For instance, experimental group 0 in Table 4 indicates that only the Non-bottleneck-1D convolution module is used in the detail branch of the proposed BiConvNet network. Experimental group 1 represents expanding the kernel size of the dilated convolution surrounding context encoding branch based on the Non-bottleneck-1D convolution module, with a size of 3\(\times\)3 and interval of 1. Experimental group 4 represents introducing two dilated convolution surrounding context encoding branches with kernel sizes of 3\(\times\)3 and 5\(\times\)5 and an interval of 1, respectively, based on the Non-bottleneck-1D convolution module. Finally, experimental group 7 is the PCSD convolution module consisting of three types of dilated convolutions and the Non-bottleneck-1D convolution module. These experiments aim to demonstrate the effectiveness of each surrounding context encoding branch of the proposed PCSD convolution module.
Based on the experiments, it is clear that using only local dense convolutions for extracting spatial detail features is not sufficient, and encoding surrounding contextual information positively contributes to the final accuracy. In terms of using only one context encoding branch combined with Non-bottleNeck-1D, the smaller 3\(\times\)3 dilated convolution with a smaller receptive field has higher accuracy. However, in the case of using two context encoding branches, the combination of 5x5 and 7x7 dilated convolutions has higher accuracy. PCSD convolutional module combines these three dilated convolutions with different receptive fields to encode surrounding contextual information, and achieves the highest accuracy, which validates the reliability and feasibility of the proposed PCSD convolutional module.
4.3 Convolutional Module Comparison Experiment
The design inspiration for the PCSD convolution module comes from various other convolution modules, such as the regular 3\(\times\)3 convolution [8], Non-bottleneck-1d [17], MSCA [18], CCA [19], and CG [16]. This section compares the performance of the PCSD convolution module with other convolution modules when used in the detail branch of the BiConvNet on the Cityscapes validation set. Experiments are conducted by replacing the spatial detail encoding convolution module in the detail branch of the BiConvNet with the aforementioned convolution modules, and their performance is compared.
The experimental results presented in Table 5 indicate that compared to the conventional 3\(\times\)3 convolution, improvement schemes such as Non-bottleneck-1D stripe convolution and CCA cross convolution are effective. Although Non-bottleneck-1D has limited accuracy improvement compared to the conventional Conv2d convolution, and there is still a significant gap compared to the other three convolution modules. However, the PCSD convolution module proposed in this paper, which combines stripe convolution and dilated convolution based on Non-bottleneck-1D, achieved an 8.09% improvement in accuracy, it maintains the highest accuracy in segmentation across all categories. This once again confirms that stripe convolution is more efficient than conventional convolution, and encoding surrounding contextual information plays a vital role in preserving spatial detail information. It also demonstrates the high accuracy and advanced nature of the proposed PCSD convolution module.
4.4 BiConvNet Network Ablation Experiment
In this section, we present ablation experiments to verify the effectiveness of each component of our proposed BiConvNet algorithm. The experimental plan is to gradually improve the detail branch, semantic branch, and aggregation layer of BiSeNet v2, which serves as the reference model, until it becomes the BiConvNet network proposed in this paper. As shown in Table 6, the “Baseline” represents the original detail and semantic branches in the BiSeNet v2 algorithm, “BGA” represents the bilateral guided aggregation proposed in BiSeNet v2, and “Improved” represents the improvement and reconstruction schemes of the detail branch, semantic branch, and aggregation layer proposed in this paper.
The results demonstrate that replacing regular Conv2d convolutions in BiSeNetv2 with PCSD convolution modules leads to a parameter count increase of only 0.18M but an accuracy increase of 4.62% (Experiment Group 2). Further improvement in accuracy by 2.94% was achieved by improving the semantic pathway using ConvNeXt and depthwise separable convolution (Experiment Group 3) with a slight parameter count increase of only 0.44M. After fine-tuning the BGA aggregation layer (Experiment Group 4), the final accuracy of the BiConvNet network increased by 9.39 percentage points compared to BiSeNet v2, reaching 68.75%, with only a slight parameter count increase, which meets the design principle of lightweight. These ablation experiments confirm the importance of the proposed PCSD detail pathway, the semantic pathway that consists of depthwise separable convolution and ConvNeXt convolution modules, and the fine-tuning of the BGA aggregation layer in the BiConvNet algorithm. As shown in Fig. 5, BiConvNet has a great improvement on edge detail retention compared with BiSeNet v2, and the improved results are better.
4.5 Algorithm Comparison Experiment
This section introduces the experiments conducted to compare the accuracy of BiConvNet with other multi-branch image segmentation algorithms, as well as common pure convolutional image segmentation algorithms and some transformer-based image segmentation algorithms, on the Cityscapes and BDD100K datasets. Different algorithms may achieve high accuracy in their respective original papers due to the authors incorporating additional network layers and more advanced training strategies. This article primarily focuses on comparing the backbone networks, highlighting the superiority of different backbone networks. The purpose of the experiments is to evaluate the accuracy of different algorithmic frameworks under the same experimental conditions and parameters, while also comparing their model parameters. According to Table 7, the experimental results show that BiSeNet v2 has an accuracy of 59.36%, which is 5.35% higher than the similar bilateral branches image segmentation network, Fast-Scnn [22]. However, bilateral branches image segmentation algorithms such as BiSeNet v2 and Fast-Scnn typically have weaker urban road segmentation capabilities compared to common single-branch network algorithms like ConvNeXt [9], SegNeXt [18], Segformer [25], CGNet [16], and STDC [25]. However, this paper, using the BiConvNet algorithm proposed on the basis of BiSeNetv2, maintains the first position in the majority of IoU accuracies on the Cityscapes dataset. Additionally, achieving an mIoU of 68.75%, this confirms the effectiveness and advanced performance of the BiConvNet bilateral branch image segmentation algorithm. From Table 8, it can be observed that the proposed BiConvNet algorithm also outperforms algorithms such as BiSeNet v2, Fast-Scnn, and Segformer in terms of accuracy on the BDD100K dataset.
5. Conclusions
This project aimed to improve the performance of the BiSeNet v2 bilateral branches image segmentation algorithm and propose a novel real-time semantic segmentation model called BiConvNet. The approach combines the strengths of strip convolution and dilated convolution to create the Pixel-Contextual Similarity Dilated (PCSD) convolution module, which is designed to capture local spatial details and surrounding contextual information in the detail branch of the model. Explored the contributions of dilated convolutions and stripe convolutions to the accuracy of image segmentation, and compared them with common convolutions, confirming the superiority of the PCSD convolution module.
To further improve the model’s ability to extract high-level semantic information, the semantic branch is reconstructed using depth-wise separable convolution and ConvNeXt convolution modules. The depth-wise separable convolution helps to reduce the number of model parameters, making it more computationally efficient, while the ConvNeXt convolution module helps to enhance feature representations by exploiting interdependencies between channels. Finally, we fine-tuned the BGA aggregation layer of BiSeNet v2 to achieve additional accuracy gains.
The effectiveness of the proposed PCSD convolution module and the overall improvement scheme of BiSeNet v2 are validated through ablation experiments and comparative experiments, demonstrating the advanced nature of BiConvNet. In particular, our experiments indicate that the accuracy of BiConvNet is significantly higher than that of BiSeNet v2 and Fast-Scnn, two similar dual-branch image segmentation networks. The experiments also validate the improvements introduced by the proposed detail branch, semantic branch, and aggregation layer, bringing substantial accuracy gains in image segmentation across all categories in the Cityscapes dataset.
Future work includes further investigating the functions of the detail and semantic branches, with the goal of improving their ability to extract spatial detail information and deep semantic information under differentiated design. Additionally, multi-task deep learning tasks such as image depth estimation will be introduced to expand the scope of the model’s applicability to real-world problems in the field of autonomous driving.
References
[1] S. Grewal and C. Rama Krishna, “Dimensionality reduction for face recognition using principal component analysis based big bang big crunch optimization algorithm,” 2nd International Conference on Electrical and Electronics Engineering, pp.949-955, Jan 2021.
CrossRef
[2] K. He, F. Wen, and J. Sun, “K-Means Hashing: An Affinity-Preserving Quantization Method for Learning Binary Compact Codes,” 26th IEEE Conference on Computer Vision and Pattern Recognition, pp.2938-2945, June 2013.
CrossRef
[3] K.H. Memon, S. Memon, M.A. Qureshi, M.B. Alvi, D. Kumar, and R.A. Shah, “Kernel Possibilistic Fuzzy c-Means Clustering with Local Information for Image Segmentation,” International Journal of Fuzzy Systems, vol.21, no.1, pp.321-332, 2018.
CrossRef
[4] E. Shelhamer, J. Long, and T. Darrell, “Fully Convolutional Networks for Semantic Segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol.39, no.4, pp.640-651, 2015.
CrossRef
[5] Z.G. Wu and Y. Z, “SWformer-VO: A monocular visual odometry model based on swin transformer,” IEEE Robot. Autom. Lett., vol.9, no.5, pp.4766-4773, 2024.
[6] J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, and H. Lu, “Dual attention network for scene segmentation,” Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp.3141-3149, June 2019.
[7] M. Yin, Z. Yao, Y. Cao, X. Li, Z. Zhang, S. Lin, and H. Hu, “Disentangled Non-local Neural Networks,” 16th European Conference on Computer Vision, pp.191-207, Aug. 2020.
CrossRef
[8] C. Yu, C. Gao, J. Wang, G. Yu, C. Shen, and N. Sang, “BiSeNet V2: Bilateral Network with Guided Aggregation for Real-Time Semantic Segmentation,” International Journal of Computer Vision, vol.128, no.11, pp.3051-3068, 2021.
CrossRef
[9] Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, and S. Xie, “A ConvNet for the 2020s,” IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.11966-11976, June 2022.
CrossRef
[10] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The Cityscapes dataset for semantic urban scene understanding,” 29th IEEE Conference on Computer Vision and Pattern Recognition, pp.3213-3223, June 2016.
CrossRef
[11] F. Yu, H. Chen, X. Wang, W. Xian, Y. Chen, F. Liu, V. Madhavan, and T. Darrell, “Bdd100k: A diverse driving dataset for heterogeneous multitask learning,” Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.2636-2645, 2020.
[12] F. Chollet, “Xception: Deep learning with depthwise separable convolutions,” 30th IEEE Conference on Computer Vision and Pattern Recognition, pp.1800-1807, July 2017.
CrossRef
[13] X. Zhang, X. Zhou, M. Lin, and J. Sun, “ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices,” 31st Meeting of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.6848-6856, June 2018.
CrossRef
[14] A.G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “MobileNets: Efficient convolutional neural networks for mobile vision applications,” arXiv: 1074.04861, https://arxiv.org/abs/1704.04861, April 2017.
[15] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “MobileNetV2: Inverted Residuals and Linear Bottlenecks,” Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp.4510-4520, Dec. 2018.
CrossRef
[16] T. Wu, “CGNet: A light-weight context guided network for semantic segmentation,” IEEE Trans. Image Process., vol.30, no.1, pp.1169-1179, 2018.
[17] E. Romera, J.M. Alvarez, L.M. Bergasa, and R. Arroyo, “ERFNet: Efficient Residual Factorized ConvNet for Real-Time Semantic Segmentation,” IEEE Trans. Intell. Transp. Syst., vol.19, no.1, pp.263-272, 2018.
CrossRef
[18] MengHao G, ChengZe L, et al., “SegNeXt: Rethinking Convolutional Attention Design for Semantic Segmentation,” arXiv. doi:10.48550/arXiv.2209.08575. (preprint)
CrossRef
[19] Z. Huang, X. Wang, L. Huang, C. Huang, Y. Wei, and W. Liu, “CCNet: Criss-Cross Attention for Semantic Segmentation,” IEEE/CVF International Conference on Computer Vision, pp.603-612, Oct. 2019.
CrossRef
[20] R.K. Poudel, B. Ujwal, et al., “ContextNet: Exploring context and detail for semantic segmentation in real-time,” British Machine Vision Conference, pp.1-12, Sept. 2018.
[21] Zhao H, Qi X, et al., “ICNet for Real-Time Semantic Segmentation on High-Resolution Images,” 15th European Conference on Computer Vision, pp.418-434, Sept. 2018.
CrossRef
[22] R.K. Poudel, L. Stephan, et al., “Fast-SCNN: Fast semantic segmentation network,” 30th British Machine Vision Conference, pp.1-9, Sept. 2019.
[23] J. Alvarez and L. Petersson, “DecomposeMe: Simplifying ConvNets for end-to-end learning, arXiv:1606.05426, 2016. https://arxiv.org/abs/1606.05426, June 2016.
[24] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv:1711.05101, https://arxiv.org/abs/1711.05101, Jan. 2019.
[25] E. Xie, W. WenHai, et al., “SegFormer: Simple and efficient design for semantic segmentation with transformers,” 35th Conference on Neural Information Processing Systems, pp.12077-12090, Dec. 2021.
[26] M. Fan, S. Lai, J. Huang, X. Wei, Z. Chai, J. Luo, and X. Wei, “Rethinking BiSeNet For Real-Time Semantic Segmentation,” IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.9711-9720, April 2021.
CrossRef