Yuan LI Tingting HU Ryuji FUCHIKAMI Takeshi IKENAGA
1 millisecond (1-ms) vision systems are gaining increasing attention in diverse fields like factory automation and robotics, as the ultra-low delay ensures seamless and timely responses. Superpixel segmentation is a pivotal preprocessing to reduce the number of image primitives for subsequent processing. Recently, there has been a growing emphasis on leveraging deep network-based algorithms to pursue superior performance and better integration into other deep network tasks. Superpixel Sampling Network (SSN) employs a deep network for feature generation and employs differentiable SLIC for superpixel generation. SSN achieves high performance with a small number of parameters. However, implementing SSN on FPGAs for ultra-low delay faces challenges due to the final layer’s aggregation of intermediate results. To address this limitation, this paper proposes an aggregated to pipelined structure for FPGA implementation. The final layer is decomposed into individual final layers for each intermediate result. This architectural adjustment eliminates the need for memory to store intermediate results. Concurrently, the proposed structure leverages decomposed layers to facilitate a pipelined structure with pixel streaming input to achieve ultra-low latency. To cooperate with the pipelined structure, layer-partitioned memory architecture is proposed. Each final layer has dedicated memory for storing superpixel center information, allowing values to be read and calculated from memory without conflicts. Calculation results of each final layer are accumulated, and the result of each pixel is obtained as the stream reaches the last layer. Evaluation results demonstrate that boundary recall and under-segmentation error remain comparable to SSN, with an average label consistency improvement of 0.035 over SSN. From a hardware performance perspective, the proposed system processes 1000 FPS images with a delay of 0.947 ms/frame.
This article focuses on improving the BiSeNet v2 bilateral branch image segmentation network structure, enhancing its learning ability for spatial details and overall image segmentation accuracy. A modified network called “BiconvNet” is proposed. Firstly, to extract shallow spatial details more effectively, a parallel concatenated strip and dilated (PCSD) convolution module is proposed and used to extract local features and surrounding contextual features in the detail branch. Continuing on, the semantic branch is reconstructed using the lightweight capability of depth separable convolution and high performance of ConvNet, in order to enable more efficient learning of deep advanced semantic features. Finally, fine-tuning is performed on the bilateral guidance aggregation layer of BiSeNet v2, enabling better fusion of the feature maps output by the detail branch and semantic branch. The experimental part discusses the contribution of stripe convolution and different sizes of empty convolution to image segmentation accuracy, and compares them with common convolutions such as Conv2d convolution, CG convolution and CCA convolution. The experiment proves that the PCSD convolution module proposed in this paper has the highest segmentation accuracy in all categories of the Cityscapes dataset compared with common convolutions. BiConvNet achieved a 9.39% accuracy improvement over the BiSeNet v2 network, with only a slight increase of 1.18M in model parameters. A mIoU accuracy of 68.75% was achieved on the validation set. Furthermore, through comparative experiments with commonly used autonomous driving image segmentation algorithms in recent years, BiConvNet demonstrates strong competitive advantages in segmentation accuracy on the Cityscapes and BDD100K datasets.
Xiangrun LI Qiyu SHENG Guangda ZHOU Jialong WEI Yanmin SHI Zhen ZHAO Yongwei LI Xingfeng LI Yang LIU
Automated tongue segmentation plays a crucial role in the realm of computer-aided tongue diagnosis. The challenge lies in developing algorithms that achieve higher segmentation accuracy and maintain less memory space and swift inference capabilities. To relieve this issue, we propose a novel Pool-unet integrating Pool-former and Multi-task mask learning for tongue image segmentation. First of all, we collected 756 tongue images taken in various shooting environments and from different angles and accurately labeled the tongue under the guidance of a medical professional. Second, we propose the Pool-unet model, combining a hierarchical Pool-former module and a U-shaped symmetric encoder-decoder with skip-connections, which utilizes a patch expanding layer for up-sampling and a patch embedding layer for down-sampling to maintain spatial resolution, to effectively capture global and local information using fewer parameters and faster inference. Finally, a Multi-task mask learning strategy is designed, which improves the generalization and anti-interference ability of the model through the Multi-task pre-training and self-supervised fine-tuning stages. Experimental results on the tongue dataset show that compared to the state-of-the-art method (OET-NET), our method has 25% fewer model parameters, achieves 22% faster inference times, and exhibits 0.91% and 0.55% improvements in Mean Intersection Over Union (MIOU), and Mean Pixel Accuracy (MPA), respectively.
2D and 3D semantic segmentation play important roles in robotic scene understanding. However, current 3D semantic segmentation heavily relies on 3D point clouds, which are susceptible to factors such as point cloud noise, sparsity, estimation and reconstruction errors, and data imbalance. In this paper, a novel approach is proposed to enhance 3D semantic segmentation by incorporating 2D semantic segmentation from RGB-D sequences. Firstly, the RGB-D pairs are consistently segmented into 2D semantic maps using the tracking pipeline of Simultaneous Localization and Mapping (SLAM). This process effectively propagates object labels from full scans to corresponding labels in partial views with high probability. Subsequently, a novel Semantic Projection (SP) block is introduced, which integrates features extracted from localized 2D fragments across different camera viewpoints into their corresponding 3D semantic features. Lastly, the 3D semantic segmentation network utilizes a combination of 2D-3D fusion features to facilitate a merged semantic segmentation process for both 2D and 3D. Extensive experiments conducted on public datasets demonstrate the effective performance of the proposed 2D-assisted 3D semantic segmentation method.
Existing weakly-supervised segmentation approaches based on image-level annotations may focus on the most activated region in the image and tend to identify only part of the target object. Intuitively, high-level semantics among objects of the same category in different images could help to recognize corresponding activated regions of the query. In this study, a scheme called Cycle-Consistency of Semantics Network (CyCSNet) is proposed, which can enhance the activation of the potential inactive regions of the target object by utilizing the cycle-consistent semantics from images of the same category in the training set. Moreover, a Dynamic Correlation Feature Selection (DCFS) algorithm is derived to reduce the noise from pixel-wise samples of low relevance for better training. Experiments on the PASCAL VOC 2012 dataset show that the proposed CyCSNet achieves competitive results compared with state-of-the-art weakly-supervised segmentation approaches.
Ji XI Yue XIE Pengxu JIANG Wei JIANG
Currently, a significant portion of acoustic scene categorization (ASC) research is centered around utilizing Convolutional Neural Network (CNN) models. This preference is primarily due to CNN’s ability to effectively extract time-frequency information from audio recordings of scenes by employing spectrum data as input. The expression of many dimensions can be achieved by utilizing 2D spectrum characteristics. Nevertheless, the diverse interpretations of the same object’s existence in different positions on the spectrum map can be attributed to the discrepancies between spectrum properties and picture qualities. The lack of distinction between different aspects of input information in ASC-based CNN networks may result in a decline in system performance. Considering this, a feature pyramid segmentation (FPS) approach based on CNN is proposed. The proposed approach involves utilizing spectrum features as the input for the model. These features are split based on a preset scale, and each segment-level feature is then fed into the CNN network for learning. The SoftMax classifier will receive the output of all feature scales, and these high-level features will be fused and fed to it to categorize different scenarios. The experiment provides evidence to support the efficacy of the FPS strategy and its potential to enhance the performance of the ASC system.
Xiaosheng YU Jianning CHI Ming XU
Accurate segmentation of fundus vessel structure can effectively assist doctors in diagnosing eye diseases. In this paper, we propose a fundus blood vessel segmentation network combined with cross-modal features and verify our method on the public data set OCTA-500. Experimental results show that our method has high accuracy and robustness.
Ryohei KANKE Masanobu TAKAHASHI
Amodal Instance Segmentation (AIS) aims to segment the regions of both visible and invisible parts of overlapping objects. The mainstream Mask R-CNN-based methods are unsuitable for thin objects with large overlaps because of their object proposal features with bounding boxes for three reasons. First, capturing the entire shapes of overlapping thin objects is difficult. Second, the bounding boxes of close objects are almost identical. Third, a bounding box contains many objects in most cases. In this paper, we propose a box-free AIS method, Seed-to-Mask, for thin objects with large overlaps. The method specifies a target object using a seed and iteratively extends the segmented region. We have achieved better performance in experiments on artificial data consisting only of thin objects.
Chunhua QIAN Xiaoyan QIN Hequn QIANG Changyou QIN Minyang LI
The segmentation performance of fresh tea sprouts is inadequate due to the uncontrollable posture. A novel method for Fresh Tea Sprouts Segmentation based on Capsule Network (FTS-SegCaps) is proposed in this paper. The spatial relationship between local parts and whole tea sprout is retained and effectively utilized by a deep encoder-decoder capsule network, which can reduce the effect of tea sprouts with uncontrollable posture. Meanwhile, a patch-based local dynamic routing algorithm is also proposed to solve the parameter explosion problem. The experimental results indicate that the segmented tea sprouts via FTS-SegCaps are almost coincident with the ground truth, and also show that the proposed method has a better performance than the state-of-the-art methods.
Chee Siang LEOW Hideaki YAJIMA Tomoki KITAGAWA Hiromitsu NISHIZAKI
Text detection is a crucial pre-processing step in optical character recognition (OCR) for the accurate recognition of text, including both fonts and handwritten characters, in documents. While current deep learning-based text detection tools can detect text regions with high accuracy, they often treat multiple lines of text as a single region. To perform line-based character recognition, it is necessary to divide the text into individual lines, which requires a line detection technique. This paper focuses on the development of a new approach to single-line detection in OCR that is based on the existing Character Region Awareness For Text detection (CRAFT) model and incorporates a deep neural network specialized in line segmentation. However, this new method may still detect multiple lines as a single text region when multi-line text with narrow spacing is present. To address this, we also introduce a post-processing algorithm to detect single text regions using the output of the single-line segmentation. Our proposed method successfully detects single lines, even in multi-line text with narrow line spacing, and hence improves the accuracy of OCR.
Ann Jelyn TIEMPO Yong-Jin JEONG
Field Programmable Gate Array (FPGA) is gaining popularity because of their reconfigurability which brings in security concerns like inserting hardware trojan. Various detection methods to overcome this threat have been proposed but in the ASIC's supply chain and cannot directly apply to the FPGA application. In this paper, the authors aim to implement a structural feature-based detection method for detecting hardware trojan in a cell-level netlist, which is not well explored yet, where the nets are segmented into smaller groups based on their interconnection and further analyzed by looking at their structural similarities. Experiments show positive performance with an average detection rate of 95.41%, an average false alarm rate of 2.87% and average accuracy of 96.27%.
Yu KASHIHARA Takashi MATSUBARA
The diffusion model has achieved success in generating and editing high-quality images because of its ability to produce fine details. Its superior generation ability has the potential to facilitate more detailed segmentation. This study presents a novel approach to segmentation tasks using an inverse heat dissipation model, a kind of diffusion-based models. The proposed method involves generating a mask that gradually shrinks to fit the shape of the desired segmentation region. We comprehensively evaluated the proposed method using multiple datasets under varying conditions. The results show that the proposed method outperforms existing methods and provides a more detailed segmentation.
Various haze removal methods based on the atmospheric scattering model have been presented in recent years. Most methods have targeted strong haze images where light is scattered equally in all color channels. This paper presents a haze removal method using near-infrared (NIR) images for relatively weak haze images. In order to recover the lost edges, the presented method first extracts edges from an appropriately weighted NIR image and fuses it with the color image. By introducing a wavelength-dependent scattering model, our method then estimates the transmission map for each color channel and recovers the color more naturally from the edge-recovered image. Finally, the edge-recovered and the color-recovered images are blended. In this blending process, the regions with high lightness, such as sky and clouds, where unnatural color shifts are likely to occur, are effectively estimated, and the optimal weighting map is obtained. Our qualitative and quantitative evaluations using 59 pairs of color and NIR images demonstrated that our method can recover edges and colors more naturally in weak haze images than conventional methods.
Shangdong LIU Chaojun MEI Shuai YOU Xiaoliang YAO Fei WU Yimu JI
The thermal imaging pedestrian segmentation system has excellent performance in different illumination conditions, but it has some drawbacks(e.g., weak pedestrian texture information, blurred object boundaries). Meanwhile, high-performance large models have higher latency on edge devices with limited computing performance. To solve the above problems, in this paper, we propose a real-time thermal infrared pedestrian segmentation method. The feature extraction layers of our method consist of two paths. Firstly, we utilize the lossless spatial downsampling to obtain boundary texture details on the spatial path. On the context path, we use atrous convolutions to improve the receptive field and obtain more contextual semantic information. Then, the parameter-free attention mechanism is introduced at the end of the two paths for effective feature selection, respectively. The Feature Fusion Module (FFM) is added to fuse the semantic information of the two paths after selection. Finally, we accelerate method inference through multi-threading techniques on the edge computing device. Besides, we create a high-quality infrared pedestrian segmentation dataset to facilitate research. The comparative experiments on the self-built dataset and two public datasets with other methods show that our method also has certain effectiveness. Our code is available at https://github.com/mcjcs001/LEIPNet.
Jeyoen KIM Takumi SOMA Tetsuya MANABE Aya KOJIMA
This paper attempts to identify which side of the road a bicycle is currently riding on using a common camera for realizing an advanced bicycle navigation system and bicycle riding safety support system. To identify the roadway area, the proposed method performs semantic segmentation on a front camera image captured by a bicycle drive recorder or smartphone. If the roadway area extends from the center of the image to the right, the bicyclist is riding on the left side of the roadway (i.e., the correct riding position in Japan). In contrast, if the roadway area extends to the left, the bicyclist is on the right side of the roadway (i.e., the incorrect riding position in Japan). We evaluated the accuracy of the proposed method on various road widths with different traffic volumes using video captured by riding bicycles in Tsuruoka City, Yamagata Prefecture, and Saitama City, Saitama Prefecture, Japan. High accuracy (>80%) was achieved for any combination of the segmentation model, riding side identification method, and experimental conditions. Given these results, we believe that we have realized an effective image segmentation-based method to identify which side of the roadway a bicycle riding is on.
William-Fabrice BROU Quang-Thang DUONG Minoru OKADA
Parallel line feeder (PLF) consisting of a two-wire transmission line operating in the MHz band has been proposed as a wide-coverage short-distance wireless charging. In the MHz band, a PLF of several meters suffers from standing wave effect, resulting in fluctuation in power transfer efficiency accordingly to the receiver's position. This paper studies a modified version of the system, where the PLF is divided into individually compensated segments to mitigate the standing wave effect. Modelling the PLF as a lossy transmission line, this paper theoretically shows that if the segments' lengths are properly determined, it is able to improve and stabilize the efficiency for all positions. Experimental results at 27.12 MHz confirm the theoretical analysis and show that a fairly high efficiency of 70% can be achieved.
He LI Yutaro IWAMOTO Xianhua HAN Lanfen LIN Akira FURUKAWA Shuzo KANASAKI Yen-Wei CHEN
Convolutional neural networks (CNNs) have become popular in medical image segmentation. The widely used deep CNNs are customized to extract multiple representative features for two-dimensional (2D) data, generally called 2D networks. However, 2D networks are inefficient in extracting three-dimensional (3D) spatial features from volumetric images. Although most 2D segmentation networks can be extended to 3D networks, the naively extended 3D methods are resource-intensive. In this paper, we propose an efficient and accurate network for fully automatic 3D segmentation. Specifically, we designed a 3D multiple-contextual extractor to capture rich global contextual dependencies from different feature levels. Then we leveraged an ROI-estimation strategy to crop the ROI bounding box. Meanwhile, we used a 3D ROI-attention module to improve the accuracy of in-region segmentation in the decoder path. Moreover, we used a hybrid Dice loss function to address the issues of class imbalance and blurry contour in medical images. By incorporating the above strategies, we realized a practical end-to-end 3D medical image segmentation with high efficiency and accuracy. To validate the 3D segmentation performance of our proposed method, we conducted extensive experiments on two datasets and demonstrated favorable results over the state-of-the-art methods.
Yuanfa JI Sisi SONG Xiyan SUN Ning GUO Youming LI
In order to improve the frequency band utilization and avoid mutual interference between signals, the BD3 satellite signals adopt Binary Offset Carrier (BOC) modulation. On one hand, BOC modulation has a narrow main peak width and strong anti-interference ability; on the other hand, the phenomenon of false acquisition locking caused by the multi-peak characteristic of BOC modulation itself needs to be resolved. In this context, this paper proposes a new BOC(n,n) unambiguous acquisition algorithm based on segmentation reconstruction. The algorithm is based on splitting the local BOC signal into four parts in each subcarrier period. The branch signal and the received signal are correlated with the received signal to generate four branch correlation signals. After a series of combined reconstructions, the final signal detection function completely eliminates secondary peaks. A simulation shows that the algorithm can completely eliminate the sub-peak interference for the BOC signals modulated by subcarriers with different phase. The characteristics of narrow correlation peak are retained. Experiments show that the proposed algorithm has superior performance in detection probability and peak-to-average ratio.
Ann Jelyn TIEMPO Yong-Jin JEONG
Using third-party intellectual properties (3PIP) has been a norm in IC design development process to meet the time-to-market demand and at the same time minimizing the cost. But this flow introduces a threat, such as hardware trojan, which may compromise the security and trustworthiness of underlying hardware, like disclosing confidential information, impeding normal execution and even permanent damage to the system. In years, different detections methods are explored, from just identifying if the circuit is infected with hardware trojan using conventional methods to applying machine learning where it identifies which nets are most likely are hardware trojans. But the performance is not satisfactory in terms of maximizing the detection rate and minimizing the false positive rate. In this paper, a new hardware trojan detection approach is proposed where gate-level netlist is segmented into regions first before analyzing which nets might be hardware trojans. The segmentation process depends on the nets' connectivity, more specifically by looking on each fanout points. Then, further analysis takes place by means of computing the structural similarity of each segmented region and differentiate hardware trojan nets from normal nets. Experimental results show 100% detection of hardware trojan nets inserted on each benchmark circuits and an overall average of 1.38% of false positive rates which resulted to a higher accuracy with an average of 99.31%.
Chongren ZHAO Yinhui ZHANG Zifen HE Yunnan DENG Ying HUANG Guangchen CHEN
Aiming at the problem of spatial focus regions distribution dispersion and dislocation in feature pyramid networks and insufficient feature dependency acquisition in both spatial and channel dimensions, this paper proposes a spatial-temporal aggregated shuffle attention for video instance segmentation (STASA-VIS). First, an mixed subsampling (MS) module to embed activating features from the low-level target area of feature pyramid into the high-level is designed, so as to aggregate spatial information on target area. Taking advantage of the coherent information in video frames, STASA-VIS uses the first ones of every 5 video frames as the key-frames and then propagates the keyframe feature maps of the pyramid layers forward in the time domain, and fuses with the non-keyframe mixed subsampled features to achieve time-domain consistent feature aggregation. Finally, STASA-VIS embeds shuffle attention in the backbone to capture the pixel-level pairwise relationship and dimensional dependencies among the channels and reduce the computation. Experimental results show that the segmentation accuracy of STASA-VIS reaches 41.2%, and the test speed reaches 34FPS, which is better than the state-of-the-art one stage video instance segmentation (VIS) methods in accuracy and achieves real-time segmentation.