The search functionality is under construction.
The search functionality is under construction.

Keyword Search Result

[Keyword] Ti(30728hit)

301-320hit(30728hit)

  • Robust Visual Tracking Using Hierarchical Vision Transformer with Shifted Windows Multi-Head Self-Attention

    Peng GAO  Xin-Yue ZHANG  Xiao-Li YANG  Jian-Cheng NI  Fei WANG  

     
    LETTER-Image Recognition, Computer Vision

      Pubricized:
    2023/10/20
      Vol:
    E107-D No:1
      Page(s):
    161-164

    Despite Siamese trackers attracting much attention due to their scalability and efficiency in recent years, researchers have ignored the background appearance, which leads to their inapplicability in recognizing arbitrary target objects with various variations, especially in complex scenarios with background clutter and distractors. In this paper, we present a simple yet effective Siamese tracker, where the shifted windows multi-head self-attention is produced to learn the characteristics of a specific given target object for visual tracking. To validate the effectiveness of our proposed tracker, we use the Swin Transformer as the backbone network and introduced an auxiliary feature enhancement network. Extensive experimental results on two evaluation datasets demonstrate that the proposed tracker outperforms other baselines.

  • A CNN-Based Multi-Scale Pooling Strategy for Acoustic Scene Classification

    Rong HUANG  Yue XIE  

     
    LETTER-Speech and Hearing

      Pubricized:
    2023/10/17
      Vol:
    E107-D No:1
      Page(s):
    153-156

    Acoustic scene classification (ASC) is a fundamental domain within the realm of artificial intelligence classification tasks. ASC-based tasks commonly employ models based on convolutional neural networks (CNNs) that utilize log-Mel spectrograms as input for gathering acoustic features. In this paper, we designed a CNN-based multi-scale pooling (MSP) strategy for ASC. The log-Mel spectrograms are utilized as the input to CNN, which is partitioned into four frequency axis segments. Furthermore, we devised four CNN channels to acquire inputs from distinct frequency ranges. The high-level features extracted from outputs in various frequency bands are integrated through frequency pyramid average pooling layers at multiple levels. Subsequently, a softmax classifier is employed to classify different scenes. Our study demonstrates that the implementation of our designed model leads to a significant enhancement in the model's performance, as evidenced by the testing of two acoustic datasets.

  • Shared Latent Embedding Learning for Multi-View Subspace Clustering

    Zhaohu LIU  Peng SONG  Jinshuai MU  Wenming ZHENG  

     
    LETTER-Artificial Intelligence, Data Mining

      Pubricized:
    2023/10/17
      Vol:
    E107-D No:1
      Page(s):
    148-152

    Most existing multi-view subspace clustering approaches only capture the inter-view similarities between different views and ignore the optimal local geometric structure of the original data. To this end, in this letter, we put forward a novel method named shared latent embedding learning for multi-view subspace clustering (SLE-MSC), which can efficiently capture a better latent space. To be specific, we introduce a pseudo-label constraint to capture the intra-view similarities within each view. Meanwhile, we utilize a novel optimal graph Laplacian to learn the consistent latent representation, in which the common manifold is considered as the optimal manifold to obtain a more reasonable local geometric structure. Comprehensive experimental results indicate the superiority and effectiveness of the proposed method.

  • Negative Learning to Prevent Undesirable Misclassification

    Kazuki EGASHIRA  Atsuyuki MIYAI  Qing YU  Go IRIE  Kiyoharu AIZAWA  

     
    LETTER-Artificial Intelligence, Data Mining

      Pubricized:
    2023/10/05
      Vol:
    E107-D No:1
      Page(s):
    144-147

    We propose a novel classification problem setting where Undesirable Classes (UCs) are defined for each class. UC is the class you specifically want to avoid misclassifying. To address this setting, we propose a framework to reduce the probabilities for UCs while increasing the probability for a correct class.

  • Inference Discrepancy Based Curriculum Learning for Neural Machine Translation

    Lei ZHOU  Ryohei SASANO  Koichi TAKEDA  

     
    PAPER-Natural Language Processing

      Pubricized:
    2023/10/18
      Vol:
    E107-D No:1
      Page(s):
    135-143

    In practice, even a well-trained neural machine translation (NMT) model can still make biased inferences on the training set due to distribution shifts. For the human learning process, if we can not reproduce something correctly after learning it multiple times, we consider it to be more difficult. Likewise, a training example causing a large discrepancy between inference and reference implies higher learning difficulty for the MT model. Therefore, we propose to adopt the inference discrepancy of each training example as the difficulty criterion, and according to which rank training examples from easy to hard. In this way, a trained model can guide the curriculum learning process of an initial model identical to itself. We put forward an analogy to this training scheme as guiding the learning process of a curriculum NMT model by a pretrained vanilla model. In this paper, we assess the effectiveness of the proposed training scheme and take an insight into the influence of translation direction, evaluation metrics and different curriculum schedules. Experimental results on translation benchmarks WMT14 English ⇒ German, WMT17 Chinese ⇒ English and Multitarget TED Talks Task (MTTT) English ⇔ German, English ⇔ Chinese, English ⇔ Russian demonstrate that our proposed method consistently improves the translation performance against the advanced Transformer baseline.

  • Multi-Task Learning of Japanese How-to Tip Machine Reading Comprehension by a Generative Model

    Xiaotian WANG  Tingxuan LI  Takuya TAMURA  Shunsuke NISHIDA  Takehito UTSURO  

     
    PAPER-Natural Language Processing

      Pubricized:
    2023/10/23
      Vol:
    E107-D No:1
      Page(s):
    125-134

    In the research of machine reading comprehension of Japanese how-to tip QA tasks, conventional extractive machine reading comprehension methods have difficulty in dealing with cases in which the answer string spans multiple locations in the context. The method of fine-tuning of the BERT model for machine reading comprehension tasks is not suitable for such cases. In this paper, we trained a generative machine reading comprehension model of Japanese how-to tip by constructing a generative dataset based on the website “wikihow” as a source of information. We then proposed two methods for multi-task learning to fine-tune the generative model. The first method is the multi-task learning with a generative and extractive hybrid training dataset, where both generative and extractive datasets are simultaneously trained on a single model. The second method is the multi-task learning with the inter-sentence semantic similarity and answer generation, where, drawing upon the answer generation task, the model additionally learns the distance between the sentences of the question/context and the answer in the training examples. The evaluation results showed that both of the multi-task learning methods significantly outperformed the single-task learning method in generative question-and-answer examples. Between the two methods for multi-task learning, that with the inter-sentence semantic similarity and answer generation performed the best in terms of the manual evaluation result. The data and the code are available at https://github.com/EternalEdenn/multitask_ext-gen_sts-gen.

  • Improved Head and Data Augmentation to Reduce Artifacts at Grid Boundaries in Object Detection

    Shinji UCHINOURA  Takio KURITA  

     
    PAPER-Image Recognition, Computer Vision

      Pubricized:
    2023/10/23
      Vol:
    E107-D No:1
      Page(s):
    115-124

    We investigated the influence of horizontal shifts of the input images for one stage object detection method. We found that the object detector class scores drop when the target object center is at the grid boundary. Many approaches have focused on reducing the aliasing effect of down-sampling to achieve shift-invariance. However, down-sampling does not completely solve this problem at the grid boundary; it is necessary to suppress the dispersion of features in pixels close to the grid boundary into adjacent grid cells. Therefore, this paper proposes two approaches focused on the grid boundary to improve this weak point of current object detection methods. One is the Sub-Grid Feature Extraction Module, in which the sub-grid features are added to the input of the classification head. The other is Grid-Aware Data Augmentation, where augmented data are generated by the grid-level shifts and are used in training. The effectiveness of the proposed approaches is demonstrated using the COCO validation set after applying the proposed method to the FCOS architecture.

  • Efficient Action Spotting Using Saliency Feature Weighting

    Yuzhi SHI  Takayoshi YAMASHITA  Tsubasa HIRAKAWA  Hironobu FUJIYOSHI  Mitsuru NAKAZAWA  Yeongnam CHAE  Björn STENGER  

     
    PAPER-Image Processing and Video Processing

      Pubricized:
    2023/10/17
      Vol:
    E107-D No:1
      Page(s):
    105-114

    Action spotting is a key component in high-level video understanding. The large number of similar frames poses a challenge for recognizing actions in videos. In this paper we use frame saliency to represent the importance of frames for guiding the model to focus on keyframes. We propose the frame saliency weighting module to improve frame saliency and video representation at the same time. Our proposed model contains two encoders, for pre-action and post-action time windows, to encode video context. We validate our design choices and the generality of proposed method in extensive experiments. On the public SoccerNet-v2 dataset, the method achieves an average mAP of 57.3%, improving over the state of the art. Using embedding features obtained from multiple feature extractors, the average mAP further increases to 75%. We show that reducing the model size by over 90% does not significantly impact performance. Additionally, we use ablation studies to prove the effective of saliency weighting module. Further, we show that our frame saliency weighting strategy is applicable to existing methods on more general action datasets, such as SoccerNet-v1, ActivityNet v1.3, and UCF101.

  • Speech Rhythm-Based Speaker Embeddings Extraction from Phonemes and Phoneme Duration for Multi-Speaker Speech Synthesis

    Kenichi FUJITA  Atsushi ANDO  Yusuke IJIMA  

     
    PAPER-Speech and Hearing

      Pubricized:
    2023/10/06
      Vol:
    E107-D No:1
      Page(s):
    93-104

    This paper proposes a speech rhythm-based method for speaker embeddings to model phoneme duration using a few utterances by the target speaker. Speech rhythm is one of the essential factors among speaker characteristics, along with acoustic features such as F0, for reproducing individual utterances in speech synthesis. A novel feature of the proposed method is the rhythm-based embeddings extracted from phonemes and their durations, which are known to be related to speaking rhythm. They are extracted with a speaker identification model similar to the conventional spectral feature-based one. We conducted three experiments, speaker embeddings generation, speech synthesis with generated embeddings, and embedding space analysis, to evaluate the performance. The proposed method demonstrated a moderate speaker identification performance (15.2% EER), even with only phonemes and their duration information. The objective and subjective evaluation results demonstrated that the proposed method can synthesize speech with speech rhythm closer to the target speaker than the conventional method. We also visualized the embeddings to evaluate the relationship between the distance of the embeddings and the perceptual similarity. The visualization of the embedding space and the relation analysis between the closeness indicated that the distribution of embeddings reflects the subjective and objective similarity.

  • Research on Lightweight Acoustic Scene Perception Method Based on Drunkard Methodology

    Wenkai LIU  Lin ZHANG  Menglong WU  Xichang CAI  Hongxia DONG  

     
    PAPER-Artificial Intelligence, Data Mining

      Pubricized:
    2023/10/23
      Vol:
    E107-D No:1
      Page(s):
    83-92

    The goal of Acoustic Scene Classification (ASC) is to simulate human analysis of the surrounding environment and make accurate decisions promptly. Extracting useful information from audio signals in real-world scenarios is challenging and can lead to suboptimal performance in acoustic scene classification, especially in environments with relatively homogeneous backgrounds. To address this problem, we model the sobering-up process of “drunkards” in real-life and the guiding behavior of normal people, and construct a high-precision lightweight model implementation methodology called the “drunkard methodology”. The core idea includes three parts: (1) designing a special feature transformation module based on the different mechanisms of information perception between drunkards and ordinary people, to simulate the process of gradually sobering up and the changes in feature perception ability; (2) studying a lightweight “drunken” model that matches the normal model's perception processing process. The model uses a multi-scale class residual block structure and can obtain finer feature representations by fusing information extracted at different scales; (3) introducing a guiding and fusion module of the conventional model to the “drunken” model to speed up the sobering-up process and achieve iterative optimization and accuracy improvement. Evaluation results on the official dataset of DCASE2022 Task1 demonstrate that our baseline system achieves 40.4% accuracy and 2.284 loss under the condition of 442.67K parameters and 19.40M MAC (multiply-accumulate operations). After adopting the “drunkard” mechanism, the accuracy is improved to 45.2%, and the loss is reduced by 0.634 under the condition of 551.89K parameters and 23.6M MAC.

  • A Novel Double-Tail Generative Adversarial Network for Fast Photo Animation

    Gang LIU  Xin CHEN  Zhixiang GAO  

     
    PAPER-Artificial Intelligence, Data Mining

      Pubricized:
    2023/09/28
      Vol:
    E107-D No:1
      Page(s):
    72-82

    Photo animation is to transform photos of real-world scenes into anime style images, which is a challenging task in AIGC (AI Generated Content). Although previous methods have achieved promising results, they often introduce noticeable artifacts or distortions. In this paper, we propose a novel double-tail generative adversarial network (DTGAN) for fast photo animation. DTGAN is the third version of the AnimeGAN series. Therefore, DTGAN is also called AnimeGANv3. The generator of DTGAN has two output tails, a support tail for outputting coarse-grained anime style images and a main tail for refining coarse-grained anime style images. In DTGAN, we propose a novel learnable normalization technique, termed as linearly adaptive denormalization (LADE), to prevent artifacts in the generated images. In order to improve the visual quality of the generated anime style images, two novel loss functions suitable for photo animation are proposed: 1) the region smoothing loss function, which is used to weaken the texture details of the generated images to achieve anime effects with abstract details; 2) the fine-grained revision loss function, which is used to eliminate artifacts and noise in the generated anime style image while preserving clear edges. Furthermore, the generator of DTGAN is a lightweight generator framework with only 1.02 million parameters in the inference phase. The proposed DTGAN can be easily end-to-end trained with unpaired training data. Extensive experiments have been conducted to qualitatively and quantitatively demonstrate that our method can produce high-quality anime style images from real-world photos and perform better than the state-of-the-art models.

  • Node-to-Set Disjoint Paths Problem in Cross-Cubes

    Rikuya SASAKI  Hiroyuki ICHIDA  Htoo Htoo Sandi KYAW  Keiichi KANEKO  

     
    PAPER-Fundamentals of Information Systems

      Pubricized:
    2023/10/06
      Vol:
    E107-D No:1
      Page(s):
    53-59

    The increasing demand for high-performance computing in recent years has led to active research on massively parallel systems. The interconnection network in a massively parallel system interconnects hundreds of thousands of processing elements so that they can process large tasks while communicating among others. By regarding the processing elements as nodes and the links between processing elements as edges, respectively, we can discuss various problems of interconnection networks in the framework of the graph theory. Many topologies have been proposed for interconnection networks of massively parallel systems. The hypercube is a very popular topology and it has many variants. The cross-cube is such a topology, which can be obtained by adding one extra edge to each node of the hypercube. The cross-cube reduces the diameter of the hypercube, and allows cycles of odd lengths. Therefore, we focus on the cross-cube and propose an algorithm that constructs disjoint paths from a node to a set of nodes. We give a proof of correctness of the algorithm. Also, we show that the time complexity and the maximum path length of the algorithm are O(n3 log n) and 2n - 3, respectively. Moreover, we estimate that the average execution time of the algorithm is O(n2) based on a computer experiment.

  • CQTXNet: A Modified Xception Network with Attention Modules for Cover Song Identification

    Jinsoo SEO  Junghyun KIM  Hyemi KIM  

     
    LETTER

      Pubricized:
    2023/10/02
      Vol:
    E107-D No:1
      Page(s):
    49-52

    Song-level feature summarization is fundamental for the browsing, retrieval, and indexing of digital music archives. This study proposes a deep neural network model, CQTXNet, for extracting song-level feature summary for cover song identification. CQTXNet incorporates depth-wise separable convolution, residual network connections, and attention models to extend previous approaches. An experimental evaluation of the proposed CQTXNet was performed on two publicly available cover song datasets by varying the number of network layers and the type of attention modules.

  • A Coded Aperture as a Key for Information Hiding Designed by Physics-in-the-Loop Optimization

    Tomoki MINAMATA  Hiroki HAMASAKI  Hiroshi KAWASAKI  Hajime NAGAHARA  Satoshi ONO  

     
    PAPER

      Pubricized:
    2023/09/28
      Vol:
    E107-D No:1
      Page(s):
    29-38

    This paper proposes a novel application of coded apertures (CAs) for visual information hiding. CA is one of the representative computational photography techniques, in which a patterned mask is attached to a camera as an alternative to a conventional circular aperture. With image processing in the post-processing phase, various functions such as omnifocal image capturing and depth estimation can be performed. In general, a watermark embedded as high-frequency components is difficult to extract if captured outside the focal length, and defocus blur occurs. Installation of a CA into the camera is a simple solution to mitigate the difficulty, and several attempts are conducted to make a better design for stable extraction. On the contrary, our motivation is to design a specific CA as well as an information hiding scheme; the secret information can only be decoded if an image with hidden information is captured with the key aperture at a certain distance outside the focus range. The proposed technique designs the key aperture patterns and information hiding scheme through evolutionary multi-objective optimization so as to minimize the decryption error of a hidden image when using the key aperture while minimizing the accuracy when using other apertures. During the optimization process, solution candidates, i.e., key aperture patterns and information hiding schemes, are evaluated on actual devices to account for disturbances that cannot be considered in optical simulations. Experimental results have shown that decoding can be performed with the designed key aperture and similar ones, that decrypted image quality deteriorates as the similarity between the key and the aperture used for decryption decreases, and that the proposed information hiding technique works on actual devices.

  • CASEformer — A Transformer-Based Projection Photometric Compensation Network

    Yuqiang ZHANG  Huamin YANG  Cheng HAN  Chao ZHANG  Chaoran ZHU  

     
    PAPER

      Pubricized:
    2023/09/29
      Vol:
    E107-D No:1
      Page(s):
    13-28

    In this paper, we present a novel photometric compensation network named CASEformer, which is built upon the Swin module. For the first time, we combine coordinate attention and channel attention mechanisms to extract rich features from input images. Employing a multi-level encoder-decoder architecture with skip connections, we establish multiscale interactions between projection surfaces and projection images, achieving precise inference and compensation. Furthermore, through an attention fusion module, which simultaneously leverages both coordinate and channel information, we enhance the global context of feature maps while preserving enhanced texture coordinate details. The experimental results demonstrate the superior compensation effectiveness of our approach compared to the current state-of-the-art methods. Additionally, we propose a method for multi-surface projection compensation, further enriching our contributions.

  • Frameworks for Privacy-Preserving Federated Learning

    Le Trieu PHONG  Tran Thi PHUONG  Lihua WANG  Seiichi OZAWA  

     
    INVITED PAPER

      Pubricized:
    2023/09/25
      Vol:
    E107-D No:1
      Page(s):
    2-12

    In this paper, we explore privacy-preserving techniques in federated learning, including those can be used with both neural networks and decision trees. We begin by identifying how information can be leaked in federated learning, after which we present methods to address this issue by introducing two privacy-preserving frameworks that encompass many existing privacy-preserving federated learning (PPFL) systems. Through experiments with publicly available financial, medical, and Internet of Things datasets, we demonstrate the effectiveness of privacy-preserving federated learning and its potential to develop highly accurate, secure, and privacy-preserving machine learning systems in real-world scenarios. The findings highlight the importance of considering privacy in the design and implementation of federated learning systems and suggest that privacy-preserving techniques are essential in enabling the development of effective and practical machine learning systems.

  • Quality and Transferred Data Based Video Bitrate Control Method for Web-Conferencing Open Access

    Masahiro YOKOTA  Kazuhisa YAMAGISHI  

     
    PAPER-Multimedia Systems for Communications

      Pubricized:
    2023/10/13
      Vol:
    E107-B No:1
      Page(s):
    272-285

    In this paper, the quality and transferred data based video bitrate control method for web-conferencing services is proposed, aiming to reduce transferred data by suppressing excessive quality. In web-conferencing services, the video bitrate is generally controlled in accordance with the network conditions (e.g., jitter and packet loss rate) to improve users' quality. However, in such a control, the bitrate is excessively high when the network conditions is sufficiently high (e.g., high throughput and low jitter), which causes an increased transferred data volume. The increased volume of data transferred leads to increased operational costs, such as network costs for service providers. To solve this problem, we developed a method to control the video bitrate of each user to achieve the required quality determined by the service provider. This method is implemented in an actual web-conferencing system and evaluated under various conditions. It was shown that the bitrate could be controlled in accordance with the required quality to reduce the transferred data volume.

  • Optimal Design of Multiuser mmWave LOS MIMO Systems Using Hybrid Arrays of Subarrays

    Zhaohu PAN  Hang LI  Xiaojing HUANG  

     
    PAPER-Wireless Communication Technologies

      Pubricized:
    2023/09/26
      Vol:
    E107-B No:1
      Page(s):
    262-271

    In this paper, we investigate optimal design of millimeter-wave (mmWave) multiuser line-of-sight multiple-input-multiple-output (LOS MIMO) systems using hybrid arrays of subarrays based on hybrid block diagonalization (BD) precoding and combining scheme. By introducing a general 3D geometric channel model, the optimal subarray separation products of the transmitter and receiver for maximizing sum-rate is designed in terms of two regular configurations of adjacent subarrays and interleaved subarrays for different users, respectively. We analyze the sensitivity of the optimal design parameters on performance in terms of a deviation factor, and derive expressions for the eigenvalues of the multiuser equivalent LOS MIMO channel matrix, which are also valid for non-optimal design. Simulation results show that the interleaved subarrays can support longer distance communication than the adjacent subarrays given the appropriate fixed subarray deployment.

  • Performance of Collaborative MIMO Reception with User Grouping Schemes

    Eiku ANDO  Yukitoshi SANADA  

     
    PAPER-Wireless Communication Technologies

      Pubricized:
    2023/10/23
      Vol:
    E107-B No:1
      Page(s):
    253-261

    This paper proposes user equipment (UE) grouping schemes and evaluates the performance of a scheduling scheme for each formed group in collaborative multiple-input multiple-output (MIMO) reception. In previous research, the criterion for UE grouping and the effects of group scheduling has never been presented. In the UE grouping scheme, two criteria, the base station (BS)-oriented one and the UE-oriented one, are presented. The BS-oriented full search scheme achieves ideal performance though it requires knowledge of the relative positions of all UEs. Therefore, the UE-oriented local search scheme is also proposed. As the scheduling scheme, proportional fairness scheduling is used in resource allocation for each formed group. When the number of total UEs increases, the difference in the number of UEs among groups enlarges. Numerical results obtained through computer simulation show that the throughput per user increases and the fairness among users decreases when the number of UEs in a cell increases in the proposed schemes compared to those of the conventional scheme.

  • Backhaul Prioritized Point-to-Multi-Point Wireless Transmission Using Orbital Angular Momentum Multiplexing

    Tomoya KAGEYAMA  Jun MASHINO  Doohwan LEE  

     
    PAPER-Wireless Communication Technologies

      Pubricized:
    2023/09/21
      Vol:
    E107-B No:1
      Page(s):
    232-243

    Orbital angular momentum (OAM) multiplexing technology is being investigated for high-capacity point-to-point (PtP) wireless transmission toward beyond 5G systems. OAM multiplexing is a spatial multiplexing technique that utilizes the twisting of electromagnetic waves. Its advantage is that it reduces the computational complexity of the signal processing on spatial multiplexing. Meanwhile point-to-multi point (PtMP) wireless transmission, such as integrated access and backhaul (IAB) will be expected to simultaneously accommodates a high-capacity prioritized backhaul-link and access-links. In this paper, we study the extension of OAM multiplexing transmission from PtP to PtMP to meet the above requirements. We propose a backhaul prioritized resource control algorithm that maximizes the received signal-to-interference and noise ratio (SINR) of the access-links while maintaining the backhaul-link. The proposed algorithm features adaptive mode selection that takes into account the difference in the received power of each OAM mode depending on the user equipment position and the guaranteed power allocation of the backhaul capacity. We then evaluate the performance of the proposed method through computer simulation. The results show that throughput of the access-links improved compared with the conventional multi-beam multi-user multi-input multi-output (MIMO) techniques while maintaining the throughput of the backhaul-link above the required value with minimal feedback information.

301-320hit(30728hit)