This paper proposes a voice conversion (VC) method based on a model that links linguistic and acoustic representations via latent phonological distinctive features. Our method, called speech chain VC, is inspired by the concept of the speech chain, where speech communication consists of a chain of events linking the speaker's brain with the listener's brain. We assume that speaker identity information, which appears in the acoustic level, is embedded in two steps — where phonological information is encoded into articulatory movements (linguistic to physiological) and where articulatory movements generate sound waves (physiological to acoustic). Speech chain VC represents these event links by using an adaptive restricted Boltzmann machine (ARBM) introducing phoneme labels and acoustic features as two classes of visible units and latent phonological distinctive features associated with articulatory movements as hidden units. Subjective evaluation experiments showed that intelligibility of the converted speech significantly improved compared with the conventional ARBM-based method. The speaker-identity conversion quality of the proposed method was comparable to that of a Gaussian mixture model (GMM)-based method. Analyses on the representations of the hidden layer of the speech chain VC model supported that some of the hidden units actually correspond to phonological distinctive features. Final part of this paper proposes approaches to achieve one-shot VC by using the speech chain VC model. Subjective evaluation experiments showed that when a target speaker is the same gender as a source speaker, the proposed methods can achieve one-shot VC based on each single source and target speaker's utterance.
Ryosuke OZAKI Tomohiro KAGAWA Tsuneki YAMASAKI
In this paper, we analyzed the pulse responses of dispersion medium with periodically conducting strips by using a fast inversion Laplace transform (FILT) method combined with point matching method (PMM) for both the TM and TE cases. Specifically, we investigated the influence of the width and number of the conducting strips on the pulse response and distribution of the electric field.
Sumika OMATA Motoi SHIRAI Takatoshi SUGIYAMA
A spectrum suppressed transmission that increases the frequency utilization efficiency, defined as throughput/bandwidth, by suppressing the required bandwidth has been proposed. This is one of the most effective schemes to solve the exhaustion problem of frequency bandwidths. However, in spectrum suppressed transmission, its transmission quality potentially degrades due to the ISI making the bandwidth narrower than the Nyquist bandwidth. In this paper, in order to improve the transmission quality degradation, we propose the spectrum suppressed transmission applying both FEC (forward error correction) and LE (linear equalization). Moreover, we also propose a new channel allocation scheme for the spectrum suppressed transmission, in multi-channel environments over a satellite transponder. From our computer simulation results, we clarify that the proposed schemes are more effective at increasing the system throughput than the scheme without spectrum suppression.
Dongliang CHEN Peng SONG Wenjing ZHANG Weijian ZHANG Bingui XU Xuan ZHOU
In this letter, we propose a novel robust transferable subspace learning (RTSL) method for cross-corpus facial expression recognition. In this method, on one hand, we present a novel distance metric algorithm, which jointly considers the local and global distance distribution measure, to reduce the cross-corpus mismatch. On the other hand, we design a label guidance strategy to improve the discriminate ability of subspace. Thus, the RTSL is much more robust to the cross-corpus recognition problem than traditional transfer learning methods. We conduct extensive experiments on several facial expression corpora to evaluate the recognition performance of RTSL. The results demonstrate the superiority of the proposed method over some state-of-the-art methods.
Toshihiro AKAGI Tetsuya ARAKI Shin-ichi NAKANO
The dispersion problem is a variant of the facility location problem. Given a set P of n points and an integer k, we intend to find a subset S of P with |S|=k such that the cost minp∈S{cost(p)} is maximized, where cost(p) is the sum of the distances from p to the nearest c points in S. We call the problem the dispersion problem with partial c sum cost, or the PcS-dispersion problem. In this paper we present two algorithms to solve the P2S-dispersion problem(c=2) if all points of P are on a line. The running times of the algorithms are O(kn2 log n) and O(n log n), respectively. We also present an algorithm to solve the PcS-dispersion problem if all points of P are on a line. The running time of the algorithm is O(knc+1).
In this paper, we propose a notion for high-dimensional generalizations of mutually orthogonal Latin squares (MOLS) and mutually orthogonal diagonal Latin squares (MODLS), called mutually dimensionally orthogonal d-cubes (MOC) and mutually dimensionally orthogonal diagonal d-cubes (MODC). Systematic constructions for MOC and MODC by using polynomials over finite fields are investigated. In particular, for 3-dimensional cubes, the results for the maximum possible number of MODC are improved by adopting the proposed construction.
Yasuhiro MOCHIDA Takayuki NAKACHI Takahiro YAMAGUCHI
High frame rate (HFR) video is attracting strong interest since it is considered as a next step toward providing Ultra-High Definition video service. For instance, the Association of Radio Industries and Businesses (ARIB) standard, the latest broadcasting standard in Japan, defines a 120 fps broadcasting format. The standard stipulates temporally scalable coding and hierarchical transmission by MPEG Media Transport (MMT), in which the base layer and the enhancement layer are transmitted over different paths for flexible distribution. We have developed the first ever MMT transmitter/receiver module for 4K/120fps temporally scalable video. The module is equipped with a newly proposed encapsulation method of temporally scalable bitstreams with correct boundaries. It is also designed to be tolerant to severe network constraints, including packet loss, arrival timing offset, and delay jitter. We conducted a hierarchical transmission experiment for 4K/120fps temporally scalable video. The experiment demonstrated that the MMT module was successfully fabricated and capable of dealing with severe network constraints. Consequently, the module has excellent potential as a means to support HFR video distribution in various network situations.
Ryuta SHINGAI Yuria HIRAGA Hisakazu FUKUOKA Takamasa MITANI Takashi NAKADA Yasuhiko NAKASHIMA
Modern deep learning has significantly improved performance and has been used in a wide variety of applications. Since the amount of computation required for the inference process of the neural network is large, it is processed not by the data acquisition location like a surveillance camera but by the server with abundant computing power installed in the data center. Edge computing is getting considerable attention to solve this problem. However, edge computing can provide limited computation resources. Therefore, we assumed a divided/distributed neural network model using both the edge device and the server. By processing part of the convolution layer on edge, the amount of communication becomes smaller than that of the sensor data. In this paper, we have evaluated AlexNet and the other eight models on the distributed environment and estimated FPS values with Wi-Fi, 3G, and 5G communication. To reduce communication costs, we also introduced the compression process before communication. This compression may degrade the object recognition accuracy. As necessary conditions, we set FPS to 30 or faster and object recognition accuracy to 69.7% or higher. This value is determined based on that of an approximation model that binarizes the activation of Neural Network. We constructed performance and energy models to find the optimal configuration that consumes minimum energy while satisfying the necessary conditions. Through the comprehensive evaluation, we found that the optimal configurations of all nine models. For small models, such as AlexNet, processing entire models in the edge was the best. On the other hand, for huge models, such as VGG16, processing entire models in the server was the best. For medium-size models, the distributed models were good candidates. We confirmed that our model found the most energy efficient configuration while satisfying FPS and accuracy requirements, and the distributed models successfully reduced the energy consumption up to 48.6%, and 6.6% on average. We also found that HEVC compression is important before transferring the input data or the feature data between the distributed inference processes.
Fumiya ISHIKAWA Keiki SHIMADA Yoshihisa KISHIYAMA Kenichi HIGUCHI
In this paper, we propose a decentralized probabilistic frequency-block activation control method for the cellular downlink. The aim of the proposed method is to increase the downlink system throughput within the system coverage by adaptively controlling the individual activation of each frequency block at all base stations (BSs) to achieve inter-cell interference coordination (ICIC) and traffic load balancing. The proposed method does not rely on complicated inter-BS cooperation. It uses only the inter-BS information exchange regarding the observed system throughput levels with the neighboring BSs. Based on the shared temporal system throughput information, each BS independently controls online the activation of their respective frequency blocks in a probabilistic manner, which autonomously achieves ICIC and load balancing among BSs. Simulation results show that the proposed method achieves greater system throughput and a faster convergence rate than the conventional online probabilistic activation/deactivation control method. We also show that the proposed method successfully tracks dynamic changes in the user distribution generated due to mobility.
Junjun ZHENG Hiroyuki OKAMURA Tadashi DOHI
In this paper, we present non-Markovian availability models for capturing the dynamics of system behavior of an operational software system that undergoes aperiodic time-based software rejuvenation and checkpointing. Two availability models with rejuvenation are considered taking account of the procedure after the completion of rollback recovery operation. We further proceed to investigate whether there exists the optimal rejuvenation schedule that maximizes the steady-state system availability, which is derived by means of the phase expansion technique, since the resulting models are not the trivial stochastic models such as semi-Markov process and Markov regenerative process, so that it is hard to solve them by using the common approaches like Laplace-Stieltjes transform and embedded Markov chain techniques. The numerical experiments are conducted to determine the optimal rejuvenation trigger timing maximizing the steady-state system availability for each availability model, and to compare both two models.
Khilda AFIFAH Nicodimus RETDIAN
Hum noise such as power line interference is one of the critical problems in the biomedical signal acquisition. Various techniques have been proposed to suppress power line interference. However, some of the techniques require more components and power consumption. The notch depth in the conventional N-path notch filter circuits needs a higher number of paths and switches off-resistance. It makes the conventional N-path notch filter less of efficiency to suppress hum noise. This work proposed the new N-path notch filter to hum noise suppression in biomedical signal acquisition. The new N-path notch filter achieved notch depth above 40dB with sampling frequency 50Hz and 60Hz. Although the proposed circuits use less number of path and switches off-resistance. The proposed circuit has been verified using artificial ECG signal contaminated by hum noise at frequency 50Hz and 60Hz. The output of N-path notch filter achieved a noise-free signal even if the sampling frequency changes.
Junxing ZHANG Shuo YANG Chunjuan BO Huimin LU
Vehicle logo detection technology is one of the research directions in the application of intelligent transportation systems. It is an important extension of detection technology based on license plates and motorcycle types. A vehicle logo is characterized by uniqueness, conspicuousness, and diversity. Therefore, thorough research is important in theory and application. Although there are some related works for object detection, most of them cannot achieve real-time detection for different scenes. Meanwhile, some real-time detection methods of single-stage have performed poorly in the object detection of small sizes. In order to solve the problem that the training samples are scarce, our work in this paper is improved by constructing the data of a vehicle logo (VLD-45-S), multi-stage pre-training, multi-scale prediction, feature fusion between deeper with shallow layer, dimension clustering of the bounding box, and multi-scale detection training. On the basis of keeping speed, this article improves the detection precision of the vehicle logo. The generalization of the detection model and anti-interference capability in real scenes are optimized by data enrichment. Experimental results show that the accuracy and speed of the detection algorithm are improved for the object of small sizes.
In this paper, we propose a secure computation of sparse coding and its application to Encryption-then-Compression (EtC) systems. The proposed scheme introduces secure sparse coding that allows computation of an Orthogonal Matching Pursuit (OMP) algorithm in an encrypted domain. We prove theoretically that the proposed method estimates exactly the same sparse representations that the OMP algorithm for non-encrypted computation does. This means that there is no degradation of the sparse representation performance. Furthermore, the proposed method can control the sparsity without decoding the encrypted signals. Next, we propose an EtC system based on the secure sparse coding. The proposed secure EtC system can protect the private information of the original image contents while performing image compression. It provides the same rate-distortion performance as that of sparse coding without encryption, as demonstrated on both synthetic data and natural images.
In this paper, we propose a deep model of visual recognition based on hybrid KPCA Network(H-KPCANet), which is based on the combination of one-stage KPCANet and two-stage KPCANet. The proposed model consists of four types of basic components: the input layer, one-stage KPCANet, two-stage KPCANet and the fusion layer. The role of one-stage KPCANet is to calculate the KPCA filters for convolution layer, and two-stage KPCANet is to learn PCA filters in the first stage and KPCA filters in the second stage. After binary quantization mapping and block-wise histogram, the features from two different types of KPCANets are fused in the fusion layer. The final feature of the input image can be achieved by weighted serial combination of the two types of features. The performance of our proposed algorithm is tested on digit recognition and object classification, and the experimental results on visual recognition benchmarks of MNIST and CIFAR-10 validated the performance of the proposed H-KPCANet.
Yuki SAITO Kei AKUZAWA Kentaro TACHIBANA
This paper presents a method for many-to-one voice conversion using phonetic posteriorgrams (PPGs) based on an adversarial training of deep neural networks (DNNs). A conventional method for many-to-one VC can learn a mapping function from input acoustic features to target acoustic features through separately trained DNN-based speech recognition and synthesis models. However, 1) the differences among speakers observed in PPGs and 2) an over-smoothing effect of generated acoustic features degrade the converted speech quality. Our method performs a domain-adversarial training of the recognition model for reducing the PPG differences. In addition, it incorporates a generative adversarial network into the training of the synthesis model for alleviating the over-smoothing effect. Unlike the conventional method, ours jointly trains the recognition and synthesis models so that they are optimized for many-to-one VC. Experimental evaluation demonstrates that the proposed method significantly improves the converted speech quality compared with conventional VC methods.
Yao GE Rui CHEN Ying TONG Xuehong CAO Ruiyu LIANG
We combine the siamese network and the recurrent regression network, proposing a two-stage tracking framework termed as SiamReg. Our method solves the problem that the classic siamese network can not judge the target size precisely and simplifies the procedures of regression in the training and testing process. We perform experiments on three challenging tracking datasets: VOT2016, OTB100, and VOT2018. The results indicate that, after offline trained, SiamReg can obtain a higher expected average overlap measure.
Van Giang TRINH Kunihiko HIRAISHI
Boolean networks (BNs) are considered as popular formal models for the dynamics of gene regulatory networks. There are many different types of BNs, depending on their updating scheme (synchronous, asynchronous, deterministic, or non-deterministic), such as Classical Random Boolean Networks (CRBNs), Asynchronous Random Boolean Networks (ARBNs), Generalized Asynchronous Random Boolean Networks (GARBNs), Deterministic Asynchronous Random Boolean Networks (DARBNs), and Deterministic Generalized Asynchronous Random Boolean Networks (DGARBNs). An important long-term behavior of BNs, so-called attractor, can provide valuable insights into systems biology (e.g., the origins of cancer). In the previous paper [1], we have studied properties of attractors of GARBNs, their relations with attractors of CRBNs, also proposed different algorithms for attractor detection. In this paper, we propose a new algorithm based on SAT-based bounded model checking to overcome inherent problems in these algorithms. Experimental results prove the effectiveness of the new algorithm. We also show that studying attractors of GARBNs can pave potential ways to study attractors of ARBNs.
Hatoon S. ALSAGRI Mourad YKHLEF
Social media channels, such as Facebook, Twitter, and Instagram, have altered our world forever. People are now increasingly connected than ever and reveal a sort of digital persona. Although social media certainly has several remarkable features, the demerits are undeniable as well. Recent studies have indicated a correlation between high usage of social media sites and increased depression. The present study aims to exploit machine learning techniques for detecting a probable depressed Twitter user based on both, his/her network behavior and tweets. For this purpose, we trained and tested classifiers to distinguish whether a user is depressed or not using features extracted from his/her activities in the network and tweets. The results showed that the more features are used, the higher are the accuracy and F-measure scores in detecting depressed users. This method is a data-driven, predictive approach for early detection of depression or other mental illnesses. This study's main contribution is the exploration part of the features and its impact on detecting the depression level.
Jianmei ZHANG Pengyu WANG Feiyang GONG Hongqing ZHU Ning CHEN
Finding the correspondence between two images of the same object or scene is an active research field in computer vision. This paper develops a rapid and effective Content-based Superpixel Image matching and Stitching (CSIS) scheme, which utilizes the content of superpixel through multi-features fusion technique. Unlike popular keypoint-based matching method, our approach proposes a superpixel internal feature-based scheme to implement image matching. In the beginning, we make use of a novel superpixel generation algorithm based on content-based feature representation, named Content-based Superpixel Segmentation (CSS) algorithm. Superpixels are generated in terms of a new distance metric using color, spatial, and gradient feature information. It is developed to balance the compactness and the boundary adherence of resulted superpixels. Then, we calculate the entropy of each superpixel for separating some superpixels with significant characteristics. Next, for each selected superpixel, its multi-features descriptor is generated by extracting and fusing local features of the selected superpixel itself. Finally, we compare the matching features of candidate superpixels and their own neighborhoods to estimate the correspondence between two images. We evaluated superpixel matching and image stitching on complex and deformable surfaces using our superpixel region descriptors, and the results show that new method is effective in matching accuracy and execution speed.
Kota KUDO Yuichi TAKANO Ryo NOMURA
This paper addresses the problem of selecting a significant subset of candidate features to use for multiple linear regression. Bertsimas et al. [5] recently proposed the discrete first-order (DFO) algorithm to efficiently find near-optimal solutions to this problem. However, this algorithm is unable to escape from locally optimal solutions. To resolve this, we propose a stochastic discrete first-order (SDFO) algorithm for feature subset selection. In this algorithm, random perturbations are added to a sequence of candidate solutions as a means to escape from locally optimal solutions, which broadens the range of discoverable solutions. Moreover, we derive the optimal step size in the gradient-descent direction to accelerate convergence of the algorithm. We also make effective use of the L2-regularization term to improve the predictive performance of a resultant subset regression model. The simulation results demonstrate that our algorithm substantially outperforms the original DFO algorithm. Our algorithm was superior in predictive performance to lasso and forward stepwise selection as well.