Kensuke SUMOTO Kenta KANAKOGI Hironori WASHIZAKI Naohiko TSUDA Nobukazu YOSHIOKA Yoshiaki FUKAZAWA Hideyuki KANUKA
Security-related issues have become more significant due to the proliferation of IT. Collating security-related information in a database improves security. For example, Common Vulnerabilities and Exposures (CVE) is a security knowledge repository containing descriptions of vulnerabilities about software or source code. Although the descriptions include various entities, there is not a uniform entity structure, making security analysis difficult using individual entities. Developing a consistent entity structure will enhance the security field. Herein we propose a method to automatically label select entities from CVE descriptions by applying the Named Entity Recognition (NER) technique. We manually labeled 3287 CVE descriptions and conducted experiments using a machine learning model called BERT to compare the proposed method to labeling with regular expressions. Machine learning using the proposed method significantly improves the labeling accuracy. It has an f1 score of about 0.93, precision of about 0.91, and recall of about 0.95, demonstrating that our method has potential to automatically label select entities from CVE descriptions.
Li HE Xiaowu ZHANG Jianyong DUAN Hao WANG Xin LI Liang ZHAO
Chinese spelling correction (CSC) models detect and correct a text typo based on the misspelled character and its context. Recently, Bert-based models have dominated the research of Chinese spelling correction. However, these methods only focus on the semantic information of the text during the pretraining stage, neglecting the learning of correcting spelling errors. Moreover, when multiple incorrect characters are in the text, the context introduces noisy information, making it difficult for the model to accurately detect the positions of the incorrect characters, leading to false corrections. To address these limitations, we apply the multimodal pre-trained language model ChineseBert to the task of spelling correction. We propose a self-distillation learning-based pretraining strategy, where a confusion set is used to construct text containing erroneous characters, allowing the model to jointly learns how to understand language and correct spelling errors. Additionally, we introduce a single-channel masking mechanism to mitigate the noise caused by the incorrect characters. This mechanism masks the semantic encoding channel while preserving the phonetic and glyph encoding channels, reducing the noise introduced by incorrect characters during the prediction process. Finally, experiments are conducted on widely used benchmarks. Our model achieves superior performance against state-of-the-art methods by a remarkable gain.
Yasuhiko IKEMATSU Tsunekazu SAITO
Multivariate public key cryptosystems (MPKC) are constructed based on the problem of solving multivariate quadratic equations (MQ problem). Among various multivariate schemes, UOV is an important signature scheme since it is underlying some signature schemes such as MAYO, QR-UOV, and Rainbow which was a finalist of NIST PQC standardization project. To analyze the security of a multivariate scheme, it is necessary to analyze the first fall degree or solving degree for the system of polynomial equations used in specific attacks. It is known that the first fall degree or solving degree often relates to the Hilbert series of the ideal generated by the system. In this paper, we study the Hilbert series of the UOV scheme, and more specifically, we study the Hilbert series of ideals generated by quadratic polynomials used in the central map of UOV. In particular, we derive a prediction formula of the Hilbert series by using some experimental results. Moreover, we apply it to the analysis of the reconciliation attack for MAYO.
Hiroyoshi NAGAO Koshiro TAMURA Marie KATSURAI
Danmaku commenting has become popular for co-viewing on video-sharing platforms, such as Nicovideo. However, many irrelevant comments usually contaminate the quality of the information provided by videos. Such an information pollutant problem can be solved by a comment classifier trained with an abstention option, which detects comments whose video categories are unclear. To improve the performance of this classification task, this paper presents Nicovideo-specific language representations. Specifically, we used sentences from Nicopedia, a Japanese online encyclopedia of entities that possibly appear in Nicovideo contents, to pre-train a bidirectional encoder representations from Transformers (BERT) model. The resulting model named Nicopedia BERT is then fine-tuned such that it could determine whether a given comment falls into any of predefined categories. The experiments conducted on Nicovideo comment data demonstrated the effectiveness of Nicopedia BERT compared with existing BERT models pre-trained using Wikipedia or tweets. We also evaluated the performance of each model in an additional sentiment classification task, and the obtained results implied the applicability of Nicopedia BERT as a feature extractor of other social media text.
Binggang ZHUO Masaki MURATA Qing MA
Paragraph segmentation is a text segmentation task. Iikura et al. achieved excellent results on paragraph segmentation by introducing focal loss to Bidirectional Encoder Representations from Transformers. In this study, we investigated paragraph segmentation on Daily News and Novel datasets. Based on the approach proposed by Iikura et al., we used auxiliary loss to train the model to improve paragraph segmentation performance. Consequently, the average F1-score obtained by the approach of Iikura et al. was 0.6704 on the Daily News dataset, whereas that of our approach was 0.6801. Our approach thus improved the performance by approximately 1%. The performance improvement was also confirmed on the Novel dataset. Furthermore, the results of two-tailed paired t-tests indicated that there was a statistical significance between the performance of the two approaches.
Natsuki UENO Shoichi KOYAMA Hiroshi SARUWATARI
We propose a useful formulation for ill-posed inverse problems in Hilbert spaces with nonlinear clipping effects. Ill-posed inverse problems are often formulated as optimization problems, and nonlinear clipping effects may cause nonconvexity or nondifferentiability of the objective functions in the case of commonly used regularized least squares. To overcome these difficulties, we present a tractable formulation in which the objective function is convex and differentiable with respect to optimization variables, on the basis of the Bregman divergence associated with the primitive function of the clipping function. By using this formulation in combination with the representer theorem, we need only to deal with a finite-dimensional, convex, and differentiable optimization problem, which can be solved by well-established algorithms. We also show two practical examples of inverse problems where our theory can be applied, estimation of band-limited signals and time-harmonic acoustic fields, and evaluate the validity of our theory by numerical simulations.
Chansu HAN Jumpei SHIMAMURA Takeshi TAKAHASHI Daisuke INOUE Jun'ichi TAKEUCHI Koji NAKAO
With the rapid evolution and increase of cyberthreats in recent years, it is necessary to detect and understand it promptly and precisely to reduce the impact of cyberthreats. A darknet, which is an unused IP address space, has a high signal-to-noise ratio, so it is easier to understand the global tendency of malicious traffic in cyberspace than other observation networks. In this paper, we aim to capture global cyberthreats in real time. Since multiple hosts infected with similar malware tend to perform similar behavior, we propose a system that estimates a degree of synchronizations from the patterns of packet transmission time among the source hosts observed in unit time of the darknet and detects anomalies in real time. In our evaluation, we perform our proof-of-concept implementation of the proposed engine to demonstrate its feasibility and effectiveness, and we detect cyberthreats with an accuracy of 97.14%. This work is the first practical trial that detects cyberthreats from in-the-wild darknet traffic regardless of new types and variants in real time, and it quantitatively evaluates the result.
A b-symbol read channel is a channel model in which b consecutive symbols are read at once. As special cases, it includes a symbol-pair read channel (b=2) and an ordinary channel (b=1). The sphere packing bound, the Gilbert-Varshamov (G-V) bound, and the asymptotic G-V bound for symbol-pair read channels are known for b=1 and 2. In this paper, we derive these three bounds for b-symbol read channels with b≥1. From analysis of the proposed G-V bound, it is confirmed that the achievable rate is higher for b-symbol read channels compared with those for ordinary channels based on the Hamming metric. Furthermore, it is shown that the optimal value of b that maximizes the asymptotic G-V bound is finitely determined depending on the fractional minimum distance.
Nguyen VIET HA Kazumi KUMAZOE Masato TSURU
The Transmission Control Protocol (TCP) with Network Coding (TCP/NC) was proposed to introduce packet loss recovery ability at the sink without TCP retransmission, which is realized by proactively sending redundant combination packets encoded at the source. Although TCP/NC is expected to mitigate the goodput degradation of TCP over lossy networks, the original TCP/NC does not work well in burst loss and time-varying channels. No apparent scheme was provided to decide and change the network coding-related parameters (NC parameters) to suit the diverse and changeable loss conditions. In this paper, a solution to support TCP/NC in adapting to mentioned conditions is proposed, called TCP/NC with Loss Rate and Loss Burstiness Estimation (TCP/NCwLRLBE). Both the packet loss rate and burstiness are estimated by observing transmitted packets to adapt to burst loss channels. Appropriate NC parameters are calculated from the estimated probability of successful recoverable transmission based on a mathematical model of packet losses. Moreover, a new mechanism for coding window handling is developed to update NC parameters in the coding system promptly. The proposed scheme is implemented and validated in Network Simulator 3 with two different types of burst loss model. The results suggest the potential of TCP/NCwLRLBE to mitigate the TCP goodput degradation in both the random loss and burst loss channels with the time-varying conditions.
The solution of the standard 2-norm-based multiple kernel regression problem and the theoretical limit of the considered model space are discussed in this paper. We prove that 1) The solution of the 2-norm-based multiple kernel regressor constructed by a given training data set does not generally attain the theoretical limit of the considered model space in terms of the generalization errors, even if the training data set is noise-free, 2) The solution of the 2-norm-based multiple kernel regressor is identical to the solution of the single kernel regressor under a noise free setting, in which the adopted single kernel is the sum of the same kernels used in the multiple kernel regressor; and it is also true for a noisy setting with the 2-norm-based regularizer. The first result motivates us to develop a novel framework for the multiple kernel regression problems which yields a better solution close to the theoretical limit, and the second result implies that it is enough to use the single kernel regressors with the sum of given multiple kernels instead of the multiple kernel regressors as long as the 2-norm based criterion is used.
Wa SI Xun PAN Harutoshi OGAI Katsumi HIRAI
In lighting control systems, accurate data of artificial light (lighting coefficients) are essential for the illumination control accuracy and energy saving efficiency. This research proposes a novel Lambertian-Radial Basis Function Neural Network (L-RBFNN) to realize modeling of both lighting coefficients and the illumination environment for an office. By adding a Lambertian neuron to represent the rough theoretical illuminance distribution of the lamp and modifying RBF neurons to regulate the distribution shape, L-RBFNN successfully solves the instability problem of conventional RBFNN and achieves higher modeling accuracy. Simulations of both single-light modeling and multiple-light modeling are made and compared with other methods such as Lambertian function, cubic spline interpolation and conventional RBFNN. The results prove that: 1) L-RBFNN is a successful modeling method for artificial light with imperceptible modeling error; 2) Compared with other existing methods, L-RBFNN can provide better performance with lower modeling error; 3) The number of training sensors can be reduced to be the same with the number of lamps, thus making the modeling method easier to apply in real-world lighting systems.
This paper proposes a new class of Hilbert pairs of almost symmetric orthogonal wavelet bases. For two wavelet bases to form a Hilbert pair, the corresponding scaling lowpass filters are required to satisfy the half-sample delay condition. In this paper, we design simultaneously two scaling lowpass filters with the arbitrarily specified flat group delay responses at ω=0, which satisfy the half-sample delay condition. In addition to specifying the number of vanishing moments, we apply the Remez exchange algorithm to minimize the difference of frequency responses between two scaling lowpass filters, in order to improve the analyticity of complex wavelets. The equiripple behavior of the error function can be obtained through a few iterations. Therefore, the resulting complex wavelets are orthogonal and almost symmetric, and have the improved analyticity. Finally, some examples are presented to demonstrate the effectiveness of the proposed design method.
Akira TANAKA Hirofumi TAKEBAYASHI Ichigaku TAKIGAWA Hideyuki IMAI Mineichi KUDO
For the last few decades, learning with multiple kernels, represented by the ensemble kernel regressor and the multiple kernel regressor, has attracted much attention in the field of kernel-based machine learning. Although their efficacy was investigated numerically in many works, their theoretical ground is not investigated sufficiently, since we do not have a theoretical framework to evaluate them. In this paper, we introduce a unified framework for evaluating kernel regressors with multiple kernels. On the basis of the framework, we analyze the generalization errors of the ensemble kernel regressor and the multiple kernel regressor, and give a sufficient condition for the ensemble kernel regressor to outperform the multiple kernel regressor in terms of the generalization error in noise-free case. We also show that each kernel regressor can be better than the other without the sufficient condition by giving examples, which supports the importance of the sufficient condition.
Fan FAN Tapan K. SARKAR Changwoo PARK Jinhwan KOH
A new approach to reconstructing antenna far-field patterns from the missing part of the pattern is presented in this paper. The antenna far-field pattern can be reconstructed by utilizing the iterative Hilbert transform, which is based on the relationship between the real and imaginary part of the Hilbert transform. A moving average filter is used to reduce the errors in the restored signal as well as the computation load. Under the constraint of the causality of the current source in space, we could successfully reconstruct the data. Several examples dealing with line source antennas and antenna arrays are simulated to illustrate the applicability of this approach.
In non-destructive testing (NDT), ultrasonic echo is often an overlapping multi-echo signals with noise. However, the accurate estimation of ultrasonic time-of-flight (TOF) is essential in NDT. In this letter, a novel method for TOF estimation through envelope is proposed. Firstly, the wavelet denoising technique is applied to the noisy echo to improve the estimation accuracy. Then, the Hilbert transform (HT) is used in ultrasonic signal processing in order to extract the envelope of the echo. Finally, the TOF of each component of multi-echo signals is estimated by the local maximum point of signal envelope. Furthermore, the time resolution of time-overlapping ultrasonic echoes is discussed. Numerical simulation has been carried out to show the performances of the proposed method in estimating TOF of ultrasonic signal.
Hee-Do KANG Il-Young OH Tong-Ho CHUNG Jong-Gwan YOOK
In this paper, penetration phenomenon of an early-time (E1) high altitude electromagnetic pulse (HEMP) into dispersive underground multilayer structures is analyzed using electromagnetic modeling of wave propagation in frequency dependent lossy media. The electromagnetic pulse is dealt with in the power spectrum ranging from 100kHz to the 100MHz band, considering the fact that the power spectrum of the E1 HEMP rapidly decreases 30dB below its maximum value beyond the 100MHz band. In addition, the propagation channel consisting of several dielectric materials is modeled with the dispersive relative permittivity of each medium. Based on source and channel models, the propagation phenomenon is analyzed in the frequency and time domains. The attenuation levels at a 100m underground point are observed to be about 15 and 20dB at 100kHz and 1MHz, respectively, and the peak level of the penetrating electric field is found 5.6kV/m. To ensure the causality of the result, we utilize the Hilbert transform.
Masashi SUGIYAMA Makoto YAMADA
The Hilbert-Schmidt independence criterion (HSIC) is a kernel-based statistical independence measure that can be computed very efficiently. However, it requires us to determine the kernel parameters heuristically because no objective model selection method is available. Least-squares mutual information (LSMI) is another statistical independence measure that is based on direct density-ratio estimation. Although LSMI is computationally more expensive than HSIC, LSMI is equipped with cross-validation, and thus the kernel parameter can be determined objectively. In this paper, we show that HSIC can actually be regarded as an approximation to LSMI, which allows us to utilize cross-validation of LSMI for determining kernel parameters in HSIC. Consequently, both computational efficiency and cross-validation can be achieved.
Teruyoshi SASAYAMA Tetsuo KOBAYASHI
We developed a novel movement-imagery-based brain-computer interface (BCI) for untrained subjects without employing machine learning techniques. The development of BCI consisted of several steps. First, spline Laplacian analysis was performed. Next, time-frequency analysis was applied to determine the optimal frequency range and latencies of the electroencephalograms (EEGs). Finally, trials were classified as right or left based on β-band event-related synchronization using the cumulative distribution function of pretrigger EEG noise. To test the performance of the BCI, EEGs during the execution and imagination of right/left wrist-bending movements were measured from 63 locations over the entire scalp using eight healthy subjects. The highest classification accuracies were 84.4% and 77.8% for real movements and their imageries, respectively. The accuracy is significantly higher than that of previously reported machine-learning-based BCIs in the movement imagery task (paired t-test, p < 0.05). It has also been demonstrated that the highest accuracy was achieved even though subjects had never participated in movement imageries.
Pulung WASKITO Shinobu MIWA Yasue MITSUKURA Hironori NAKAJO
In off-line analysis, the demand for high precision signal processing has introduced a new method called Empirical Mode Decomposition (EMD), which is used for analyzing a complex set of data. Unfortunately, EMD is highly compute-intensive. In this paper, we show parallel implementation of Empirical Mode Decomposition on a GPU. We propose the use of “partial+total” switching method to increase performance while keeping the precision. We also focused on reducing the computation complexity in the above method from O(N) on a single CPU to O(N/P log (N)) on a GPU. Evaluation results show our single GPU implementation using Tesla C2050 (Fermi architecture) achieves a 29.9x speedup partially, and a 11.8x speedup totally when compared to a single Intel dual core CPU.
Generally, two problems of bag-of-features in image retrieval are still considered unsolved: one is that spatial information about descriptors is not employed well, which affects the accuracy of retrieval; the other is that the trade-off between vocabulary size and good precision, which decides the storage and retrieval performance. In this paper, we propose a novel approach called Hilbert scan based bag-of-features (HS-BoF) for image retrieval. Firstly, Hilbert scan based tree representation (HSBT) is studied, which is built based on the local descriptors while spatial relationships are added into the nodes by a novel grouping rule, resulting of a tree structure for each image. Further, we give two ways of codebook production based on HSBT: multi-layer codebook and multi-size codebook. Owing to the properties of Hilbert scanning and the merits of our grouping method, sub-regions of the tree are not only flexible to the distribution of local patches but also have hierarchical relations. Extensive experiments on caltech-256, 13-scene and 1 million ImageNet images show that HS-BoF obtains higher accuracy with less memory usage.