Speaker change detection involves the identification of the time indices of an audio stream, where the identity of the speaker changes. This paper proposes novel measures for speaker change detection over the centroid model, which divides the feature space into non-overlapping clusters for effective speaker-change comparison. The centroid model is a computationally-efficient variant of the widely-used mixture-distribution based background models for speaker recognition. Experiments on both synthetic and real-world data were performed; the results show that the proposed approach yields promising results compared with the conventional statistical measures.
Xuemin ZHAO Yuhong GUO Jian LIU Yonghong YAN Qiang FU
In this paper, a logarithmic adaptive quantization projection (LAQP) algorithm for digital watermarking is proposed. Conventional quantization index modulation uses a fixed quantization step in the watermarking embedding procedure, which leads to poor fidelity. Moreover, the conventional methods are sensitive to value-metric scaling attack. The LAQP method combines the quantization projection scheme with a perceptual model. In comparison to some conventional quantization methods with a perceptual model, the LAQP only needs to calculate the perceptual model in the embedding procedure, avoiding the decoding errors introduced by the difference of the perceptual model used in the embedding and decoding procedure. Experimental results show that the proposed watermarking scheme keeps a better fidelity and is robust against the common signal processing attack. More importantly, the proposed scheme is invariant to value-metric scaling attack.
Huy-Binh LE Sang-Gug LEE Seung-Tak RYU
A 20 kHz audio-band ADC with a single pair of power and ground pads is implemented for a digital electret microphone. Under the limited power/ground pad condition, the switching noise effect on the signal quality is estimated via post simulations with parasitic models. Performance degradation is minimized by time-domain noise isolation with sufficient time-spacing between the sampling edge and the output transition. The prototype ADC was implemented in a 0.18 µm CMOS process. It operates under a minimum supply voltage of 1.6 V with total current of 420 µA. Operating at 2.56 MHz clock frequency, it achieves 84 dB dynamic range and a 64 dB peak signal-to-(noise+distortion) ratio. The measured power supply rejection at a 100 mVpp 217 Hz square wave is -72 dB.
This letter proposes a spread spectrum audio watermarking robust against playback speed modification (PSM) attack which introduces both time-scale modification and pitch shifting. Two important improvements are exploited to achieve this robustness. The first one is selecting an embedding region according to the stable characteristic of the audio energy. The second one is stretching the pseudo-random noise sequence to match the length of the embedding region before embedding and detection. Experimental results show that our method is highly robust to common audio signal processing attacks and synchronization attacks including PSM, cropping, trimming and jittering.
Yuta NAKASHIMA Ryosuke KANETO Noboru BABAGUCHI
Recently, a number of location-based services such as navigation and mobile advertising have been proposed. Such services require real-time user positions. Since a global positioning system (GPS), which is one of the most well-known techniques for real-time positioning, is unsuitable for indoor uses due to unavailability of GPS signals, many indoor positioning systems (IPSs) using WLAN, radio frequency identification tags, and so forth have been proposed. However, most of them suffer from high installation costs. In this paper, we propose a novel IPS for real-time positioning that utilizes a digital audio watermarking technique. The proposed IPS first embeds watermarks into an audio signal to generate watermarked signals, each of which is then emitted from a corresponding speaker installed in a target environment. A user of the proposed IPS receives the watermarked signals with a mobile device equipped with a microphone, and the watermarks are detected in the received signal. For positioning, we model various effects upon watermarks due to propagation in the air, i.e., delays, attenuation, and diffraction. The model enables the proposed IPS to accurately locate the user based on the watermarks detected in the received signal. The proposed IPS can be easily deployed with a low installation cost because the IPS can work with off-the-shelf speakers that have been already installed in most of the indoor environments such as department stores, amusement arcades, and airports. We experimentally evaluate the accuracy of positioning and show that the proposed IPS locates the user in a 6 m by 7.5 m room with root mean squared error of 2.25 m on average. The results also demonstrate the potential capability of real-time positioning with the proposed IPS.
Md. TARIQUZZAMAN Jin Young KIM Seung You NA Hyoung-Gook KIM Dongsoo HAR
In this paper, a novel visual signal reliability (VSR) measure is proposed to consider video degradation at the signal level in audio-visual speaker identification (AVSI). The VSR estimation is formulated using a~ Gaussian fuzzy membership function (GFMF) to measure lighting variations. The variance parameters of GFMF are optimized in order to maximize the performance of the overall AVSI. The experimental results show that the proposed method outperforms the score-based reliability measuring technique.
Jeong-Hun SEO Inyong CHOI Sang Bae CHON Koeng-Mo SUNG
The adequate evaluation of sound quality is an important issue for the lossy compression codecs, such as MP3. ITU-R Rec BS. 1387-1 (PEAQ – Perceptual Evaluation of Audio Quality) is the most widely used method to evaluate sound quality objectively. However, PEAQ can only be used for mono signals or two channel stereo signals, because it considers only timbral factors when assessing sound quality. This paper introduces an improved objective quality assessment method that can be used for mono signals and multichannel audio signals that considers both “spatial” and “timbral” factors. The “spatial” factors, which measure perceptual distortions in spatial impression, are important to evaluate the quality of multichannel sounds.
The ability to find the speaker's face region in a video is useful for various applications. In this work, we develop a novel technique to find this region within different time windows, which is robust against the changes of view, scale, and background. The main thrust of our technique is to integrate audiovisual correlation analysis into a video segmentation framework. We analyze the audiovisual correlation locally by computing quadratic mutual information between our audiovisual features. The computation of quadratic mutual information is based on the probability density functions estimated by kernel density estimation with adaptive kernel bandwidth. The results of this audiovisual correlation analysis are incorporated into graph cut-based video segmentation to resolve a globally optimum extraction of the speaker's face region. The setting of any heuristic threshold in this segmentation is avoided by learning the correlation distributions of speaker and background by expectation maximization. Experimental results demonstrate that our method can detect the speaker's face region accurately and robustly for different views, scales, and backgrounds.
Noritsugu EGI Takanori HAYASHI Akira TAKAHASHI
We propose a parametric packet-layer model for monitoring audio quality in multimedia streaming services such as Internet protocol television (IPTV). This model estimates audio quality of experience (QoE) on the basis of quality degradation due to coding and packet loss of an audio sequence. The input parameters of this model are audio bit rate, sampling rate, frame length, packet-loss frequency, and average burst length. Audio bit rate, packet-loss frequency, and average burst length are calculated from header information in received IP packets. For sampling rate, frame length, and audio codec type, the values or the names used in monitored services are input into this model directly. We performed a subjective listening test to examine the relationships between these input parameters and perceived audio quality. The codec used in this test was the Advanced Audio Codec-Low Complexity (AAC-LC), which is one of the international standards for audio coding. On the basis of the test results, we developed an audio quality evaluation model. The verification results indicate that audio quality estimated by the proposed model has a high correlation with perceived audio quality.
Zul Azri BIN MUHAMAD NOH Takahiro SUZUKI Shuji TASAKA
This paper proposes a cross-layer packet scheduling scheme for QoS support in audio-video transmission with IEEE 802.11e HCCA and assesses application-level QoS and QoE of the scheduling scheme under lossy channel conditions. In the proposed scheme, the access point (AP) basically allocates transmission opportunity (TXOP) for each station in a service interval (SI) like the reference scheduler of the IEEE 802.11e standard, which is referred to as the TGe scheme in this paper. In the proposed scheme, however, the AP calculates the number of MAC service data units (MSDUs) arrived in an SI, considering the inter-arrival time of audio samples and that of video frames, which are referred to as media units (MUs), at the application layer. The AP then gives additional TXOP duration in the SI to stations which had audio or video MAC protocol data units (MPDUs) in their source buffers at the end of the previous TXOP. In addition, utilizing video frame information from the application layer, we propose video frame skipping at the MAC-level of a source station. If a station fails to transmit a video MPDU, it drops all the following video MPDUs in the source buffer until the next intra-coded frame comes to the head of the buffer. We compare the reference scheduler (TGe scheme), the proposed packet scheduling scheme with and without the video frame skipping at the source in terms of application-level QoS and QoE. We discuss the effectiveness of the proposed packet scheduling scheme from a viewpoint of QoE as well as QoS. Numerical results reveal that the proposed packet scheduling scheme can achieve higher quality than the TGe scheme under lossy channel conditions. We also show that the proposed scheduling scheme can improve the QoS and QoE by using the video frame skipping at the source. Furthermore, we also examine the effect of SI on the QoS and QoE of the proposed packet scheduling scheme and obtain that the appropriate value of SI is equal to the inter-arrival time of video frame.
Bing-Fei WU Hao-Yu HUANG Yen-Lin CHEN Hsin-Yuan PENG Jia-Hsiung HUANG
This study presents several optimization approaches for the MPEG-2/4 Audio Advanced Coding (AAC) Low Complexity (LC) encoding and decoding processes. Considering the power consumption and the peripherals required for consumer electronics, this study adopts the TI OMAP5912 platform for portable devices. An important optimization issue for implementing AAC codec on embedded and mobile devices is to reduce computational complexity and memory consumption. Due to power saving issues, most embedded and mobile systems can only provide very limited computational power and memory resources for the coding process. As a result, modifying and simplifying only one or two blocks is insufficient for optimizing the AAC encoder and enabling it to work well on embedded systems. It is therefore necessary to enhance the computational efficiency of other important modules in the encoding algorithm. This study focuses on optimizing the Temporal Noise Shaping (TNS), Mid/Side (M/S) Stereo, Modified Discrete Cosine Transform (MDCT) and Inverse Quantization (IQ) modules in the encoder and decoder. Furthermore, we also propose an efficient memory reduction approach that provides a satisfactory balance between the reduction of memory usage and the expansion of the encoded files. In the proposed design, both the AAC encoder and decoder are built with fixed-point arithmetic operations and implemented on a DSP processor combined with an ARM-core for peripheral controlling. Experimental results demonstrate that the proposed AAC codec is computationally effective, has low memory consumption, and is suitable for low-cost embedded and mobile applications.
Ji-Soo KEUM Hyon-Soo LEE Masafumi HAGIWARA
In this letter, we propose an improved speech/ nonspeech classification method to effectively classify a multimedia source. To improve performance, we introduce a feature based on spectral duration analysis, and combine recently proposed features such as high zero crossing rate ratio (HZCRR), low short time energy ratio (LSTER), and pitch ratio (PR). According to the results of our experiments on speech, music, and environmental sounds, the proposed method obtained high classification results when compared with conventional approaches.
This letter suggests a novel high capacity robust audio watermarking algorithm by using the high frequency band of the wavelet decomposition, for which the human auditory system (HAS) is not very sensitive to alteration. The main idea is to divide the high frequency band into frames and then, for embedding, the wavelet samples are changed based on the average of the relevant frame. The experimental results show that the method has very high capacity (about 5.5 kbps), without significant perceptual distortion (ODG in [-1, 0] and SNR about 33 dB) and provides robustness against common audio signal processing such as added noise, filtering, echo and MPEG compression (MP3).
This paper proposes a novel robust audio watermarking algorithm to embed data and extract it in a bit-exact manner based on changing the magnitudes of the FFT spectrum. The key point is selecting a frequency band for embedding based on the comparison between the original and the MP3 compressed/decompressed signal and on a suitable scaling factor. The experimental results show that the method has a very high capacity (about 5 kbps), without significant perceptual distortion (ODG about -0.25) and provides robustness against common audio signal processing such as added noise, filtering and MPEG compression (MP3). Furthermore, the proposed method has a larger capacity (number of embedded bits to number of host bits rate) than recent image data hiding methods.
Young Han LEE Deok Su KIM Hong Kook KIM Jongmo SUNG Mi Suk LEE Hyun Joo BAE
In this paper, we propose a bandwidth-scalable stereo audio coding method based on a layered structure. The proposed stereo coding method encodes super-wideband (SWB) stereo signals and is able to decode either wideband (WB) stereo signals or SWB stereo signals, depending on the network congestion. The performance of the proposed stereo coding method is then compared with that of a conventional stereo coding method that separately decodes WB or SWB stereo signals, in terms of subjective quality, algorithmic delay, and computational complexity. Experimental results show that when stereo audio signals sampled at a rate of 32 kHz are compressed to 64 kbit/s, the proposed method provides significantly better audio quality with a 64-sample shorter algorithmic delay, and comparable computational complexity.
A method for accurate scene segmentation using two kinds of directed graph obtained by object matching and audio features is proposed. Generally, in audiovisual materials, such as broadcast programs and movies, there are repeated appearances of similar shots that include frames of the same background, object or place, and such shots are included in a single scene. Many scene segmentation methods based on this idea have been proposed; however, since they use color information as visual features, they cannot provide accurate scene segmentation results if the color features change in different shots for which frames include the same object due to camera operations such as zooming and panning. In order to solve this problem, scene segmentation by the proposed method is realized by using two novel approaches. In the first approach, object matching is performed between two frames that are each included in different shots. By using these matching results, repeated appearances of shots for which frames include the same object can be successfully found and represented as a directed graph. The proposed method also generates another directed graph that represents the repeated appearances of shots with similar audio features in the second approach. By combined use of these two directed graphs, degradation of scene segmentation accuracy, which results from using only one kind of graph, can be avoided in the proposed method and thereby accurate scene segmentation can be realized. Experimental results performed by applying the proposed method to actual broadcast programs are shown to verify the effectiveness of the proposed method.
Takahiro SUZUKI Shuji TASAKA Atsunori NOGUCHI
This paper assesses application-level QoS and Quality of Experience (QoE) in the case where audio and video streams are transferred with the enhanced distributed channel access (EDCA) of the IEEE 802.11e MAC. In EDCA, a station can transmit multiple MAC frames during a transmission opportunity (TXOP); this is referred to as TXOP-bursting. By simulation, we first compare application-level QoS with the TXOP-bursting scheme and that without the scheme for various distances between access point (AP) and stations. In this paper, we suppose that the bit error rate (BER) becomes larger as the distance increases. Numerical results show that TXOP-bursting can improve many metrics of video quality such as average media unit (MU) delay, MU loss ratio, and media synchronization quality, particularly when the AP sends audio and video streams to stations in the downlink direction. We then examine the effect of TXOPLimit on the video quality. Simulation results show that the video quality can be degraded if the value of TXOPLimit is too small. Furthermore, we assess QoE by the method of successive categories, which is a psychometric method. Numerical results show that TXOP-bursting can also improve the QoE. We also perform QoS mapping between application-level and user-level with principal component analysis and multiple regression analysis.
Yoko YAMAKATA Michiaki KATSUMOTO Toshiyuki KIMURA
In this paper, we propose a new system for controlling radiated sound directivity. The proposed system artificially induces a bending vibration on a planar diaphragm by vibrating it artificially using multiple vibrators. Because the bending vibration in this case is determined by not one but all of the accelerated vibrations, the vibration of the diaphragm can be controlled by modulating the accelerated vibration waveforms relatively for each frequency. As a consequence, the directivity of the radiated sound is also varied. To investigate the feasibility of this system, we constructed a prototype that has for a diaphragm a circular plate-one of the most typical shapes considered for discussing plate vibration-and three vibrators. The measurement data showed visually that with this system, surface vibration and sound directivity change depending on the phases of the accelerated vibrations.
Hitoshi OHNISHI Kaname MOCHIZUKI
Transmission delay in audio communications is a well-known obstacle to achieving smooth communication. However, it is not known what kinds of effects are caused by small delays. We hypothesized that the small delay in the listener's responses disturbs the speaker's "verbal conditioning," where the verbal behavior of the speaker varies in accordance with the listener's responses. We examined whether the small delays in the listener's responses disturb the speaker's verbal conditioning using an artificial-grammar learning task. The results suggested that a 300-ms delay disturbed the participants' verbal conditioning although they were not adequately aware of the delay.
Yasuyuki MATSUYA Takahiro MESUDA
We propose a stereo transmission technique using infrared rays and pulse density modulation (PDM) for digital wireless audio headphone systems. The main feature of the proposed technique is the use of two channels for transmission: the PDM data channel and the synchronized clock channel. This technique improves receiver characteristics to a noise floor of -80 dB and a second distortion of 62 dB and achieves a very low power consumption of 3.5 mW.