Objective assessment of image and video quality should be based on a correct understanding of subjective assessment by human observers. Previous models have incorporated the mechanisms of early visual processing in image quality metrics, enabling us to evaluate the visibility of errors from the original images. However, to understand how human observers perceive image quality, one should also consider higher stages of visual processing where perception is established. In higher stages, the visual system presumably represents a visual scene as a collection of meaningful components such as objects and events. Our recent psychophysical studies suggest two principles related to this level of processing. First, the human visual system integrates shape and color signals along perceived motion trajectories in order to improve visibility of the shape and color of moving objects. Second, the human visual system estimates surface reflectance properties like glossiness using simple image statistics rather than by inverse computation of image formation optics. Although the underlying neural mechanisms are still under investigation, these computational principles are potentially useful for the development of effective image processing technologies and for quality assessment. Ideally, if a model can specify how a given image is transformed into high-level scene representations in the human brain, it would predict many aspects of subjective image quality, including fidelity and naturalness.
In this paper, we present a Double-Anchoring Based Tone Mapping (DABTM) algorithm for displaying high dynamic range (HDR) images. First, two anchoring values are obtained using the double-anchoring theory. Second, we use the two values to formulate the compressing operator, which can achieve the aim of tone mapping directly. A new method based on accelerated K-means for the decomposition of HDR images into groups (frameworks) is proposed. Most importantly, a group of piecewise-overlap linear functions is put forward to define the belongingness of pixels to their locating frameworks. Experiments show that our algorithm is capable of achieving dynamic range compression, while preserving fine details and avoiding common artifacts such as gradient reversals, halos, or loss of local contrast.
Chang Ha LEE Youngmin KIM Amitabh VARSHNEY
The comprehensibility of large and complex 3D models can be greatly enhanced by guiding viewer's attention to important regions. Lighting is crucial to our perception of shape. Careful use of lighting has been widely used in art, scientific illustration, and computer graphics to guide visual attention. In this paper, we explore how the saliency of 3D objects can be used to guide lighting to emphasize important regions and suppress less important ones.
Yuki HONGOH Shinichi KITA Yoshiharu SOETA
We examined how spatial disparity between the auditory and visual stimuli modulated the audio-visual (A-V) prior entry effect. Spatial and temporal proximity of multisensory stimuli are crucial factors for multisensory perception in most cases (e.g. [1],[2]). However our previous research[3],[4] suggested that this well-accepted hypothesis was not applicable to the A-V prior entry effect. In order to examine the effect of the spatial disparity on the A-V prior entry effect, six loudspeakers and two light emitting diodes (LEDs) were used as stimuli. The loudspeakers were located at 10, 25, and 90 degrees from the midline of the participants to both right and left sides. A preceding sound was presented from one of these six loudspeakers. After the preceding sound, two visual targets were presented successively at a short interval and participants judged which visual target was presented first. Two colour changeable ('red' or 'green') LEDs were used for the visual targets and participants judged the order of visual targets by their colour not by their side in order to avoid the response bias as much as possible. The visual targets were situated at 10 degrees or 25 degrees from the participants' midline to both right and left in the Experiment 1. Results showed a biased judgment that the visual target at the sound presented side was presented first. The amplitude of the A-V prior entry effect was greater when the preceding sound source was more apart from the midline of participants. This effect of spatial separation indicated that the clarity of either right or left side of the preceding sound enhanced the amplitude of the A-V prior entry effect (Experiment 2). These results challenge the belief that the spatial proximity of multisensory stimuli is a crucial factor for multisensory perception.
Sylvain TOURANCHEAU Patrick LE CALLET Dominique BARBA
In this paper, the impact of display on quality assessment is addressed. Subjective quality assessment experiments have been performed on both LCD and CRT displays. Two sets of still images and two sets of moving pictures have been assessed using either an ACR or a SAMVIQ protocol. Altogether, eight experiments have been led. Results are presented and discussed, some differences are pointed out. Concerning moving pictures, these differences seem to be mainly due to LCD moving artefacts such as motion blur. LCD motion blur has been measured objectively and with psycho-physics experiments. A motion-blur metric based on the temporal characteristics of LCD can be defined. A prediction model have been then designed which predict the differences of perceived quality between CRT and LCD. This motion-blur-based model enables the estimation of perceived quality on LCD with respect to the perceived quality on CRT. Technical solutions to LCD motion blur can thus be evaluated on natural contents by this mean.
Te-Yuan HUANG Kuan-Ta CHEN Polly HUANG Chin-Laung LEI
Quantifying user satisfaction is essential, because the results can help service providers deliver better services. In this work, we propose a generalizable methodology, based on survival analysis, to quantify user satisfaction in terms of session times, i.e., the length of time users stay with an application. Unlike subjective human surveys, our methodology is based solely on passive measurement, which is more cost-efficient and better able to capture subconscious reactions. Furthermore, by using session times, rather than a specific performance indicator, such as the level of distortion of voice signals, the effects of other factors like loudness and sidetone, can also be captured by the developed models. Like survival analysis, our methodology is characterized by low complexity and a simple model-developing process. The feasibility of our methodology is demonstrated through case studies of ShenZhou Online, a commercial MMORPG in Taiwan, and the most prevalent VoIP application in the world, namely Skype. Through the model development process, we can also identify the most significant performance factors and their impacts on user satisfaction and discuss how they can be exploited to improve user experience and optimize resource allocation.
Lin YANG Jianping ZHANG Jian SHAO Yonghong YAN
This letter evaluates the relative contributions of temporal fine structure cues in various frequency bands to Mandarin tone perception using novel "auditory chimaeras". Our results confirm the importance of temporal fine structure cues to lexical tone perception and the dominant region of lexical tone perception is found, namely the second to fifth harmonics can contribute no less than the fundamental frequency itself.
David GAVILAN Hiroki TAKAHASHI Suguru SAITO Masayuki NAKAJIMA
A method for evaluating image segmentation methods is proposed in this paper. The method is based on a perception model where the drawing act is used to represent visual mental percepts. Each segmented image is represented by a minimal set of features and the segmentation method is tested against a set of sketches that represent a subset of the original image database, using the Mahalanobis distance function. The covariance matrix is set using a collection of sketches drawn by different users. The different drawings are demonstrated to be consistent across users. This evaluation method can be used to solve the problem of parameter selection in image segmentation, as well as to show the goodness or limitations of the different segmentation algorithms. Different well-known color segmentation algorithms are analyzed with the proposed method and the nature of each one is discussed. This evaluation method is also compared with heuristic functions that serve for the same purpose, showing the importance of using users' pictorial knowledge.
Takafumi KANAZAWA Toshimitsu USHIO
In evolutionary game theory, to the best of our knowledge, individuals' perceptions have not been taken into consideration explicitly. When an individual interacts with the other individual under coexistence of heterogeneous sub-populations, the individual may be willing to change his/her strategy depending on the sub-population the other individual belongs to. Moreover, in such a situation, each individual may make an error about the sub-population the other individual belongs to. In this paper, we propose a multi-population model with such erroneous perceptions. We define an evolutionarily stable strategy (ESS) and formulate replicator dynamics in this model, and prove several properties of the proposed model. Moreover, we focus on a two-population chicken game with erroneous perceptions and discuss characteristics of equilibrium points of its replicator dynamics.
Feng-Cheng CHANG Hsueh-Ming HANG
Content-based image search has long been considered a difficult task. Making correct conjectures on the user intention (perception) based on the query images is a critical step in the content-based search. One key concept in this paper is how we find the user preferred low-level image characteristics from the multiple positive samples provided by the user. The second key concept is how we generate a set of consistent "pseudo images" when the user does not provide a sufficient number of samples. The notion of image feature stability is thus introduced. The third key concept is how we use negative images as pruning criterion. In realizing the preceding concepts, an image search scheme is developed using the weighted low-level image features. At the end, quantitative simulation results are used to show the effectiveness of these concepts.
In this paper, we introduce a new method for depth perception from a 2D natural scene using scale variation of patterns. As the surface from a 2D scene gets farther away from us, the texture appears finer and smoother. Texture gradient is one of the monocular depth cues which can be represented by gradual scale variations of textured patterns. To extract feature vectors from textured patterns, higher order local autocorrelation functions are utilized at each scale step. The hierarchical linear discriminant analysis is employed to classify the scale rate of the feature vector which can be divided into subspaces by recursively grouping the overlapped classes. In the experiment, relative depth perception of 2D natural scenes is performed on the proposed method and it is expected to play an important role in natural scene analysis.
Hideaki TAKADA Shiro SUYAMA Kenji NAKAZAWA
We are developing a simple three-dimensional (3-D) display method that uses only two transparent images using luminance division displays without any extra equipment. This method can be applied to not only electronic displays but also the printed sheets. The method utilizes a 3-D visual illusion in which two ordinary images with many edges can be perceived as an apparent 3-D image with continuous depth between the two image planes, when two identical images are overlapped from the midpoint of the observer's eyes and their optical-density ratio is changed according to the desired image depths. We can use transparent printed sheets or transparent liquid crystal displays to display two overlapping transparent images using this 3-D display method. Subjective test results show that the perceived depths changed continuously as the optical-density ratio changed. Deviations of the perceived depths from the average for each observer were sufficiently small. The depths perceived by all six observers coincided well.
Automatic labeling of prosodic features is an important topic when constructing large speech databases for speech synthesis or analysis purposes. Perceptually-related F0 parameters are proposed with the aim of automatically classifying phrase final tones. Analyses are conducted to verify how consistently subjects are able to categorize phrase final tones, and how perceptual features are related with the categories. Three types of acoustic parameters are proposed and analyzed for representing the perceptual features related to the tone categories: one related to pitch movement within the phrase final, one related to pitch reset prior to the phrase final, and one related to the length of the phrase final. A classification tree is constructed to evaluate automatic classification of phrase final tones, resulting in 79.2% accuracy for the consistently categorized samples, using the best combination among the proposed acoustic parameters.
Visual defects, called mura in the field, sometimes occur during the manufacturing of the flat panel liquid crystal displays. In this paper we propose an automatic inspection method that reliably detects and quantifies TFT-LCD region-mura defects. The method consists of two phases. In the first phase we segment candidate region-muras from TFT-LCD panel images using the modified regression diagnostics and Niblack's thresholding. In the second phase, based on the human eye's sensitivity to mura, we quantify mura level for each candidate, which is used to identify real muras by grading them as pass or fail. Performance of the proposed method is evaluated on real TFT-LCD panel samples.
Steven GREENBERG Takayuki ARAI
Classical models of speech recognition assume that a detailed, short-term analysis of the acoustic signal is essential for accurately decoding the speech signal and that this decoding process is rooted in the phonetic segment. This paper presents an alternative view, one in which the time scales required to accurately describe and model spoken language are both shorter and longer than the phonetic segment, and are inherently wedded to the syllable. The syllable reflects a singular property of the acoustic signal -- the modulation spectrum -- which provides a principled, quantitative framework to describe the process by which the listener proceeds from sound to meaning. The ability to understand spoken language (i.e., intelligibility) vitally depends on the integrity of the modulation spectrum within the core range of the syllable (3-10 Hz) and reflects the variation in syllable emphasis associated with the concept of prosodic prominence ("accent"). A model of spoken language is described in which the prosodic properties of the speech signal are embedded in the temporal dynamics associated with the syllable, a unit serving as the organizational interface among the various tiers of linguistic representation.
Shiro SUYAMA Hideaki TAKADA Sakuichi OHTSUKA
We propose a novel three-dimensional (3-D) display using only two 2-D images displayed at different depths. It is based on a new perceptual phenomenon induced by the human binocular visual system and enables an observer using no extra equipment to perceive an apparent 3-D image of continuous depth when the luminance is divided between the 2-D images according to the 3-D image depth. Our prototype direct-vision 3-D display using this mechanism can easily produce moving 3-D color images by using conventional 2-D color displays.
Jeffrey C. BAMBER Paul E. BARBONE Nigel L. BUSH David O. COSGROVE Marvin M. DOYELY Frank G. FUECHSEL Paul M. MEANEY Naomi R. MILLER Tsuyoshi SHIINA Francois TRANQUART
A digest is provided of work carried out at the Institute of Cancer Research to develop freehand elastography and apply it to breast investigations. Topics covered include the development of freehand elastography and its relationship to other methods, a description of the system for off-line clinical evaluation of the freehand method, comparison of the physical performances of freehand and mechanical elastography, early clinical results on 70 breast tumours, real-time imaging, quantitative elastography and psychophysical aspects of the detection and assessment of elastic lesions. Progress in developing this new medical imaging modality is occurring rapidly throughout the world and its future looks promising.
Tsutomu MIYASATO Haruo NOMA Fumio KISHINO
This paper describes the results of tests that measured the allowable delay between images and tactile information via a force feedback device. In order to investigate the allowable delay, two experiments were performed: 1) subjective evaluation in real space and 2) subjective evaluation in virtual space using a force feedback device.
Sumio OHNO Keikichi HIROSE Hiroya FUJISAKI
In conventional word-spotting methods for automatic recognition of continuous speech, individual frames or segments of the input speech are assigned labels and local likelihood scores solely on the basis of their own acoustic characteristics. On the other hand, experiments on human speech perception conducted by the present authors and others show that human perception of words in connected speech is based, not only on the acoustic characteristics of individual segments, but also on the acoustic and linguistic contexts in which these segments occurs. In other words, individual segments are not correctly perceive by humans unless they are accompanied by their context. These findings on the process of human speech perception have to be applied in automatic speech recognition in order to improve the performance. From this point of view, the present paper proposes a new scheme for detecting words in continuous speech based on template matching where the likelihood of each segment of a word is determined not only by its own characteristics but also by the likelihood of its context within the framework of a word. This is accomplished by modifying the likelihood score of each segment by the likelihood score of its phonetic context, the latter representing the degree of similarity of the context to that of a candidate word in the lexicon. Higher enhancement is given to the segmental likelihood score if the likelihood score of its context is higher. The advantage of the proposed scheme over conventional schemes is demonstrated by an experiment on constructing a word lattice using connected speech of Japanese uttered by a male speaker. The result indicates that the scheme is especially effective in giving correct recognition in cases where there are two or more candidate words which are almost equal in raw segmental likelihood scores.
Kenya UOMORI Shinji MURAKAMI Mitsuho YAMADA Mitsuru FUJII Hiroshi YOSHIMATSU Norihito NAKANO Hitoshi HONGO Jiro MIYAZAWA Keiichi UENO Ryo FUKATSU Naohiko TAKAHATA
To clarify the stereopsis disturbance in patients with Alzheimer's disease (AD), we analyzed binocular eye movement when subjects shifted their gaze between targets at different depths. Subjects are patients with Alzheimer's disease, Mluti-infarct dementia (MID), or Olivopontocerebellar atrophy (OPCA), and healthy controls. Targets are arranged in two ways: along the median plane and asymmetrically crossing the median plane, at distances from the eyes of 1000 mm and 300 mm. When the targets are switched at the onset of a beep, the subjects shifted their gaze to the lit target. The experiment is conducted in a dimly lit room whose structure is capable of providing good binocular cues for depth. In AD subjects, especially in the subjects whose symptoms are moderate (advanced stage), vergence is limited and the change in the convergence angle is small, unstable, and non-uniform. These results are different from those of other patients (MID) and OPCA) or healthy controls and suggest a disturbance of stereopsis in the parietal lobe where AD patients typically have dysfunctions.