Seiichi SERIKAWA Teruo SHIMOMURA
A new gloss-extracting method is proposed in this study. A spatial filter with variable resolution is used for the extraction of glossiness. Various spheres and cylinders with curvature radii from 4 to mm are used as the specimens. In all samples, a strong correlation, with a correlation coefficient of more than 0.98, has been observed between psychological glossiness Gph perceived by the human eye and glossiness Gfm extracted by this method. This method is useful for plane specimens as well as spherical and cylindrical ones.
The potential attenuation process of charged human body (HB) is analyzed. A two-dimensional circuit model is presented for predicting the potential attenuation characteristics of the HB charged on the floor. The theoretical equation for the HB potential is derived in the closed form in the Laplacian transformation domain, and the numerical inverse Laplace transform is used to compute it. The half-life or relaxation time of the HB potential for decay is numerically examined with respect to the electrical parameters of shoes. The experiment is also conducted for verifying the validity of the computed result.
Hisako IGARASHI Jun NORITAKE Nobuyasu FURUTA Kuniharu SHINDO Kiyoyuki YAMAZAKI Katsuro OKAMOTO Atsuya YOSHIDA Takami YAMAGUCHI
We are studying a novel concept of the on-line hospital system using a virtual environment called Hyper Hospital," the Hyper Hospital" is a medical care system which is constructed in a distributed manner to the electronic information network using virtual reality (VR) as a human interface. In the present report, we studied the physiological and psychological responses of healthy subjects induced by the usage of the VR in terms of fatigue. Twenty healthy young male subjects were exposed to the virtual reality system and they performed some psychological tasks with a virtual nurse for 30 minutes. Several parameters of physiological, psychological, and subjective fatigue were measured. None of the physiological or psychological parameters such as urinary catecholamine release, ECG, etc. showed significant fatigue induced by our VR system. However, by using a standard questionnaire, some kinds of subjective fatigue were noted and they were thought to be indicating a direction of improvement for our VR system.
Considering the trend towards adopting high efficiency picture coding schemes into digital broadcasting services, we investigate objective picture quality scales for evaluating digitally encoded still and moving pictures. First, the study on the objective picture quality scale for high definition still pictures coded by the JPEG scheme is summarized. This scale is derived from consideration of the following distortion factors; 1) weighted noise by the spatial frequency characteristics and masking effects of human vision, 2) block distortion, and 3) mosquito noise. Next, an objective picture quality scale for motion pictures of standard television coded by the hybrid DCT scheme is studied. In addition to the above distortion factors, the temporal frequency characteristics of vision are also considered. Furthermore, considering that all of these distortions vary over time in motion pictures, methods for determining a single objective picture quality value for this time varying distortion are examined. As a result, generally applicable objective picture quality scale is obtained that correlates extremely well with subjective picture quality scale for both still and motion pictures, irrespective of the contents of the pictures. Having an objective scale facilitates automated picture quality evaluation and control.
Atsuya YOSHIDA Takami YAMAGUCHI Kiyoyuki YAMAZAKI
The Hyper Hospital" is a novel medical care system which will be constructed on an electronic information network. The human interface of the Hyper Hospital based on the modern virtual reality technology is expected to enhance patients' ability to heal by providing computer-supported on-line visual consultations. In order to investigate the effects and features of on-line visual consultations in the Hyper Hospital, we conducted an experiment to clarify the influence of electronic interviews on the talking behavior of interviewees in the context of simulated doctor-patient interactions. Four types of distant-confrontation interviews were made with voluntary subjects and their verbal and non-verbal responses were analyzed from the behavioral point of view. The types of interviews included three types of electronic media-mediated interviews and one of a live face to face interview. There was a tendency in the media-mediated interviews that both the latency and the duration of interviewees' utterances in answering questions increased when they were compared with those of live face to face interviews. These results suggest that the interviewee became more verbose or talkative in the media-mediated interviews than in the live interviews. However, the interviewee's psychological tension was generally augmented in the media-mediated interviews, which was suggested by the delay of the initiation of conversations as compared to the conventional face-to-face interviews. We also discuss the applicability of media-mediated interviews by an electronic doctor which we are studying as a functional unit of our Hyper Hospital, a network based virtual reality space for medical care.
Akitoshi TSUKAMOTO Chil-Woo LEE Saburo TSUJI
This paper describes a new method for pose estimation of human face moving abruptly in real world. The virtue of this method is to use a very simple calculation, disparity, among multiple model images, and not to use any facial features such as facial organs. In fact, since the disparity between input image and a model image increases monotonously in accordance with the change of facial pose, view direction, we can estimate pose of face in input image by calculating disparity among various model images of face. To overcome a weakness coming from the change of facial patterns due to facial individuality or expression, the first model image of face is detected by employing a qualitative feature model of frontal face. It contains statistical information about brightness, which are observed from a lot of facial images, and is used in model-based approach. These features are examined in everywhere of input image to calculate faceness" of the region, and a region which indicates the highest faceness" is taken as the initial model image of face. To obtain new model images for another pose of the face, some temporary model images are synthesized through texture mapping technique using a previous model image and a 3-D graphic model of face. When the pose is changed, the most appropriate region for a new model image is searched by calculating disparity using temporary model images. In this serial processes, the obtained model images are used not only as templates for tracking face in following image sequence, but also texture images for synthesizing new temporary model images. The acquired model images are accumulated in memory space and its permissible extent for rotation or scale change is evaluated. In the later of the paper, we show some experimental results about the robustness of the qualitative facial model used to detect frontal face and the pose estimation algorithm tested on a long sequence of real images including moving human face.
Seiichiro DAN Toshiyasu NAKAO Tadahiro KITAHASHI
We can understand and recover a scene even from a picture or a line drawing. A number of methods have been developed for solving this problem. They have scarcely aimed to deal with scenes of multiple objects although they have ability to recognize three-dimensional shapes of every object. In this paper, challenging to solve this problem, we describe a method for deciding configurations of multiple objects. This method employs the assumption of coplanarity and the constraint of occlusion. The assumption of coplanarity generates the candidates of configurations of multiple objects and the constraint of occlusion prunes impossible configurations. By combining this method with a method of shape recovery for individual objects, we have implemented a system acquirig a three-dimensional information of scene including multiple objects from a monocular image.
Noriko SUZUKI Taroh SASAKI Ryuji KOHNO Hideki IMAI
This paper proposes and investigates an intelligent error-controlling scheme according to different importance of segmental information. In particular, the scheme is designed for facial images encoded by model-based coding that is a kind of intelligent compression coding. Intelligent communication systems regard the contents of information to be transmitted with extremely high compression and reliability. After highly efficient information compression by model-beaed coding, errors in the compressed information lead to severe semantic errors. The proposed scheme reduces semantic errors of information for the receiver. In this paper, we consider Action Unit (AU) as a segment of model-based coded facial image of human being and define the importance for each AU. According to the importance, an AU is encoded by an appropriated code among codes with different error-correcting capabilities. For encoding with different error controlling codes, we use three kinds of constructions to obtain unequal error protection (UEP) codes in this paper. One of them is the direct sum construction and the others are the proposed constructions which are based on joint and double coding. These UEP codes can have higher coderate than other UEP codes when minimum Hamming distance is small. By using these UEP codes, the proposed intelligent error-controlling scheme can protect information in segment in order to reduce semantic errors over a conventional error-controlling scheme in which information is uniformly protected by an error-correcting code.
Saprangsit MRUETUSATORN Hirotsugu KINOSHITA Yoshinori SAKAI
This paper discusses a new image resolution conversion method which converts not only spatial resolution but also amplitude resolution. This method involves considering impulse responses of image devices and human visual characteristics, and can preserve high image quality. This paper considers a system that digitizes the multilevel input image with high spatial resolution and low amplitude resolution using an image scanner, and outputs the image with low spatial resolution and high amplitude resolution on a CRT display. The algorithm thus reduces the number of pixels while increasing the number of brightness levels. Since a CRT display is chosen as the output device, the distribution of each spot in the display, which is modeled as a Gaussian function, is taken as the impulse response. The output image is then expressed as the summation of various amplitudes of the impulse response. Furthermore, human visual perception, which bears a nonlinear relationship to the spatial frequency component, is simplified and modeled with a cascade combination of low-pass and high-pass filters. The output amplitude is determined so that the error between the output image and the input image, after passing through the visual perception filter, is minimized. According to the results of a simulation, it is shown that image quality can be largely preserved by the proposed method, while significant image information is lost by conventional methods.
Masaji YAMASHITA Koichi SHIMIZU Goro MATSUMOTO
To study the biological effects of the ion-current commonly found under ultra-high voltage DC transmission lines, a technique was developed to evaluate the human exposure to the ion-current field. This technique is based on numerical analysis using the boundary element method. The difficulty of handling the space charge in the calculation was overcome by assuming a lumped source ion-current. This technique is applicable to a three-dimensionally complex object such as a human body. In comparison with theoretical values, the accuracy of this technique was evaluated to be satisfactory for our purposes. It was then applied to a human body in an ion-current field. The distribution of the electric field along the body surface was obtained. The general characteristics of the field distribution were essentially the same as in those without space charges. However, it was found that the strength of the field concentration was significantly enhanced by the space charges. Further, the field exposure when a human body was charged by an ion-current was evaluated. As the charged voltage increases, the position of the field concentration moves from a human's head toward his legs. But the shock of micro spark increases. This technique provides a useful tool for the study of biological effects and safety standards of ion-current fields.
Seiichi SERIKAWA Teruo SHIMOMURA
Although the perception of gloss is based on human visual perception, some methods for extracting glossiness, in contrast to human ability, have been proposed involving curved surfaces. Glossiness defined in these methods, however, does not correspond with psychological glossiness perceived by the human eye over the wide range from relatively low gloss to high gloss. In addition, the obtained glossiness in these methods changes remarkably when the curvature radius of the high-gloss object becomes larger than 10mm. In reality, psychological glossiness does not change. These methods, furthermore, are available only for spherical objects. A new method for extracting glossiness is proposed in this study. For the new definition of glossiness, a spatial filter which simulates human retina function is utilized. The light intensity distribution of the curved object is convoluted with the spatial filter. The maximum value Hmax of the convoluted distribution has a high correlation with psychological glossiness Gph. From the relationship between Gph and Hmax, new glossiness Gf is defined. The gloss-extraction equipment consists of a light source, TV camera, an image processor and a personal computer. Cylinders with the curvature radii of 3-30 mm are used as the specimens in addition to spherical balls. In all specimens, a strong correlation, with a correlation coefficient of more than 0.97, has been observed between Gf and Gph over a wide range. New glossiness Gf conforms to Gph even if the curvature radius in more than 10 mm. Based on these findings, it is found that this method for extracting glossiness is useful for the extraction of glossiness of spherical and cylindrical objects over a wide range from relatively low gloss to high gloss.
Yoshiyuki HARA Tsuneo NITTA Hiroyoshi SAITO Ken'ichiro KOBAYASHI
Text-to-speech synthesis (TTS) is currently one of the most important media conversion techniques. In this paper, we describe a Japanese TTS card developed for constructing a personal-computer-based multimedia platform, and a TTS software package developed for a workstation-based multimedia platform. Some applications of this hardware and software are also discussed. The TTS consists of a linguistic processing stage for converting text into phonetic and prosodic information, and a speech processing stage for producing speech from the phonetic and prosodic symbols. The linguistic processing stage uses morphological analysis, rewriting rules for accent movement and pause insertion, and other techniques to impart correct accentuation and a natural-sounding intonation to the synthesized speech. The speech processing stage employs the cepstrum method with consonant-vowel (CV) syllables as the synthesis unit to achieve clear and smooth synthesized speech. All of the processing for converting Japanese text (consisting of mixed Japanese Kanji and Kana characters) to synthesized speech is done internally on the TTS card. This allows the card to be used widely in various applications, including electronic mail and telephone service systems without placing any processing burden on the personal computer. The TTS software was used for an E-mail reading tool on a workstation.
The paper indicates the importance of suitability assesment in speech synthesis applications. Human factors involved in the use of a synthetic speech are first discussed on the basis of an example of a newspaper company where synthetic speech is extensively used as an aid for proofreading a manuscript. Some findings obtained from perceptual experiments on the subjects' preference for paralinguistic properties of synthetic speech are then described, focusing primarily on the suitability of pitch characteristics, speaker's gender, and speaking rates in the task where subjects are asked to proofread a printed text while listening to the speech. The paper finally claims the need for a flexibile speech synthesis system which helps the users create their own synthetic speech.
The recent non von Neumann chip architectures are mainly classified into the AI architecture and the neural architecture. We focus on these two categories, and introduce the representatives each with a brief history. The AI chip architecture is difficult to escape essentially from the von Neumann architecture as far as it is language-oriented. The neural architecture, however, may yield an essentially new computer architecture, when the new device technologies will support it. In particular, the optoelectronics and the quantum electronics will provide a lot of powerful technologies.
Tsuneo KATSUYAMA Hajime KAMATA Satoshi OKUYAMA Toshimitsu SUZUKI You MINAKUCHI Katsutoshi YANO
Broadband multimedia information environments are part of the next big advance in communications and computer technology. The use of multimedia infrastructures in offices is becoming very important. This paper deals with a service concept and human interfaces based on a paper metaphor. The proposed service offers the advantages of paper and eliminates the disadvantages. The power of multimedia's expressiveness, user interaction, and hypermedia technology are key points of our solution. We propose a system configuration for implementing the service/human interface.
Yoichi TAKEBAYASHI Hiroyuki TSUBOI Hiroshi KANAZAWA Yoichi SADAMOTO Hideki HASHIMOTO Hideaki SHINCHI
This paper describes a task-oriented speech dialogue system based on spontaneous speech understanding and response generation (TOSBURG). The system has been developed for a fast food ordering task using speaker-independent keyword-based spontaneous speech understanding. Its purpose being to understand the user's intention from spontaneous speech, the system consists of a noise-robust keyword-spotter, a semantic keyword lattice parser, a user-initiated dialogue manager and a multimodal response generator. After noise immunity keyword-spotting is performed, the spotted keyword candidates are analyzed by a keyword lattice parser to extract the semantic content of the input speech. Then, referring to the dialogue history and context, the dialogue manager interprets the semantic content of the input speech. In cases where the interpretation is ambiguous or uncertain, the dialogue manager invites the user to confirm verbally the system's understanding of the speech input. The system's response to the user throughout the dialogue is multimodal; that is, several modes of communication (synthesized speech, text, animated facial expressions and ordered food items) are used to convey the system's state to the user. The object here is to emulate the multimodal interaction that occurs between humans, and so achieve more natural and efficient human-computer interaction. The real-time dialogue system has been constructed using two general purpose workstations and four DSP accelerators (520MFLOPS). Experimental results have shown the effectiveness of the newly developed speech dialogue system.
This paper presents unique specification environments for LOTOS, which is one of FDTs (Formal Description Techniques) developed in ISO. We first discuss the large gap in terms of syntax and semantics between informal specifications at the early stage of specification design and formal specifications based on FDT such as LOTOS. This large gap has been bridged by human intelligent works thus far. In order to bridge the large gap, we have designed user-friendly specification environments for FDTs. The outlines of SEGL (Specification Environment for G-LOTOS), CBP (Concept-Based Programming environment) and MBP (Model-Based Programming environment) are described. The effectiveness of software development under such an environment is demonstrated using application examples from OSI and non-OSI protocols.
Yoshinori KITAHARA Yoh'ichi TOHKURA
In speech output expected as an ideal man-machine interface, there exists an important issue on emotion production in order to not only improve its naturalness but also achieve more sophisticated speech interaction between man and machine. Speech has two aspects, which are prosodic information and phonetic feature. For the purpose of application to natural and high quality speech synthesis, the role of prosody in speech perception has been studied. In this paper, prosodic components, which contribute to the expression of emotions and their intensity, are clarified by analyzing emotional speech and by conducting listening tests of synthetic speech. The analysis is performed by substituting the components of neutral speech (i.e., one with no particular emotion) with those of emotional speech preserving the temporal correspondence by means of DTW. It has been confirmed that prosodic components, which are composed of pitch structure, temporal structure and amplitude structure, contribute to the expression of emotions more than the spectral structure of speech. The results of listening tests using prosodic substituted speech show that temporal structure is the most important for the expression of anger, while all of three components are much more important for the intensity of anger. Pitch structure also plays a significant role in the expression of joy and sadness and their intensity. These results make it possible to convert neutral utterances into utterances expressing various emotions. The results can also be applied to controlling the emotional characteristics of speech in synthesis by rule.
The development of computers capable of handling complex objects requires nonverbal interfaces that can bidirectionally mediate nonverbal communication including the gestures of both people and computers. Nonverbal expressions are poweful media for enriching and facilitating humancomputer interaction when used as interface languages. Four gestural modes are appropriate for human-computer interaction: the sign, indication, illustration and manipulation modes. All these modes can be conveyed by a generalized gesture interface that has specific processors for each mode. The basic component of the generalized gesture interface, a gesture dictionary, is proposed. The dictionary can accept sign and indicating gestures in which postures or body shapes are significant, pass their meaning to a computer and display gestures from the computer. For this purpose it converts body shapes into gestural codes by means of two code systems and, moreover, it performs bidirectional conversions of several gesture representations. This dictionary is applied to the translation of Japanese into sign language; it displays an actor who speaks the given Japanese sentences by gesture of sign words and finger alphabets. The performance of this application confirms the adequacy and usefulness of the gesture dictionary.
The new notion of "multiuser interface", an interface for groups working together in a shared workspace, originated from the expansion of CSCW research and the spread of the groupware concept. This paper introduces a new multiuser interface design approach based on the translucent video overlay technique. This approach was realized in the multimedia desktop conference system Team WorkStation. Team WorkStation demonstrates that this translucent video overlay technique can achieve two different goals: (1) fused overlay for realizing the open shared workspace, and (2) selective overlay for effectively using limited screen space. This paper first describes the concept of open shared workspace and its implementation based on the fused overlay technique. The shared work window of Team-WorkStation is created by overlaying translucent individual workspace images. Each video layer is originally physically separated. However, because of the spatial relationships among marks on each layer, the set of overlaid layers provides users with sufficient semantics to fuse them into one image. The usefulness of this cognitive fusion was demonstrated through actual usage in design sessions. Second, the problem of screen space limitation is described. To solve this problem, the idea of ClearFace based on selective overlay is introduced. The ClearFace idea is to lay translucent live face video windows over a shared work window. Through the informal observations of experimental use in design sessions, little difficulty was experienced in switching the focus of attention between the face images and the drawing objects. The theory of selective looking accounts for this flexible perception mechanism. Although users can see drawn objects behind a face without difficulty, we found that users hesitate to draw figures or write text over face images. Because of this behavior, we devised the "movable" face window strategy.