To cope with complicated interference scenarios in realistic acoustic environment, supervised deep neural networks (DNNs) are investigated to estimate different user-defined targets. Such techniques can be broadly categorized into magnitude estimation and time-frequency mask estimation techniques. Further, the mask such as the Wiener gain can be estimated directly or derived by the estimated interference power spectral density (PSD) or the estimated signal-to-interference ratio (SIR). In this paper, we propose to incorporate the multi-task learning in DNN-based single-channel speech enhancement by using the speech presence probability (SPP) as a secondary target to assist the target estimation in the main task. The domain-specific information is shared between two tasks to learn a more generalizable representation. Since the performance of multi-task network is sensitive to the weight parameters of loss function, the homoscedastic uncertainty is introduced to adaptively learn the weights, which is proven to outperform the fixed weighting method. Simulation results show the proposed multi-task scheme improves the speech enhancement performance overall compared to the conventional single-task methods. And the joint direct mask and SPP estimation yields the best performance among all the considered techniques.
We propose a new method for improving the recognition performance of phonemes, speech emotions, and music genres using multi-task learning. When tasks are closely related, multi-task learning can improve the performance of each task by learning common feature representation for all the tasks. However, the recognition tasks considered in this study demand different input signals of speech and music at different time scales, resulting in input features with different characteristics. In addition, a training dataset with multiple labels for all information sources is not available. Considering these issues, we conduct multi-task learning in a sequential training process using input features with a single label for one information source. A comparative evaluation confirms that the proposed method for multi-task learning provides higher performance for all recognition tasks than individual learning for each task as in conventional methods.
Noriyuki TONAMI Keisuke IMOTO Ryosuke YAMANISHI Yoichi YAMASHITA
Sound event detection (SED) and acoustic scene classification (ASC) are important research topics in environmental sound analysis. Many research groups have addressed SED and ASC using neural-network-based methods, such as the convolutional neural network (CNN), recurrent neural network (RNN), and convolutional recurrent neural network (CRNN). The conventional methods address SED and ASC separately even though sound events and acoustic scenes are closely related to each other. For example, in the acoustic scene “office,” the sound events “mouse clicking” and “keyboard typing” are likely to occur. Therefore, it is expected that information on sound events and acoustic scenes will be of mutual aid for SED and ASC. In this paper, we propose multitask learning for joint analysis of sound events and acoustic scenes, in which the parts of the networks holding information on sound events and acoustic scenes in common are shared. Experimental results obtained using the TUT Sound Events 2016/2017 and TUT Acoustic Scenes 2016 datasets indicate that the proposed method improves the performance of SED and ASC by 1.31 and 1.80 percentage points in terms of the F-score, respectively, compared with the conventional CRNN-based method.
Longfei CHEN Yuichi NAKAMURA Kazuaki KONDO Dima DAMEN Walterio MAYOL-CUEVAS
We propose a novel framework for integrating beginners' machine operational experiences with those of experts' to obtain a detailed task model. Beginners can provide valuable information for operation guidance and task design; for example, from the operations that are easy or difficult for them, the mistakes they make, and the strategy they tend to choose. However, beginners' experiences often vary widely and are difficult to integrate directly. Thus, we consider an operational experience as a sequence of hand-machine interactions at hotspots. Then, a few experts' experiences and a sufficient number of beginners' experiences are unified using two aggregation steps that align and integrate sequences of interactions. We applied our method to more than 40 experiences of a sewing task. The results demonstrate good potential for modeling and obtaining important properties of the task.
Kazuya URAZOE Nobutaka KUROKI Yu KATO Shinya OHTANI Tetsuya HIROSE Masahiro NUMA
This paper presents an image super-resolution technique using a convolutional neural network (CNN) and multi-task learning for multiple image categories. The image categories include natural, manga, and text images. Their features differ from each other. However, several CNNs for super-resolution are trained with a single category. If the input image category is different from that of the training images, the performance of super-resolution is degraded. There are two possible solutions to manage multi-categories with conventional CNNs. The first involves the preparation of the CNNs for every category. This solution, however, requires a category classifier to select an appropriate CNN. The second is to learn all categories with a single CNN. In this solution, the CNN cannot optimize its internal behavior for each category. Therefore, this paper presents a super-resolution CNN architecture for multiple image categories. The proposed CNN has two parallel outputs for a high-resolution image and a category label. The main CNN for the high-resolution image is a normal three convolutional layer-architecture, and the sub neural network for the category label is branched out from its middle layer and consists of two fully-connected layers. This architecture can simultaneously learn the high-resolution image and its category using multi-task learning. The category information is used for optimizing the super-resolution. In an applied setting, the proposed CNN can automatically estimate the input image category and change the internal behavior. Experimental results of 2× image magnification have shown that the average peak signal-to-noise ratio for the proposed method is approximately 0.22 dB higher than that for the conventional super-resolution with no difference in processing time and parameters. We have ensured that the proposed method is useful when the input image category is varying.
Keisuke MAEDA Kazaha HORII Takahiro OGAWA Miki HASEYAMA
A multi-task convolutional neural network leading to high performance and interpretability via attribute estimation is presented in this letter. Our method can provide interpretation of the classification results of CNNs by outputting attributes that explain elements of objects as a judgement reason of CNNs in the middle layer. Furthermore, the proposed network uses the estimated attributes for the following prediction of classes. Consequently, construction of a novel multi-task CNN with improvements in both of the interpretability and classification performance is realized.
In this paper, we present a joint multi-patch and multi-task convolutional neural networks (JMM-CNNs) framework to learn more descriptive and robust face representation for face recognition. In the proposed JMM-CNNs, a set of multi-patch CNNs and a feature fusion network are constructed to learn and fuse global and local facial features, then a multi-task learning algorithm, including face recognition task and pose estimation task, is operated on the fused feature to obtain a pose-invariant face representation for the face recognition task. To further enhance the pose insensitiveness of the learned face representation, we also introduce a similarity regularization term on features of the two tasks to propose a regularization loss. Moreover, a simple but effective patch sampling strategy is applied to make the JMM-CNNs have an end-to-end network architecture. Experiments on Multi-PIE dataset demonstrate the effectiveness of the proposed method, and we achieve a competitive performance compared with state-of-the-art methods on Labeled Face in the Wild (LFW), YouTube Faces (YTF) and MegaFace Challenge.
Toshiro NUNOME Suguru KAEDE Shuji TASAKA
In this paper, we propose a user-assisted QoS control scheme that utilizes media adaptive buffering to enhance QoE of audiovisual and haptic IP communications. The scheme consists of two modes: a manual mode and an automatic mode. It enables users to switch between these two modes according to their inclinations. We compare four QoS control schemes: the manual mode only, the automatic mode only, the switching scheme starting with the manual mode, and the switching scheme starting with the automatic mode. We assess the effects of the four schemes, user attributes, and tasks on QoE through a subjective experiment which provides information on users' behavior in addition to QoE scores. As a result of the experiment, we show that the user-assisted QoS control scheme can enhance QoE. Furthermore, we notice that the proper QoS control scheme depends on user attributes and tasks.
Named Entity Recognition (NER) systems are often realized by supervised methods such as CRF and neural network methods, which require large annotated data. In some domains that small annotated training data is available, multi-domain or multi-task learning methods are often used. In this paper, we explore the methods that use news domain and Chinese Word Segmentation (CWS) task to improve the performance of Chinese named entity recognition in weibo domain. We first propose two baseline models combining multi-domain and multi-task information. The two baseline models share information between different domains and tasks through sharing parameters simply. Then, we propose a Double ADVersarial model (DoubADV model). The model uses two adversarial networks considering the shared and private features in different domains and tasks. Experimental results show that our DoubADV model outperforms other baseline models and achieves state-of-the-art performance compared with previous works in multi-domain and multi-task situation.
Hideaki OHASHI Yasuhito ASANO Toshiyuki SHIMIZU Masatoshi YOSHIKAWA
Peer assessments, in which people review the works of peers and have their own works reviewed by peers, are useful for assessing homework. In conventional peer assessment systems, works are usually allocated to people before the assessment begins; therefore, if people drop out (abandoning reviews) during an assessment period, an imbalance occurs between the number of works a person reviews and that of peers who have reviewed the work. When the total imbalance increases, some people who diligently complete reviews may suffer from a lack of reviews and be discouraged to participate in future peer assessments. Therefore, in this study, we adopt a new adaptive allocation approach in which people are allocated review works only when requested and propose an algorithm for allocating works to people, which reduces the total imbalance. To show the effectiveness of the proposed algorithm, we provide an upper bound of the total imbalance that the proposed algorithm yields. In addition, we extend the above algorithm to consider reviewing ability. The extended algorithm avoids the problem that only unskilled (or skilled) reviewers are allocated to a given work. We show the effectiveness of the proposed two algorithms compared to the existing algorithms through experiments using simulation data.
Yuki KUROSAWA Shinya MOCHIDUKI Yuko HOSHINO Mitsuho YAMADA
We measured eye movements at gaze points while subjects performed calculation tasks and examined the relationship between the eye movements and fatigue and/or internal state of a subject by tasks. It was suggested that fatigue and/or internal state of a subject affected eye movements at gaze points and that we could measure them using eye movements at gaze points in real time.
Takafumi HIGASHI Hideaki KANAI
To improve the cutting skills of learners, we developed a method for improving the skill involved in creating paper cuttings based on a steering task in the field of human-computer interaction. TaWe made patterns using the white and black boundaries that make up a picture. The index of difficulty (ID) is a numerical value based on the width and distance of the steering law. First, we evaluated novice and expert pattern-cutters, and measured their moving time (MT), error rate, and compliance with the steering law, confirming that the MT and error rate are affected by pattern width and distance. Moreover, we quantified the skills of novices and experts using ID and MT based models. We then observed changes in the cutting skills of novices who practiced with various widths and evaluated the impact of the difficulty level on skill improvement. Patterns considered to be moderately difficult for novices led to a significant improvement in skills.
Takashi NAKADA Hiroyuki YANAGIHASHI Kunimaro IMAI Hiroshi UEKI Takashi TSUCHIYA Masanori HAYASHIKOSHI Hiroshi NAKAMURA
Near real-time periodic tasks, which are popular in multimedia streaming applications, have deadline periods that are longer than the input intervals thanks to buffering. For such applications, the conventional frame-based schedulings cannot realize the optimal scheduling due to their shortsighted deadline assumptions. To realize globally energy-efficient executions of these applications, we propose a novel task scheduling algorithm, which takes advantage of the long deadline period. We confirm our approach can take advantage of the longer deadline period and reduce the average power consumption by up to 18%.
Hand-dorsa vein recognition is solved based on the convolutional activations of the pre-trained deep convolutional neural network (DCNN). In specific, a novel task-specific cross-convolutional-layer pooling is proposed to obtain the more representative and discriminative feature representation. Rigorous experiments on the self-established database achieves the state-of-the-art recognition result, which demonstrates the effectiveness of the proposed model.
In recent years, a rise in healthy eating has led to various food management applications which have image recognition function to record everyday meals automatically. However, most of the image recognition functions in the existing applications are not directly useful for multiple-dish food photos and cannot automatically estimate food calories. Meanwhile, methodologies on image recognition have advanced greatly because of the advent of Convolutional Neural Network (CNN). CNN has improved accuracies of various kinds of image recognition tasks such as classification and object detection. Therefore, we propose CNN-based food calorie estimation for multiple-dish food photos. Our method estimates dish locations and food calories simultaneously by multi-task learning of food dish detection and food calorie estimation with a single CNN. It is expected to achieve high speed and small network size by simultaneous estimation in a single network. Because currently there is no dataset of multiple-dish food photos annotated with both bounding boxes and food calories, in this work we use two types of datasets alternately for training a single CNN. For the two types of datasets, we use multiple-dish food photos annotated with bounding boxes and single-dish food photos with food calories. Our results showed that our multi-task method achieved higher accuracy, higher speed and smaller network size than a sequential model of food detection and food calorie estimation.
Takuya KUWAHARA Takayuki KURODA Manabu NAKANOYA Yutaka YAKUWA Hideyuki SHIMONISHI
As IT systems, including network systems using SDN/NFV technologies, become large-scaled and complicated, the cost of system management also increases rapidly. Network operators have to maintain their workflow in constructing and consistently updating such complex systems, and thus these management tasks in generating system update plan are desired to be automated. Declarative system update with state space search is a promising approach to enable this automation, however, the current methods is not enough scalable to practical systems. In this paper, we propose a novel heuristic approach to greatly reduce computation time to solve system update procedure for practical systems. Our heuristics accounts for structural bottleneck of the system update and advance search to resolve bottlenecks of current system states. This paper includes the following contributions: (1) formal definition of a novel heuristic function specialized to system update for A* search algorithm, (2) proofs that our heuristic function is consistent, i.e., A* algorithm with our heuristics returns a correct optimal solution and can omit repeatedly expansion of nodes in search spaces, and (3) results of performance evaluation of our heuristics. We evaluate the proposed algorithm in two cases; upgrading running hypervisor and rolling update of running VMs. The results show that computation time to solve system update plan for a system with 100 VMs does not exceed several minutes, whereas the conventional algorithm is only applicable for a very small system.
Longfei CHEN Yuichi NAKAMURA Kazuaki KONDO Walterio MAYOL-CUEVAS
This paper presents an approach to analyze and model tasks of machines being operated. The executions of the tasks were captured through egocentric vision. Each task was decomposed into a sequence of physical hand-machine interactions, which are described with touch-based hotspots and interaction patterns. Modeling the tasks was achieved by integrating the experiences of multiple experts and using a hidden Markov model (HMM). Here, we present the results of more than 70 recorded egocentric experiences of the operation of a sewing machine. Our methods show good potential for the detection of hand-machine interactions and modeling of machine operation tasks.
Chanyoung OH Saehanseul YI Youngmin YI
As energy efficiency has become a major design constraint or objective, heterogeneous manycore architectures have emerged as mainstream target platforms not only in server systems but also in embedded systems. Manycore accelerators such as GPUs are getting also popular in embedded domains, as well as the heterogeneous CPU cores. However, as the number of cores in an embedded GPU is far less than that of a server GPU, it is important to utilize both heterogeneous multi-core CPUs and GPUs to achieve the desired throughput with the minimal energy consumption. In this paper, we present a case study of mapping LBP-based face detection onto a recent CPU-GPU heterogeneous embedded platform, which exploits both task parallelism and data parallelism to achieve maximal energy efficiency with a real-time constraint. We first present the parallelization technique of each task for the GPU execution, then we propose performance and energy models for both task-parallel and data-parallel executions on heterogeneous processors, which are used in design space exploration for the optimal mapping. The design space is huge since not only processor heterogeneity such as CPU-GPU and big.LITTLE, but also various data partitioning ratios for the data-parallel execution on these heterogeneous processors are considered. In our case study of LBP face detection on Exynos 5422, the estimation error of the proposed performance and energy models were on average -2.19% and -3.67% respectively. By systematically finding the optimal mappings with the proposed models, we could achieve 28.6% less energy consumption compared to the manual mapping, while still meeting the real-time constraint.
Liang CHEN Dongyi CHEN Xiao CHEN
Touch screen has become the mainstream manipulation technique on handheld devices. However, its innate limitations, e.g. the occlusion problem and fat finger problem, lower user experience in many use scenarios on handheld displays. Back-of-device interaction, which makes use of input units on the rear of a device for interaction, is one of the most promising approaches to address the above problems. In this paper, we present the findings of a user study in which we explored users' pointing performances in using two types of touch input on handheld devices. The results indicate that front-of-device touch input is averagely about two times as fast as back-of-device touch input but with higher error rates especially in acquiring the narrower targets. Based on the results of our study, we argue that in the premise of keeping the functionalities and layouts of current mainstream user interfaces back-of-device touch input should be treated as a supplement to front-of-device touch input rather than a replacement.
This paper discusses VDT syndrome from the point of view of the viewing distance between a computer screen and user's eyes. This paper conducts a series of experiments to show an impact of the viewing distance on task performance. In the experiments, two different viewing distances of 50cm and 350cm with the same viewing angle of 30degrees are taken into consideration. The results show that the long viewing distance enables people to manipulate the mouse more slowly, more correctly and more precisely than the short.