The search functionality is under construction.

Keyword Search Result

[Keyword] dataset(12hit)

1-12hit
  • Dataset of Functionally Equivalent Java Methods and Its Application to Evaluating Clone Detection Tools Open Access

    Yoshiki HIGO  

     
    PAPER-Software System

      Pubricized:
    2024/02/21
      Vol:
    E107-D No:6
      Page(s):
    751-760

    Modern high-level programming languages have a wide variety of grammar and can implement the required functionality in different ways. The authors believe that a large amount of code that implements the same functionality in different ways exists even in open source software where the source code is publicly available, and that by collecting such code, a useful data set can be constructed for various studies in software engineering. In this study, we construct a dataset of pairs of Java methods that have the same functionality but different structures from approximately 314 million lines of source code. To construct this dataset, the authors used an automated test generation technique, EvoSuite. Test cases generated by automated test generation techniques have the property that the test cases always succeed. In constructing the dataset, using this property, test cases generated from two methods were executed against each other to automatically determine whether the behavior of the two methods is the same to some extent. Pairs of methods for which all test cases succeeded in cross-running test cases are manually investigated to be functionally equivalent. This paper also reports the results of an accuracy evaluation of code clone detection tools using the constructed dataset. The purpose of this evaluation is assessing how accurately code clone detection tools could find the functionally equivalent methods, not assessing the accuracy of detecting ordinary clones. The constructed dataset is available at github (https://github.com/YoshikiHigo/FEMPDataset).

  • Dataset Distillation Using Parameter Pruning Open Access

    Guang LI  Ren TOGO  Takahiro OGAWA  Miki HASEYAMA  

     
    LETTER-Image

      Pubricized:
    2023/09/06
      Vol:
    E107-A No:6
      Page(s):
    936-940

    In this study, we propose a novel dataset distillation method based on parameter pruning. The proposed method can synthesize more robust distilled datasets and improve distillation performance by pruning difficult-to-match parameters during the distillation process. Experimental results on two benchmark datasets show the superiority of the proposed method.

  • GAN-SR Anomaly Detection Model Based on Imbalanced Data

    Shuang WANG  Hui CHEN  Lei DING  He SUI  Jianli DING  

     
    PAPER-Data Engineering, Web Information Systems

      Pubricized:
    2023/04/13
      Vol:
    E106-D No:7
      Page(s):
    1209-1218

    The issue of a low minority class identification rate caused by data imbalance in anomaly detection tasks is addressed by the proposal of a GAN-SR-based intrusion detection model for industrial control systems. First, to correct the imbalance of minority classes in the dataset, a generative adversarial network (GAN) processes the dataset to reconstruct new minority class training samples accordingly. Second, high-dimensional feature extraction is completed using stacked asymmetric depth self-encoder to address the issues of low reconstruction error and lengthy training times. After that, a random forest (RF) decision tree is built, and intrusion detection is carried out using the features that SNDAE retrieved. According to experimental validation on the UNSW-NB15, SWaT and Gas Pipeline datasets, the GAN-SR model outperforms SNDAE-SVM and SNDAE-KNN in terms of detection performance and stability.

  • Synthetic Scene Character Generator and Ensemble Scheme with the Random Image Feature Method for Japanese and Chinese Scene Character Recognition

    Fuma HORIE  Hideaki GOTO  Takuo SUGANUMA  

     
    PAPER-Image Recognition, Computer Vision

      Pubricized:
    2021/08/24
      Vol:
    E104-D No:11
      Page(s):
    2002-2010

    Scene character recognition has been intensively investigated for a couple of decades because it has a great potential in many applications including automatic translation, signboard recognition, and reading assistance for the visually-impaired. However, scene characters are difficult to recognize at sufficient accuracy owing to various noise and image distortions. In addition, Japanese scene character recognition is more challenging and requires a large amount of character data for training because thousands of character classes exist in the language. Some researchers proposed training data augmentation techniques using Synthetic Scene Character Data (SSCD) to compensate for the shortage of training data. In this paper, we propose a Random Filter which is a new method for SSCD generation, and introduce an ensemble scheme with the Random Image Feature (RI-Feature) method. Since there has not been a large Japanese scene character dataset for the evaluation of the recognition systems, we have developed an open dataset JPSC1400, which consists of a large number of real Japanese scene characters. It is shown that the accuracy has been improved from 70.9% to 83.1% by introducing the RI-Feature method to the ensemble scheme.

  • A Novel Approach to Address External Validity Issues in Fault Prediction Using Bandit Algorithms

    Teruki HAYAKAWA  Masateru TSUNODA  Koji TODA  Keitaro NAKASAI  Amjed TAHIR  Kwabena Ebo BENNIN  Akito MONDEN  Kenichi MATSUMOTO  

     
    LETTER-Software Engineering

      Pubricized:
    2020/10/30
      Vol:
    E104-D No:2
      Page(s):
    327-331

    Various software fault prediction models have been proposed in the past twenty years. Many studies have compared and evaluated existing prediction approaches in order to identify the most effective ones. However, in most cases, such models and techniques provide varying results, and their outcomes do not result in best possible performance across different datasets. This is mainly due to the diverse nature of software development projects, and therefore, there is a risk that the selected models lead to inconsistent results across multiple datasets. In this work, we propose the use of bandit algorithms in cases where the accuracy of the models are inconsistent across multiple datasets. In the experiment discussed in this work, we used four conventional prediction models, tested on three different dataset, and then selected the best possible model dynamically by applying bandit algorithms. We then compared our results with those obtained using majority voting. As a result, Epsilon-greedy with ϵ=0.3 showed the best or second-best prediction performance compared with using only one prediction model and majority voting. Our results showed that bandit algorithms can provide promising outcomes when used in fault prediction.

  • Analysis of Work Efficiency and Quality of Software Maintenance Using Cross-Company Dataset

    Masateru TSUNODA  Akito MONDEN  Kenichi MATSUMOTO  Sawako OHIWA  Tomoki OSHINO  

     
    PAPER

      Pubricized:
    2020/08/31
      Vol:
    E104-D No:1
      Page(s):
    76-90

    Software maintenance is an important activity in the software lifecycle. Software maintenance does not only mean removing faults found after software release. Software needs extensions or modifications of its functions owing to changes in the business environment and software maintenance also refers to them. To help users and service suppliers benchmark work efficiency for software maintenance, and to clarify the relationships between software quality, work efficiency, and unit cost of staff, we used a dataset that includes 134 data points collected by the Economic Research Association in 2012, and analyzed the factors that affected the work efficiency of software maintenance. In the analysis, using a multiple regression model, we clarified the relationships between work efficiency and programming language and productivity factors. To analyze the influence to the quality, relationships of fault ratio was analyzed using correlation coefficients. The programming language and productivity factors affect work efficiency. Higher work efficiency and higher unit cost of staff do not affect the quality of software maintenance.

  • Learning of Nonnegative Matrix Factorization Models for Inconsistent Resolution Dataset Analysis

    Masahiro KOHJIMA  Tatsushi MATSUBAYASHI  Hiroshi SAWADA  

     
    INVITED PAPER

      Pubricized:
    2019/02/04
      Vol:
    E102-D No:4
      Page(s):
    715-723

    Due to the need to protect personal information and the impracticality of exhaustive data collection, there is increasing need to deal with datasets with various levels of granularity, such as user-individual data and user-group data. In this study, we propose a new method for jointly analyzing multiple datasets with different granularity. The proposed method is a probabilistic model based on nonnegative matrix factorization, which is derived by introducing latent variables that indicate the high-resolution data underlying the low-resolution data. Experiments on purchase logs show that the proposed method has a better performance than the existing methods. Furthermore, by deriving an extension of the proposed method, we show that the proposed method is a new fundamental approach for analyzing datasets with different granularity.

  • Automatic Retrieval of Action Video Shots from the Web Using Density-Based Cluster Analysis and Outlier Detection

    Nga Hang DO  Keiji YANAI  

     
    PAPER-Image Processing and Video Processing

      Pubricized:
    2016/07/21
      Vol:
    E99-D No:11
      Page(s):
    2788-2795

    In this paper, we introduce a fully automatic approach to construct action datasets from noisy Web video search results. The idea is based on combining cluster structure analysis and density-based outlier detection. For a specific action concept, first, we download its Web top search videos and segment them into video shots. We then organize these shots into subsets using density-based hierarchy clustering. For each set, we rank its shots by their outlier degrees which are determined as their isolatedness with respect to their surroundings. Finally, we collect high ranked shots as training data for the action concept. We demonstrate that with action models trained by our data, we can obtain promising precision rates in the task of action classification while offering the advantage of fully automatic, scalable learning. Experiment results on UCF11, a challenging action dataset, show the effectiveness of our method.

  • History-Pattern Encoding for Large-Scale Dynamic Multidimensional Datasets and Its Evaluations

    Masafumi MAKINO  Tatsuo TSUJI  Ken HIGUCHI  

     
    PAPER

      Pubricized:
    2016/01/14
      Vol:
    E99-D No:4
      Page(s):
    989-999

    In this paper, we present a new encoding/decoding method for dynamic multidimensional datasets and its implementation scheme. Our method encodes an n-dimensional tuple into a pair of scalar values even if n is sufficiently large. The method also encodes and decodes tuples using only shift and and/or register instructions. One of the most serious problems in multidimensional array based tuple encoding is that the size of an encoded result may often exceed the machine word size for large-scale tuple sets. This problem is efficiently resolved in our scheme. We confirmed the advantages of our scheme by analytical and experimental evaluations. The experimental evaluations were conducted to compare our constructed prototype system with other systems; (1) a system based on a similar encoding scheme called history-offset encoding, and (2) PostgreSQL RDBMS. In most cases, both the storage and retrieval costs of our system significantly outperformed those of the other systems.

  • Efficient Large-Scale Video Retrieval via Discriminative Signatures

    Pengyi HAO  Sei-ichiro KAMATA  

     
    PAPER-Image Processing and Video Processing

      Vol:
    E96-D No:8
      Page(s):
    1800-1810

    The topic of retrieving videos containing a desired person from a dataset just using the content of faces without any help of textual information has many interesting applications like video surveillance, social network, video mining, etc. However, traditional face matching against a huge number of detected faces leads to an unacceptable response time and may also reduce the accuracy due to the large variations in facial expressions, poses, lighting, etc. Therefore, in this paper we propose a novel method to generate discriminative “signatures” for efficiently retrieving the videos containing the same person with a query. In this research, the signature is defined as a compact, discriminative and reduced dimensionality representation, which is generated from a set of high-dimensional feature vectors of an individual. The desired videos are retrieved based on the similarities between the signature of the query and those of individuals in the database. In particular, we make the following contributions. Firstly, we give an algorithm of two directional linear discriminant analysis with maximum correntropy criterion (2DLDA-MCC) as an extension to our recently proposed maximum correntropy criterion based linear discriminant analysis (LDA-MCC). Both algorithms are robust to outliers and noise. Secondly, we present an approach for transferring a set of exemplars to a fixed-length signature using LDA-MCC and 2DLDA-MCC, resulting in two kinds of signatures that are called 1D signature and 2D signature. Finally, a novel video retrieval scheme is given based on the signatures, which has low storage requirement and can achieve a fast search. Evaluations on a large dataset of videos show reliable measurement of similarities by using the proposed signatures to represent the identities generated from videos. Experimental results also demonstrate that the proposed video retrieval scheme has the potential to substantially reduce the response time and slightly increase the mean average precision of retrieval.

  • Measuring the Degree of Synonymy between Words Using Relational Similarity between Word Pairs as a Proxy

    Danushka BOLLEGALA  Yutaka MATSUO  Mitsuru ISHIZUKA  

     
    PAPER-Natural Language Processing

      Vol:
    E95-D No:8
      Page(s):
    2116-2123

    Two types of similarities between words have been studied in the natural language processing community: synonymy and relational similarity. A high degree of similarity exist between synonymous words. On the other hand, a high degree of relational similarity exists between analogous word pairs. We present and empirically test a hypothesis that links these two types of similarities. Specifically, we propose a method to measure the degree of synonymy between two words using relational similarity between word pairs as a proxy. Given two words, first, we represent the semantic relations that hold between those words using lexical patterns. We use a sequential pattern clustering algorithm to identify different lexical patterns that represent the same semantic relation. Second, we compute the degree of synonymy between two words using an inter-cluster covariance matrix. We compare the proposed method for measuring the degree of synonymy against previously proposed methods on the Miller-Charles dataset and the WordSimilarity-353 dataset. Our proposed method outperforms all existing Web-based similarity measures, achieving a statistically significant Pearson correlation coefficient of 0.867 on the Miller-Charles dataset.

  • Efficiently Finding Individuals from Video Dataset

    Pengyi HAO  Sei-ichiro KAMATA  

     
    PAPER-Video Processing

      Vol:
    E95-D No:5
      Page(s):
    1280-1287

    We are interested in retrieving video shots or videos containing particular people from a video dataset. Owing to the large variations in pose, illumination conditions, occlusions, hairstyles and facial expressions, face tracks have recently been researched in the fields of face recognition, face retrieval and name labeling from videos. However, when the number of face tracks is very large, conventional methods, which match all or some pairs of faces in face tracks, will not be effective. Therefore, in this paper, an efficient method for finding a given person from a video dataset is presented. In our study, in according to performing research on face tracks in a single video, we also consider how to organize all the faces in videos in a dataset and how to improve the search quality in the query process. Different videos may include the same person; thus, the management of individuals in different videos will be useful for their retrieval. The proposed method includes the following three points. (i) Face tracks of the same person appearing for a period in each video are first connected on the basis of scene information with a time constriction, then all the people in one video are organized by a proposed hierarchical clustering method. (ii) After obtaining the organizational structure of all the people in one video, the people are organized into an upper layer by affinity propagation. (iii) Finally, in the process of querying, a remeasuring method based on the index structure of videos is performed to improve the retrieval accuracy. We also build a video dataset that contains six types of videos: films, TV shows, educational videos, interviews, press conferences and domestic activities. The formation of face tracks in the six types of videos is first researched, then experiments are performed on this video dataset containing more than 1 million faces and 218,786 face tracks. The results show that the proposed approach has high search quality and a short search time.