1-4hit |
Juntong HONG Eunjong CHOI Osamu MIZUNO
Code search is a task to retrieve the most relevant code given a natural language query. Several recent studies proposed deep learning based methods use multi-encoder model to parse code into multi-field to represent code. These methods enhance the performance of the model by distinguish between similar codes and utilizing a relation matrix to bridge the code and query. However, these models require more computational resources and parameters than single-encoder models. Furthermore, utilizing the relation matrix that solely relies on max-pooling disregards the delivery of word alignment information. To alleviate these problems, we propose a combined alignment model for code search. We concatenate the multi-code fields into one sequence to represent code and use one encoding model to encode code features. Moreover, we transform the relation matrix using trainable vectors to avoid information losses. Then, we combine intra-modal and cross-modal attention to assign the salient words while matching the corresponding code and query. Finally, we apply the attention weight to code/query embedding and compute the cosine similarity. To evaluate the performance of our model, we compare our model with six previous models on two popular datasets. The results show that our model achieves 0.614 and 0.687 Top@1 performance, outperforming the best comparison models by 12.2% and 9.3%, respectively.
Eunjong CHOI Norihiro YOSHIDA Katsuro INOUE
Although code clones (i.e. code fragments that have similar or identical code fragments in the source code) are regarded as a factor that increases the complexity of software maintenance, tools for supporting clone refactoring (i.e. merging a set of code clones into a single method or function) are not commonly used. To promote the development of refactoring tools that can be more widely utilized, we present an investigation of clone refactoring carried out in the development of open source software systems. In the investigation, we identified the most frequently used refactoring patterns and discovered how merged code clone token sequences and differences in token sequence lengths vary for each refactoring pattern.
Eunjong CHOI Norihiro YOSHIDA Yoshiki HIGO Katsuro INOUE
So far, many approaches for detecting code clones have been proposed based on the different degrees of normalizations (e.g. removal of white spaces, tokenization, and regularization of identifiers). Different degrees of normalizations lead to different granularities of source code to be detect as code clones. To investigate how the normalizations impact the code clone detection, this study proposes six approaches for detecting code clones with preprocessing input source files using different degrees of normalizations. More precisely, each normalization is applied to the input source files and then equivalence class partitioning is performed to the files in the preprocessing. After that, code clones are detected from a set of files that are representatives of each equivalence class using a token-based code clone detection tool named CCFinder. The proposed approaches can be categorized into two types, approaches with non-normalization and normalization. The former is the detection of only identical files without any normalization. Meanwhile, the latter category is the detection of identical files with different degrees of normalizations such as removal of all lines containing macros. From the case study, we observed that our proposed approaches detect code clones faster than the approach that uses only CCFinder. We also found the approach with non-normalization is the fastest among the proposed approaches in many cases.
Khine Yin MON Masanari KONDO Eunjong CHOI Osamu MIZUNO
Defect prediction approaches have been greatly contributing to software quality assurance activities such as code review or unit testing. Just-in-time defect prediction approaches are developed to predict whether a commit is a defect-inducing commit or not. Prior research has shown that commit-level prediction is not enough in terms of effort, and a defective commit may contain both defective and non-defective files. As the defect prediction community is promoting fine-grained granularity prediction approaches, we propose our novel class-level prediction, which is finer-grained than the file-level prediction, based on the files of the commits in this research. We designed our model for Python projects and tested it with ten open-source Python projects. We performed our experiment with two settings: setting with product metrics only and setting with product metrics plus commit information. Our investigation was conducted with three different classifiers and two validation strategies. We found that our model developed by random forest classifier performs the best, and commit information contributes significantly to the product metrics in 10-fold cross-validation. We also created a commit-based file-level prediction for the Python files which do not have the classes. The file-level model also showed a similar condition as the class-level model. However, the results showed a massive deviation in time-series validation for both levels and the challenge of predicting Python classes and files in a realistic scenario.