A Method of K-Means Clustering Based on TF-IDF for Software Requirements Documents Written in Chinese Language

Jing ZHU; Song HUANG; Yaqing SHI; Kaishun WU; Yanqiu WANG

doi:10.1587/transinf.2021EDP7144

A Method of K-Means Clustering Based on TF-IDF for Software Requirements Documents Written in Chinese Language

Jing ZHU, Song HUANG, Yaqing SHI, Kaishun WU, Yanqiu WANG

Full Text Views

0

Cite this

Summary :

Nowadays there is no way to automatically obtain the function points when using function point analyze (FPA) method, especially for the requirement documents written in Chinese language. Considering the characteristics of Chinese grammar in words segmentation, it is necessary to divide words accurately Chinese words, so that the subsequent entity recognition and disambiguation can be carried out in a smaller range, which lays a solid foundation for the efficient automatic extraction of the function points. Therefore, this paper proposed a method of K-Means clustering based on TF-IDF, and conducts experiments with 24 software requirement documents written in Chinese language. The results show that the best clustering effect is achieved when the extracted information is retained by 55% to 75% and the number of clusters takes the middle value of the total number of clusters. Not only for Chinese, this method and conclusion of this paper, but provides an important reference for automatic extraction of function points from software requirements documents written in other Oriental languages, and also fills the gaps of data preprocessing in the early stage of automatic calculation function points.

Publication: IEICE TRANSACTIONS on Information Vol.E105-D No.4 pp.736-754

Publication Date: 2022/04/01

Publicized: 2021/12/28

Online ISSN: 1745-1361

DOI: 10.1587/transinf.2021EDP7144

Type of Manuscript: PAPER

Category: Software Engineering

Authors

Jing ZHU
  Army Engineering University of PLA,Navy Command College
Song HUANG
  Army Engineering University of PLA
Yaqing SHI
  Army Engineering University of PLA
Kaishun WU
  Army Engineering University of PLA
Yanqiu WANG
  Baopo technology Co. Ltd.

Keyword

Chinese, TF-IDF, K-means, clustering

Cite this

Copy

Jing ZHU, Song HUANG, Yaqing SHI, Kaishun WU, Yanqiu WANG, "A Method of K-Means Clustering Based on TF-IDF for Software Requirements Documents Written in Chinese Language" in IEICE TRANSACTIONS on Information, vol. E105-D, no. 4, pp. 736-754, April 2022, doi: 10.1587/transinf.2021EDP7144.
Abstract: Nowadays there is no way to automatically obtain the function points when using function point analyze (FPA) method, especially for the requirement documents written in Chinese language. Considering the characteristics of Chinese grammar in words segmentation, it is necessary to divide words accurately Chinese words, so that the subsequent entity recognition and disambiguation can be carried out in a smaller range, which lays a solid foundation for the efficient automatic extraction of the function points. Therefore, this paper proposed a method of K-Means clustering based on TF-IDF, and conducts experiments with 24 software requirement documents written in Chinese language. The results show that the best clustering effect is achieved when the extracted information is retained by 55% to 75% and the number of clusters takes the middle value of the total number of clusters. Not only for Chinese, this method and conclusion of this paper, but provides an important reference for automatic extraction of function points from software requirements documents written in other Oriental languages, and also fills the gaps of data preprocessing in the early stage of automatic calculation function points.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.2021EDP7144/_p

Copy

@ARTICLE{e105-d_4_736,
author={Jing ZHU, Song HUANG, Yaqing SHI, Kaishun WU, Yanqiu WANG, },
journal={IEICE TRANSACTIONS on Information},
title={A Method of K-Means Clustering Based on TF-IDF for Software Requirements Documents Written in Chinese Language},
year={2022},
volume={E105-D},
number={4},
pages={736-754},
abstract={Nowadays there is no way to automatically obtain the function points when using function point analyze (FPA) method, especially for the requirement documents written in Chinese language. Considering the characteristics of Chinese grammar in words segmentation, it is necessary to divide words accurately Chinese words, so that the subsequent entity recognition and disambiguation can be carried out in a smaller range, which lays a solid foundation for the efficient automatic extraction of the function points. Therefore, this paper proposed a method of K-Means clustering based on TF-IDF, and conducts experiments with 24 software requirement documents written in Chinese language. The results show that the best clustering effect is achieved when the extracted information is retained by 55% to 75% and the number of clusters takes the middle value of the total number of clusters. Not only for Chinese, this method and conclusion of this paper, but provides an important reference for automatic extraction of function points from software requirements documents written in other Oriental languages, and also fills the gaps of data preprocessing in the early stage of automatic calculation function points.},
keywords={},
doi={10.1587/transinf.2021EDP7144},
ISSN={1745-1361},
month={April},}

Copy

TY - JOUR
TI - A Method of K-Means Clustering Based on TF-IDF for Software Requirements Documents Written in Chinese Language
T2 - IEICE TRANSACTIONS on Information
SP - 736
EP - 754
AU - Jing ZHU
AU - Song HUANG
AU - Yaqing SHI
AU - Kaishun WU
AU - Yanqiu WANG
PY - 2022
DO - 10.1587/transinf.2021EDP7144
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E105-D
IS - 4
JA - IEICE TRANSACTIONS on Information
Y1 - April 2022
AB - Nowadays there is no way to automatically obtain the function points when using function point analyze (FPA) method, especially for the requirement documents written in Chinese language. Considering the characteristics of Chinese grammar in words segmentation, it is necessary to divide words accurately Chinese words, so that the subsequent entity recognition and disambiguation can be carried out in a smaller range, which lays a solid foundation for the efficient automatic extraction of the function points. Therefore, this paper proposed a method of K-Means clustering based on TF-IDF, and conducts experiments with 24 software requirement documents written in Chinese language. The results show that the best clustering effect is achieved when the extracted information is retained by 55% to 75% and the number of clusters takes the middle value of the total number of clusters. Not only for Chinese, this method and conclusion of this paper, but provides an important reference for automatic extraction of function points from software requirements documents written in other Oriental languages, and also fills the gaps of data preprocessing in the early stage of automatic calculation function points.
ER -