In this paper, we propose a method to reduce the labeling cost while acquiring training data for a malicious domain name detection system using supervised machine learning. In the conventional systems, to train a classifier with high classification accuracy, large quantities of benign and malicious domain names need to be prepared as training data. In general, malicious domain names are observed less frequently than benign domain names. Therefore, it is difficult to acquire a large number of malicious domain names without a dedicated labeling method. We propose a method based on active learning that labels data around the decision boundary of classification, i.e., in the gray area, and we show that the classification accuracy can be improved by using approximately 1% of the training data used by the conventional systems. Another disadvantage of the conventional system is that if the classifier is trained with a small amount of training data, its generalization ability cannot be guaranteed. We propose a method based on ensemble learning that integrates multiple classifiers, and we show that the classification accuracy can be stabilized and improved. The combination of the two methods proposed here allows us to develop a new system for malicious domain name detection with high classification accuracy and generalization ability by labeling a small amount of training data.
Naoki FUKUSHI
Waseda University
Daiki CHIBA
NTT Corporation
Mitsuaki AKIYAMA
NTT Corporation
Masato UCHIDA
Waseda University
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
Copy
Naoki FUKUSHI, Daiki CHIBA, Mitsuaki AKIYAMA, Masato UCHIDA, "Exploration into Gray Area: Toward Efficient Labeling for Detecting Malicious Domain Names" in IEICE TRANSACTIONS on Communications,
vol. E103-B, no. 4, pp. 375-388, April 2020, doi: 10.1587/transcom.2019NRP0005.
Abstract: In this paper, we propose a method to reduce the labeling cost while acquiring training data for a malicious domain name detection system using supervised machine learning. In the conventional systems, to train a classifier with high classification accuracy, large quantities of benign and malicious domain names need to be prepared as training data. In general, malicious domain names are observed less frequently than benign domain names. Therefore, it is difficult to acquire a large number of malicious domain names without a dedicated labeling method. We propose a method based on active learning that labels data around the decision boundary of classification, i.e., in the gray area, and we show that the classification accuracy can be improved by using approximately 1% of the training data used by the conventional systems. Another disadvantage of the conventional system is that if the classifier is trained with a small amount of training data, its generalization ability cannot be guaranteed. We propose a method based on ensemble learning that integrates multiple classifiers, and we show that the classification accuracy can be stabilized and improved. The combination of the two methods proposed here allows us to develop a new system for malicious domain name detection with high classification accuracy and generalization ability by labeling a small amount of training data.
URL: https://global.ieice.org/en_transactions/communications/10.1587/transcom.2019NRP0005/_p
Copy
@ARTICLE{e103-b_4_375,
author={Naoki FUKUSHI, Daiki CHIBA, Mitsuaki AKIYAMA, Masato UCHIDA, },
journal={IEICE TRANSACTIONS on Communications},
title={Exploration into Gray Area: Toward Efficient Labeling for Detecting Malicious Domain Names},
year={2020},
volume={E103-B},
number={4},
pages={375-388},
abstract={In this paper, we propose a method to reduce the labeling cost while acquiring training data for a malicious domain name detection system using supervised machine learning. In the conventional systems, to train a classifier with high classification accuracy, large quantities of benign and malicious domain names need to be prepared as training data. In general, malicious domain names are observed less frequently than benign domain names. Therefore, it is difficult to acquire a large number of malicious domain names without a dedicated labeling method. We propose a method based on active learning that labels data around the decision boundary of classification, i.e., in the gray area, and we show that the classification accuracy can be improved by using approximately 1% of the training data used by the conventional systems. Another disadvantage of the conventional system is that if the classifier is trained with a small amount of training data, its generalization ability cannot be guaranteed. We propose a method based on ensemble learning that integrates multiple classifiers, and we show that the classification accuracy can be stabilized and improved. The combination of the two methods proposed here allows us to develop a new system for malicious domain name detection with high classification accuracy and generalization ability by labeling a small amount of training data.},
keywords={},
doi={10.1587/transcom.2019NRP0005},
ISSN={1745-1345},
month={April},}
Copy
TY - JOUR
TI - Exploration into Gray Area: Toward Efficient Labeling for Detecting Malicious Domain Names
T2 - IEICE TRANSACTIONS on Communications
SP - 375
EP - 388
AU - Naoki FUKUSHI
AU - Daiki CHIBA
AU - Mitsuaki AKIYAMA
AU - Masato UCHIDA
PY - 2020
DO - 10.1587/transcom.2019NRP0005
JO - IEICE TRANSACTIONS on Communications
SN - 1745-1345
VL - E103-B
IS - 4
JA - IEICE TRANSACTIONS on Communications
Y1 - April 2020
AB - In this paper, we propose a method to reduce the labeling cost while acquiring training data for a malicious domain name detection system using supervised machine learning. In the conventional systems, to train a classifier with high classification accuracy, large quantities of benign and malicious domain names need to be prepared as training data. In general, malicious domain names are observed less frequently than benign domain names. Therefore, it is difficult to acquire a large number of malicious domain names without a dedicated labeling method. We propose a method based on active learning that labels data around the decision boundary of classification, i.e., in the gray area, and we show that the classification accuracy can be improved by using approximately 1% of the training data used by the conventional systems. Another disadvantage of the conventional system is that if the classifier is trained with a small amount of training data, its generalization ability cannot be guaranteed. We propose a method based on ensemble learning that integrates multiple classifiers, and we show that the classification accuracy can be stabilized and improved. The combination of the two methods proposed here allows us to develop a new system for malicious domain name detection with high classification accuracy and generalization ability by labeling a small amount of training data.
ER -