Training Multiple Support Vector Machines for Personalized Web Content Filters

Dung Duc NGUYEN; Maike ERDMANN; Tomoya TAKEYOSHI; Gen HATTORI; Kazunori MATSUMOTO; Chihiro ONO

doi:10.1587/transinf.E96.D.2376

Training Multiple Support Vector Machines for Personalized Web Content Filters

Dung Duc NGUYEN, Maike ERDMANN, Tomoya TAKEYOSHI, Gen HATTORI, Kazunori MATSUMOTO, Chihiro ONO

Full Text Views

0

Cite this

Summary :

The abundance of information published on the Internet makes filtering of hazardous Web pages a difficult yet important task. Supervised learning methods such as Support Vector Machines (SVMs) can be used to identify hazardous Web content. However, scalability is a big challenge, especially if we have to train multiple classifiers, since different policies exist on what kind of information is hazardous. We therefore propose two different strategies to train multiple SVMs for personalized Web content filters. The first strategy identifies common data clusters and then performs optimization on these clusters in order to obtain good initial solutions for individual problems. This initialization shortens the path to the optimal solutions and reduces the training time on individual training sets. The second approach is to train all SVMs simultaneously. We introduce an SMO-based kernel-biased heuristic that balances the reduction rate of individual objective functions and the computational cost of kernel matrix. The heuristic primarily relies on the optimality conditions of all optimization problems and secondly on the pre-calculated part of the whole kernel matrix. This strategy increases the amount of information sharing among learning tasks, thus reduces the number of kernel calculation and training time. In our experiments on inconsistently labeled training examples, both strategies were able to predict hazardous Web pages accurately (> 91%) with a training time of only 26% and 18% compared to that of the normal sequential training.

Publication: IEICE TRANSACTIONS on Information Vol.E96-D No.11 pp.2376-2384

Publication Date: 2013/11/01

Publicized

Online ISSN: 1745-1361

DOI: 10.1587/transinf.E96.D.2376

Type of Manuscript: PAPER

Category: Artificial Intelligence, Data Mining

Authors

Dung Duc NGUYEN
  Vietnam Academy of Science and Technology
Maike ERDMANN
  KDDI R&D Laboratories
Tomoya TAKEYOSHI
  KDDI R&D Laboratories
Gen HATTORI
  KDDI R&D Laboratories
Kazunori MATSUMOTO
  KDDI R&D Laboratories
Chihiro ONO
  KDDI R&D Laboratories

Keyword

support vector machines, sequential minimal optimization, text categorization, Web content filtering

Cite this

Copy

Dung Duc NGUYEN, Maike ERDMANN, Tomoya TAKEYOSHI, Gen HATTORI, Kazunori MATSUMOTO, Chihiro ONO, "Training Multiple Support Vector Machines for Personalized Web Content Filters" in IEICE TRANSACTIONS on Information, vol. E96-D, no. 11, pp. 2376-2384, November 2013, doi: 10.1587/transinf.E96.D.2376.
Abstract: The abundance of information published on the Internet makes filtering of hazardous Web pages a difficult yet important task. Supervised learning methods such as Support Vector Machines (SVMs) can be used to identify hazardous Web content. However, scalability is a big challenge, especially if we have to train multiple classifiers, since different policies exist on what kind of information is hazardous. We therefore propose two different strategies to train multiple SVMs for personalized Web content filters. The first strategy identifies common data clusters and then performs optimization on these clusters in order to obtain good initial solutions for individual problems. This initialization shortens the path to the optimal solutions and reduces the training time on individual training sets. The second approach is to train all SVMs simultaneously. We introduce an SMO-based kernel-biased heuristic that balances the reduction rate of individual objective functions and the computational cost of kernel matrix. The heuristic primarily relies on the optimality conditions of all optimization problems and secondly on the pre-calculated part of the whole kernel matrix. This strategy increases the amount of information sharing among learning tasks, thus reduces the number of kernel calculation and training time. In our experiments on inconsistently labeled training examples, both strategies were able to predict hazardous Web pages accurately (> 91%) with a training time of only 26% and 18% compared to that of the normal sequential training.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.E96.D.2376/_p

Copy

@ARTICLE{e96-d_11_2376,
author={Dung Duc NGUYEN, Maike ERDMANN, Tomoya TAKEYOSHI, Gen HATTORI, Kazunori MATSUMOTO, Chihiro ONO, },
journal={IEICE TRANSACTIONS on Information},
title={Training Multiple Support Vector Machines for Personalized Web Content Filters},
year={2013},
volume={E96-D},
number={11},
pages={2376-2384},
abstract={The abundance of information published on the Internet makes filtering of hazardous Web pages a difficult yet important task. Supervised learning methods such as Support Vector Machines (SVMs) can be used to identify hazardous Web content. However, scalability is a big challenge, especially if we have to train multiple classifiers, since different policies exist on what kind of information is hazardous. We therefore propose two different strategies to train multiple SVMs for personalized Web content filters. The first strategy identifies common data clusters and then performs optimization on these clusters in order to obtain good initial solutions for individual problems. This initialization shortens the path to the optimal solutions and reduces the training time on individual training sets. The second approach is to train all SVMs simultaneously. We introduce an SMO-based kernel-biased heuristic that balances the reduction rate of individual objective functions and the computational cost of kernel matrix. The heuristic primarily relies on the optimality conditions of all optimization problems and secondly on the pre-calculated part of the whole kernel matrix. This strategy increases the amount of information sharing among learning tasks, thus reduces the number of kernel calculation and training time. In our experiments on inconsistently labeled training examples, both strategies were able to predict hazardous Web pages accurately (> 91%) with a training time of only 26% and 18% compared to that of the normal sequential training.},
keywords={},
doi={10.1587/transinf.E96.D.2376},
ISSN={1745-1361},
month={November},}

Copy

TY - JOUR
TI - Training Multiple Support Vector Machines for Personalized Web Content Filters
T2 - IEICE TRANSACTIONS on Information
SP - 2376
EP - 2384
AU - Dung Duc NGUYEN
AU - Maike ERDMANN
AU - Tomoya TAKEYOSHI
AU - Gen HATTORI
AU - Kazunori MATSUMOTO
AU - Chihiro ONO
PY - 2013
DO - 10.1587/transinf.E96.D.2376
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E96-D
IS - 11
JA - IEICE TRANSACTIONS on Information
Y1 - November 2013
AB - The abundance of information published on the Internet makes filtering of hazardous Web pages a difficult yet important task. Supervised learning methods such as Support Vector Machines (SVMs) can be used to identify hazardous Web content. However, scalability is a big challenge, especially if we have to train multiple classifiers, since different policies exist on what kind of information is hazardous. We therefore propose two different strategies to train multiple SVMs for personalized Web content filters. The first strategy identifies common data clusters and then performs optimization on these clusters in order to obtain good initial solutions for individual problems. This initialization shortens the path to the optimal solutions and reduces the training time on individual training sets. The second approach is to train all SVMs simultaneously. We introduce an SMO-based kernel-biased heuristic that balances the reduction rate of individual objective functions and the computational cost of kernel matrix. The heuristic primarily relies on the optimality conditions of all optimization problems and secondly on the pre-calculated part of the whole kernel matrix. This strategy increases the amount of information sharing among learning tasks, thus reduces the number of kernel calculation and training time. In our experiments on inconsistently labeled training examples, both strategies were able to predict hazardous Web pages accurately (> 91%) with a training time of only 26% and 18% compared to that of the normal sequential training.
ER -