We consider a data set in which each example is an n-dimensional Boolean vector labeled as true or false. A pattern is a co-occurrence of a particular value combination of a given subset of the variables. If a pattern appears frequently in the true examples and infrequently in the false examples, we consider it a good pattern. In this paper, we discuss the problem of determining the data size needed for removing "deceptive" good patterns; in a data set of a small size, many good patterns may appear superficially, simply by chance, independently of the underlying structure. Our hypothesis is that, in order to remove such deceptive good patterns, the data set should contain a greater number of examples than that at which a random data set contains few good patterns. We justify this hypothesis by computational studies. We also derive a theoretical upper bound on the needed data size in view of our hypothesis.
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
Copy
Kazuya HARAGUCHI, Mutsunori YAGIURA, Endre BOROS, Toshihide IBARAKI, "A Randomness Based Analysis on the Data Size Needed for Removing Deceptive Patterns" in IEICE TRANSACTIONS on Information,
vol. E91-D, no. 3, pp. 781-788, March 2008, doi: 10.1093/ietisy/e91-d.3.781.
Abstract: We consider a data set in which each example is an n-dimensional Boolean vector labeled as true or false. A pattern is a co-occurrence of a particular value combination of a given subset of the variables. If a pattern appears frequently in the true examples and infrequently in the false examples, we consider it a good pattern. In this paper, we discuss the problem of determining the data size needed for removing "deceptive" good patterns; in a data set of a small size, many good patterns may appear superficially, simply by chance, independently of the underlying structure. Our hypothesis is that, in order to remove such deceptive good patterns, the data set should contain a greater number of examples than that at which a random data set contains few good patterns. We justify this hypothesis by computational studies. We also derive a theoretical upper bound on the needed data size in view of our hypothesis.
URL: https://global.ieice.org/en_transactions/information/10.1093/ietisy/e91-d.3.781/_p
Copy
@ARTICLE{e91-d_3_781,
author={Kazuya HARAGUCHI, Mutsunori YAGIURA, Endre BOROS, Toshihide IBARAKI, },
journal={IEICE TRANSACTIONS on Information},
title={A Randomness Based Analysis on the Data Size Needed for Removing Deceptive Patterns},
year={2008},
volume={E91-D},
number={3},
pages={781-788},
abstract={We consider a data set in which each example is an n-dimensional Boolean vector labeled as true or false. A pattern is a co-occurrence of a particular value combination of a given subset of the variables. If a pattern appears frequently in the true examples and infrequently in the false examples, we consider it a good pattern. In this paper, we discuss the problem of determining the data size needed for removing "deceptive" good patterns; in a data set of a small size, many good patterns may appear superficially, simply by chance, independently of the underlying structure. Our hypothesis is that, in order to remove such deceptive good patterns, the data set should contain a greater number of examples than that at which a random data set contains few good patterns. We justify this hypothesis by computational studies. We also derive a theoretical upper bound on the needed data size in view of our hypothesis.},
keywords={},
doi={10.1093/ietisy/e91-d.3.781},
ISSN={1745-1361},
month={March},}
Copy
TY - JOUR
TI - A Randomness Based Analysis on the Data Size Needed for Removing Deceptive Patterns
T2 - IEICE TRANSACTIONS on Information
SP - 781
EP - 788
AU - Kazuya HARAGUCHI
AU - Mutsunori YAGIURA
AU - Endre BOROS
AU - Toshihide IBARAKI
PY - 2008
DO - 10.1093/ietisy/e91-d.3.781
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E91-D
IS - 3
JA - IEICE TRANSACTIONS on Information
Y1 - March 2008
AB - We consider a data set in which each example is an n-dimensional Boolean vector labeled as true or false. A pattern is a co-occurrence of a particular value combination of a given subset of the variables. If a pattern appears frequently in the true examples and infrequently in the false examples, we consider it a good pattern. In this paper, we discuss the problem of determining the data size needed for removing "deceptive" good patterns; in a data set of a small size, many good patterns may appear superficially, simply by chance, independently of the underlying structure. Our hypothesis is that, in order to remove such deceptive good patterns, the data set should contain a greater number of examples than that at which a random data set contains few good patterns. We justify this hypothesis by computational studies. We also derive a theoretical upper bound on the needed data size in view of our hypothesis.
ER -