Small Sample Inference in Classification Using the CUD Bound Eric Laber Susan A. Murphy University of Michigan, Ann Arbor August 7, 2008 Eric Laber Susan A. Murphy (University of Michigan, Small Sample Ann Inference Arbor) in Classification Using the CUD Bound August 7, 2008 1 / 19 Introduction What is this talk about? Eric Laber Susan A. Murphy (University of Michigan, Small Sample Ann Inference Arbor) in Classification Using the CUD Bound August 7, 2008 2 / 19 Introduction What is this talk about? Constructing faithful confidence sets for the generalization error Coping with non-regularity Living in the small sample world (no test set) Eric Laber Susan A. Murphy (University of Michigan, Small Sample Ann Inference Arbor) in Classification Using the CUD Bound August 7, 2008 2 / 19 Introduction What is this talk about? Constructing faithful confidence sets for the generalization error Coping with non-regularity Living in the small sample world (no test set) What is this talk not about? Eric Laber Susan A. Murphy (University of Michigan, Small Sample Ann Inference Arbor) in Classification Using the CUD Bound August 7, 2008 2 / 19 Introduction What is this talk about? Constructing faithful confidence sets for the generalization error Coping with non-regularity Living in the small sample world (no test set) What is this talk not about? Mining Complex and High-Dimensional Data Eric Laber Susan A. Murphy (University of Michigan, Small Sample Ann Inference Arbor) in Classification Using the CUD Bound August 7, 2008 2 / 19 Introduction What is this talk about? Constructing faithful confidence sets for the generalization error Coping with non-regularity Living in the small sample world (no test set) What is this talk not about? Mining Complex and High-Dimensional Data Muppets Eric Laber Susan A. Murphy (University of Michigan, Small Sample Ann Inference Arbor) in Classification Using the CUD Bound August 7, 2008 2 / 19 Setup Classification in a nutshell: We observe iid (feature, label) pairs {(xi , yi )}ni=1 from distribution F (assume xi ∈ Rp and yi ∈ {−1, 1}) Goal in classification: construct a “rule” which predicts label Y from feature X with high probability Eric Laber Susan A. Murphy (University of Michigan, Small Sample Ann Inference Arbor) in Classification Using the CUD Bound August 7, 2008 3 / 19 Setup Classification in a nutshell: We observe iid (feature, label) pairs {(xi , yi )}ni=1 from distribution F (assume xi ∈ Rp and yi ∈ {−1, 1}) Goal in classification: construct a “rule” which predicts label Y from feature X with high probability Given a class of rules F we’d like: fˆ = arg min P1Yf (X )<0 f ∈F (Generalization Error) but we settle for: fˆ = arg min Pn l(x, y , f ) f ∈F (Surrogate Loss) for some smooth surrogate loss function l ex. l(x, y , f ) = (y − f (x))2 Eric Laber Susan A. Murphy (University of Michigan, Small Sample Ann Inference Arbor) in Classification Using the CUD Bound August 7, 2008 3 / 19 Setup Given classification rule fˆ = arg minf ∈F Pn l(x, y , f ) we want Q1−γ so that: n o Pr (Pn − P)1y fˆ(x)<0 < Q1−γ ≥ 1−γ where Q1−γ is a function of the training data {(xi , yi )}ni=1 (and of course γ) Things to note: We’re operating in a data impoverished setting: no testing set No distributional assumptions are made about (Pn − P)1y fˆ(x)<0 or F √ n(Pn − P)1y fˆ(x)<0 need not converge to a tight limit Eric Laber Susan A. Murphy (University of Michigan, Small Sample Ann Inference Arbor) in Classification Using the CUD Bound August 7, 2008 4 / 19 Relevant Work “But I can think of at least a dozen things to try. Let’s see, there’s the bootstrap, cross-validation, normal approximation, Bayesian methods, VC bounds...” Eric Laber Susan A. Murphy (University of Michigan, Small Sample Ann Inference Arbor) in Classification Using the CUD Bound August 7, 2008 5 / 19 Relevant Work A snapshot of the difficulties: Bootstrap Non-smooth sampling distribu√ tion of n(Pn − P)1y fˆ(x) Distributional Approximations Sampling distribution of √ n(Pn − P)1y fˆ(x) need not follow any particular distribution VC Bounds Small samples make this bound too loose Eric Laber Susan A. Murphy (University of Michigan, Small Sample Ann Inference Arbor) in Classification Using the CUD Bound August 7, 2008 6 / 19 A New Approach Eric Laber Susan A. Murphy (University of Michigan, Small Sample Ann Inference Arbor) in Classification Using the CUD Bound August 7, 2008 7 / 19 The CUD Bound Trivially we have the following inequality: (Pn − P)1 ˆ ≤ sup (Pn − P)1yf (x)<0 y f (x)<0 f ∈F The above bound is too loose because the sup is taken over a huge space (we call this a uniform deviation bound) Taking the sup over a smaller space will lead to tighter bounds What space should we choose? Eric Laber Susan A. Murphy (University of Michigan, Small Sample Ann Inference Arbor) in Classification Using the CUD Bound August 7, 2008 8 / 19 The CUD Bound Trivially we have the following inequality: (Pn − P)1 ˆ ≤ sup (Pn − P)1yf (x)<0 y f (x)<0 f ∈F The above bound is too loose because the sup is taken over a huge space (we call this a uniform deviation bound) Taking the sup over a smaller space will lead to tighter bounds What space should we choose? Solution: take the sup over a contiguous ball centered at f ∗ = arg min Pl(x, y , f ) f ∈F Eric Laber Susan A. Murphy (University of Michigan, Small Sample Ann Inference Arbor) in Classification Using the CUD Bound August 7, 2008 8 / 19 The CUD Bound Since fˆ minimizes Pn l(x, y , f ) we have: √ √ n (Pn − P)1y fˆ(x)<0 ≤ supf ∈F n(Pn − P)1y fˆ(x)<0 × n 1 Pn (l(x, y , f ∗ ) − l(x, y , f )) ≥ −1 αn o where αn is a sequence of positive numbers tending to ∞ depending on surrogate loss l and sample size n The non-smoothness of the indicator function is problematic, we replace it with smooth surrogate g (x) = (1 + x) ∧ 1 to obtain the CUD bound: √ √ n (Pn − P)1y fˆ(x)<0 ≤ supf ∈F n(Pn − P)1y fˆ(x)<0 × g {αn Pn (l(x, y , f ∗ ) − l(x, y , f ))} Eric Laber Susan A. Murphy (University of Michigan, Small Sample Ann Inference Arbor) in Classification Using the CUD Bound August 7, 2008 9 / 19 The CUD Bound What does CUD stand for? Eric Laber Susan A. Murphy (University of Michigan, Small Sample Ann Inference Arbor) in Classification Using the CUD Bound August 7, 2008 10 / 19 The CUD Bound What does CUD stand for? Constrained Uniform Deviation Eric Laber Susan A. Murphy (University of Michigan, Small Sample Ann Inference Arbor) in Classification Using the CUD Bound August 7, 2008 10 / 19 The CUD Bound What does CUD stand for? Constrained Uniform Deviation Eric Laber Susan A. Murphy (University of Michigan, Small Sample Ann Inference Arbor) in Classification Using the CUD Bound August 7, 2008 10 / 19 The CUD Bound Theorem If Q1−γ is 1 − γ quantile of: √ sup n(Pn − P)1y fˆ(x)<0 × g {αn Pn (l(x, y , f ∗ ) − l(x, y , f ))} f ∈F where αn is any positive sequence on numbers then: n√ o Pr n(Pn − P)1y fˆ(x)<0 < Q1−γ ≥ 1 − γ Eric Laber Susan A. Murphy (University of Michigan, Small Sample Ann Inference Arbor) in Classification Using the CUD Bound August 7, 2008 11 / 19 The CUD Bound Theorem If Q1−γ is 1 − γ quantile of: √ sup n(Pn − P)1y fˆ(x)<0 × g {αn Pn (l(x, y , f ∗ ) − l(x, y , f ))} f ∈F where αn is any positive sequence on numbers then: n√ o Pr n(Pn − P)1y fˆ(x)<0 < Q1−γ ≥ 1 − γ Issues: How to choose αn How do we use this result? Eric Laber Susan A. Murphy (University of Michigan, Small Sample Ann Inference Arbor) in Classification Using the CUD Bound August 7, 2008 11 / 19 Answers: Choose αn so that the CUD bound converges to a tight limit with a linear classifier and strictly convex l(x, y , f ) we have αn = n. In practice we must estimate the CUD bound using the bootstrap Eric Laber Susan A. Murphy (University of Michigan, Small Sample Ann Inference Arbor) in Classification Using the CUD Bound August 7, 2008 12 / 19 Answers: Choose αn so that the CUD bound converges to a tight limit with a linear classifier and strictly convex l(x, y , f ) we have αn = n. In practice we must estimate the CUD bound using the bootstrap Theorem If F is the space of linear classifiers and l(x, y , f ) is strictly convex (and subject to additional regularity conditions) then: √ n ˆ ˆ ˆ Pn l(x, y , f ) − l(x, y , f ) sup n (Pn − Pn )1yf (x)<0 g log (n) f ∈F converges weakly to a tight limit in l ∞ (Rp ). Eric Laber Susan A. Murphy (University of Michigan, Small Sample Ann Inference Arbor) in Classification Using the CUD Bound August 7, 2008 12 / 19 Answers: Choose αn so that the CUD bound converges to a tight limit with a linear classifier and strictly convex l(x, y , f ) we have αn = n. In practice we must estimate the CUD bound using the bootstrap Theorem If F is the space of linear classifiers and l(x, y , f ) is strictly convex (and subject to additional regularity conditions) then: √ n ˆ ˆ ˆ Pn l(x, y , f ) − l(x, y , f ) sup n (Pn − Pn )1yf (x)<0 g log (n) f ∈F converges weakly to a tight limit in l ∞ (Rp ). Note* So in the bootstrap case we see αn should be chosen to be n log (n) . Eric Laber Susan A. Murphy (University of Michigan, Small Sample Ann Inference Arbor) in Classification Using the CUD Bound August 7, 2008 12 / 19 Computation How do we compute the sup? Eric Laber Susan A. Murphy (University of Michigan, Small Sample Ann Inference Arbor) in Classification Using the CUD Bound August 7, 2008 13 / 19 Computation At first glance the required optimization problem: √ n ˆ ˆ n − Pn )1yf (x)<0 g Pn l(x, y , fˆ) − l(x, y , f ) sup n(P log (n) f ∈F seems infeasible. Eric Laber Susan A. Murphy (University of Michigan, Small Sample Ann Inference Arbor) in Classification Using the CUD Bound August 7, 2008 14 / 19 Computation At first glance the required optimization problem: √ n ˆ ˆ n − Pn )1yf (x)<0 g Pn l(x, y , fˆ) − l(x, y , f ) sup n(P log (n) f ∈F seems infeasible. It turns out the classifier is linear this can be computed using a series of 2n MINLP problems. When l is convex, the sup in the preceding display can be computed efficiently and exactly. Eric Laber Susan A. Murphy (University of Michigan, Small Sample Ann Inference Arbor) in Classification Using the CUD Bound August 7, 2008 14 / 19 The CUD Bound: Simulations We compare the performance of the CUD bound with several methods in the literature. Use a mix of test sets (from UCI) and simulated datasets Use linear classifiers with squared error loss We record the coverage based on 1000 generations of each dataset As competitors we consider: I I I I Two bootstrap methods Two distributional methods One CV method An approximation to Langford’s inverse binomial method See www.stat.lsa.umich.edu/∼laber for details Eric Laber Susan A. Murphy (University of Michigan, Small Sample Ann Inference Arbor) in Classification Using the CUD Bound August 7, 2008 15 / 19 Simulation Results Data Spam Donut Ionosphere Heart Diabetes Outlier 3 Pt. Abalone Liv. Magic χ2 Mammogram CUD 1.00 .999 .998 .999 1.00 .988 .993 .983 .961 .998 .985 1.00 BS1 .478 .880 .605 .406 .654 .884 .827 .963 .964 .918 .958 .678 n = 30 BS2 .983 .908 .926 .981 .900 .736 .717 .737 .878 .920 .786 .994 K .782 .631 .816 .718 .912 .808 .844 .972 .980 .961 .961 .646 M .632 .633 .757 .475 .986 .838 .896 .991 1.00 .982 .982 .426 Y .996 .937 .994 .998 .997 .910 .753 .998 1.00 .991 .996 .983 L .636 .620 .747 .458 .984 .961 .952 .997 1.00 .992 1.00 .406 Table: Average coverage of new method and competitors. Target coverage level was .950, those meeting or exceeding this level are highlighted in orange. Eric Laber Susan A. Murphy (University of Michigan, Small Sample Ann Inference Arbor) in Classification Using the CUD Bound August 7, 2008 16 / 19 Simulation Results Data Spam Donut Ionosphere Heart Diabetes Outlier 3 Pt. Abalone Liver Magic χ2 Mammogram CUD .469 .485 .437 .459 .433 .555 .508 .617 .559 .606 .595 .464 BS1 .432 .590 .429 .468 .305 .491 .476 .314 .372 .307 .330 .532 n = 30 BS2 .432 .590 .429 .468 .305 .491 .476 .314 .372 .307 .330 .532 K .315 .324 .301 .319 .307 .328 .318 .331 .327 .300 .329 .321 M .314 .324 .296 .319 .310 .329 .317 .331 .326 .283 .330 .321 Y .451 .414 .501 .433 .312 .456 .457 .504 .485 .464 .501 .421 L .354 .364 .337 .359 .350 .368 .356 .371 .366 .324 .351 .361 Table: Average diameter of confidence set constructed using new method and competitors. Method with smallest diameter and having correct coverage is highlighted in yellow. Eric Laber Susan A. Murphy (University of Michigan, Small Sample Ann Inference Arbor) in Classification Using the CUD Bound August 7, 2008 17 / 19 Conclusions CUD bound provided the appropriate coverage on all datasets CUD bound was conservative but not trivial (smallest in diameter for two of the datasets) Future work includes trying to correct conservatism and generalize Eric Laber Susan A. Murphy (University of Michigan, Small Sample Ann Inference Arbor) in Classification Using the CUD Bound August 7, 2008 18 / 19 Conclusions CUD bound provided the appropriate coverage on all datasets CUD bound was conservative but not trivial (smallest in diameter for two of the datasets) Future work includes trying to correct conservatism and generalize Thanks! Eric Laber Susan A. Murphy (University of Michigan, Small Sample Ann Inference Arbor) in Classification Using the CUD Bound August 7, 2008 18 / 19 References: B. Efron. Estimating the error rate of a prediction rule: Improvement on cross-validation. J. Amer. Statis. Assoc., 78(382):316–331, 1983. B. Efron and R. Tibshirani. Cross-validation and the bootstrap: Estimating the error of a prediction rule. Technical report, 1995. Ron Kohavi. A study of cross-validation and bootstrap for accuracy estimation and model selection. In IJCAI, pages 1137–1145, 1995. J. Langford. Combining train set and test set bounds. ICML, 2002. J. Langford. Tutorial on practical prediction theory for classification. J. Machine Learning Research, 6:273–306, 2005. J. Kent Martin and Daniel S. Hirschberg. Small sample statistics for classification error rates II: Confidence intervals and significance tests. Technical Report ICS-TR-96-22, 1996. Rosa A. Schiavo and David J. Hand. Ten more years of error rate research. International Statistical Review, 68(3):295–310, 2000. Yuhong Yang. Comparing learning methods for classification. Statistica Sinica, 16(2):635–657, 2006. Eric Laber Susan A. Murphy (University of Michigan, Small Sample Ann Inference Arbor) in Classification Using the CUD Bound August 7, 2008 19 / 19

© Copyright 2020