+ Machine Learning and Data Mining Linear regression (adapted from) Prof. Alexander Ihler Supervised learning • Notation – – – – Features x Targets y Predictions ŷ Parameters q Learning algorithm Program (“Learner”) Training data (examples) Features Feedback / Target values Characterized by some “parameters” θ Procedure (using θ) that outputs a prediction Score performance (“cost function”) Change θ Improve performance Linear regression “Predictor”: Evaluate line: Target y 40 return r 20 0 0 10 Feature x 20 • Define form of function f(x) explicitly • Find a good f(x) within that family (c) Alexander Ihler More dimensions? 26 26 y y 24 24 22 22 20 20 30 30 40 20 x1 30 20 10 10 0 0 x2 (c) Alexander Ihler 40 20 x1 30 20 10 10 0 0 x2 Notation Define “feature” x0 = 1 (constant) Then (c) Alexander Ihler Measuring error Error or “residual” Observation Prediction 0 0 20 (c) Alexander Ihler Mean squared error • How can we quantify the error? • Could choose something else, of course… – Computationally convenient (more later) – Measures the variance of the residuals – Corresponds to likelihood under Gaussian model of “noise” (c) Alexander Ihler MSE cost function • Rewrite using matrix form (Matlab) >> e = y’ – th*X’; J = e*e’/m; (c) Alexander Ihler J(θ) Visualizing the cost function θ1 40 40 30 30 20 20 10 10 0 0 -10 -10 -20 -20 -30 -30 -40 -1 -0.5 0 0.5 1 1.5 2 2.5 (c) Alexander-40Ihler -1 3 -0.5 0 0.5 1 1.5 2 2.5 3 Supervised learning • Notation – – – – Features x Targets y Predictions ŷ Parameters q Learning algorithm Program (“Learner”) Training data (examples) Features Feedback / Target values Characterized by some “parameters” θ Procedure (using θ) that outputs a prediction Score performance (“cost function”) Change θ Improve performance Finding good parameters • Want to find parameters which minimize our error… • Think of a cost “surface”: error residual for that θ… (c) Alexander Ihler + Machine Learning and Data Mining Linear regression: direct minimization (adapted from) Prof. Alexander Ihler MSE Minimum • Consider a simple problem – One feature, two data points – Two unknowns: µ0, µ1 – Two equations: • Can solve this system directly: • However, most of the time, m > n – There may be no linear function that hits all the data exactly – Instead, solve directly for minimum of MSE function (c) Alexander Ihler SSE Minimum • Reordering, we have • X (XT X)-1 is called the “pseudo-inverse” • If XT is square and independent, this is the inverse • If m > n: overdetermined; gives minimum MSE fit (c) Alexander Ihler Matlab SSE • This is easy to solve in Matlab… % % y = [y1 ; … ; ym] X = [x1_0 … x1_m ; x2_0 … x2_m ; …] % Solution 1: “manual” th = y’ * X * inv(X’ * X); % Solution 2: “mrdivide” th = y’ / X’; % th*X’ = y (c) Alexander Ihler => th = y/X’ Effects of MSE choice • Sensitivity to outliers 18 16 16 2 cost for this one datum 14 12 Heavy penalty for large errors 10 5 8 4 3 6 2 4 1 0 -20 2 0 0 2 4 6 8 10 12 14 16 18 (c) Alexander Ihler -15 20 -10 -5 0 5 L1 error 18 L2, original data 16 L1, original data 14 L1, outlier data 12 10 8 6 4 2 0 0 2 4 6 8 10 12 14 16 18 (c) Alexander Ihler 20 Cost functions for regression (MSE) (MAE) Something else entirely… (???) “Arbitrary” functions can’t be solved in closed form… - use gradient descent (c) Alexander Ihler + Machine Learning and Data Mining Linear regression: nonlinear features (adapted from) Prof. Alexander Ihler Nonlinear functions • What if our hypotheses are not lines? – Ex: higher-order polynomials Order 3 polynom ial Order 1 polynom ial 18 18 16 16 14 14 12 12 10 10 8 8 6 6 4 4 2 0 2 0 2 4 6 8 10 12 14 16 18 20 0 0 (c) Alexander Ihler 2 4 6 8 10 12 14 16 18 20 Nonlinear functions • Single feature x, predict target y: Add features: Linear regression in new features • Sometimes useful to think of “feature transform” (c) Alexander Ihler Higher-order polynomials Order 1 polynom ial 18 • Fit in the same way • More “features” 16 14 12 10 8 6 4 2 0 Order 2 polynom ial 18 18 16 16 14 0 2 4 6 8 3 polynom 10 12 Order ial 14 16 18 20 14 12 12 10 10 8 8 6 6 4 4 2 2 0 -2 0 2 4 6 8 10 12 14 16 18 20 0 0 2 4 6 8 10 12 14 16 18 20 Features • In general, can use any features we think are useful • Other information about the problem – Sq. footage, location, age, … • Polynomial functions – Features [1, x, x2, x3, …] • Other functions – 1/x, sqrt(x), x1 * x2, … • “Linear regression” = linear in the parameters – Features we can make as complex as we want! (c) Alexander Ihler Higher-order polynomials • Are more features better? • “Nested” hypotheses – 2nd order more general than 1st, – 3rd order “ “ than 2nd, … • Fits the observed data better Overfitting and complexity • More complex models will always fit the training data better • But they may “overfit” the training data, learning complex relationships that are not really present Complex model Simple model Y Y X (c) Alexander Ihler X Test data • After training the model • Go out and get more data from the world – New observations (x,y) • How well does our model perform? (c) Alexander Ihler Training data New, “test” data Training versus test error • Plot MSE as a function of model complexity 30 – Polynomial order Training data 25 • Decreases 20 15 • What about new data? 10 • 0th to 1st order – Error decreases – Underfitting • Higher order – Error increases – Overfitting 5 0 0 Mean squared error – More complex function fits training data better New, “test” data 0.5 1 1.5 2 Polynomial order (c) Alexander Ihler 2.5 3 3.5 4 4.5 5

© Copyright 2018