• Two basic types of statistics: 1. Descriptive stats – methods for organizing and summarizing  information

• Two basic types of statistics:
1. Descriptive stats – methods for organizing and summarizing information
– Stats in sports are a great example
– Usually we use graphs, charts, and tables showing averages and associated measures of variation
2. Inferential stats – methods for drawing and measuring the reliability of conclusions about a population based on information obtained from a sample of the population
– When we poll a small sample of potential voters (sample) we can infer something about the sentiment of the entire voting population
• Both types are interrelated because we use descriptive stats to organize and summarize sample information to carry out an inferential analysis
• We can conduct a census, obtain info from entire population, but that is usually time consuming, costly, or impossible
– Instead we can do a survey where we take information from a sample
1
• Variable – a characteristic that varies from one person or thing to another
– For people: height, weight, number of siblings, gender, marital status, eye color
– The first 3 variables yield numerical info and are called quantitative variables
– The last 3 yield non‐numerical info and are called qualitative
or categorical variables
• There are 2 types of quantitative variables:
1. Discrete variables have values that can be listed but the list can continue indefinitely
– The variable may have only a finite number of possible values or its values are some collection of whole numbers
– Usually a count of something as in number of siblings
2. Continuous variables have possible values that form some interval of numbers
– Usually a measurement of something as in height or weight of a person
2
• Grouping discrete quantitative data
– To get a clear picture of trends in a list of observations (together called a data set), we need to group by classes
70
62
75
57
51
64
38
56
53
36
99
67
71
47
63
55
70
51
50
66
64
60
99
55
85
89
69
68
81
79
87
78
95
80
83
65
39
86
98
70
– First, decide on class intervals (10’s, 100’s, etc. are a good start)
Days to Maturity
30‐39
40‐49
50‐59
60‐69
70‐79
80‐89
90‐100
Tally
III
I
IIII III
IIII IIII
IIII II
IIII II
IIII
No. of Investments
3
1
8
10
7
7
4
40
– We can see several pieces of info that are important, most importantly there were more investments in the 60‐69‐day range than any other
– Generally:
• Number of classes should be between 5 and 20, although fewer may be used for categorical data
• Classes should have the same width
• Frequency and relative frequency
– Number of observations that fall into each class is called the frequency (or count)
3
– A table that provides all classes and their frequencies is called a frequency distribution
– Many times, we are interested in relative frequency or percentage of each class, so divide the frequency of each class by the total number of observations
• In table 2 for class 50‐59:
8 / 40 = 0.20
0.20×100 = 20%
• 0.20 is the relative frequency and 20% is the percentage of observations in class 50‐59
• Now, we can construct a relative frequency distribution table
Days to Maturity
30‐39
40‐49
50‐59
60‐69
70‐79
80‐89
90‐100
Relative Frequency
3/40 = 0.075
1/40 = 0.025
0.200
0.250
0.175
0.175
0.100
1.000
Percentage
0.075 X 100 = 7.5%
0.025 X 100 = 2.5%
20.0%
25.0%
17.5%
17.5%
10.0%
100.0%
• Single‐value grouping
– In some cases, we are interested in classes that represent a single possible value
– This is the usually the case for discrete data where there are only a few possible observations
4
Number of TV sets in 50 households
1
3
3
0
3
1
2
1
3
2
1
1
1
1
1
2
5
4
2
2
6
2
3
1
1
3
1
2
2
1
3
3
2
3
3
4
6
2
1
1
2
2
2
1
5
4
2
3
3
1
Single‐value grouped‐data table
No. of TVs
0
1
2
3
4
5
6
Frequency
1
16
14
12
3
2
2
50
Relative Frequency
0.020
0.320
0.280
0.240
0.060
0.040
0.040
1.000
• Grouping continuous quantitative data
– An important difference between grouping discrete and continuous data is that you must decide on exactly where to distinguish classes along the continuum of real numbers
– For data in Table 6, we could use 120‐139.5 or 120‐139.9 as a first group…we’ll use 139.9
129.2
155.2
167.3
191.1
161.7
278.8
146.4
149.9
185.3
170.0
161.0
150.7
170.1
175.6
209.1
158.6
218.1
151.3
178.7
187.0
165.8
188.7
175.4
5
182.5
187.5
165.0
173.7
214.6
132.1
182.0
142.8
145.6
172.5
178.2
136.7
158.5
173.6
• Also, note that relative frequency may not always sum to 1.0 due to rounding error (0.999, below)
– If we carried each relative frequency to more decimal places we may be able to sum to 1.0, but it’s not of great importance
Weight (lb)
120‐139.9
140‐159.9
160‐179.9
180‐199.9
200‐219.9
220‐239.9
240‐259.9
260‐280
Frequency
3
9
14
7
3
0
0
1
37
Relative Frequency
0.081
0.243
0.378
0.189
0.081
0.000
0.000
0.027
0.999
• Grouping qualitative data
– Classes are simply the observed value of the corresponding variable: Are you male or female? Male
– For a survey of political affiliation for Intro to Stats students:
D
D
D
D
O
R
O
R
O
R
O
R
O
D
D
R
D
D
D
R
R
O
R
D
R
R
O
R
R
R
R
R
O
O
R
R
D
R
D
D
– Classes will be Democratic, Republican, or Other
6
• Relative frequency distribution table:
Party
Democratic
Republican
Other
Frequency
13
18
9
40
Relative Frequency
0.325
0.450
0.225
1.000
– From this data, we can say that most stat students are Republican at 45.0%, fewer are Democrats at 32.5%, and 22.5% have other political affiliations
• Next, we’ll see how to visually represent summarized data
• Frequency histograms are graphs that display class on the horizontal axis and the frequencies of the classes on the vertical axis
– Frequency of each class is represented by a vertical bar whose height is equal to the frequency of the class
– Histograms are used for quantitative data to visualize the actual distribution of data across a scale
– For data in tables 1‐3:
12
10
Frequency
8
6
4
2
0
30‐39
40‐49
50‐59
60‐69
Days to maturity
7
70‐79
80‐89
90‐100
• We can also graph relative frequency, but the overall shape will be the same because they are proportional
– For classes including a range of values, the range should be listed under each bar or cut‐off values should be placed at each “tick” mark
0.3
0.25
Relative Frequency
0.2
0.15
0.1
0.05
0
30‐39
40‐49
50‐59
60‐69
70‐79
80‐89
90‐100
Days to maturity
• For single‐value grouped data (TV data in Tables 4 and 5), place the middle of each histogram bar directly over the single value represented by the class
0.35
18
16
0.3
14
0.25
Relative Frequency
Frequency
12
10
8
6
4
0.2
0.15
0.1
0.05
2
0
0
0
1
2
3
4
5
0
6
1
2
No. of TVs
No. of TVs
8
3
4
5
6
• Graphical displays for qualitative data
– Bar graphs are used for qualitative data because there is no hierarchy (ascending levels) or scale of values
• We could order classes in any way to visualize relative frequency (not so with quantitative data)
• Bars should be separated and class labels should be centered underneath
0.5
0.45
0.4
Relative frequency
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
Democratic
Republican
Other
Party
– Pie charts have wedge‐shaped pieces that are proportional to relative frequencies
Political Party Affiliations
Democratic
Republican
Other
Political Party Affiliations
Other
23%
Democratic
32%
Republican
45%
9
MA 2113 Lecture 1 Table 1. Days to maturity for 40 short‐term investments. 70 62 75 57 51 64 38 56 53 36 99 67 71 47 63 55 70 51 50 66 64 60 99 55 85 89 69 68 81 79 87 78 95 80 83 65 39 86 98 70 Table 2. Classes and counts for above data. Days to Maturity Tally No. of Investments/Frequency
30‐39 3 40‐49 1 50‐59 8 60‐69 10 70‐79 7 80‐89 7 90‐100 4 40 Equation 1. Relative frequency and percentage of a class. 8 / 40 = 0.20
0.20 × 100 = 20%
Table 3. Relative frequency distribution and percentage for above data. Days to Maturity Relative Frequency Percentage 30‐39 3/40 = 0.075 0.075 X 100 = 7.5% 40‐49 1/40 = 0.025 0.025 X 100 = 2.5% 50‐59 0.200 20.0% 60‐69 0.250 25.0% 70‐79 0.175 17.5% 80‐89 0.175 17.5% 90‐100 0.100 10.0% 1.000 100.0% 10
Table 4. Number of TV sets in 50 randomly selected households. 1 1 1 2 6 3 3 4 2 3 2 1 5 2 1 3 6 2 3 1 1 4 3 2 2 2 2 0 3 1 2 1 2 3 1 1 3 2 1 2 1 1 3 1 5 Table 5. Grouped‐data table for number of TV sales. No. of TVs Frequency Relative Frequency 0 1 0.020 1 16 0.320 2 14 0.280 3 12 0.240 4 3 0.060 5 2 0.040 6 2 0.040 50 1.000 Table 6. Weights of 37 males, aged 18‐24 years. 129.2 185.3 218.1 182.5 142.8 155.2 170.0 151.3 187.5 145.6 167.3 161.0 178.7 165.0 172.5 191.1 150.7 187.0 173.7 178.2 161.7 170.1 165.8 214.6 136.7 278.8 175.6 188.7 132.1 158.5 146.4 209.1 175.4 182.0 173.6 149.9 158.6 11
4 2 3 3 1 Table 7. Grouped‐data table for the weights of 37 males, aged 18‐24 years. Weight (lb) Frequency Relative Frequency 120‐139.9 3 0.081 140‐159.9 9 0.243 160‐179.9 14 0.378 180‐199.9 7 0.189 200‐219.9 3 0.081 220‐239.9 0 0.000 240‐259.9 0 0.000 260‐280 1 0.027 37 0.999 Table 8. Political party affiliations of the students in introductory stats. (Dem, Rep, or Other) D R O R R R R R D O R D O O R D D R O D R R O R D O D D D R O D O R D R R R R D Table 9. Frequency and relative frequency distribution tables for political party affiliations. Party Frequency Relative Frequency Democratic 13 0.325 Republican 18 0.450 Other 9 0.225 40 1.000 12
More descriptive stats
• Measures of center – descriptive measures that indicate where the center or most typical value of a data set lies
– Most commonly used is the mean, the sum of observations divided by number of observations
x + x + x + .... + xn
Mean = 1 2 3
n
– For Table 1:
300
300
450
300
400
800
300
300
450
940
400
1050
300
300 + 300 + 300 + 940 + … + 1050 = 6290
n = # of observations = 13
Mean 1 = 6290/13 = $483.85
• For Table 2:
300
400
300
300
940
300
450
1050
400
300
4740/10 = $474.00
– We may also use the median, the number that divides the bottom 50% of data from the top 50%
• If # of observations is odd, median is the observation exactly in the middle of the ordered list
• If # of observations is even, median is the mean of the 2 middle observations of the ordered list
13
– For Table 1 (don’t forget to order data small to large):
300 300 300 300 300 300 400 400 450 450 800 940 1050
– For Table 2:
300 300 300 300 300 400 400 450 940 1050
median = 300 + 400/2 = 350
– When we compare mean and median, we see that mean is greater than median
• This is because mean is strongly affected by a few relatively large salaries ($900, $1,050) but the median is not
• Any time there are a few relatively large or small values in the data set, the mean will be skewed toward those values
• Summation notation
– In stats, letters such as x, y and z are used to denote variables
• If we take height and weight data from people, x = variable “height” and y = variable “weight”
14
15
• Measures of variation
– Two data sets can have the same mean and median but differ in other ways (other than measure of center)
– Consider heights (inches) of 5 starting players on 2 basketball teams:
Team 1: 72, 73, 76, 76, 78
Team 2: 67, 72, 76, 76, 84
mean for both = 75.0 median for both = 76
– Heights on Team 2 vary more than those on Team 1, and we need to describe that difference quantitatively
• The sample standard deviation (SD) is most commonly used to quantify variation in a data set
– SD measures variation by indicating how far, on average, observations are from the mean
• For a data set with a small amount of variation (Team 1), observations will, on average, be closer to the mean and SD is smaller
• For a data set with a large amount of variation (Team 2), observations will, on average, be farther from the mean and SD is larger
16
17
• It seems natural to divide the sum by # of observations, n, but statistical theory shows that this underestimates the population variance
– Instead we divide by n‐1 to give, on average, a better estimate of population or sample variance, denoted by s2
2
s
∑ (x
=
i
− x) 2
n −1
– For Team 1 heights:
s2 =
∑ (x
i
− x) 2
n −1
=
24
=6
5 −1
• But remember, sample variance is in units that are the square of the original units
– We want a descriptive measure expressed in original units, so to get sample standard deviation, SD, take the square root of sample variance
– SD is denoted by s
s=
∑ (x
i
− x) 2
n −1
– For Team 1 heights, simply take the square root of s2
s = 6 = 2.4
18
– This is interpreted as, on average, the heights of players on Team 1 vary from the mean height of 75 inches by 2.4 inches
– Team 2 standard deviation (s):
xi − x
x
67
72
76
76
84
‐8
‐3
1
1
9
s2 =
(xi -x )2
64
9
1
1
81
156
156
= 39
5 −1
s = 39 = 6.2
Quartiles and Boxplots
• Descriptive measures of variation based on quartiles
– Remember, the median divides data into the bottom 50% and the top 50%
– Percentiles divide data into hundredths or 100 equal parts
• Percentile one, P1, divides the bottom 1% of data from the top 99%
– Deciles divide data in 10 equal parts and quartiles into 4 equal parts or quarters
19
• We will focus on quartiles
– A data set has 3 quartiles or dividing lines, Q1, Q2, Q3
– Q1 is the number that divides the bottom 25% from the top 75%, Q2 is the median, and Q3 divides bottom 75% from the top 25%
• For Table 7 data:
25
66
34
30
41
35
26
38
27
31
32
30
32
15
38
20
43
5
16
21
– First, arrange in increasing order and get median, Q2:
5 15 16 20 21 25 26 27 30 30 31 32 32 34 35 38 38 41 43 66
Median = 30.5
– Now, get median of lower 50% (Q1) and upper (Q3) 50% of data:
5 15 16 20 21 25 26 27 30 30 Q1 = 23.0
31 32 32 34 35 38 38 41 43 66
Q3 = 36.5
20
– We interpret this as:
•
•
•
•
25% of TV‐viewing times are <23 hours
25% are between 23‐30.5 hours
25% are between 30.5‐36.5 hours
25% are >36.5 hours
– With median as our measure of center, we like to use the Interquartile Range (IQR) as the associated measure of variation
• IQR is the difference between the 1st and 3rd quartiles or
IQR = Q3 – Q1
• For Table 7 data: IQR = 36.5 – 23.0 = 13.5 hours
– We can also get a measures of variation for the two middle quarters similarly, Q2 – Q1 and Q3 – Q2
– But, that tells us nothing about variation in quarters 1 and 4
• We use minimum value with Q1 and maximum value with Q3 to get variation for those quarters:
Q1 – Min
and Max – Q3
– To summarize the dataset , we use a Five‐number Summary:
Min, Q1, Q2, Q3, Max or 5, 23, 30.5, 36.5, 66
21
MA 2113 Lecture 2 Table 1. Weekly income ($) for Office Staff 1. 300 300 300 940 300 300 400 300 400 450 800 450 1050 Table 2. Weekly income ($) for Office Staff 2. 300 300 940 450 400 400 300 300 1050 300 Equation 1. Mean of a data set. x + x 2 + x3 + .... + x n
Mean = 1
where n = # of observations n
Equation 2. Mean of a data set with summation notation. xi
Mean = x̄ = n
Table 3. Height (ft) of sweetgum trees on a selected study site on Noxubee National Wildlife Refuge. ∑
70
105
73
86
22
31
48
27
40
62
45
45
62
35
75
18
25
32
37
34
25
28
18
20
37
50
40
72
25
40
45
70
35
40
25
22
25
45
45
70
60
45
70
45
37
27
60
50
Table 4. Deviations from mean for heights of players from Team 1. Height Deviation from mean x
xi - x̄
72 ‐3 73 ‐2 76 1 76 1 78 3 Table 5. Sum of squared deviations for heights of players from Team 1. Squared deviation Height Deviation from mean x (xi -x̄ )2 xi - x̄
72 ‐3 9 73 ‐2 4 76 1 1 76 1 1 78 3 9 24 Equation 3. Sample variance. 2
( xi − x) s2 =
n −1
Equation 4. Sample standard deviation. ( xi − x) 2
∑
s =
n −1
∑
23
Table 6. Sum of squared deviations for heights of players from Team 2. x (xi -x̄ )2 xi - x̄
67 ‐8 64 72 ‐3 9 76 1 1 76 1 1 84 9 81 156 Table 7. Weekly number of hours of TV watched by 20 Americans from Nielsen ratings. 25 41 27 32 43 66 35 31 15 5 34 26 32 38 16 30 38 30 20 21 24
Outliers
• Observations that fall well outside the overall pattern of the data
• Reasons for outliers:
– Measurement or recording error
– An observation from a different population
– An unusual extreme observations
• We can use the IQR to identify outliers
– First, we need to define the lower limit and upper limit of a data set
• Observations that lie below the lower limit or above the upper limit are potential outliers
Lower limit = Q1 – (1.5 X IQR)
Upper limit = Q3 + (1.5 X IQR)
– For Table 1 (TV‐viewing) data:
IQR = 13.5
Q1 = 23.0
Q3 = 36.5
Lower limit = 23.0 – (1.5 X 13.5) = 23.0 – 20.25 = 2.75 hrs
Upper limit = 36.5 + (1.5 X 13.5) = 36.5 + 20.25 = 56.75 hrs
– Anything below or above these values is probably an outlier
25
– In Table 1, the only extreme value is 66 and we consider this and unusual extreme observation
• One person in the sample population watches much more TV than others in the population
• We can easily see this in a histogram
0.5
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
1‐9
10‐19
20‐29
30‐39
40‐49
50‐59
60‐69
Boxplots
• Also called box‐and‐whisker diagram
• Based on the five‐number summary to graphically display the center and variation in a data set
– Additionally, we need to identify adjacent values, the most extreme observations that still lie within lower and upper limits
– If there are no outliers, adjacent values are the min and max
• Steps to construct box plots:
1. Determine quartiles
26
6/16/2014
2. Determine outliers and adjacent values
3. Above an x‐axis, draw marks for quartiles (long lines) and adjacent values (short lines) with vertical lines
4. Connect quartile lines to make a box, then connect box to adjacent value lines
5. Plot each outlier with an asterisk
*
0
10
20
30
40
50
60
70
• With ≥2 samples, we can compare boxplots among samples to visualize the difference in median values and variation in data sets
– 5‐number summary for Table 2a and 2b:
2a: 3.0 4.85 6.3 7.4 8.8
2b: 5.2 12.3 18.25 21.55 29.4
0
5
10
15
20
25
30
35
0
5
10
15
20
25
30
35
27
1
Linear Equations
• Often, it is important to know if 2 or more variables are related and how they’re related
– Linear equations are a good way to assess relationships and even predict future values
• As an example, we could examine height and shoe size of a sample group of people and determine if there is any relationship between the 2 variables
– Also, we can determine the strength of the relationship… is it a strong or weak connection?
• General form of a linear equation:
y = b0 + b1x
– b0 and b1 are constants (fixed numbers)
– x is the independent variable
– y is the dependent variable
• The graph of a linear equation with one independent variable is a straight line
– Linear equations are one of the most commonly used statistical tools in practically all fields of research/business (management, marketing, physical and mathematical sciences, etc.)
28
• Intercept and slope, b0 and b1
– b0 is the y‐value of the point where the line crosses the y‐axis so we call it the y‐intercept
– b1 measures the steepness of the line or, in other words, how much the y‐value changes when the x‐value increases by 1 unit on a graph, so it is the slope of the equation
• Some example equations and their graphs:
• A practical example of application in business
– Business Services offers word processing at $20/hr plus a $25 disk charge
– Total cost depends on number of hours to complete a job
29
• For Table 5 (word processing):
Time (hr)
x
5.0
7.5
15.0
20.0
22.5
Cost ($)
y
125
175
325
425
475
– The total cost, y, of a job that takes x hours is
y = 25 + 20x
b0 = 25 and b1 = 20
– If we know # of hours required we can predict cost
• To graph a linear equation, you only need 2 values of x
– To graph the equation y = 5 – 3x, let’s use x‐values of 1 and 3 (it can be any 2 values but use some logic and consider scale)
– Also, do not forget that the y‐intercept (where x = 0) is a value you can graph
y = 5 – (3 x 1) = 2 (x, y) = (1, 2)
y = 5 – (3 x 3) = ‐4 (x, y) = (3, ‐4)
30
The Regression Equation
• Rarely are applications of the linear equation as simple as the word processing example where one variable (cost) can be predicted exactly in terms of another variable (time)
– So, many times we must rely on “rough” predictions from a sample data set
– We can’t predict the exact price, y, of a make and model of used car just by knowing the age, x
• We have to rely on a rough prediction using an estimate of the mean price of a sample of other cars of the same age
• For Table 4 data (cars):
Car
1
2
3
4
5
6
7
8
9
10
11
Age (yr)
x
5
4
6
5
5
5
6
6
2
7
7
Price ($100)
y
85
103
70
82
89
98
66
95
169
70
48
• To visualize the relationship, if any, between age and price we will use a scatterplot
– A scatterplot is a graph of data from 2 quantitative variables
31
180
160
140
Price ($100)
120
100
80
60
40
20
0
0
1
2
3
4
5
6
7
8
Age (yr)
• Although the age‐price data points do not fall exactly on a line, they do appear to cluster around a line (there appears to be a relationship)
– With regression, we can “fit” a line (equation) to the sample data and use that line to predict or give a rough estimate of a used Orion car based on its age
32
MA 2113 Lecture 3 Table 1. Weekly number of hours of TV watched by 20 Americans from Nielsen ratings. 25 41 27 32 43 66 35 31 15 5 34 26 32 38 16 30 38 30 20 21 Equation 1. Lower and upper limits to identify potential outliers in a data set. Lower limit = Q1 – (1.5 X IQR) Upper limit = Q3 + (1.5 X IQR) Table 2a. Skinfold thickness (mm) for sample of elite runners. 7.3 3 7.8 5.4 3.7 6.7 5.1 3.8 6.4 7.5 8.7 8.8 6.2 6.3 4.6 Table 2b. Skinfold thickness (mm) for random people of similar age. 24 28 9.3 9.6 12.4 19.9 29.4 18.1 19.4 5.2 7.5 20.3 22.8 16.3 12.2 18.4 19 24.2 16.3 15.6 Equation 2. The general formula for a linear equation. y = b0 + b1x
33
Table 3. Times and costs for five word‐processing jobs. Time (hr) x
5.0 7.5 15.0 20.0 22.5 Cost ($) y
125 175 325 425 475 Table 4. Age and price data for a sample of 11 used Orion cars. Car 1 2 3 4 5 6 7 8 9 10 11 Age (yr) x
5 4 6 5 5 5 6 6 2 7 7 34
Price ($100) y
85 103 70 82 89 98 66 95 169 70 48 The Regression Equation
• In the last lecture, we demonstrated that you can place a linear line (with a given equation) through a scatterplot in an attempt to “fit” it to the data
– We can come up with many candidate equations and lines
– We can compare how well each line fits the data by comparing error values between equations
• The error value will measure how far the observed y‐value for each data point is from the predicted y‐value given by the equation
• An example with a small data set, Table 1:
x
1
1
2
4
y
1
2
2
6
– First, a scatterplot of the data to check for a linear trend
– Next, let’s propose 2 candidate equations and determine which best fits the real data
Line A: y = 0.05 + 1.25x
Line B: y = ‐0.25 + 1.50x
35
– When we graph the linear equations with the scatterplot of real data, we see both seem to fit the data well, but which is best?
– First, calculate differences between the real values of y and the predicted value of y from the equation
• When value of x = 2, the real value for y = 2 (from Table 1)
• The predicted values for y when x = 2, which we will denote as ŷ, from equations lines A and B are:
Line A: ŷ = 0.50 + 1.25 (2) = 3
Line B: ŷ = ‐0.25 + 1.50 (2) = 2.75
• And differences between real and predicted values are:
Line A: Error = y – ŷ = 2 – 3 = ‐1
Line B: Error = y – ŷ = 2 – 2.75 = ‐0.75
• Now, we can make a table of error values for both equations and determine which is the “better” equation
36
– Next, construct table to see which equation will provide the lowest value of sum of squares for error values
Line A
x
1
1
2
4
y
1
2
2
6
ŷ
1.75
1.75
3.00
5.50
y-ŷ
‐0.75
0.25
‐1.00
0.50
Line B
x
1
1
2
4
(y - ŷ)2
0.5625
0.0625
1.0000
0.2500
1.8750
y
1
2
2
6
ŷ
1.25
1.25
2.75
5.75
y-ŷ
‐0.25
0.75
‐0.75
0.25
(y - ŷ)2
0.0625
0.5625
0.5625
0.0625
1.2500
• The bottom value in the 5th column is the sum of squared errors or Σ (y – ŷ)2
• We can see that Line B provides the smallest value of sum of squared errors, so it fits the data better
• We still do not know if Line B is the “best” line because there are many more candidate lines to compare
– In fact we can propose an infinite number of lines
– Now, we introduce an equation that will give us the best line, the Regression Equation
• We will calculate the best b1 and b0 for the equation
– First, some notation we need to know:
Sxy = Σ xiyi – (Σ xi)(Σ yi)/n
Sxx = Σ x2 – (Σ xi)2/n
37
• As an example, let’s look back at the used car data (Table 2) from last week in Lecture 3
• Step 1: Construct a table with the following columns:
Car
1
2
3
4
5
6
7
8
9
10
11
Age (yr)
x
5
4
6
5
5
5
6
6
2
7
7
Price ($100)
y
85
103
70
82
89
98
66
95
169
70
48
x
5
4
6
5
5
5
6
6
2
7
7
58
y
85
103
70
82
89
98
66
95
169
70
48
975
xy
425
412
420
410
445
490
396
570
338
490
336
4732
x2
25
16
36
25
25
25
36
36
4
49
49
326
– The second table will provide all the numbers we need to complete the Regression Equation
– The Regression Equation for a set of n data points is
ŷ = b0 + b1x
S
S xx
xy
(∑ y i − b1 ∑ xi )
– where, b1 = and b
0 = We use this equation
1
n
– Step 2: Calculate b1, slope of the regression line
S
S xx
xy
b1 = = ∑ x y − (∑ x )(∑ y ) / n
∑ x − (∑ x ) / n
i
i
i
2
i
i
2
i
4732 − (58)(975) / 11
326 − (58) / 11
b1 = = ‐20.26
2
38
We use this equation
– Step 3: Calculate b0, the y‐intercept
1
1
(∑ y i − b1 ∑ xi )
[975 − (−20.26) × 58]
b0 = = = 195.47
n
11
– Step 4: Fill in Regression Equation
ŷ = 195.47 – 20.26 x
• Once you have the regression equation, you can graph the line over a scatterplot of observed data
ŷ = 195.47 – 20.26x
– Predicted values of y for x = 2, 4, and 6
(2, 154.95) (4, 114.43) (6, 73.91)
180
160
140
120
100
80
60
40
20
0
0
1
2
3
4
39
5
6
7
8
MA 2113 Lecture 4 Table 1. Simple (x, y) data set. x
y
1
1
1
2
2
2
4
6
Equations 1 and 2. Preliminary calculations for Regression Equation: S = Σ x y – (Σ x )(Σ y )/n
xy
i i
i
i
2
S = Σ x – (Σ x ) /n
xx
2
i
Equation 3. The Regression Equation: ŷ = b + b x
0
1
Table 2. Age and price data for a sample of 11 used Orion cars. Car
Age (yr)
x
Price ($100)
y
1
5
85
2
4
103
3
6
70
4
5
82
5
5
89
6
5
98
7
6
66
8
6
95
9
2
169
10
7
70
11
7
48
40
Equation 4. Slope of the regression line, b1. xi y i − ( xi )( y i ) / n
S xy
b1 = = xi2 − ( xi ) 2 / n
S xx
Equation 5. The y‐intercept, b0. 1
b0 = ( y i − b1 xi )
∑
n
∑
∑
∑ ∑
∑
∑
41
The Coefficient of Determination
• Now, we can check the usefulness of the regression equation by using some diagnostic techniques
– For our used car example, is the regression equation useful for predicting price, or could we do just as well by ignoring age?
– One method is to determine % of variation in observed values of the response variable (y) that is explained by the regression
– To find this %, we need to calculate 2 measures of variation:
a) Total variation in observed values of the response variable
b) Amount of variation in the observed values of the response variable that is explained by the regression
• Step 1: Get sum of squared deviations of observed values of y
from the mean of y (remember sample variance of heights of basketball players)
2
– This is total sum of squares, SST = ∑ ( y i − y )
• Step 2: Get sum of squared deviations of predicted values of y
from mean of y
– This is regression sum of squares, SSR =
42
∑ ( yˆ
i
− y) 2
• Step 3: Now, we use SST and SSR to get % of variation in observed values of y that is explained by regression, or the coefficient of determination denoted by r2
r2 = SSR/SST
– r2 always lies between 0 and 1
– A value near 0 suggests the regression equation is not very useful for predictions
– A value near 1 suggests the regression equation is very useful for predictions
• Let’s return to the used car example (Table 1) and calculate r2
– Table for computing SST
y = 88.6
x
5
4
6
5
5
5
6
6
2
7
7
y
85
103
70
82
89
98
66
95
169
70
48
975
y− y
‐3.6
14.4
‐18.6
‐6.6
0.4
9.4
‐22.6
6.4
80.4
‐18.6
‐40.6
43
( y − y) 2
13.0
207.4
346.0
43.6
0.2
88.4
510.8
41.0
6464.2
346.0
1648.4
9708.6
– Table for computing SSR
x
5
4
6
5
5
5
6
6
2
7
7
y
85
103
70
82
89
98
66
95
169
70
48
ŷ
94.2
114.4
73.9
94.2
94.2
94.2
73.9
73.9
155.0
53.7
53.7
y 2
ŷ - y (ŷ ‐ )
5.6
31.0
25.8
667.2
‐14.7
215.8
5.6
31.0
5.6
31.0
5.6
31.0
‐14.7
215.8
‐14.7
215.8
66.4
4402.3
‐35.0
1221.5
‐35.0
1221.5
8284.0
r2 = SSR/SST = 8284.0/9708.6 = 0.853
0.853 X 100 = 85.3%
– This is a very good regression equation for predicting price of this type of used car based on age
44
MA 2113 SST. Equation 1. Total sum of squares, S
SST = n sum of squaares, SSR. Equation 2. Regression
2 SSR = Σ (ŷ ‐
Lecture 5 )
Equation 3. Coefficientt of Determin
nation, r2. r2 = SSR/S
/SST
Table 1. A
Age and price data for a sample of 11 ussed Orion carrs. Age (yr) Price ($100) x
y
1 5 85 2 4 103 3 6 70 4 5 82 5 5 89 6 5 98 7 6 66 8 6 95 9 2 169 10 7 70 11 7 48 Car 45
Linear Correlation
• With the last procedure, we assessed fit of a constructed linear equation with a set of (x, y) data points
– Now, we are going to assess the correlation between two variables with a linear correlation coefficient, r
– It is also called the Pearson product moment correlation coefficient
– r will always lie between ‐1 and 1
• Some properties of r:
1. The value of r will reflect slope of the scatterplot
– It is positive when the scatterplot shows a positive slope and negative when the scatterplot show a negative slope
2. The magnitude of r indicates the strength of the linear relationship
– A value close to ‐1 or 1 indicates a strong relationship and the variable x is a good linear predictor of y
– A value near 0 indicates a weak relationship between x and y
46
3. The sign of r suggests the type of linear relationship
– A positive value means that y tends to increase as x increases and the tendency is greater as the value approaches 1
– A negative value means that y tends to decrease as x increases and the tendency is greater as the value approaches ‐1
• The formula for r:
r=
∑ ( x − x)( y − y)
∑ ( x − x) ∑ ( y − y )
2
2
• You will notice that for the denominator we have most of the equation for the standard deviation for x and y
47
• For an example we return to the used Orion car data where x is age of car in years and y is price times $100
x
y
5
4
6
5
5
5
6
6
2
7
7
85
103
70
82
89
98
66
95
169
70
48
( x − x) ( y − y ) ( x − x) ( y − y )
‐0.273
‐1.273
0.727
‐0.273
‐0.273
‐0.273
0.727
0.727
‐3.273
1.727
1.727
‐3.636
14.364
‐18.636
‐6.636
0.364
9.364
‐22.636
6.364
80.364
‐18.636
‐40.636
0.9926
‐18.2854
‐13.5484
1.8116
‐0.0994
‐2.5564
‐16.4564
4.6266
‐263.0314
‐32.1844
‐70.1784
‐408.9091
(x − x)2 ( y − y)2
0.075
1.621
0.529
0.075
0.075
0.075
0.529
0.529
10.713
2.983
2.983
20.1818
13.220
206.324
347.300
44.036
0.132
87.684
512.388
40.500
6458.372
347.300
1651.284
9708.5455
• The only column we have not computed in the past is column 5
• Now plug sums from table into the equation
r=
r=
r=
∑ ( x − x)( y − y)
∑ ( x − x) ∑ ( y − y )
2
2
− 408.909
20.182 9,708.545
− 408.909
− 408.909
=
= −0.924
(4.492)(98.542)
442.651
• Also, the coefficient of determination, r2, equals the square of the linear correlation coefficient, r
For this example: r2 = √‐0.924 = 0.854
48
MA 2113 Lecture 6 Equation 1. Formula for r, Pearson correlation coefficient. ( x − x)( y − y )
r=
( x − x) 2
( y − y) 2
Table 1. Orion used car data where x is age of car (years) and y is price (X $100) with columns for calculating r. ∑
∑
x
y
5
4
6
5
5
5
6
6
2
7
7
85
103
70
82
89
98
66
95
169
70
48
∑
( x − x) ( y − y ) ( x − x) ( y − y )
‐0.273
‐1.273
0.727
‐0.273
‐0.273
‐0.273
0.727
0.727
‐3.273
1.727
1.727
‐3.636
14.364
‐18.636
‐6.636
0.364
9.364
‐22.636
6.364
80.364
‐18.636
‐40.636
0.9926
‐18.2854
‐13.5484
1.8116
‐0.0994
‐2.5564
‐16.4564
4.6266
‐263.0314
‐32.1844
‐70.1784
‐408.9091
49
(x − x)2 ( y − y)2
0.075
1.621
0.529
0.075
0.075
0.075
0.529
0.529
10.713
2.983
2.983
20.1818
13.220
206.324
347.300
44.036
0.132
87.684
512.388
40.500
6458.372
347.300
1651.284
9708.5455 Random Variables
• A quantitative variable whose value depends on chance
– As an example, I can ask each student how many siblings they have
• Number of siblings will vary among students and if I select a student at random, the value of the variable is random
• The value depends on chance of which student is selected
– A discrete random variable is a random variable whose possible values can be listed
• Notation is a bit different for random variables vs. variables
– Instead of x, y, and z, we use upper‐case letters X, Y, and Z
– If X is the number of siblings of a student, then P(X=2) is the notation for the probability that a student has 2 siblings
• Like earlier, we can take values we obtain and construct a probability distribution and then graph the info for a probability histogram
– Before, we called it a relative frequency distribution and histogram
– Note that the sum of probabilities will equal to 1.0
50
• Example: Enrollment data for U.S. public schools by grade level
Grade (y)
0
1
2
3
4
5
6
7
8
Freq
4,248
3,615
3,595
3,654
3,696
3,728
3,770
3,722
3,619
33,647
P(Y=y)
0.126
0.107
0.107
0.109
0.110
0.111
0.112
0.111
0.108
1.000
• What is P(Y=5)?
– 3,728/33,647 = 0.111 so 11.1% of elementary school students in the U.S. are in 5th grade
• A bit more complex example:
– We toss a dime 3 times giving 8 equally likely outcomes. Our event of interest (X) is total # of heads obtained in 3 tosses
HHH
HHT
HTH
HTT
THH
THT
# of heads
0
1
2
3
TTH
TTT
HHH
HHT
• What is P(X=2)? 3/8 = 0.375
HTH
HTT
P(X=x)
0.125
0.375
0.375
0.125
THH
THT
TTH
TTT
• What is P(X≤2)? P(X≤2) = P(X=0) + P(X=1) + P(X=2)
= 0.125 + 0.375 + 0.375 = 0.875
HHH
HHT
HTH
HTT
THH
THT
TTH
TTT
51
• Now, let’s compare a real example with the probability table we just constructed
– We flipped 3 dimes 1,000 times and recorded results for number of heads we observed
# of heads
0
1
2
3
Freq
136
377
368
119
1000
Proportion
0.136
0.377
0.368
0.119
1.00
Expected Results
Experimental Results
Mean and standard deviation of a discrete random variable
• Notation for a sample (rather than population)
– For a sample
– For a population x=
∑x
µ=
i
n
∑x
i
N
• We can express the mean value of a sample in terms of the probability distribution of X, here ages of 8 students
19
21
20
27
20
20
μ = 20.88
52
19
21
– The probability distribution for X:
Age (x)
19
20
21
27
P(X=x)
0.250
0.375
0.250
0.125
μ = [19 ∙ P(X=19)] + [20 ∙ P(X=20)] + [21 ∙ P(X=21)] + [27 ∙ P(X=27)]
Age (x)
19
20
21
27
P(X=x) x ∙ P(X=x)
0.250
4.750
0.375
7.500
0.250
5.250
0.125
3.375
1.000
20.88
– So, the mean of a discrete random variable = ∑ x • P( X = x)
• Standard deviation and variance
∑ ( xi − x )
– For a sample = variance
s2 =
2
n −1
s = s2
= SD
σ 2 = ∑ ( xi − µ ) 2 • P ( xi )
– For a population = variance
σ = σ2
= SD
– For the data for age of 8 students (μ = 20.88):
Age (x)
19
20
21
27
P(X=x)
0.250
0.375
0.250
0.125
1.000
x‐μ
‐1.88
‐0.88
0.12
6.12
53
(x‐μ)2 (x‐μ)2∙ P(x)
3.534
0.8836
0.774
0.2904
0.014
0.0036
37.454
4.6818
5.8594
– Using the formula instead of the table:
σ2 = [(19 – 20.88)2 ∙ 0.250] + [(20 – 20.88)2 ∙ 0.375] + [(21 – 20.88)2
∙ 0.250] + [(27 – 20.88)2 ∙ 0.125] = 5.86
and σ = √σ2 = 2.42
54
The Normal Distribution
• In life, we deal with a variety of variables and many of them have a common distribution in the shape of a bell‐shaped curve
– We call it a “normal” curve because researchers found that it was a common occurrence for a variable to have this distribution
– If a population variable is normally distributed we say that we have a normally distributed population
– But, in practice we rarely see a distribution that is exactly in this shape so we often say a variable is approximately normally distributed
– A normal distribution is determined by the mean and SD, so we call these measurements the parameters of the normal curve
• Given just these 2 parameters we can graph any normally distributed variable
55
• Frequency and relative‐frequency distributions for heights of female college students (n = 3,264) at a small mid‐western college
Ht (in)
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
μ = 64.4
σ = 2.4
Freq
3
6
26
74
147
247
382
483
559
514
359
240
122
65
24
7
5
1
3264
Rel Freq
0.0009
0.0018
0.0080
0.0227
0.0450
0.0757
0.1170
0.1480
0.1713
0.1575
0.1100
0.0735
0.0374
0.0199
0.0074
0.0021
0.0015
0.0003
1.0000
• The relative frequency distribution graph of the data and the normal curve with parameters μ = 64.4 and σ = 2.4
Rel Freq
0.1800
0.1600
0.18
0.1400
0.16
0.1200
0.14
0.1000
0.12
0.1
0.0800
0.08
0.0600
0.06
0.0400
0.04
0.0200
0.02
0.0000
56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73
Ht (in)
0
55
60
65
70
75
• Remember, when we add up all proportions of the relative freq bars we get 1.0
– The same applies to the area under the curve, area equals 1.0
56
• Equation for the distribution of a Normal Random Variable x:
f ( x) =
–
–
–
–
1
σ 2π
e −(1 / 2)[( x − µ ) / σ ]
2
μ = mean of the normal random variable x
σ = standard deviation
π = 3.141…
e = 2.718…
For the prior example
on women’s height:
x
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
f(x)
0.000364
0.001433
0.004748
0.013225
0.030963
0.060939
0.100821
0.14022
0.163933
0.161112
0.133103
0.092438
0.053966
0.026484
0.010926
0.003789
0.001105
0.000271
• Now for example, what is P(X=67)?
– According to relative freq, it is 0.0735 or 7.35%, the cross‐
hatched area of the bar
– If we superimpose the normal curve over the actual distribution we find that the blue‐shaded area approximates the area of the cross‐hatched bar
57
Standardizing a Normally Distributed Variable
• Now, how do we find areas under a normal curve?
– We would need a table of areas for each conceivable normal curve (all σ and μ), an infinite number
– So we standardize, or transform, every normal distribution into one in particular, the Standard Normal distribution
• This distribution will have mean of 0 and a SD of 1.0
• We will do this by transforming our observed variables into z‐scores
z=
x−µ
σ
• For a series of numbers x = ‐1, 3, 3, 3, 5, 5:
μ = 3.0 and σ = 2
z=
z1 =
x−µ
σ
=
x−3
2
−1− 3
3−3
= −2, z 2 =
= 0, etc.
2
2
x
‐1
3
3
3
5
5
z
‐2
0
0
0
1
1
• Now, treat z as any variable when you compute mean and SD and you will always get μ = 0 and σ = 1
µz =
∑z
N
i
=
− 2 + 0 + 0 + 0 +1+1
=0
6
58
• For SD of z:
σz =
∑ (z
i
− µz )2
N
=
6
=1
6
In numerator, (‐2 ‐0)2 + (0 ‐0)2 +…+ (1‐0)2 = 6
• For our example normal curves earlier:
• For practical use, this tells us that if we have any variable (x) that is normally distributed with mean (μ) and SD (σ) we can use the z‐
conversion to find areas of interest under the standardized curve
– We will use a standard normal table to find values for area
– If we want the area between point a and b (real numbers) where a<b:
• We can find the % of all possible observations of x that lie between a and b by calculating (a‐μ)/σ and (b‐μ)/σ and looking up the values in our z‐table
59
• For our example of heights of college women:
– What is the probability of selecting a woman that is ≤67 inches in height?
– 0.0009 +…+0.0735 = 0.9314
– or 3 +…+ 240 = 3,040/3,264 = 0.9314
x−µ
z=
– or = 68 –
64.4/2.4 = 1.5
σ
• in z‐table, 1.5 corresponds to an area
of 0.9332
60
Ht (in)
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
Freq
3
6
26
74
147
247
382
483
559
514
359
240
122
65
24
7
5
1
3264
Rel
0.0009
0.0018
0.0080
0.0227
0.0450
0.0757
0.1170
0.1480
0.1713
0.1575
0.1100
0.0735
0.0374
0.0199
0.0074
0.0021
0.0015
0.0003
1.0000
• We have found the area under the normal curve that represents all the range of possibilities <68 inches or 93.3% of observations
0.18
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75
Finding areas under the Normal Curve
• Properties of the Standard Normal Curve (SNC)
– Total area under the curve is 1.0
– It is symmetric about 0
– Almost all area under the curve lies between ‐3 and 3
• Now, we will calculate areas under the curve under various scenarios:
– To the right
– To the left
– Between values
61
1st curve: simply the area to left of z
2nd curve: 1 – (area to left of z)
3rd curve: (area to left of z2) – (area to left of z1)
• Examples:
62
For z = 0.76
For z = 1.23
For z = 1.82
For z = ‐ 0.68 63
• Examples with real data
– Intelligence quotients (IQs) are normally distributed with μ = 100 and σ = 16
– What is the % of people who have IQs between 115 and 140?
• For x = 115, z = 115 – μ/σ = 115 – 100/16 = 0.94
• For x = 140, z = 140 – μ/σ = 140 – 100/16 = 2.50
• Area to left of 0.94 (from z‐table) = 0.8264
• Area to left of 2.50 = 0.9938
• So, 0.9938 – 0.8264 = 0.1674
• And, we can say that 16.74% of people have IQ between 115 and 140
• We can also ask about a certain area, find a z‐score, then calculate a value (reversing the process we just covered)
– What is the IQ that is the cutoff for the top 10% for all people?
• In the z‐table, we see that 0.8997 is the value closest to 0.90
• That corresponds to a z‐score = 1.28
64
z = 1.28 = x – 100/16
Multiply both sides by SD 1.28 ∙ 16 = x – 100
20.48 = x – 100
Add mean to both sides 100 + 20.48 = x
x = 120.48
• So, 90% of people have IQs below 120.48 and 10% have higher IQs
65
• Determine the area under the SNC that lies to the left of:
2.24 ‐1.56
0 ‐0.5
• Determine the area under the SNC that lies to the right of:
‐1.07 0.6 1.8
• Determine the area under the SNC that lies between:
‐2.18 and 1.44
‐2 and ‐1.5
0.59 and 1.51
• Find the z‐score for which the area under the SNC to its left is 0.025 or 2.5%
• Find the z‐score that has area of 0.70 to its right
66
• Frequency and relative‐frequency distributions for heights of female college students (n = 3,264) at a small mid‐western college
μ = 64.4
σ = 2.4
Ht (in)
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
Freq
3
6
26
74
147
247
382
483
559
514
359
240
122
65
24
7
5
1
3264
Rel Freq
0.0009
0.0018
0.0080
0.0227
0.0450
0.0757
0.1170
0.1480
0.1713
0.1575
0.1100
0.0735
0.0374
0.0199
0.0074
0.0021
0.0015
0.0003
1.0000
– What is the % of female students with heights between 65 and 70 inches?
z2 = 70 – 64.4/2.4 = 2.3
z1 = 65 – 64.4/2.4 = 0.25
Area to left of z2 = 0.9893
Area to left of z1 = 0.5987
• Area between = 0.9893 – 0.5987 = 0.3906 or 39.1% of female students will have heights 65‐70 inches
• In the table, relative frequency between 65‐70 inches = 0.1575 + 0.1100 + … + 0.0074 = 0.4056 or 40.6%
67
68
69
Sample Distribution and Sampling Error
• We’ve talked about how much time and money we can save by taking a sample from a large population instead of census
– But, a sample will guarantee a certain amount of sampling error (s and will never be population mean and SD)
x
• If we are sampling from a normal population, we can expect that x
our sample will be normally distributed and will be close to μ
– For a variable x and a given sample size, the distribution of the variable is called the sampling distribution of the sample x
mean
• Example: Population is 5 starting players on a men’s basketball team, players A, B, C, D, and E
Player
Ht (in)
µ=
∑x
N
A
76
i
=
B
78
C
79
D
81
E
86
76 + 78 + 79 + 81 + 86
= 80.0
5
– If we take a random sample of 2 players, we have the following sampling distribution of with 10 possible combinations
x
Sample
Hts
x
A,B
A,C
A,D
A,E
B,C
B,D
B,E
C,D
C,E
D,E
76,78
76,79
76,81
76,86
78,79
78,81
78,86
79,81
79,86
81,86
77
77.5
78.5
81
78.5
79.5
82
80
82.5
83.5
70
• We see that mean height of 2 players isn’t likely to equal μ = 80.0, and only 1/10 sample means = 80
x
– So, we can say that there is a 1/10 = 10% chance that = μ
76
77
78
79
80
81
82
83
84
• Also, 3/10 samples have means within 1 inch of μ, so we can say that the probability is 0.3 or there is a 30% chance that a sample mean will be within 1 inch of μ
• Now, let’s choose 4 players at random (only 5 possibilities):
Sample
A,B,C,D
A,B,C,E
A,B,D,E
A,C,D,E
B,C,D,E
x
78.50
79.75
80.25
80.50
81.00
• A graph for the distribution of sample means:
76.00
77.00
78.00
79.00
80.00
81.00
82.00
83.00
84.00
– Now, none of the sample means equal μ, but 4/5 or 80% are x
within 1 inch of μ or P(79.0≥ ≤81.0) = 0.8
– Graphs of sampling distributions as sample size increases
n = 1
n = 2
76
78
76
80
78
80
82
84
86
82
84
86
n = 4
76
77
78
79
80
81
82
83
84
85
86
n = 5
76
78
80
82
71
84
86
• This demonstrates that sampling error tends to be smaller for larger samples
– This is what we see with our basketball player example
Sample size (n )
1
2
3
4
5
# possible # within % within # within % within samples
1" of μ
1" of μ 0.5" of μ 0.5" of μ
5
2
40%
0
0%
10
3
30%
2
20%
10
5
50%
2
20%
5
4
80%
3
60%
1
1
100%
1
100%
x
• There is a simple relationship between the mean of variable and the population mean, μ
µx = µ
x
• This means that if we take all possible for any sample size and µ x , the population mean for the sample take the mean ( distribution), we will get μ, the entire population mean
– For our basketball example for sample size 4:
µx =
78.50 + 79.75 + 80.25 + 80.50 + 81.00
= 80.0
5
• There is also a relationship between the standard deviation (s or x
SD) of the variable with the population SD or σ
σx =
72
σ
n
• For our basketball example for sample size 4:
σx =
(78.50 − 80) 2 + (79.75 − 80) 2 + (80.25 − 80) 2 + (80.50 − 80) 2 + (81.00 − 80) 2
5
σx =
2.25 + 0.0625 + 0.0625 + 0.25 + 1
3.625
=
= 0.85
5
5
– Note: when sampling is done without replacement from a σ
finite population (basketball example) the formula σx =
n
will not give you exact sample SD, σ x
• For all possible sample sizes:
Sample size (n )
1
2
3
4
5
SD of x
3.41
2.09
1.39
0.85
0
• Example from U.S. Census Bureau
– Mean living space, μ, for a single‐family detached home is 1,742 sq. ft. and SD, σ, is 568 sq. ft.
a) For sample of 25 single‐family homes, what is mean and SD of variable ?
x
µ x = µ = 1,742
σx =
σ
n
=
568
=
568
25
= 113.6
b) For a sample size of 500?
µ x = µ = 1,742
σx =
73
σ
n
500
= 25.4
Confidence Intervals for a Population Mean
• Remember, when we get a sample mean ( ), we are getting an x
estimate of the population mean (μ) which we will call a point estimate
– A sample mean is usually not equal to the population mean because we will have sampling error
– Now, we will attach information to the estimate that will indicate the accuracy of that estimate, a confidence interval estimate for μ, or CI
– The more information we have for our sample (greater sample size) the more confident we will be that we are close to μ
• We should now know how to compute areas under the standard normal curve (SNC) between two critical values
– You should be able to do this whether you start with z‐scores or values of x
– From lecture 9:
– Computing a 95% CI is similar to finding the critical values that are the boundaries for the middle 95% of data for a population
(as opposed to a sample)
74
• The difference is we are computing critical z‐values that are the boundaries for a CI surrounding our sample mean
– In other words, how confident are we that the population mean (μ) lies within the CI we have constructed around our sample mean ( )
x ?
– If we use calculations for a 95% CI then we are 95% confident that μ is within the CI
– An increase in sample size will lead to a more narrow CI surrounding x
• For a 95% CI, the critical z‐values will always be ‐1.96 to the left and 1.96 to the right
– 1 – 0.95 = 0.05, and the area of 0.05 is divided by 2 (for right and left side), so 0.05/2 = 0.0250
– In z‐table, the z‐value of ‐1.96 corresponds to the area under the SNC of 0.0250
– The area of equal size to the right is always the same z‐value but positive
75
• These critical values of z are given a special notation depending on our desired level of confidence (CL)
– We need to write the CL in the form of 1 – α, where α is the number that must be subtracted from 1 to get the CL
– If we want a 95% CI, 1 – 0.95 = 0.05 = α and the associated z‐
value is z0.05
– But, because we want to split the area on the left and right side we want zα/2 or z0.05/2 = z0.025
– This gives us the formula: α/2 = (1 – CL)/2
• For a CL of 90%: α/2 = (1 – 0.90)/2 = 0.05
– So, we need critical cut‐off values of z0.05
– In the z‐table, the area under the SNC of 0.05 corresponds to a z‐value of ‐1.64
– In most cases, we are interested in CL’s of 90%, 95%, and 99% (z = 1.64, 1.96, and 2.57, respectively)
76
Confidence Intervals with known σ
• If we know the population standard deviation (σ), the following formulas will give us the CI surrounding x
x − zα / 2
σ
n
σ
σ
x + zα / 2
x ± zα / 2
and or n
n
lower limit upper limit
• Example: Take a sample of verbal SAT scores from n = 40 high school students. We know the population SD is σ = 83.8 and our sample mean is = 450.5
x
– Construct a 95% CI
– First we need critical values for –zα/2 and zα/2
α 2 = (1 − CL) 2 = (1 − 0.95) 2 = 0.025
z0.025 = 1.96
– For our 95% CI:
x − zα / 2
x + zα / 2
σ
n
σ
n
= 450.5 − 1.96 ×
83.8
= 450.5 + 1.96 ×
40
83.8
40
= 450.5 − 1.96 × 13.25 = 450.5 − 25.97 = 424.5
= 450.5 + 1.96 × 13.25 = 450.5 + 25.97 = 476.5
77
– This is interpreted as, we are 95% confident that the population mean (μ) lies within the range of 424.5 – 476.5
– This also implies that there is a 5% chance that μ does not fall into this interval
– Actually, μ = 456.7 from a population of 728 students
– Graphically, the 95% CI would appear like this:
x µ
420
430
440
450
78
460
470
480
Confidence Intervals with unknown σ
• This is very similar to CI with known σ, but there are 2 things we must do differently
1. First, we have to estimate σ using our sample data to calculate a sample SD (s)
2. Next, we can’t use the z‐table (that’s only for large populations), so we use a t‐table (for small populations)
• The numbers in the body (middle) of the table represent an estimation of z‐scores but for smaller samples
• To get values of t, we need 2 things: α and degrees of freedom (df) which is n – 1 (n = sample size)
79
• In the example above, n = 14 so df = 13, and if we’re looking for the critical value of tα, we look for t0.05 across the top of the table and df along the side to get 1.771
• We find α/2 with the same equation: α/2 = (1 – CL)/2
• The equations to calculate CI’s are identical to solve but we use a t instead of z‐value and sample SD (s) instead of σ
s
x + tα / 2
x − tα / 2
n
s
n
• We will use the same data set for verbal SAT scores for an example (sample of 40 scores)
654
573
596
459
374
440
460
368
582
459
324
430
519
340
553
464
337
381
427
484
292
590
368
334
456
371
418
331
463
537
308
372
544
579
412
399
490
406
653
472
– From beginning of semester, we used these formulas to calculate s
s2 =
∑ (x
i
− x) 2
s = s2
n −1
80
– The sample SD from this sample data set is s = 96.9 and sample x
mean is the same, = 450.5
– The t‐value for a 95% CI is t0.025 (remember α/2) with df = 39 so critical t0.025 = 2.023
– We put these values into the new equations:
x − tα / 2
s
n
= 450.5 − 2.02 ×
96.9
x + tα / 2
40
s
n
= 450.5 + 2.02 ×
96.9
450.5 − 2.02 × 15.33
450.5 + 2.02 × 15.33
450.5 − 30.97 = 419.5
450.5 + 30.97 = 481.5
81
40
– This is has the same interpretation, we are 95% confident that the population mean (μ) lies within the range of 419.5 – 481.5
– But, notice the 95% CI is wider using a sample SD
– Remember, μ = 456.7 from a population of 728 students
– Graphically, let’s compare the 95% CI with and without knowing σ:
With σ
With s
415
425
435
415
425
435
445
455
445
455
82
465
465
475
475
485
485
83
Hypothesis Testing
• One of the primary uses of calculated statistics is to make decisions about the value of a parameter (inferential stats)
– Examples:
• Is the mean weight, μ, of bags of pretzels produced by a company really 454 grams as advertised?
• Has the mean age of all cars in use increased from the 1995 mean of 8.5 years?
• Has the mean weight of white‐tail deer does in a particular area of Mississippi increased or decreased from year to year?
• We can attempt to answer these questions by setting up a null hypothesis, H0, to test
– If we have a null hypothesis, we need at least one alternative hypothesis, H1 or Ha, to test against
– For the pretzel example:
H0: mean wt. of bags is 454 g; H0: μ = 454 g
Ha: mean wt. of bags is not 454 g; Ha: μ ≠ 454 g
– We call this a two‐tailed test because Ha can be > or <
– There are also one‐tailed tests where Ha is either > or <
84
• We will concentrate on two‐tailed tests
– We reject our H0 if our calculated z‐value falls out of the “do not reject region” of the standard normal curve (SNC)
– Notice that the two‐tailed SNC above is identical to the CI curve we discussed in Lectures 9 and 10 (below)
• For a hypothesis test, we call the associated α value the significance level
– Example: If we want to be 95% confident that our sample mean, , is or is not equal to our hypothesis mean, μ, we use a x
significance level of α = 0.05
– Recall from Lecture 9, the associated critical z‐value for α = 0.05 with two‐tails (so α/2) is ± 1.96
85
• Hypothesis test for population mean with known σ
– The equation to calculate a critical z‐value is:
z=
x − µ0
σ
n
• For an example we will use the company producing the bags of pretzels at 454 g each
– For quality control, we took 25 bags at random from the production line to test our hypothesis, H0: μ = 454 g
465
456
438
454
447
449
442
449
446
447
468
433
454
463
450
446
447
456
452
444
447
456
456
435
450
– We will assume that population standard deviation, σ, is 7.8 g
– We wish to be 95% sure that our hypothesis is correct so we will use a significance level of 5% (α = 0.05)
• So, our critical z‐values are ± 1.96
• To test our results, if our calculated z‐value is < ‐1.96 or >1.96 then we are 95% sure that our population mean, μ, is not 454 g as we thought
x
• Our sample mean, , is 450.0 (from sample of 25)
z=
x − µ0
σ
n
=
450.0 − 454.0
7.8
86
25
=
−4
= −2.56
1.56
– Because our calculated z‐value of ‐2.56 is < ‐1.96, we reject our null hypothesis (H0) and conclude that we are 95% sure that the population mean for weight of our bags of pretzels is not 454 g
‐2.56 ‐1.96
0
87
1.96
• Hypothesis test for population mean with unknown σ
– The equation to calculate a critical t‐value is:
t=
x − µ0
s
n
• Example: In 2002, the average consumer spent $1,749 on apparel and services (U.S. Bureau of Labor Statistics). That same year, a sample of 25 consumers in the Northeast spent and average of $1,935.76 on apparel and services with a SD = $350.90.
– At 5% significance level, do the data provide sufficient evidence to conclude that the 2002 mean annual expenditure on apparel and services in the Northeast differed from the national mean of $1,749.
H0: μ = $1,749
Ha: μ ≠ $1,749
– We wish to be 95% sure that our hypothesis is correct so we will use a significance level of 5% (α = 0.05)
• So, our critical t‐values are ± 2.064 (df = 24, t0.025)
• To test our results, if our calculated t‐value is < ‐2.064 or >2.064 then we are 95% sure that our Northeastern population mean, μ, is not $1,749 as we hypothesized
t=
x − µ0
s
n
=
1935.76 − 1749.00
350.90
88
25
=
186.76
= 2.66
70.18
– Because our calculated t‐value (2.66) is > 2.064, we reject our null hypothesis (H0) and conclude that we are 95% sure that the population mean annual expenditure on apparel and services in the Northeast is not $1,749
‐2.064
‐3
‐2
0
‐1
0
89
2.064 2.66
1
2
3
`