Publish on 18th October 2019
Category: Birds
0

Cal State Northridge427Ainsworth

Correlation and Regression

Major Points - Correlation

Questions answered by correlationScatterplotsAn exampleThe correlation coefficientOther kinds of correlationsFactors affecting correlationsTesting for significance

The Question

Are two variables related?Does one increase as the other increases?e. g. skills and incomeDoes one decrease as the other increases?e. g. health problems and nutritionHow can we get a numerical measure of the degree of relationship?

Scatterplots

AKA scatter diagram or scattergram.Graphically depicts the relationship between two variables in two dimensional space.

Direct Relationship

Inverse Relationship

An Example

Does smoking cigarettes increase systolic blood pressure?Plotting number of cigarettes smoked per day against systolic blood pressureFairly moderate relationshipRelationship is positive

Trend?

Smoking and BP

Note relationship is moderate, but real.Why do we care about relationship?What would conclude if there were no relationship?What if the relationship were near perfect?What if the relationship were negative?

Heart Disease and Cigarettes

Data on heart disease and cigarette smoking in 21 developed countries (Landwehr and Watkins, 1987)Data have been rounded for computational convenience.The results were not affected.

The Data

Surprisingly, the U.S. is the first country on the list--the countrywith the highest consumption and highest mortality.

Scatterplot of Heart Disease

CHD Mortality goes on ordinate (Y axis)Why?Cigarette consumption on abscissa (X axis)Why?What does each dot represent?Best fitting line included for clarity

{X=6, Y= 11}

What Does the Scatterplot Show?

As smoking increases, so does coronary heart disease mortality.Relationship looks strongNot all data points on line.This gives us “residuals” or “errors of prediction”To be discussed later

Correlation

Co-relationThe relationship between two variablesMeasured with a correlation coefficientMost popularly seen correlation coefficient: Pearson Product-Moment Correlation

Types of Correlation

Positive correlationHigh values of X tend to be associated with high values of Y.As X increases, Y increasesNegative correlationHigh values of X tend to be associated with low values of Y.As X increases, Y decreasesNo correlationNo consistent tendency for values on Y to increase or decrease as X increases

Correlation Coefficient

A measure of degree of relationship.Between 1 and -1Sign refers to direction.Based on covarianceMeasure of degree to which large scores on X go with large scores on Y, and small scores on X go with small scores on YThink of it as variance, but with 2 variables instead of 1 (What does that mean??)

18

Covariance

Remember that variance is:The formula for co-variance is:How this works, and why?When wouldcovXYbe large and positive? Large and negative?

Example

Example

21

What the heck is a covariance?I thought we were talking about correlation?

Correlation Coefficient

Pearson’s Product Moment CorrelationSymbolized byrCovariance ÷ (product of the 2 SDs)Correlation is a standardized covariance

Calculation for Example

CovXY= 11.12sX= 2.33sY= 6.69

Example

Correlation = .713Sign is positiveWhy?If sign were negativeWhat would it mean?Would not alter thedegreeof relationship.

Other calculations

25

Z-score methodComputational (Raw Score) Method

Other Kinds of Correlation

Spearman Rank-Order Correlation Coefficient (rsp)used with 2 ranked/ordinal variablesuses the same Pearson formula

26

Other Kinds of Correlation

Point biserial correlation coefficient (rpb)used with one continuous scale and one nominal or ordinal or dichotomous scale.uses the same Pearson formula

27

Other Kinds of Correlation

Phi coefficient ()used with two dichotomous scales.uses the same Pearson formula

28

Factors Affectingr

Range restrictionsLooking at only a small portion of the total scatter plot (looking at a smaller portion of the scores’ variability)decreasesr.Reducing variability reducesrNonlinearityThe Pearson r (and its relatives) measure the degree oflinearrelationship between two variablesIf a strong non-linear relationship exists, r will provide a low, or at least inaccurate measure of the true relationship.

Factors Affectingr

Heterogeneous subsamplesEveryday examples (e.g. height and weight using both men and women)OutliersOverestimate CorrelationUnderestimate Correlation

Countries With Low Consumptions

Data With Restricted Range

Truncated at 5 Cigarettes Per Day

Cigarette Consumption per Adult per Day

5.5

5.0

4.5

4.0

3.5

3.0

2.5

CHD Mortality per 10,000

20

18

16

14

12

10

8

6

4

2

Truncation

32

Non-linearity

33

Heterogenous samples

34

Outliers

35

Testing Correlations

36

So you have a correlation. Now what?In terms of magnitude, how big is big?Small correlations in large samples are “big.”Large correlations in small samples aren’t always “big.”Depends upon the magnitude of the correlation coefficientANDThe size of your sample.

Testingr

Population parameter =Null hypothesisH0: = 0Test of linear independenceWhat would a true null mean here?What would a false null mean here?Alternative hypothesis (H1) 0Two-tailed

Tables of Significance

We can convert r to t and test for significance:Where DF = N-2

Tables of Significance

In our examplerwas .71N-2 = 21 – 2 = 19T-crit(19) = 2.09Since 6.90 is larger than 2.09 rejectr= 0.

Computer Printout

Printout gives test of significance.

Regression

What is regression?

42

How do we predict one variable from another?How does one variable change as the other changes?Influence

Linear Regression

43

A technique we use to predict the most likely score on one variable from those on another variableUses thenature of the relationship(i.e. correlation)between two variables toenhanceyour prediction

Linear Regression: Parts

44

Y- the variables you are predictingi.e. dependent variableX- the variables you are using to predicti.e. independent variable- your predictions (also known asY’)

Why Do We Care?

45

We may want to make a prediction.More likely, we want to understand the relationship.How fast does CHD mortality rise with a one unit increase in smoking?Note: we speak about predicting, but often don’t actually predict.

An Example

46

Cigarettes and CHD Mortality againData repeated on next slideWe want to predict level of CHD mortality in a country averaging 10 cigarettes per day.

The Data

47

Based on the data we have what would we predict the rate of CHD be in a country that smoked 10 cigarettes on average?First, we need to establish a prediction of CHD from smoking…

48

For a country that smokes 6 C/A/D…

We predict a CHD rate of about 14

Regression Line

Regression Line

49

Formula= the predicted value ofY(e.g. CHD mortality)X= the predictor variable (e.g. average cig./adult/country)

Regression Coefficients

50

“Coefficients” areaandbb= slopeChange in predictedYfor one unit change inXa= interceptvalue of whenX= 0

Calculation

51

SlopeIntercept

For Our Data

52

CovXY= 11.12s2X= 2.332= 5.447b= 11.12/5.447 = 2.042a= 14.524 - 2.042*5.952 = 2.32See SPSS printout on next slide

Answers are not exact due to rounding error and desire to match SPSS.

SPSS Printout

53

Note:

54

The values we obtained are shown on printout.The intercept is the value in theBcolumn labeled “constant”The slope is the value in theBcolumn labeled by name of predictor variable.

Making a Prediction

55

Second, once we know the relationship we can predictWe predict 22.77 people/10,000 in a country with an average of 10 C/A/D will die of CHD

Accuracy of Prediction

Finnish smokers smoke 6 C/A/DWe predict:They actually have 23 deaths/10,000Our error (“residual”) =23 - 14.619 = 8.38a large error

56

57

Cigarette Consumption per Adult per Day

12

10

8

6

4

2

CHD Mortality per 10,000

30

20

10

0

Residuals

58

When we predict Ŷ for a given X, we will sometimes be in error.Y – Ŷ for any X is a anerror of estimateAlso known as: aresidualWe want to Σ(Y- Ŷ) as small as possible.BUT, there are infinitely many lines that can do this.Just draw ANY line that goes through the mean of the X and Y values.Minimize Errors of Estimate… How?

Minimizing Residuals

59

Again, the problem lies with this definition of the mean:So, how do we get rid of the 0’s?Square them.

Regression Line:A Mathematical Definition

The regression line is the line which when drawn through your data set produces the smallest value of:Called the Sum of Squared Residual or SSresidualRegression line is also called a “least squares line.”

60

Summarizing Errors of Prediction

61

Residual varianceThe variability of predicted values

Standard Error of Estimate

62

Standard error of estimateThe standard deviation of predicted valuesA common measure of the accuracy of our predictionsWe want it to be as small as possible.

Example

63

Regression and Z Scores

64

When your data are standardized (linearly transformed to z-scores), the slope of the regression line is called βDO NOT confuse this β with the β associated with type II errors. They’re different.When we have one predictor, r = βZy= βZx, since A now equals 0

Sums of square deviationsTotalRegressionResidual we already coveredSStotal=SSregression+SSresidual

Partitioning Variability

65

Partitioning Variability

66

Degrees of freedomTotaldftotal= N - 1Regressiondfregression= number of predictorsResidualdfresidual=dftotal–dfregressiondftotal=dfregression+dfresidual

Partitioning Variability

67

Variance (or Mean Square)Total Variances2total=SStotal/ dftotalRegression Variances2regression=SSregression/ dfregressionResidual Variances2residual= SSresidual/ dfresidual

Example

68

Example

69

Coefficient of Determination

70

It is a measure of the percent of predictable variabilityThe percentage of the total variability in Y explained by X

r= .713r2= .7132=.508orApproximately 50% in variability of incidence of CHD mortality is associated with variability in smoking.

r2for our example

71

Coefficient of Alienation

72

It is defined as 1 -r2orExample1 - .508 = .492

r2, SS and sY-Y’

73

r2* SStotal= SSregression(1 - r2) * SStotal= SSresidualWe can also use r2to calculate the standard error of estimate as:

Testing Overall Model

74

We can test for the overall prediction of the model by forming the ratio:If the calculated F value is larger than a tabled value (F-Table) we have a significant prediction

Testing Overall Model

75

ExampleF-Table – F critical is found using 2 thingsdfregression(numerator) anddfresidual.(demoninator)F-Table ourFcrit(1,19) = 4.3819.594 > 4.38, significant overallShould all sound familiar…

SPSS output

76

Testing Slope and Intercept

77

The regression coefficients can be tested for significanceEach coefficient divided by it’s standard error equals a t value that can also be looked up in a t-tableEach coefficient is tested against 0

Testing the Slope

78

With only 1 predictor, the standard error for the slope is:For our Example:

Testing Slope and Intercept

79

These are given in computer printout as attest.

Testing

80

Thetvalues in the second from right column are tests on slope and intercept.The associatedpvalues are next to them.The slope is significantly different from zero, but not the intercept.Why do we care?

Testing

81

What does it mean if slope is not significant?How does that relate to test onr?What if the intercept is not significant?Does significant slope mean we predict quite well?

0

Embed

Upload

Correlation - California State University, Northridge