Correlation & Regression

$owl$

~Correlation is a number between -1 and 1, inclusive, which describes the degree of relationship between 2 variables (bivariate data)

~The relationship could be linear or nonlinear. Scatter plots can be used to visualize the tendency for the type of relationship. To get a scatter plot, treat the paired data as points (x,y) and plot them in the plane. (usually y vs. x) (vertical vs. horizontal)

1) strong positive linear: points would appear close to a straight line with positive slope

2) strong negative linear: points would appear close to a straight line with negative slope

3) nonlinear: points appear to be on a curved path (nonlinear path)

4) none: no pattern appears (the points seem to be scattered randomly in the plane)

~We will confine our discussion to linear correlation. This number (which measures the degree of strength of a linear relationship) is called the linear correlation coefficient or formally, the Pearson product moment correlation coefficient. The symbol is r .

Assumptions:

1) the sample paired data (x,y) is a random sample of quantitative (measured) data.

2) the distributions for each variable is basically normal.

~Correlation vs. Causation~

~Two variables could have a high correlation without being causally related (one doesn’t cause the other to happen)

~It could be the case that the two variables are strongly related to other variables which are not considered. This happens when both variables have a strong tendency to increase or decrease simultaneously. (i.e., teacher salaries and liquor sales, money wagered on horse racing and college enrollment, & so on). Even though they seem to be unrelated, one could be a good predictor of the other.

~Formal definition: (mainly used in advance courses)

r= Cov(x,y)/SxSy, where Cov(x,y)(covariance)=[∑{(x-Xm )(y-Ym)}]/(n-1), where

Sx =sqr[∑[(x-Xm)²]/(n-1)] & Sy = sqr[∑(y-Ym)²/(n-1)], where Xm & Ym are the means for the data sets for x & y, n is the number of pairs in our sample (sample size).

~We will not use this definition to compute r. It can be shown that the following formula for r is equivalent to the one in the formal definition:

r = [n∑xy-(∑x)(∑y)] / sqr{[n(∑x²)-(∑x)²][n(∑y²)-(∑y)²]}, where r is between -1 & 1 (inclusive)

~Given the paired data (x,y), all of the quantities in the formula can be found easily., just compute carefully.

~Fortunately, the TI-83 will do it for us under the menu: STAT, TESTS, LinRegTTest. Just clear your lists, L₁ & L₂, then carefully insert the X's of your data in L₁, the corresponding Y's of your data in L₂. Make sure they correspond. Then press Stat, Tests, go to menu E, enter. The display starts off by giving the equation of the regression line (see regression, in this discussion), the values for a & b in that line. Keep scrolling until you see the correlation coefficient, r. Be careful, r² is right above it & many a student would mistakenly give this for r.

~Note: It doesn't make any difference if you call the first or second numbers in your pair data X or Y, since X correlated with Y gives the same result as Y correlated with X, however, it would make a very important difference in the equation of the regression line, so, when getting that, you must make a distinction, Y vs X (or Y "on" X) gives a different line compared to X vs Y
(X "on" Y). They are inverses.

~Interpretation of the value of r~

~If the |r| (absolute value of r) is close to 1, we can usually conclude that there is a strong linear tie between x & y. It could be positive or negative. However, there is a way of testing the population correlation., r, using the sample correlation r at a desired significance level & sample size. Table A-6 does this for us. If the |r| exceeds the critical value in table A-6, we conclude that there is significant correlation, otherwise, conclude there is not sufficient evidence to support a significant correlation. This will test the null hypothesis : H_o: r=0 against H₁: r≠0, where r is the correlation of the population. (r pronounced "rho")

~Regression~

~Assuming we have a collection of paired data values, (x,y), we would like to find a straight line that best fits or approximates them.

~The points (x,y) should have somewhat of a linear tendency. We can recognize this from a scatter plot of the points. Therefore, there should be somewhat of a linear correlation for the data. The higher the correlation, the better our line would be for prediction purposes.

~This "best fitting" line is called the regression line or least-squares line, from the method used to find it. The form used in statistics is y = a + bx. Our job is to find the numbers a & b of our line that best fits our linear data. We will always assume that y is the dependent variable (vertical axis) and x the independent variable (horizontal axis).

~Note: The data points in list L₁ are the X's & those in L₂ are the Y's. Be careful, it's Y that we are predicting (if there is a significant correlation by table A-6).

~So, for that reason, this line is also called the regression line of Y on X & is used for estimating or predicting Y for given values of X.

~Treating X as dependent (which is usually not the case) will give us a different result.

~The best such line is the "least-squares" regression line, however, getting the values of a and b for our line is way beyond the scope of this course. One popular method uses knowledge of calculus III to make the sum of squares of certain distances as small as possible.

~We calculate the squares of the vertical distances each point is from the regression line (using y=a+bx for our line). We then form the sum of these squared distances.

~Then, We then use calculus to minimize this sum.

~The result of this gives us 2 equations in a and b, as follows:

(1) b∑x + na - ∑y = 0 and ∑xy = a∑x + b∑x²

~By solving these equations simultaneously, we can find a and b.

~Then we can substitute these values into y = a + bx to get our line.

~The end result would be :

b = [n∑xy - (∑x)(∑y)]/[n∑x²-(∑x)²], and

a = [(∑y)(∑x²) - (∑x)(∑xy)]/[n∑x²-(∑x)²]

Not very pleasant looking! However, the TI-83 will do it all for us!

~Predicting using our regression line: If the correlation r for the data is significant (i.e., there is a linear correlation using table A-6), then, we can use this line for predicting a y value for a given x value.

~If there is no significant correlation, then the best predicted value of y, for any x, is the MEAN for y.

~In the first case, just substitute the x value into the regression equation to find y. In the second case, just give the value of the y mean.

~We will use the TI-83 for most of the analysis. Use the menu: STAT, TESTS, LinRegTTest(E).

~Coefficient of Determination~

~Coefficient of determination = r². This is defined as the proportion of the total variation made by using the regression line to predict the observed y-values, instead of predicting the y-values by their average in each case. (so, that's why it's refered to as the proportion of the total variation in the observed y-values that is EXPLAINED by the regression line). The closer r² is to 1, the more useful the regression equation is for prediction.

~Technically, r² = 1 - [∑(y-y_L)²]/[∑(y-y_A)²], where y_L is the y-value predicted from the regression line & Y_A is the average of the observed
y-values.

~In words, r² = [Explained Variation] divided by [Total Variation]

~The following will give you the motivation for this definition

~Since y-y_A=(y_L-y_A)+(y-y_L), we can square both sides, to get:

(y-y_A)² = [(y_L-y_A) + (y-y_L)]², or

(y-y_A)² = [(y_L-y_A)]² +2[(y_L-y_A)][(y-y_L)] + [(y-y_L)]²

~Using Sigma (Sum) properties, we get:
∑[(y-y_A)²] = ∑[(y_L-y_A)]² + 2∑[(y_L-y_A)][(y-y_L)] + ∑[(y-y_L)]²

~The middle term on the right side can be shown to = 0 in the derivation process of the "least squares line" using the equations denoted by (1) above

~We then have, ∑[(y-y_A)²] = ∑[(y_L-y_A)]² + ∑[(y-y_L)]²

which are the expressions that give us

~Total Variation = Explained Variation + Unexplained Variation, or

~Explained variation = Total Variaton - Unexplained variation

~Dividing both sides by the total variation, we get:

~[Explained Variaton]/[total Variation]=

1-[Unexplained Variation]/[Total Variation] or r²
(equation for r² given initially)

~The Unexplained Variation, ∑[(y-y_L)]², can be due to other variables of our problem not considered.

~Fitting a data set with a non-linear model often arises in practical applications. See the following link for details:

Curve Fitting