Correlation & Regression

~Correlation is a number between -1 and 1, inclusive, which
describes the degree of relationship between 2 variables (bivariate
data)
~The relationship could be linear or nonlinear. Scatter plots can be
used to visualize the tendency for the type of relationship. To get a
scatter plot, treat the paired data as points (x,y) and plot them in
the plane. (usually y vs. x) (vertical vs. horizontal)
1) strong positive linear: points would appear close to a straight line with positive slope
2) strong negative linear: points would appear close to a straight line with negative slope
3) nonlinear: points appear to be on a curved path (nonlinear path)
4) none: no pattern appears (the points seem to be scattered randomly in the plane)
~We will confine our discussion to linear correlation. This number
(which measures the degree of strength of a linear relationship) is
called the linear correlation coefficient or formally, the Pearson
product moment correlation coefficient. The symbol is r .
Assumptions:
1) the sample paired data (x,y) is a random sample of quantitative (measured) data.
2) the distributions for each variable is basically normal.
~Correlation vs. Causation~
~Two variables could have a high correlation without being causally related (one doesn’t cause the other to happen)
~It could be the case that the two variables are strongly related to
other variables which are not considered. This happens when both
variables have a strong tendency to increase or decrease
simultaneously. (i.e., teacher salaries and liquor sales, money wagered
on horse racing and college enrollment, & so on). Even though they
seem to be unrelated, one could be a good predictor of the other.
~Formal definition: (mainly used in advance courses)
r= Cov(x,y)/SxSy, where Cov(x,y)(covariance)=[∑{(x-Xm )(y-Ym)}]/(n-1), where
Sx =sqr[∑[(x-Xm)2]/(n-1)]
& Sy = sqr[∑(y-Ym)2/(n-1)], where Xm & Ym are the means for the data sets for x & y, n is the number of pairs in our sample (sample size).
~We will not use this definition to compute r. It can be shown that the
following formula for r is equivalent to the one in the formal
definition:
r = [n∑xy-(∑x)(∑y)] / sqr{[n(∑x2)-(∑x)2][n(∑y2)-(∑y)2]}, where r is between -1 & 1 (inclusive)
~Given the paired data (x,y), all of the quantities in the formula can be found easily., just compute carefully.
~Fortunately, the TI-83 will do it for us under the menu: STAT, TESTS, LinRegTTest. Just clear your lists, L1 & L2, then carefully insert the X's of your data in L1, the corresponding Y's of your data in L2.
Make sure they correspond. Then press Stat, Tests, go to menu E, enter.
The display starts off by giving the equation of the regression line
(see regression, in this discussion), the values for a & b in that
line. Keep scrolling until you see the correlation coefficient, r. Be
careful, r2 is right above it & many a student would mistakenly give this for r.
~Note: It doesn't make any difference if you call the first or second
numbers in your pair data X or Y, since X correlated with Y gives the
same result as Y correlated with X, however, it would make a very
important difference in the equation of the regression line, so, when
getting that, you must make a distinction, Y vs X (or Y "on" X) gives a
different line compared to X vs Y
(X "on" Y). They are inverses.
~Interpretation of the value of r~
~If the |r|
(absolute value of r) is close to 1, we can usually conclude that there
is a strong linear tie between x & y. It could be positive or
negative. However, there is a way of testing the population
correlation., r, using the sample
correlation r at a desired significance level & sample size. Table
A-6 does this for us. If the |r| exceeds the critical value in table
A-6, we conclude that there is significant correlation, otherwise,
conclude there is not sufficient evidence to support a significant
correlation. This will test the null hypothesis : Ho: r=0 against H1: r≠0, where r is the correlation of the population. (r pronounced "rho")
~Regression~
~Assuming we have a collection of paired data values, (x,y), we would
like to find a straight line that best fits or approximates them.
~The points (x,y) should have somewhat of a linear tendency. We can
recognize this from a scatter plot of the points. Therefore, there
should be somewhat of a linear correlation for the data. The higher the
correlation, the better our line would be for prediction purposes.
~This "best fitting" line is called the regression line or
least-squares line, from the method used to find it. The form used in
statistics is y = a + bx. Our job is to find the numbers a & b of
our line that best fits our linear data. We will always assume that y
is the dependent variable (vertical axis) and x the independent
variable (horizontal axis).
~Note: The data points in list L1 are the X's & those in L2 are the Y's. Be careful, it's Y that we are predicting (if there is a significant correlation by table A-6).
~So, for that reason, this line is also called the regression line of Y
on X & is used for estimating or predicting Y for given
values of X.
~Treating X as dependent (which is usually not the case) will give us a different result.
~The best such line is the "least-squares" regression line, however,
getting the values of a and b for our line is way beyond the scope of
this course. One popular method uses knowledge of calculus III to
make the sum of squares of certain distances as small as possible.
~We calculate the squares of the vertical distances each point is from
the regression line (using y=a+bx for our line). We then form the sum
of these squared distances.
~Then, We then use calculus to minimize this sum.
~The result of this gives us 2 equations in a and b, as follows:
(1) b∑x
+ na - ∑y = 0 and ∑xy = a∑x + b∑x2
~By solving these equations simultaneously, we can find a and b.
~Then we can substitute these values into y = a + bx to get our line.
~The end result would be :
b = [n∑xy - (∑x)(∑y)]/[n∑x2-(∑x)2], and
a = [(∑y)(∑x2) - (∑x)(∑xy)]/[n∑x2-(∑x)2]
Not very pleasant looking! However, the TI-83 will do it all for us!
~Predicting using our regression line: If the correlation r for
the data is significant (i.e., there is a linear correlation using
table A-6), then, we can use this line for predicting a y value for a
given x value.
~If there is no significant correlation, then the best predicted value of y, for any x, is the MEAN for y.
~In the first case, just substitute the x value into the regression
equation to find y. In the second case, just give the value of the y
mean.
~We will use the TI-83 for most of the analysis. Use the menu: STAT, TESTS, LinRegTTest(E).
~Coefficient of Determination~