Pearson's Correlation Coefficient
This
is one in a series of tutorials using examples from WINKS SDA.
Definition: Measures
the strength of the linear relationship between two variables.
Assumptions: Both
variables (often called X and Y) are interval/ratio and approximately
normally distributed, and their joint distribution is bivariate normal.
Characteristics:
Pearson's Correlation Coefficient is usually signified by r (rho),
and can take on the values from 1.0 to 1.0. Where 1.0 is a perfect
negative (inverse) correlation, 0.0 is no correlation, and 1.0 is a perfect
positive correlation.
Related statistics:
R^{2} (called the coefficient of determination or r squared) can be
interpreted as the proportion of variance in Y that is contained in X.
Tests: The
statistical significance of r is tested using a ttest. The
hypotheses for this test are:
H_{0}: rho = 0
H_{a}: rho <> 0
A low pvalue for this test
(less than 0.05 for example) means that there is evidence to reject the null
hypothesis in favor of the alternative hypothesis, or that there is a
statistically significant relationship between the two variables.
Note: This test is
equivalent to the test of no slope in the simple linear regression
procedure.
Location in
WINKS:
Pearson's correlation coefficient is found in the following locations:
1. Regression and
Correlation  The Correlation procedure produces both Pearson and Spearman
Correlation coefficients. The ttest for statistical significance of r is
calculated. R^{2} is also reported.
2. Regression and
Correlation  The Simple linear regression reports the Pearson correlation
coefficient and the ttest. R^{2} is also reported.
3. Regression and
Correlation  The Correlation Matrix procedure produces a matrix of
correlations for a number of pairs of variables at a time, and includes the
pvalue for the test or significance of r.
Graphs: An important
part of interpreting r is to observe a scatterplot of the data. Scatterplots
are available from the Graphs option, as a part of Simple Linear Regression
and in the Graphical Correlation Matrix option in Regression and
Correlation.
Example: Use the
Correlation procedure to calculate r for the two variables HP
(horsepower) and WEIGHT in the WINKS "CAR" database. The results from WINKS (in part) are:
Variables used
: HP and WEIGHT Number of cases used: 38 Pearson's r (Correlations
Coefficient) = 0.9172 RSquare = 0.8413 Test of hypothesis to determine
significance of relationship:
H(null): Slope
= 0 or H(null): r = 0 (Pearson's)
t = 13.81425
with 36 d.f. p < 0.001
(A low pvalue
implies that the slope does not = 0.) Spearman's Rank Correlation
Coefficient = 0.9071 (Spearman's) t = 12.93361 with 36 d.f. p < 0.001.
A scatterplot of this data shows the positive correlation  cars with
higher horsepower tend to weigh more:
An example of writing up these results:
Narrative: "An evaluation was made of the linear relationship
between horsepower and vehicle weight using Pearson's correlation."
Results:
"An analysis using Pearson's correlation coefficient indicates a
statistically significant linear relationship between horsepower and vehicle
weight r(36)=0.92, p<0.001. For these data, the mean (SD) for horsepower is
101.7(26.4) and for weight 2.86 (0.71)."
Warning: There is a
temptation to infer cause and effect when observing a correlation. However,
the ability to assign causality depends on the creation of an experiment
specifically designed to provide this kind of inference.
Related topics:
Spearman's Correlation Coefficient is the nonparametric counterpart to r.
See also simple linear regression, multiple regression, and polynomial
regression.
Exercise  Correlation
At the beginning of an
introductory engineering course, 10 students were given a pretest to
determine their initial mathematical ability. The following table lists the
student's pretest score and final grade in the class:
Student Number 
PreTest 
Course Grade 
1
2
3
4
5
6
7
8
9
10

45
23
50
46
33
21
13
30
34
50 
92
86
97
95
87
76
72
84
85
98 
1. Calculate Pearson's
Correlation Coefficient (r) on this data.
r =
2. What statistical test is
used to determine if this value of r is statistically significant?
3. Is the correlation seen
in this data statistically significant. Why?
4. Display a scatterplot of
the data. Does the data appear linearly correlated. Do there seem to be any
outlier values?
5. Suppose an 11th student
were added to the data, with a pretest score of 40 and a Course Grade of
70. How would this effect r?
