| WINKS Manual Index | Help | Home | Tutorials |

WINKS Online Manual


Chapter 4 Part 5

Regression and Correlation

Simple Linear Regression

Simple linear regression is used for predicting a value of a dependent variable using an independent variable. Multiple regression is used for predicting the value of a dependent variable using one or more independent variables. Correlation is used to measure the strength of association between two variables. For example, you may be interested in relating advertising to orders received. The question you are asking is, "Is there a relationship between the amount of money spent on advertising and the amount of orders received?" It is also possible to compare more than two variables at a time using multiple regression. For example, you may be interested in how the combination of radio advertising costs, direct mail costs and commissions relate to the number of orders received.

Both regression and correlation measure the linear relationship between the variables. In the case of the Spearman's correlation the relationship measured is an association between the ranks of the data. When the data are plotted (scatterplot), highly associated variables should fall "scattered" about a straight line. You should check this assumption using the scatterplot or residual plot options available in the WINKS regression and correlation procedures.

Regression procedures also assume that for a fixed X value (a fixed value of the independent variable), the population of Y values (values of the dependent variable) is normally distributed and that all these normal distributions have equal variances. You can use the residual plot options on the regression procedures to check this assumption. If the residuals plotted against an independent variable show a pattern other than a band of points randomly scattered about zero, these assumptions may be violated.

Data for this example of simple linear regression are Homicide Rate and Handgun Licenses Issued per 100,000 population for the years 1961 to 1973 in Detroit (Fisher, 1976, reprinted from Gunst and Mason, 1980).

Data for simple linear regression (handgun study)

Year   Homicide   Handguns 
            Rate          Registered
1961    8.60           178.15
1962    8.90           156.41
1963    8.52           198.02
1964    8.89           222.10
1965   13.07          301.92
1966   14.57          391.22
1967   21.36          665.56
1968   28.03        1131.21
1969   31.49          837.60
1970   37.39          794.90
1971   46.26          817.74
1972   47.24          583.17
1973   52.33          709.59

Since you want to compare the homicide rate with handguns registered, you need a database with only these two sets of numbers (you can exclude year.) The data for this example is stored on your disk as HANDGUNS.DBF with the variables HOMICIDES and HANDGUNS. To perform a simple linear regression using this data, follow these steps:

Step 1 Open the database named HANDGUNS.DBF. Or, you can create a database using the pre-defined database description “Simple Linear Regression.”

Step 2 From the Analyze menu, select Regression & Correlation and "Simple Linear Regression."

Step 3: Select HOM_RATE as the DEPENDENT (Y) variable first, then select HAND_REG as the INDEPENDENT (X) variable.

Step 4: WINKS will display  preliminary results, and ask if you want to Predict or Continue. The Predict option is used if you want to use the regression equation to calculate new values of Y by entering values of X. For this example, select Continue.

Regression results will be displayed in the viewer.

Dependent variable is HOM_RATE, 1 independent variables, 13 cases.
---------------------------------------------------------------------------
Variable 	Coefficient 	St. Error 	t-value 	p(2 tail)
---------------------------------------------------------------------------
Intercept 	4.9105126 	6.6274622 	.7409341 	0.474
HAND_REG 	.0376114 	.0107324 	3.5044807 	0.005
---------------------------------------------------------------------------
R-Square = 0.5275 	Adjusted R-Square = 0.4846

Analysis of Variance to Test Regression Relation

Source 		Sum of Sqs 	df 	Mean Sq 	F 	    p-value
---------------------------------------------------------------------------
Regression 	1699.557 	1 	1699.557 	12.281385    0.005
Error 		1522.2328 	11 	138.3848
---------------------------------------------------------------------------
Total 		3221.7897 	12

A low p-value suggests that the dependent variable HOM_RATE
may be linearly related to independent variable(s).

---------------------------------------------------------------------------
MEAN X = 537.507 	S.D. X = 316.415 	CORR XSS = 1201423.0
MEAN Y = 25.127 	S.D. Y = 16.385 	CORR YSS = 3221.788
REGRESSION MS= 1699.557 	RESIDUAL MS= 138.385
---------------------------------------------------------------------------

Pearson's r (Correlation Coefficient)= 0.7263

The linear regression equation is: 
	HOM_RATE = 4.910512 + 3.761144E-02 * HAND_REG

Test of hypothesis to determine significance of relationship:
	H(null): Slope = 0 or H(null): r = 0 (two-tailed test)
	t = 3.5 with 11 degrees of freedom p = 0.005

Note: A low p-value implies that the slope does not = 0.

The table at the top of the output tells you the intercept value and the coefficient values for each of the independent variables. These can be used to create a prediction equation as explained below. Pearson's correlation coefficient (r) is reported (0.7263). Pearson's r ranges from -1 to 1; the further r is from 0, the stronger the correlation. In this case, r=0.7263, not necessarily a strong correlation, but not weak either although the test of hypothesis determines that r is significantly different from zero (p=0.005). How substantial a correlation of this strength is depends on the situation and the judgment of the researcher. R2 ranges from 0 to 1; the closer to one (1) R2 is, the better fit the regression line is to the data. The linear regression equation given is a mathematical representation of a straight line that passes through a plot of the data, and can be used to predict the dependent variable (HOMICIDES) given a value for the independent variable (HANDGUNS). In this case the linear regression equation is:

HOMICIDES = 4.910512 + 3.761144E-02 * HANDGUNS

(The E notation is scientific. Thus, 3.7E-02 means 0.037.) If you want to predict the homicide rate for 300 handguns registered, you would use the equation:

HOMICIDES = 4.910512 + 3.761144E-02 * 300

A t-test is performed to test the statistical significance of the linear relationship between the two variables. A low p-value means that the two variables are significantly related. In this case p=0.005, quite small, so the null hypothesis (Slope = 0) is rejected and you conclude that the regression line has a slope significantly different from zero. That is, there is a significant linear relationship between homicides and handguns for the years 1961 to 1973 in Detroit, and within the range 178 to 1131 handguns. Reference a text on regression for warnings about how to use (or not to use) this kind of information for prediction purposes.

Step 5: Click Graph to view regression plots. You may choose to view a scatterplot of the original data with the fitted regression line (Regression Plot) , or a plot of the residual values by choosing from the combo box displayed, then Ok.

Plots are helpful in visually examining the relationship between the variables. It is important to verify that the relationship is indeed a straight line. If it is not, a non-linear pattern should emerge from the scatterplot of the data, or a pattern other than a random horizontal band centered at zero from the plot of the residuals. See Neter and Wassermen's book for a good description of residual plots. If the relationship is not linear, they could possibly be transformed to make the data linear.  

Prediction Intervals in Simple Regression

After performing a simple linear regression, you can select the "Predict" button on the text viewer to calculate predicted "Y" values using the regression equation. Beginning with version 4.5, the predicted Y values are accompanied by a 95% prediction interval on those values.


Multiple Regression Analysis

Multiple regression is an extension of simple linear regression into several dimensions (several independent variables). In the multiple regression procedure, you must enter a list of the independent variables and a single dependent variable on which you wish to perform the regression analysis. In WINKS you may use up to 20 independent variables in this option. Multiple regression can be complicated. Refer to a good text on the subject before making any conclusions about your results.

WINKS calculates and displays several results, including the coefficients and intercept of the regression "line". A significance test is performed to determine the significance of the contribution of the different variables or factors to the model (mathematical representation). Also displayed is R-square (R2), as well as adjusted R-square. R-square varies from 0.0 to 1.0, with 0.0 meaning no relationship (model is not good) and 1.0 meaning the regression equation perfectly describes the sample data.

An analysis of variance is performed to determine the overall significance of  the model. If the ANOVA reveals a significant relationship, (that is, if the p-value is small) the model may be a good representation of the sample data. A plot of residuals from the fit is available. You may plot the fit against any of the variables. Look for patterns in the residuals. Patterns other than a horizontal band about zero suggest that the assumptions necessary for regression analysis may be violated.

Longley (1967) introduced a data set which has often been used in comparing multiple linear regression procedures in the literature. The variables refer to economic factors.  This example uses the LONGLEY database on the WINKS disk. Follow these steps to perform a multiple linear regression:

Step 1: Open the database named LONGLEY.

Step 2: From the Analyze menu, select “Regression and Correlation,” then "Multiple Regression Analysis" 

Step 3: The LONGLEY database consists of 7 fields. Select  TOTAL as the DEPENDENT variable and DEFLATOR, GNP, UNEMP, ARMED, POP, TIME as the INDEPENDENT variables, then click on Ok.

Step 4: WINKS will display  preliminary results, and ask if you want to Predict or Continue. The Predict option is used if you want to use the regression equation to calculate new values of Y by entering values of X. For this example, select Continue.

Regression results will be displayed in the viewer.

Dependent variable is TOTAL, 6 independent variables, 16 cases.
---------------------------------------------------------------------------
Variable     Coefficient         St. Error     t-value      p(2 tail)
---------------------------------------------------------------------------
Intercept  -3482258.6349         889652.92    -3.914177     0.004
DEFLATOR       15.061872         84.841736      .177529     0.863
GNP            -.0358192          .0334621    -1.070439     0.312
UNEMP           -2.02023          .4879787    -4.139996     0.003
ARMED          -1.033227          .2140895    -4.826145     <.001
POP            -.0511041          .2258783     -.2262461    0.826
TIME           1829.1515         455.08592     4.0193541    0.003
---------------------------------------------------------------------------
R-Square = 0.9955     Adjusted R-Square = 0.9925

Analysis of Variance to Test Regression Relation

Source         Sum of Sqs     df     Mean Sq         F          p-value
---------------------------------------------------------------------------
Regression     184173843.173   6     30695640.5288  330.85802    <.001
Error             834982.83    9     92775.87
---------------------------------------------------------------------------
Total            185008826.   15

A low p-value suggests that the dependent variable TOTAL
may be linearly related to independent variable(s).

The table at the top of the output tells you the intercept value and the coefficient values for each of the independent variables. These can be used to create an equation for prediction of the dependent variable. In this case, the equation is:

TOTAL = -3481930.1065 + DEFLATOR*(15.0161517122) + GNP*(-0.03579443400) + 
UNEMP*(-2.0199053296) + ARMED*(-1.0332049046) + POP*(-0.05130725587) + 
TIME*(1828.99249535)

Note: Although the results are reported to 8 to 9 decimal places, it is usually not appropriate or necessary to use this many decimal places.

The t-value associated with each coefficient tests its significance in the equation. You can use the p-value associated with each coefficient to make a decision about the validity of having that variable in the equation. A low p-value suggests that the dependent variable, TOTAL, is related to the independent variable whose p-value you are examining. In this case, you might question the validity of having DEFLATOR (p=0.8636), GNP (p=0.3132) and POP (p=0.8257) in the equation.

In choosing the variables to have in such an equation, you also need to consider such questions as multicollinearity, heteroscedasticity and parsimony. There are also other ways to approach the selection of variables for a multiple regression equation. Refer to a good text on regression. If you wish to delete some variables from the equation, you can do so by redoing the analysis and leaving some of the variables out of the equation.

WINKS also reports R-Square, which gives you a measure of how well the regression "line" fits the data, and the adjusted R-Square, which adjusts R-Square for how many variables there are in the equation. R-Square ranges from 0 to 1; the closer to one (1.0) R2 is, the better fit the "line" is to the data. In this example, when all six variables are included, R-Square is 0.9955 and the adjusted R-Square is 0.9925, indicating a good fit.

Step 5 Click on Graph to view the residual plots. It is a good idea to view plots of residuals. The plots are helpful to determine if regression analysis is appropriate. A pattern other than a random horizontal band about zero indicates that the assumptions necessary for a regression procedure may be violated. You have options of producing plots of the residuals, and/or predicting values for the dependent variable based on values of the independent variable(s).


Correlation Analysis

 The correlation coefficient is a measure of the strength of the linear relationship between two variables. WINKS allows you to find both Pearson's and Spearman's (rank) correlation coefficients of two variables.

Like regression analysis, correlation assumes that the relationship between the two variables is linear. That is, when one of the variables is plotted against the other, the data points should show a straight line pattern (no curves). Unlike regression, correlation assumes that both variables are independent, neither is dependent on, causes or influences the other. You may wish to create a scatterplot of the data to check for linearity. The correlation coefficient takes on values between -1 and 1, with values close to -1 or 1 indicating a strong relationship between the two variables. A value close to 0 indicates a weak or non-existent relationship. A negative value shows a negative, or inverse relationship--as one variable increases, the other decreases. A positive value shows a positive, or direct relationship--as one variable increases, the other also increases. Pearson's correlation coefficient (Pearson's r) assumes that both populations are well approximated by a normal distribution, and that their joint distribution is bivariate normal. WINKS calculates Pearson's r and R2 (the coefficient of determination equal to r squared), and performs a t-test on the significance of rho (the population correlation coefficient) and reports a p-value. This t-test is a test of the hypotheses:

Ho: rho = 0  
Ha: rho <> 0

A low p-value (less than 0.05, for example) is usually taken to indicate that the correlation is significant (Ho is rejected).

WINKS also calculates and reports Spearman's rank correlation coefficient. This result does not assume normality and is based on the ranks of the data rather than the data values themselves. Spearman's r (rs) is calculated by ranking the data within each of the two groups, then finding the Pearson correlation for the rank data. Thus, rs measures the linear relationship between the ranked data and thus measures the monotonic relationship between the original variables, i.e., does the variable increase or decrease consistently as the other values increased. Spearman's rank correlation coefficient falls between -1 and 1, like Pearson's r and is interpreted similarly.

This example uses the HANDGUNS.DBF data from the Simple Linear Regression example. You may be interested in measuring the strength of the linear association between the numbers of registered handguns and the number of homicides. Both variables may well be influenced by other factors, but you want to know if they tend to be related. Follow these steps to calculate correlation coefficients:

Step 1: Open the database named HANDGUNS.

Step 2 From the Analyze menu, select “Regression and Correlation” and “Correlation - Pearson and Spearman.”

Step 3: Select HAND_REG and HOM_RATE as the two variables to analyze and click Ok. WINKS will perform the calculations and display the results in the viewer:

Variables used : HOM_RATE and HAND_REG

Number of cases used: 13 

Pearson's r (Correlations Coefficient) = 0.7263 R-Square = 0.5275

Test of hypothesis to determine significance of relationship:
H(null): Slope = 0 or H(null): r = 0

(Pearson's) t = 3.504481 with 11 d.f. p = 0.005
(A low p-value implies that the slope does not = 0.)

Spearman's Rank Correlation Coefficient = 0.7527

In this example, Pearson's r is 0.7263 and R2 is 0.5275. The t-test of significance of the relationship has a low p-value 0.005, indicating that the correlation is significantly different from zero. Spearman's rank correlation coefficient is 0.7527. The investigation must determine whether or not these correlations are large enough to be important. How substantial a correlation of 0.7263 is depends on the specific situation and the judgment of the researcher. To check whether the two variables are linearly related, you may wish to produce a scatterplot. To do so, you can use the "Graphical Correlation Matrix" or "Simple Linear Regression" option, or you can use the XY plot in the Graphs menu.


Correlation Matrix Analysis

If you are working with several variables and would like to have the Pearson's correlation coefficient of each pair of the several variables in question, you can use the "Correlation matrix" option in the “Regression and Correlation” menu.

For example, to display the correlation matrix of the Longley data, use these steps:

Step 1: Open the database named LONGLEY.DBF.

Step 2: From the Analyze menu, select “Regression and Correlation” then choose the “Correlation Matrix" option.

Step 3: Select all the fields for this analysis. WINKS will perform the calculations and display the 7 by 7 matrix of correlations and click Ok.  

Matrix of Correlation Coefficients C:\WINKS46P\LONGLEY.DBF

		DEFLAT GNP 	UNEMP 	ARMED 	POP 	TIME 	TOTAL 
DEFLATOR 	.992 		.621 	.465 	.979 	.991 	.971
		( 0.0) 		(0.01) 	(0.07) 	( 0.0) 	( 0.0) 	( 0.0)
		[ 16] 		[ 16] 	[ 16] 	[ 16] 	[ 16] 	[ 16]

GNP 				.604 	.446 	.991 	.995 	.984
				(.013) 	(.083) 	( 0.0) 	( 0.0) 	( 0.0)
				[ 16] 	[ 16] 	[ 16] 	[ 16] 	[ 16]

UNEMP 					-.177 	.687 	.668 	.502
					(.511) 	(.003) 	(.005) 	(.047)
					[ 16] 	[ 16] 	[ 16] 	[ 16]

ARMED 						.364 	.417 	.457
						(.165) 	(.108) 	(.075)
						[ 16] 	[ 16] 	[ 16]

POP 							.994 	.96
							( 0.0) 	( 0.0)
							[ 16] 	[ 16]

TIME 								.971
								( 0.0)
								[ 16]

Key: Correlation
(p-value)
[count]

Only half of the array is displayed since the other half is a mirror image. The diagonal entries are also omitted since they are all one; a variable is always perfectly correlated with itself. Each entry in the array consists of three numbers.

The first (upper) is the Pearson's correlation coefficient for the two (row and column) variables of that entry. The second (middle) number, in parentheses, is the p-value of the t-test for Ho: rho = 0 vs. Ha: rho <> 0. The third (bottom) number, in brackets, is the sample size, or number of paired observations used in the calculations.

Both the correlation coefficient and the p-value are interpreted as they are for any correlation of two variables. In this array, for example, POP and TIME are highly correlated (r=0.994, p=0.00) but POP and ARMED are not (r= 0.364, p=0.17). Notice that ARMED and UNEMP have a negative correlation (r=-0.177); as one increases the other decreases. However, since p is large (0.51), we cannot conclude that this correlation is significantly different from zero. Care must be taken when running a multitude of tests at a given significance level. As the number of tests increases, the chances of finding a significant relationship when none really exists increases.


Graphical correlation matrix

You may display an array of scatterplots (XY plots) to see in one screen relationship of up to ten variables. For example, to perform this analysis on the LONGLEY data, follow these steps:

Step 1: Open the database named LONGLEY.

Step 2: From the Analyze menu select “Regression and Correlation” then  choose the "Graphical Correlation Matrix" option.

Step 3: Select all the fields for this analysis. WINKS will perform the calculations and display the matrix of scatterplots.

 

These scatterplots are a visual way of examining the relationships between pairs of variables. It allows you to determine if a relationship exists between the variables, and allows you to see if that relationship is linear. The more highly correlated two variables are, the more tightly clustered about a straight line are the points on the scatterplot. In this case you can see that GNP looks highly correlated with TIME (we know r= 0.995 from a previous example) whereas GNP and UNEMP do not look as related , although still statistically different from 0 (r = 0.604). Notice also that the relationship between ARMED and TIME looks  related, but not in a linear fashion (r = 0.417). You can use this graphical  correlation matrix to examine the relationships between variables before using them in a multiple regression analysis.  


Regression Through the Origin

A standard simple linear regression procedure calculates coefficients for an intercept and slope term in the linear equation. (See the main WINKS manual, Pages 4-43 and following.) However, sometimes your knowledge about the true nature of the relationship between your two variables tells you that the intercept coefficient should be 0 (zero). That is, the line should pass through the origin. When this is the case, you can force WINKS to estimate the best slope of the line (least squares) that will fit a straight line through the origin and fit the scatter of points. The following example illustrates how to perform a linear regression with a forced zero intercept.

Step 1:
Open the database named ORIGIN.DBF. This database contains two variables, VAR1 and VAR2.

Step 2:
From the Analyze menu, select "Regression and Correlation" then "Simple Linear Regression."

Step 3:
You will be prompted to select the dependent and independent variables for the regression. In this case, select VAR1 as the Independent variable and VAR2 as the Dependent variable.

Step 4:
Important: Before clicking the Okay button, click on the radio button labeled "Regression through origin option." Then click on Okay.

The output that will be displayed is:

-------------------------------------------------------------------
Dependent variable is VAR2, 1 independent variables, 12 cases.
---------------------------------------------------------------------
Variable Coefficient St. Error t-value p(2 tail)
---------------------------------------------------------------------
VAR1 4.6852741 .034205 136.9762 <.001

A 95% Confidence (using t(.975, 11) = 2.201 interval for VAR1 is:

( 4.609988, 4.76056)

The estimated regression function is:

VAR2-hat = 4.6852741106916 * VAR1


This output above specifies that the estimated linear regression equation (with zero intercept) as

VAR2-hat = 4.6852741106916 * VAR1

Notice that the normal intercept term is missing. The 4.68 value is the slope of the line though the scatter of points. A t-test yields a t-value of 136.97 and a p-value < 0.001. This tells you that the slope is significantly different from zero. The 95% confidence interval tells you that the slope is most likely between about 4.609988 and 4.76056.

The use and interpretation of the resulting equation is similar to that for a Simple Linear Regression.

Point Bi-serial Correlation

The point bi-serial correlation is used when one of the measures is continuous and one is dichotomous (0,1). For example, using the database on disk named POINTBS.DBF, select SCORE as the "X" (first) field and YESNO as the "Y" (0,1 type) field. The results are (edited):

Dependent variable =SCORE

Independent variable = YESNO

Point bi-serial correlation = 0.5948

t = 6.535 with 78 degrees of freedom. p <= 0.001

When the p-value is small (less than 0.05), you can conclude that there is evidence to reject the null hypothesis and to support the alternative hypothesis. For this example, since p is small, you can conclude that there is a significant relationship between SCORE and YESNO. You might note that the t-test in this case is equivalent to an independent group t-test using the 0,1 variable as the grouping variable.

 


Continue to Chapter 4. Part 6. (Frequency and Crosstabulation Procedures.)  

     


| Previous Section | Next Section | WINKS Manual Index | Help | Home | Tutorials |