| WINKS Manual Index | Help | Home | Tutorials |

WINKS Online Manual


Chapter 5 Part 3

Advanced Regression Procedures


Polynomial Regression

Polynomial regression is considered in a situation in which the relationship between predictor and response variables is curvilinear. A straight line fitted through the data will not look appropriate. The data given in the following table are the ages of 29 players and their scores on a new video game (generated data).

A plot reveals that the relationship between AGE and SCORE is not clearly linear, but that a quadratic term may be helpful in describing the relationship. That is, a model such as

Y = b0 + b1X + b2X2 + e

might be considered in this case.

The GAME.DBF database (partial listing).

RECORD AGE SCORE CENTERED  
------ --- ----- --------  
 1     6.9 24710   -11.75  
 2     7.4 26730   -11.25  
 3     7.9 25920   -10.75  
 4     8.3 27510   -10.35  
 :      :    :        :  
 28   41.6  9850    22.95  
 29   43.1  6000    24.45  
   
Such a polynomial model can be recognized as a form of a multiple linear regression model with two predictor variables, X and X2.

In fitting a polynomial regression model, all lower order terms must be included. That is, the first-order term is used, and higher order terms are used only if the first-order term is not sufficient. A cubic term is used only if both the linear (first-order) and quadratic (second-order) terms are included. When using Polynomial Regression in WINKS, you are asked to specify the order of the polynomial you wish to fit.

As with any multiple regression analysis, care must be taken to avoid collinearities between the predictors. That is, if the predictors are highly correlated, the coefficient estimates may contain considerable error. (See, e.g., Montgomery and Peck, 1982, p.184) Since X and X2 may be highly correlated the collinearities can be reduced by expressing these predictors as deviates from the sample mean. That is, define the predictors as (X-Xbar) and (X-Xbar)2.

Centering the data may not always sufficiently reduce collinearities, in which case the data should be standardized (divide the centered values by the standard deviation of the predictor variable values).

There are various approaches for determining the order of the model. One method is a "forward selection" procedure in which the first-order (linear) term is fit and then higher order terms are added sequentially until the F-test for a non-zero coefficient is not significant for the highest order term. Another method is a "backward elimination" procedure in which an appropriately high-order polynomial model is fit and terms are deleted one at a time from high to low order until the highest order term of the remaining terms results in a significant F-test. These two methods may not result in the same model.

WINKS fits a model of the order you select and reports the coefficients of each term, including an intercept term, up to that order. The results of the tests of significance of these coefficients are also reported. A small p-value indicates that the corresponding coefficient is significantly different from zero. Residual analysis is also useful for investigating the appropriateness of the model selected. WINKS also reports the Analysis of Variance for the entire regression fit, as well as R-Square and adjusted R-Square, as it does in the regular Linear Regression module.

In general, in regression analysis simpler models are preferred. It may be possible to transform the predictor in some way so that higher order terms are not necessary. Terms higher than second or third order are not usually used unless there is some reason. It is always possible to fit a high enough order model, but such a model is difficult to interpret and not generally recommended.

As with any regression model, extrapolation is risky and should be avoided. While a polynomial model may adequately model the relationship between variables within the range of the data used in the analysis, it is extremely risky to assume that relationship continues to exist outside the range of the data. Refer to a standard text, such as Neter and Wasserman or Montgomery and Peck, for more information about polynomial regression.

As noted earlier, in order to reduce collinearities, the predictor variable values are standardized or, in this case, centered. That is, the mean of AGE, 18.65, is subtracted from each value of AGE to create a new variable, CENTERED, also contained in GAME.DBF. While the correlation coefficient of AGE and AGE2 is 0.98, that of  CENTERED and CENTERED2 is 0.78, so CENTERED is used in the regression analysis.

Follow these steps to perform this analysis:

Step 1: Open the database named GAME.DBF.

Step 2: From the Analyze menu, select “Advanced Regression, “ then  "Polynomial Regression Analysis."

Step 3: Select the field SCORE as the dependent variable and select CENTERED as the independent variable.

Step 4:When prompted to specify the order, select "1st order, linear regression". Results will be displayed in the viewer

WINKS reports an F-statistic of 59.32 (p<0.001) for the first-order term of CENTERED, that is for (AGE-Mean AGE).From the options menu, you can also see a plot of this linear fit to the data. The plot will look similar to figure 1 except the CENTERED variable is used rather than AGE. Nevertheless, the single order line will obviously not fit well. If you redo the analysis, this time selecting. "2nd order, quadratic regression." results are as follows: The first-order term, CENTERED^1, has an F-statistic of 11.51 (p=0.002) and the second-order term has an F-statistic of 292.58 (p<0.001). The quadratic term is highly significant; it adds greatly to the model after the first-order term has been fit.


Figure 1

Since the quadratic term is significant, the next step is to try adding a cubic term using "3rd order, cubic regression." However, the results of a cubic regression show the F-statistic for CENTERED^3 to be 1.65 (p=0.210), which is not significant, and hence the cubic term is not useful to the model. Thus the process ends and the selected model is

SCORE = 28645.88 - 91.68(AGE-Mean AGE) - 32.67(AGE-Mean AGE)2

In order to express the model as SCORE = b0 + b1(AGE) + b2(AGE)2 , the relationships between the two sets of coefficients

   b0 = b0* - b1*X + b2*X2

   b1 = b1* - 2b2*X

   b2 = b2

are used, where b0*', b1* and b2* are the coefficients in the model using the CENTERED predictor variable. In this example, then,

   b0 = 28645.88 + 91.68(18.65) - 32.67(18.65)2

        = 18992.35

   b1 = -91.68 -2(-32.67)(18.65)

        = 1126.91

   b2 = -32.67

and the model in terms of AGE is

SCORE = 18992.35 + 1126.91(AGE) - 32.67(AGE)2 .

Using this regression equation to predict the response, SCORE, given values of the predictor variable, AGE, the score for someone 32 years old is predicted to be 21599.39 while that of a 16-year-old is 28659.39.

Variable Selection Procedures

In a multiple regression setting, the final step in choosing a model is the selection of predictor variables to be included in the model. Of the available predictors, it is often desirable and sometimes necessary to use only a subset in the model, in the interest of parsimony or to reduce collinearities.  That is, predictors that do not add significantly to the model's ability to explain the variability in the response might very well be left out of the model, and variables that are redundant should not both (all) be included. The goal is to adequately, but concisely, model the relationship between the predictors and the response.

It should be emphasized that variable selection is the final step in the regression analysis, and is done after necessary preliminary steps are taken, such as verifying assumptions, residual analysis, and detecting outliers and collinearities.

Scatterplots can be helpful in assessing whether the relationship between each of the predictors and the response is linear. The relationships between pairs of predictors should also be investigated. Transformations of some or all of the predictors or of the response may be necessary to meet the assumption of linear relationships. Residual analysis is helpful in investigating the relationships between variables, as well as in verifying homoscedasticity (constant variance) of errors.

It is also important to verify that the model errors follow a normal distribution with constant variance. Transformations can be helpful in meeting the assumption of constant variance. This assumption can be verified using residual plots of the predicted response values under the model.

Before selecting a final model, the data (after transformations) should also be investigated for influential observations (outliers) and for collinearities. Scatterplots are a first step in these investigations. Two predictor variables whose scatterplot has all the points close to a straight line and whose correlation coefficient is very high (>>0.90) may be redundant, and only one is necessary in the model. It is also important to take into account any theoretical or practical reasons in the problem definition for including or excluding certain variables.

After the necessary preliminary steps have been taken, the variables are selected for the final model. There are several techniques commonly used for variable selection. WINKS's Advanced Regression module performs "all possible" and "stepwise" selection procedures.

Other selection procedures, known as "backward elimination"  and "forward selection", can be performed with up to ten predictor variables using the Multiple Regression option in the regular WINKS program. Backward elimination is the process of including all predictor variables and then eliminating one by one those whose coefficients are found upon testing to not be significantly different from zero. Forward selection is the process of adding variables one at a time as they are found to be most significant to the model. The stepwise procedure used by WINKS is a combination of forward selection and backward elimination.


All Possible Regressions

Also known as "best subset selection", this procedure consists of considering all possible combinations of the predictor variables. It is then possible to compare all possible models and choose the "best" one. Comparison can be based on a number of criteria, including mean squared error, Mallow's Cp, and R-Square.

The calculated MSE is an estimate of the variance of the errors in the full model. A smaller error variance is desirable, so different models can be compared based on MSE, with those having a smaller MSE preferred.

R-Square, the coefficient of determination, is a measure of how much of the variability in the response is explained in the model, provided the model has been arrived at properly. A model with larger R2 is preferred to one with a much smaller R2.

Mallow's Cp is a statistic which is a function of the error sum of squares for the full model and that for the reduced model. The formula for Cp is (SSEp/s2)-(N-2p), where SSEp is the error sum of squares for the reduced model with p terms, s2 is the estimate of MSE for the full model and N is the number of observations. Under the correct model, Cp is approximately equal to p and otherwise is greater than p, reflecting bias in the parameter estimates in the regression equation. Thus, it is desirable to select a model in which the value of Cp is close to the number of terms, including the constant term, in the model.

These three criteria are typically used to compare combinations, or subsets, of the predictor variables. When WINKS reports the results of the All Possible Regressions procedure, it reports all three of these criteria. Of course, you should also take into account any theoretical criteria specific to the problem for including or excluding variables, as well as be careful not to include redundant variables, which may introduce collinearities. It is often helpful to consider which variables consistently appear in the better models. The better models can then be analyzed using the Multiple Regression option of Regression and Correlations, and the results of tests for significant coefficients considered in the final decision. Residual plots of predicted values under the chosen model should show a random scatter of points.

Clearly, comparing all possible models is generally the best method for making a decision about a "best" model since it provides the most information about the available choices. However, the "all possible" subsets procedure can become quite large with just a moderate number of predictors. WINKS has the capability to perform All Possible Regressions on a maximum of eight predictor variables. With eight variables, there are 28-1, or 255, possible subsets, and the procedure can take some time.

As an example of the All Possible Regressions procedure, consider the Longley data. Follow these steps to perform this analysis:

Step 1: Open the database named LONGLEY. 

Step 2: From the ANALYZE menu, choose Advanced Regression then "All Possible Regressions". 

Step 3: When asked to specify the fields to use, select TOTAL as the dependent field and all others as independent fields.

Step 4: The results will be displayed in the viewer. WINKS reports the results of all six single-variable models, then all 15 two-variable models (15= six taken two at a time), 20 three-variable models, and so forth. For each of the 26-1=63 models, WINKS reports p, the number of variables in the model including the intercept term, the degrees offreedom, sum of squares for error and mean-squared error for the full model, R-square and Mallow's Cp.

For the Longley data, a partial listing of the eight models with lowest MSE, highest R-square and Cp approximately equal to p, are listed below. All of the 55 other models have very large values of Cp. 

Model                                      R2         Cp
(UNEMP, ARMED, TIME)                     .993         6.2 
(DEFLATOR, UNEMP, ARMED, TIME)           .993         8.2 
(GNP, UNEMP, ARMED, TIME)                .995         3.2
(UNEMP, ARMED, POP, TIME)                .995         4.6
(DEF., GNP, UNEMP, ARMED, T              .995         5.1 
(DEF., UNEMP, ARMED, POP, TIME)          .995         6.1
(GNP, UNEMP, ARMED, POP, TIME)           .995         5.0
(DEF., GNP, UNEMP, ARM, POP, TIME)       .995         7.0


Of these eight models, all include variables #3,4,5 (UNEMP, ARMED, TIME). The pairwise scatterplots and pairwise correlation coefficients of the six predictors show that TIME, DEFLATOR, GNP and POP are all highly correlated (r>0.90). Therefore, only one of these four variables needs to be included in the model. Furthermore, note that the models with more variables than UNEMP, ARMED and TIME have only trivially larger values of R2. It is helpful to display the correlation matrix for these variables to see how variables are correlated.

The multiple regression on the model including UNEMP, ARMED and TIME confirms all three coefficients to be significant. Therefore, an appropriate model is:

TOTAL = -1797221.11 - 1.47(UNEMP) - 0.77(ARMED) + 956.38(TIME)

Use the multiple regression procedures to run a full regression analysis on these terms.

Stepwise Selection for Multiple Regression

For a large number of predictors, or if for other reasons the All Possible Regressions variable selection procedure is not practical, an alternative is the Stepwise variable selection procedure. WINKS Stepwise option can consider up to 49 variables, and can define a model using up to 20 of those variables. As noted earlier, the Stepwise procedure is a combination of "forward selection" and "backward elimination" techniques. 

At the first step, the model consisting of all variables is considered, and the variable testing "most significant", i.e., having the largest F-statistic, becomes the first variable included in the model. In the second step, the variable selected in the first step is forced into the model and the other variables are then fit. A cut-off p-value is used as the selection criteria to determine whether any more variables should be included. This cut-off p-value selection criteria can be designated by you, or else the default criteria used by WINKS is a p-value of 0.25 for the F-tests. Of those variables meeting the selection criteria at step two, the one showing the most significance, i.e., having the largest F-statistic, is added to the model consisting of the variable selected in the first step.

The two-variable model is then "checked" and if the coefficients of both variables are shown to be significantly different from zero (having small p-values), the process continues. Again, the cut-off p-value can be set by you, or else the default is 0.25. At the third step, the two already chosen variables are forced into the model and the other variables then fit. If any remaining variables meet the selection criteria, the "most significant" of those is added, and the three-variable model checked. The process continues as long as all selected variables satisfy the "checking" procedure, and as long as at least one remaining variable meets the selection criteria and is added to the model at each "forward" step. The operator is also given the opportunity at each step to continue or to stop the procedure.

The data used in this example are contained in the CRIME.DBF database. Each of the 141 records contains U.S. Census Bureau information on one metropolitan area in one year. The response variable, CRIMES, is the total number of crimes. There are nine predictors:

AREA: number of square miles
POP: total population
CITY: percent of population in central cities
OVER65: percent of population age 65 and older
DRS: number of active physicians
HOSP: number of hospital beds
HSGRAD: percent of adult population having completed high school
LABOR: number of persons in civilian labor force
INCOME: total income received

Preliminaries (Transformations, Indicator Variables, Outliers)

Scatterplots of the quantitative predictors against the response (which can be easily displayed using the graphical correlation matrix procedure) raise doubts about the linearity of some of these relationships. Natural log transformations of the quantitative predictor variables as well as the response variable result in approximately linear relationships. Therefore, these transformed variables are used in the analysis, and are included in the database, CRIME.DBF, as LNCRIMES, LNAREA, LNPOP, etc

In this example, none of the observations has been excluded from the analysis, but there are a few data points which might be considered questionable. There are a few observations with exceptionally large values of the response variable. It is a difficult judgment whether to exclude observations from the analysis and such action should be taken only with justification. 

Refer to a standard regression textbook, or especially to Belsley, Kuh and Welch (1980), for discussion of techniques for identifying influential observations, or outliers, and for discussion of other considerations which are preliminary steps to variable selection.

For example, follow these steps to perform a Stepwise analysis:

Step 1: Open the database named LNCRIME.DBF. It contains log values of the fields to be used from the original CRIME.DBF database.

Step 2: From the ANALYZE menu select Advanced Regression then "Stepwise Regression". 

Step 3: You will be prompted to enter which fields to use. Select LNCRIMES as the dependent variable, and LNAREA, LNPOP, LNCITY, LNOVER65, LNDRS, LNHOSP, LNHSGRAD, LNLABOR and LNINCOME as nine independent variables. 

Step 4: You are prompted to indicate any variables you want to force the model to include. Simply press Enter to indicate none. Then you are asked to specify the cut-off p-values for adding variables in the "forward" steps and for dropping variables in the "backward" steps. Press enter to select the defaults of 0.25 in both cases.

Step 5: WINKS begins by performing the regression using the full set of nine predictors, and selects LNPOP as the "most significant" variable (F=598.86, p<001). Continuing the procedure, WINKS considers the eight tow-variable models, each consisting of LNPOP and one of the other eight predictors. Of these, the variable LNHSGRAD has the largest F-statistic (11.61, p=.001) when fit after LNPOP, so LNHSGRAD is added to the model previously consisting only of LNPOP and the constant term.

If you continue the procedure, WINKS then tests this two-variable model to make sure that the term previously included, LNPOP, remains significant after LNHSGRAD is added. the coefficient of LNPOP now has an F-statistic of about 691.82, p, so neither variable is eliminated in the first "backward" step. Continuing, the next forward step considers the eight three-variable models, each consisting of LNPOP, LNHSGRAD and one of the other remaining seven predictors. LNLABOR is found to have the largest F-statistic (2.83, p = 0.099) and is added to the model. 

Again, the backward procedure tests this model consisting of LNPOP, LNHSGRAD and LNLABOR, but does not eliminate any of them since none of the corresponding F-tests result in p-values greater than 0.25. 

At each step you are asked to continue or stop the procedure. There is some concern in this data about collinearities between some of the predictors. It would be advisable to display pairwise correlations of the variables using the WINKS Regression module. If two highly correlated variables enter the equation, you might want to run the stepwise procedure again, leaving one of them out. The final model selected by this run of the Stepwise procedure is:

LNCRIMES = .0150363 + 1.5690325 (LNPOP) + 0.7512479 (LNHSGRAD) - .4524823 (LNLABOR)

with R-Square equal to 0.9426, adjusted R-Square equal to 0.9389, and MSE equal to 0.038.


Simple Logistic Regression

Logistic Regression is used to analyze the relationship between two variables when the dependent variable is binary. This differs from normal simple linear regression where the dependent variable is a continuous numeric variable. The logistic regression model can be described by

logit(pi) = log(pi / (1 – pi)) = b0 + b1* xi

                    where

pi is the response to be modeled
b0
is the intercept parameter
b1 is the slope parameter
xi is an array of independent variables

The logistic model uses the logit transformation of the ith observation’s event probability, pi , as a linear function explained by the independent variables xi. Thus, for a binary dependent variable and a continuous independent variable, the WINKS program will calculate the coefficients for the logistic equation that best fits the data.

In WINKS, there are two ways to enter data for use in the logistic procedure. You may enter your data as raw data or summarized data. In the summarized data method, you need at least three fields in your database — The independent variable (X), the number of observations for each value of the independent variable (Nj), and a count of positive outcomes from the dependent variable. For example, suppose you are testing coupons that offer discounts of 5, 10, 15, and 20 percent off. You give away 400 of each kind of coupon and observe how many are redeemed.

Xj = discount value of coupon
Nj = 400 for each value of the coupon
Cj = How many coupons for value j were redeemed

The program will calculate the proportion of coupons redeemed (Pj) for the information above. For example, suppose your data for this experiment is as follows:

Discount

Given out

Redeemed

5

400

57

10

400

93

15

400

145

20

400

209

30

400

305

To analyze this data, follow these steps:

Step 1: Open the database named logistic.dbf. This database contains the data in the table above.

Step 2: From the Analyze menu, select Advanced Regression, then Logistic.

Step 3: You must carefully select the field names in the correct order. First select DISCOUNT, then click Add.

Step 4: Select GIVEN, then click Add.

Step 5: Select USED, then click Dep. Var.

Step 6: Click OK. Optionally enter numbers for use in prediction. You will be given a chance to enter values you want to predict, and the calculations will be performed and reported in the output. (See discussion below.)

Dependent variable is USED
Independent variable is DISCOUNT
Weights variable is GIVEN
Number of cases is 5
------------------------------------------------------------------------
Variable Coefficient St. Error t-value p(2 tail)
------------------------------------------------------------------------
Intercept -2.361179 .0587289 -40.20472 <.001
GIVEN .1193188 .0031445 37.945333 <.001
The fitted transformed logistic response function is
P’ = -2.361179 + .1193188 * DISCOUNT

Step 7:
Using the following equation, estimate the percent of redemption from a 15% coupon:

                                  b0 + b1 * Discount
                           
e
P    =     —————————–
                           b0 + b1 * Discount
   
                1+ e

where

b0 + b1 + Discount = -2.361179 + .1193188 * 15
                              = -0.571397

Putting the –0.571397 into the equation yields the value

P = 0.360915

Thus, you estimate that about 36% of the 15% off discount coupons will be redeemed.

Reference:
See Neter, J. Wasserman, W. and Kutner, M.H., Applied Linear Statistical Models, Richard D. Irwin, 1990.

Using Raw Data In Logistic Regression

A second way to read in data in the Logistic Regression procedure is to read in only two fields — the independent variable (X) and a 0/1 (binary) dependent variable. For the coupon data, this database would look something like this:

COUPON

USED

5

1

5

0

15

0

15

0

Etc...

1

In this database, each coupon has an entry, so for each of the 5, 10, 15, 20 and 30 percent off coupons, you have one record, making a total of (400*5) = 2,000 records. If your data is in this raw form, use the Tabulation procedure to calculate counts for each group. For example, using the data in lograw.dbf file, you will get the following table:

------------------------------
|        |        USED       |
|        |-------------------|
|        |     0   |    1    |
|--------|---------|---------|
|COUPON  |         |         |
|--------|         |         |
|5       |      343|       57|
|--------+---------+---------+
|10      |      307|       93|
|--------+---------+---------+
|15      |      255|      145|
|--------+---------+---------+
|20      |      191|      209|
|--------+---------+---------+
|30      |       95|      305|
|--------+---------+---------+

Use this information in column 1 to create a database usable for the logistic procedure as shown in the preceding example.




 
Continue to Chapter 5 Part 4. (Time Series Analysis)

     


| Previous Section | Next Chapter | WINKS Manual Index | Help | Home | Tutorials |