| WINKS Manual Index | Help | Home | Tutorials |
WINKS Online Manual
Chapter 5 Part 3
Advanced Regression Procedures
Polynomial
Regression
Polynomial regression is considered in a situation in which the
relationship between predictor and response variables is curvilinear. A straight
line fitted through the data will not look appropriate. The data given in the
following table are the ages of 29 players and their scores on a new video game
(generated data).
A plot reveals that the relationship between AGE and SCORE is not
clearly linear, but that a quadratic term may be helpful in describing the
relationship. That is, a model such as
Y = b0 + b1X + b2X2 + e
might be considered in this case.
The GAME.DBF database (partial listing).
RECORD AGE SCORE CENTERED
------ --- ----- --------
1 6.9 24710
-11.75
2 7.4 26730
-11.25
3 7.9 25920
-10.75
4 8.3 27510
-10.35
: :
:
:
28 41.6 9850
22.95
29 43.1 6000
24.45
In fitting a polynomial regression model, all lower order terms must be
included. That is, the first-order term is used, and higher order terms are used
only if the first-order term is not sufficient. A cubic term is used only if
both the linear (first-order) and quadratic (second-order) terms are included.
When using Polynomial Regression in WINKS, you are asked to specify the order of
the polynomial you wish to fit.
As with any multiple regression analysis, care must be taken to avoid
collinearities between the predictors. That is, if the predictors are highly
correlated, the coefficient estimates may contain considerable error. (See,
e.g., Montgomery and Peck, 1982, p.184) Since X and X2 may be highly correlated
the collinearities can be reduced by expressing these predictors as deviates
from the sample mean. That is, define the predictors as
Centering the data may not always sufficiently reduce collinearities, in
which case the data should be standardized (divide the centered values by the
standard deviation of the predictor variable values).
There are various approaches for determining the order of the model. One
method is a "forward selection" procedure in which the first-order
(linear) term is fit and then higher order terms are added sequentially until
the F-test for a non-zero coefficient is not significant for the highest order
term. Another method is a "backward elimination" procedure in which an
appropriately high-order polynomial model is fit and terms are deleted one at a
time from high to low order until the highest order term of the remaining terms
results in a significant F-test. These two methods may not result in the same
model.
WINKS fits a model of the order you select and reports the coefficients
of each term, including an intercept term, up to that order. The results of the
tests of significance of these coefficients are also reported. A small p-value
indicates that the corresponding coefficient is significantly different from
zero. Residual analysis is also useful for investigating the appropriateness of
the model selected. WINKS also reports the Analysis of Variance for the entire
regression fit, as well as R-Square and adjusted R-Square, as it does in the
regular Linear Regression module.
In general, in regression analysis simpler models are preferred. It may
be possible to transform the predictor in some way so that higher order terms
are not necessary. Terms higher than second or third order are not usually used
unless there is some reason. It is
As with any regression model, extrapolation is risky and should be
avoided. While a polynomial model may adequately model the relationship between
variables within the range of the data used in the analysis, it is extremely
risky to assume that relationship
As noted earlier, in order to reduce collinearities, the predictor
variable values are standardized or, in this case, centered. That is, the mean
of AGE, 18.65, is subtracted from each value of AGE to create a new variable,
CENTERED, also contained in GAME.DBF. While the correlation coefficient of AGE
and AGE2 is 0.98, that of
Follow these steps to perform this analysis:
Step
1: Open the database named GAME.DBF.
Step
2: From the Analyze menu, select
“Advanced Regression, “ then "Polynomial Regression Analysis."
Step
3: Select the field SCORE as the
dependent variable and select CENTERED as the independent variable.
Step
4:When prompted to specify the order,
select "1st order, linear regression". Results will be displayed in
the viewer
WINKS reports an F-statistic of 59.32 (p<0.001) for the first-order
term of CENTERED, that is for (AGE-Mean AGE).From the options menu, you can also
see a plot of

Figure 1
Since the quadratic term is significant, the next step is to try adding
a cubic term using "3rd order, cubic regression." However, the results
of a cubic regression show the F-statistic for CENTERED^3 to be 1.65 (p=0.210),
which is not significant, and hence the cubic term is not useful to the model.
Thus the process ends and the selected model is
SCORE = 28645.88 - 91.68(AGE-Mean AGE) - 32.67(AGE-Mean AGE)2 .
In order to express the model as SCORE = b0 + b1(AGE) + b2(AGE)2 ,
the relationships between the two sets of coefficients
b0 = b0*
- b1*X + b2*X2
b1 = b1*
- 2b2*X
b2 = b2
are used, where b0*', b1* and b2* are
the coefficients in the model using the CENTERED predictor variable. In this
example, then,
b0 =
28645.88 + 91.68(18.65) - 32.67(18.65)2
= 18992.35
b1 =
-91.68 -2(-32.67)(18.65)
= 1126.91
b2 =
-32.67
and the model in terms of AGE is
SCORE = 18992.35 + 1126.91(AGE) - 32.67(AGE)2 .
Using this regression equation to predict the response, SCORE, given
values of the predictor variable, AGE, the score for someone 32 years old is
predicted to be 21599.39 while that of a 16-year-old is 28659.39.
Variable Selection Procedures
In a multiple regression setting, the final step in choosing a model is
the selection of predictor variables to be included in the model. Of the
available predictors, it is often desirable and sometimes necessary to use only
a subset in the model, in the interest of parsimony or to reduce collinearities.
That is, predictors that do not add
significantly to the model's ability to explain the variability in the response
might very well be left out
It should be emphasized that variable selection is the final step in the
regression analysis, and is done after necessary preliminary steps are taken,
such as verifying assumptions, residual analysis, and detecting outliers and
collinearities.
Scatterplots can be helpful in assessing whether the relationship
between each of the predictors and the response is linear. The relationships
between pairs of predictors should also be investigated. Transformations of some
or all of the predictors or of the response may be necessary to meet the
assumption of linear relationships. Residual analysis is helpful in
investigating the relationships between variables, as well as in verifying
It is also important to verify that the model errors follow a normal
distribution with constant variance. Transformations can be helpful in meeting
the assumption of constant variance. This assumption can be verified using
residual plots of the predicted response values under the model.
Before selecting a final model, the data (after transformations) should
also be investigated for influential observations (outliers) and for
collinearities. Scatterplots are a first step in these investigations. Two
predictor variables whose scatterplot has all the points close to a straight
line and whose correlation coefficient is very high (>>0.90) may be
redundant, and only one is necessary in the model. It is also important to take
into account any theoretical or practical reasons in the problem definition for
including or excluding certain variables.
After the necessary preliminary steps have been taken, the variables are
selected for the final model. There are several techniques commonly used for
variable selection. WINKS's
Other selection procedures, known as "backward elimination"
and "forward selection", can be performed with up to ten
predictor variables using the Multiple Regression option in the regular WINKS
program. Backward elimination is the process of including all predictor
variables and then eliminating one by one those whose coefficients are found
upon testing to not be significantly different from zero. Forward selection is
the process of adding variables one at a time as they are found to be most
significant to the model. The stepwise procedure used by WINKS is a combination
of forward selection and backward elimination.
All
Possible Regressions
Also known as "best subset selection", this procedure consists
of considering all possible combinations of the predictor variables. It is then
possible to compare all possible models and choose the "best" one.
Comparison can be based on a number of criteria, including mean squared error,
Mallow's Cp, and R-Square.
The calculated MSE is an estimate of the variance of the errors in the
full model. A smaller error variance is desirable, so different models can be
compared based on MSE, with those having a smaller MSE preferred.
R-Square, the coefficient of determination, is a measure of how much of
the variability in the response is explained in the model, provided the model
has been arrived at properly. A model with larger R2 is preferred to
one with a much smaller R2.
Mallow's Cp is a statistic which is a function of the error
sum of squares for the full model and that for the reduced model. The formula
for Cp is (SSEp/s2)-(N-2p), where SSEp
is the error sum of squares for the reduced model with p terms, s2 is
the estimate of MSE for the full model and N is the number of observations.
Under the correct model, Cp is approximately equal to p and otherwise
is greater than p, reflecting bias in the parameter estimates in the regression
equation. Thus, it is desirable to select a model in which the value of Cp is
close to the number of terms, including the constant term, in the model.
These three criteria are typically used to compare combinations, or
subsets, of the predictor variables. When WINKS reports the results of the All
Possible Regressions procedure, it reports all three of these criteria. Of
course, you should also take into
Clearly, comparing all possible models is generally the best method for making a decision about a "best" model since it provides the most information about the available choices. However, the "all possible" subsets procedure can become quite large with just a moderate number of predictors. WINKS has the capability to perform All Possible Regressions on a maximum of eight predictor variables. With eight variables, there are 28-1, or 255, possible subsets, and the procedure can take some time.
As an example of the All Possible Regressions procedure, consider the Longley data. Follow these steps to perform this analysis:
Step 1: Open the database named LONGLEY.
Step 2: From the ANALYZE menu, choose Advanced Regression then "All Possible Regressions".
Step 3: When asked to specify the fields to use, select TOTAL as the dependent field and all others as independent fields.
Step 4: The results will be displayed in the viewer. WINKS reports the results of all six
single-variable models, then all 15 two-variable models (15= six taken two at a time), 20 three-variable models, and so forth. For each of the 26-1=63 models, WINKS reports p, the number of variables in the model including the intercept term, the degrees offreedom, sum of squares for error and mean-squared error for the full model, R-square and Mallow's Cp.
For the Longley data, a partial listing of the eight models with lowest MSE, highest R-square and Cp approximately equal to p, are listed below. All of the 55 other models have very large values of Cp.
Model
R2 Cp
(UNEMP, ARMED, TIME)
.993 6.2
(DEFLATOR, UNEMP, ARMED, TIME)
.993 8.2
(GNP, UNEMP, ARMED, TIME)
.995 3.2
(UNEMP, ARMED, POP, TIME)
.995 4.6
(DEF., GNP, UNEMP, ARMED,
T .995
5.1
(DEF., UNEMP, ARMED, POP, TIME)
.995 6.1
(GNP, UNEMP, ARMED, POP, TIME)
.995 5.0
(DEF., GNP, UNEMP, ARM, POP, TIME) .995
7.0
Of these eight models, all include variables #3,4,5 (UNEMP, ARMED, TIME). The pairwise scatterplots and pairwise correlation coefficients of the six predictors show that TIME, DEFLATOR, GNP and POP are all highly correlated (r>0.90). Therefore, only one of these four variables needs to be included in the model. Furthermore, note that the models with more variables than UNEMP, ARMED and TIME have only trivially larger values of R2. It is helpful to display the correlation matrix for these variables to see how variables are correlated.
The multiple regression on the model including UNEMP, ARMED and TIME confirms all three coefficients to be significant. Therefore, an appropriate model is:
TOTAL = -1797221.11 - 1.47(UNEMP) - 0.77(ARMED) + 956.38(TIME)
Use the multiple
regression procedures to run a full regression analysis on these terms.
Stepwise
Selection for Multiple Regression
For a large number of predictors, or if for other reasons the All Possible Regressions variable selection procedure is not practical, an alternative is the Stepwise variable selection procedure. WINKS Stepwise option can consider up to 49 variables, and can define a model using up to 20 of those variables. As noted earlier, the Stepwise procedure is a combination of "forward selection" and "backward elimination" techniques.
At the first step, the model consisting of all variables is considered, and the variable testing "most significant", i.e., having the largest F-statistic, becomes the first variable included in the model. In the second step, the variable selected in the first step is forced into the model and the other variables are then fit. A cut-off p-value is used as the selection criteria to determine whether any more variables should be included. This cut-off p-value selection criteria can be designated by you, or else the default criteria used by WINKS is a p-value of 0.25 for the F-tests. Of those variables meeting the selection criteria at step two, the one showing the most significance, i.e., having the largest F-statistic, is added to the model consisting of the variable selected in the first step.
The two-variable model is then "checked" and if the coefficients of both variables are shown to be significantly different from zero (having small p-values), the process continues. Again, the cut-off p-value can be set by you, or else the default is 0.25. At the third step, the two already chosen variables are forced into the model and the other variables then fit. If any remaining variables meet the selection criteria, the "most significant" of those is added, and the three-variable model checked. The process continues as long as all selected variables satisfy the "checking" procedure, and as long as at least one remaining variable meets the selection criteria and is added to the model at each "forward" step. The operator is also given the opportunity at each step to continue or to stop the procedure.
The data used in this example are contained in the CRIME.DBF database. Each of the 141 records contains U.S. Census Bureau information on one metropolitan area in one year. The response variable, CRIMES, is the total number of crimes. There are nine predictors:
AREA: number of square miles
POP: total population
CITY: percent of population in central cities
OVER65: percent of population age 65 and older
DRS: number of active physicians
HOSP: number of hospital beds
HSGRAD: percent of adult population having completed high school
LABOR: number of persons in civilian labor force
INCOME: total income received
Preliminaries (Transformations, Indicator Variables, Outliers)
Scatterplots of the quantitative predictors against the response (which can be easily displayed using the graphical correlation matrix procedure) raise doubts about the linearity of some of these relationships. Natural log transformations of the quantitative predictor variables as well as the response variable result in approximately linear relationships. Therefore, these transformed variables are used in the analysis, and are included in the database, CRIME.DBF, as LNCRIMES, LNAREA, LNPOP, etc
In this example, none of the observations has been excluded from the analysis, but there are a few data points which might be considered questionable. There are a few observations with exceptionally large values of the response variable. It is a difficult judgment whether to exclude observations from the analysis and such action should be taken only with justification.
Refer to a standard regression textbook, or especially to Belsley, Kuh and Welch (1980), for discussion of techniques for identifying influential observations, or outliers, and for discussion of other considerations which are preliminary steps to variable selection.
For example, follow these steps to perform a Stepwise analysis:
Step 1: Open the database named LNCRIME.DBF. It contains log values of the fields to be used from the original CRIME.DBF database.
Step 2: From the ANALYZE menu select Advanced Regression then "Stepwise Regression".
Step 3: You will be prompted to enter which fields to use. Select LNCRIMES as the dependent variable, and LNAREA, LNPOP, LNCITY, LNOVER65, LNDRS, LNHOSP, LNHSGRAD, LNLABOR and LNINCOME as nine independent variables.
Step 4: You are prompted to indicate any variables you want to force the model to include. Simply press Enter to indicate none. Then you are asked to specify the cut-off p-values for adding variables in the "forward" steps and for dropping variables in the "backward" steps. Press enter to select the defaults of 0.25 in both cases.
Step 5: WINKS begins by performing the regression using the full set of nine predictors, and selects LNPOP as the "most significant" variable (F=598.86, p<001). Continuing the procedure, WINKS considers the eight tow-variable models, each consisting of LNPOP
and one of the other eight predictors. Of these, the variable LNHSGRAD has the largest F-statistic (11.61, p=.001) when fit after LNPOP, so LNHSGRAD is added to the model previously consisting only of LNPOP and the constant term.
If you continue the procedure, WINKS then tests this two-variable model to make sure that the term previously included, LNPOP, remains significant after LNHSGRAD is added. the coefficient of LNPOP now has an F-statistic of about 691.82, p, so neither variable is eliminated in the first "backward" step. Continuing, the next forward step considers the eight three-variable models, each consisting of LNPOP, LNHSGRAD and one of the other remaining seven predictors. LNLABOR is found to have the largest F-statistic (2.83, p = 0.099) and is added to the model.
Again, the backward procedure tests this model consisting of LNPOP, LNHSGRAD and LNLABOR, but does not eliminate any of them since none of the corresponding F-tests result in p-values greater than 0.25.
At each step you are asked to continue or stop the procedure. There is some concern in this data about collinearities between some of the predictors. It would be advisable to display pairwise correlations of the variables using the WINKS Regression module. If two highly correlated variables enter the equation, you might want to run the stepwise procedure again, leaving one of them out. The final model selected by this run of the Stepwise procedure is:
LNCRIMES = .0150363 + 1.5690325 (LNPOP) + 0.7512479 (LNHSGRAD) - .4524823 (LNLABOR)
with R-Square equal to 0.9426, adjusted R-Square equal to 0.9389, and MSE equal to 0.038.
Logistic Regression is used to analyze the relationship between two variables when the dependent variable is binary. This differs from normal simple linear regression where the dependent variable is a continuous numeric variable. The logistic regression model can be described by
logit(pi) = log(pi / (1 – pi)) = b0 + b1* xi
where
pi is the response to be modeled
b0 is the intercept parameter
b1 is the slope parameter
xi is an array of independent variables
The logistic model uses the logit transformation of the ith observation’s event probability, pi , as a linear function explained by the independent variables xi. Thus, for a binary dependent variable and a continuous independent variable, the WINKS program will calculate the coefficients for the logistic equation that best fits the data.
In WINKS, there are two ways to enter data for use in the logistic procedure. You may enter your data as raw data or summarized data. In the summarized data method, you need at least three fields in your database — The independent variable (X), the number of observations for each value of the independent variable (Nj), and a count of positive outcomes from the dependent variable. For example, suppose you are testing coupons that offer discounts of 5, 10, 15, and 20 percent off. You give away 400 of each kind of coupon and observe how many are redeemed.
Xj = discount value of coupon
Nj = 400 for each value of the coupon
Cj = How many coupons for value j were redeemed
The program will calculate the proportion of coupons redeemed (Pj) for the information above. For example, suppose your data for this experiment is as follows:
|
Discount |
Given out |
Redeemed |
|
5 |
400 |
57 |
|
10 |
400 |
93 |
|
15 |
400 |
145 |
|
20 |
400 |
209 |
|
30 |
400 |
305 |
To analyze this data, follow these steps:
Step 1: Open the database named logistic.dbf. This database contains the data in the table above.
Step 2: From the Analyze menu, select Advanced Regression, then Logistic.
Step 3: You must carefully select the field names in the correct order. First select DISCOUNT, then click Add.
Step 4: Select GIVEN, then click Add.
Step 5: Select USED, then click Dep. Var.
Step 6: Click OK. Optionally enter numbers for use in prediction. You will be given a chance to enter values you want to predict, and the calculations will be performed and reported in the output. (See discussion below.)
Dependent variable is USED
Independent variable is DISCOUNT
Weights variable is GIVEN
Number of cases is 5
------------------------------------------------------------------------
Variable Coefficient St. Error t-value p(2 tail)
------------------------------------------------------------------------
Intercept -2.361179 .0587289 -40.20472 <.001
GIVEN .1193188 .0031445 37.945333 <.001
The fitted transformed logistic response function is
P’ = -2.361179 + .1193188 * DISCOUNT
b0 + b1 * Discount
e
P = —————————–
b0
+ b1 * Discount
1+ e
where
b
0 + b1 + Discount = -2.361179 + .1193188 * 15Putting the –0.571397 into the equation yields the value
P = 0.360915
Thus, you estimate that about 36% of the 15% off discount coupons will be
redeemed.
Reference: See Neter, J. Wasserman, W. and Kutner, M.H.,
Applied Linear Statistical Models, Richard D. Irwin, 1990.
Using Raw Data In Logistic Regression
A second way to read in data in the Logistic Regression procedure is to read in only two fields — the independent variable (X) and a 0/1 (binary) dependent variable. For the coupon data, this database would look something like this:
|
COUPON |
USED |
|
5 |
1 |
|
5 |
0 |
|
15 |
0 |
|
15 |
0 |
|
Etc... |
1 |
In this database, each coupon has an entry, so for each of the 5, 10, 15, 20 and 30 percent off coupons, you have one record, making a total of (400*5) = 2,000 records. If your data is in this raw form, use the Tabulation procedure to calculate counts for each group. For example, using the data in lograw.dbf file, you will get the following table:
------------------------------
|
|
USED |
| |-------------------|
| |
0 | 1 |
|--------|---------|---------|
|COUPON |
| |
|--------|
| |
|5 |
343| 57|
|--------+---------+---------+
|10 |
307| 93|
|--------+---------+---------+
|15 |
255| 145|
|--------+---------+---------+
|20 |
191| 209|
|--------+---------+---------+
|30 |
95| 305|
|--------+---------+---------+
Use this information in column 1 to create a database usable for the logistic procedure as shown in the preceding example.
Continue to Chapter
5 Part 4. (Time Series Analysis)
| Previous Section | Next Chapter | WINKS Manual Index | Help | Home | Tutorials |