/*
PR0165.SAS
COMPUTING NUMERATOR SUMS OF SQUARES
IN UNBALANCED ANALYSIS OF VARIANCE:
Three-Way Case
Donald B. Macnaughton
donmac@matstat.com
TABLE OF CONTENTS
Introductory Comments
- abstract
- introduction
Preliminary Steps
- load the Searle data into a SAS dataset and print the data
- start PROC IML and read the data from the SAS dataset into
IML
- generate the three main effect submatrices of the design
matrix
- generate the four interaction submatrices of the design
matrix
- make the SS subroutine available to the program
- set the values of the three secondary arguments of the SS
subroutine
Compute the Twenty-Two Sums of Squares Using the SS Subroutine
- compute the six sequential (SAS Type I) sums of squares
- compute the sum of squares for the highest-level interac-
tion
- compute the six SAS Type II sums of squares
- compute the six HTI = SAS Type III sums of squares
- compute three HTO sums of squares
- discuss the HTOS sums of squares
Quit from IML
Run PROC GLM to Compute Nineteen of the Sums of Squares (for
comparison with the values generated above)
Appendix: Steps to Run the Program
References
Output from PROC PRINT
Output from PROC GLM
ABSTRACT
This SAS program illustrates a conceptual point of view and the
matrix arithmetic for computing the following types of analysis
of variance numerator sums of squares:
- HTO (Higher-level Terms are Omitted)
= SAS Type II in the two-way case
= SPSS ANOVA Experimental
- SAS Type II
- HTOS (Higher-level Terms are Omitted unless Significant)
= a superset of SAS Type II and HTO
- HTI (Higher-Level Terms are Included)
= SAS Type III
= SPSS ANOVA UNIQUE
= the default method in many analysis of variance programs
- sequential
= SAS Type I
= SPSS ANOVA Hierarchical in the two-way case.
The conceptual point of view is one of computing an analysis of
variance sum of squares by computing the difference between the
residual sums of squares of two model equations (Yates 1934).
The matrix arithmetic is simple, and is specified directly in
terms of the conceptual point of view.
Computations are illustrated using data from a 4 x 3 x 2 unbal-
anced experiment discussed by Shayle Searle (1987, 392).
INTRODUCTION
Discussion in this program is an extension of discussion in an-
other program, which computes sums of squares for an unbalanced
2 x 3 experiment (Macnaughton 1998). In that program I note
that most researchers use analysis of variance to obtain the
resulting p-values, which are used to help in detecting rela-
tionships between variables. In order to compute analysis of
variance p-values, it is mathematically necessary to first com-
pute certain "numerator sums of squares".
The other program illustrates the three best-known conceptual
methods (HTO, HTI, and sequential) for computing numerator sums
of squares when there are two discrete-valued predictor vari-
ables in the experiment. When there are three or more dis-
crete-valued predictor variables, other methods of computing
numerator sums of squares become available. The present pro-
gram illustrates some of these methods.
BEGINNING OF THE PROGRAM STATEMENTS
(Note: If you wish to run this program on your computer, see
the checklist in the appendix.)
Define the title for the output and set SAS system options.
*/
title 'IML/GLM 4 x 3 x 2 Unbalanced ANOVA, Searle (1987, 392)';
options linesize=80 nodate probsig=2;
/*
Load the Searle data into a SAS dataset.
*/
data Searle_2 (keep = a b c y);
input a b c n @;
do i = 1 to n;
input y @;
output;
end;
cards;
1 1 1 1 10
1 1 2 2 4 6
1 2 1 2 3 5
1 2 2 3 2 3 7
1 3 1 3 1 2 3
1 3 2 3 4 5 9
2 1 1 2 5 9
2 1 2 1 8
2 2 1 1 5
2 2 2 2 6 8
2 3 1 2 6 10
2 3 2 1 2
3 1 1 4 2 3 3 4
3 1 2 3 3 4 8
3 2 1 3 3 4 8
3 2 2 1 4
3 3 1 1 5
3 3 2 1 6
4 1 1 1 4
4 1 2 3 5 7 9
4 2 1 4 1 1 3 7
4 2 2 1 19
4 3 1 1 8
4 3 2 2 20 24
;
/*
Print the data.
*/
proc print data=Searle_2;
run;
/*
BEGINNING OF THE IML COMPUTATIONS
Start PROC IML and reset options that control the destination
and appearance of the IML output.
*/
proc iml;
reset log spaces=3;
/*
Read the data from the SAS dataset into IML. Each of the four
variables in the dataset (i.e., a, b, c, and y) becomes a col-
umn vector in IML. The column vectors inherit the names of the
respective dataset variables.
*/
use Searle_2;
read all var _all_;
/*
Compute the main-effect submatrices of the overall design ma-
trix using the IML DESIGNF function.
*/
aD = designf(a);
bD = designf(b);
cD = designf(c);
/*
Compute the interaction submatrices of the overall design ma-
trix using the IML HDIR (horizontal direct product) function.
*/
abD = hdir(aD,bD);
acD = hdir(aD,cD);
bcD = hdir(bD,cD);
abcD = hdir(abD,cD);
/*
Include the subroutine that will be used to do the computa-
tions, but don't print it.
*/
%include 'D:\PROGS\SS.SAS' / NOsource2;
/*
Set values of the three secondary arguments for the SS subrou-
tine. The settings instruct the subroutine to
- use the first method of computing sums of squares
- print the values of the computed sums of squares, but
- omit printing the intermediate results.
*/
level = 1;
printss = 1;
printh = 0;
/*
BEGINNING OF CALLS TO THE SS SUBROUTINE
To facilitate comparisons, sums of squares are computed in this
program in the order in which they appear in the output from
SAS PROC GLM. This order does not reflect the order of impor-
tance of the sums of squares since (as I shall discuss further
in later material) the SAS Type I sums of squares have a seri-
ous problem.
(The SAS Type IV sums of squares are omitted because [as illus-
trated in later GLM output from this program] they are identi-
cal to the HTI (= SAS Type III) sums of squares for the Searle
data. Type IV sums of squares differ from HTI sums of squares
only in the infrequent case in which there are empty cells.)
As I discuss in the program for the two-way case (1998), it is
useful to view the computation of analysis of variance sums of
squares in terms of the operation of computing the difference
between the residual sums of squares of two overparameterized
model equations. The two equations are specified in terms of
two matrices, XE and XR.
XE is the submatrix of the (full-column-rank) design matrix for
the effect being tested. XR is the horizontal concatenation of
the submatrices of the design matrix for all the other terms
(excluding the error term) on the right side of the two model
equations whose residual sums of squares we wish to difference.
SEQUENTIAL (SAS TYPE I) SUMS OF SQUARES
Sequential sums of squares are based on a chosen sequence for
computing the sums of squares for the various effects. The se-
quence begins with the main effects and works upward through
the interactions. The sequence starts with only the constant
term mu in the model, and terms are added to the model, one at
a time, as in stepwise regression. Once a term is added to the
model, it stays.
(A problem with the sequential method is that it is not clear
which particular sequence is best. For example, SAS and SPSS
use different logical sequences for their sequential sum of
squares.)
The sums of squares computed in this section demonstrate the
sequence used by SAS in computing the Type I sum of squares.
For each sum of squares I show (in abbreviated form) the two
model equations whose residual sums of squares are being dif-
ferenced. You may wish to confirm that the definition of XE
and XR in each case is consistent with the two model equations.
Sequential A
y = m
y = m + a */
XE = aD;
XR = J(48,1);
call SS(result, y, XE, XR, level,printss,printh);
/* Note how the value of RESULT given above is identical in
all available digits with the Type I sum of squares for A in
the GLM output generated later in this program. (Similarly,
all the following sums of squares computed in this IML program
are identical to the corresponding sums of squares [when avail-
able] from SAS GLM.)
The values NDIGITSS, NDIGITSR, and NDIGITSI are rough indica-
tors of the number of digits of accuracy of the values in the
"projection matrix", which I discuss in the other program
(1998).
Sequential B
y = m + a
y = m + a + b */
XE = bD;
XR = J(48,1) || aD;
call SS(result, y, XE, XR, level,printss,printh);
/* Sequential A x B
y = m + a + b
y = m + a + b + ab */
XE = abD;
XR = J(48,1) || aD || bD;
call SS(result, y, XE, XR, level,printss,printh);
/* Sequential C
y = m + a + b + ab
y = m + a + b + c + ab */
XE = cD;
XR = J(48,1) || aD || bD || abD;
call SS(result, y, XE, XR, level,printss,printh);
/* Sequential A x C
y = m + a + b + c + ab
y = m + a + b + c + ab + ac */
XE = acD;
XR = J(48,1) || aD || bD || cD || abD;
call SS(result, y, XE, XR, level,printss,printh);
/* Sequential B x C
y = m + a + b + c + ab + ac
y = m + a + b + c + ab + ac + bc */
XE = bcD;
XR = J(48,1) || aD || bD || cD || abD || acD;
call SS(result, y, XE, XR, level,printss,printh);
/* A x B x C
Note that the sum of squares for the three-way A x B x C inter-
action is computed only once in the IML portion of this program
because it is the same for all the types of sums of squares.
y = m + a + b + c + ab + ac + bc
y = m + a + b + c + ab + ac + bc + abc */
XE = abcD;
XR = J(48,1) || aD || bD || cD || abD || acD || bcD;
call SS(result, y, XE, XR, level,printss,printh);
/*
SAS TYPE II SUMS OF SQUARES
Consider a definition
An analysis of variance effect is *contained* in an-
other effect if the name of the former effect can be
obtained from the name of the latter by deleting terms.
For example, the A effect is contained in the A x B effect be-
cause the name of the A effect can be obtained by deleting the
term B from the name of the A x B effect.
The XR matrix for a SAS Type II sum of squares is the horizon-
tal concatenation of
- all the submatrices of the design matrix for terms in the
model equation at the same level as the effect being tested
plus
- all the submatrices (if any) of the design matrix for terms
at lower levels than the effect being tested plus
- all the remaining submatrices of the design matrix whose ef-
fects do NOT contain the effect being tested.
For example, the XR matrix for Type II A effect is the horizon-
tal concatenation of
- the submatrices for the two effects at the same level as A --
i.e., B and C.
- (no submatrices for effects at lower levels than A because a
main effect is the lowest level possible)
- the submatrix for the B x C interaction (since this interac-
tion does not contain the A effect).
Thus the sum of squares for the Type II A effect is computed as
follows:
Type II A
y = m + b + c + bc
y = m + a + b + c + bc */
XE = aD;
XR = J(48,1) || bD || cD || bcD;
call SS(result, y, XE, XR, level,printss,printh);
/* Type II B
y = m + a + c + ac
y = m + a + b + c + ac */
XE = bD;
XR = J(48,1) || aD || cD || acD;
call SS(result, y, XE, XR, level,printss,printh);
/* Type II A x B
y = m + a + b + c + ac + bc
y = m + a + b + c + ab + ac + bc */
XE = abD;
XR = J(48,1) || aD || bD || cD || acD || bcD;
call SS(result, y, XE, XR, level,printss,printh);
/* Type II C
y = m + a + b + ab
y = m + a + b + c + ab */
XE = cD;
XR = J(48,1) || aD || bD || abD;
call SS(result, y, XE, XR, level,printss,printh);
/* Type II A x C
y = m + a + b + c + ab + bc
y = m + a + b + c + ab + ac + bc */
XE = acD;
XR = J(48,1) || aD || bD || cD || abD || bcD;
call SS(result, y, XE, XR, level,printss,printh);
/* Type II B x C
y = m + a + b + c + ab + ac
y = m + a + b + c + ab + ac + bc */
XE = bcD;
XR = J(48,1) || aD || bD || cD || abD || acD;
call SS(result, y, XE, XR, level,printss,printh);
/*
HTI = SAS Type III SUMS OF SQUARES
The XR matrix for an HTI (Higher-level Terms Included) sum of
squares always consists of the horizontal concatenation of sub-
matrices for ALL the terms on the right side of the "saturated"
version of the overparameterized model equation (except the er-
ror term and except the term specified in the specification of
XE). (The saturated version of a model equation is simply the
model that contains all the possible terms.) Thus for the
Searle data the XR matrix for an HTI sum of squares is a hori-
zontal concatenation of seven matrices in each of the six calls
to SS that follow.
HTI A
y = m + b + c + ab + ac + bc + abc
y = m + a + b + c + ab + ac + bc + abc */
XE = aD;
XR = J(48,1) || bD || cD || abD || acD || bcD || abcD;
call SS(result, y, XE, XR, level,printss,printh);
/* HTI B
y = m + a + c + ab + ac + bc + abc
y = m + a + b + c + ab + ac + bc + abc */
XE = bD;
XR = J(48,1) || aD || cD || abD || acD || bcD || abcD;
call SS(result, y, XE, XR, level,printss,printh);
/* HTI A x B
y = m + a + b + c + ac + bc + abc
y = m + a + b + c + ab + ac + bc + abc */
XE = abD;
XR = J(48,1) || aD || bD || cD || acD || bcD || abcD;
call SS(result, y, XE, XR, level,printss,printh);
/* HTI C
Searle's exact answer (1987, 394) is 62
y = m + a + b + ab + ac + bc + abc
y = m + a + b + c + ab + ac + bc + abc */
XE = cD;
XR = J(48,1) || aD || bD || abD || acD || bcD || abcD;
call SS(result, y, XE, XR, level,printss,printh);
/* HTI A x C
y = m + a + b + c + ab + bc + abc
y = m + a + b + c + ab + ac + bc + abc */
XE = acD;
XR = J(48,1) || aD || bD || cD || abD || bcD || abcD;
call SS(result, y, XE, XR, level,printss,printh);
/* HTI B x C
Searle's exact answer (1987, 395) is 192(1678)/11505
= 28.00312 90743 15514 99
y = m + a + b + c + ab + ac + abc
y = m + a + b + c + ab + ac + bc + abc */
XE = bcD;
XR = J(48,1) || aD || bD || cD || abD || acD || abcD;
call SS(result, y, XE, XR, level,printss,printh);
/*
HTO SUMS OF SQUARES
The XR matrix for an HTO sum of squares consists of the hori-
zontal concatenation of the submatrices of the design matrix
for all the effects at the same level as and at lower levels
than the effect being tested. That is, Higher-level Terms are
Omitted (HTO).
Note the similarity and difference between the HTO and the SAS
Type II methods: In a two-way experiment the HTO and SAS Type
II sums of squares are identical. But in the three-way case
and higher there are differences. In the three-way case the
differences are only in the main effects. That is, for an HTO
main effect the submatrices for all the higher-level effects
are omitted from XR. On the other hand, for a SAS Type II main
effect the submatrices for higher-level effects that do not
contain the effect being tested are included in XR.
For example, the XR matrix for the SAS Type II A main effect
(as discussed above in the Type II section) is
XR = J(48,1) || bD || cD || bcD.
But the XR matrix for the HTO A main effect is
XR = J(48,1) || bD || cD.
In cases in which there is no B x C interaction extant in the
population, the HTO sum of squares provides a slightly more
powerful statistical test of the A effect than the SAS Type II
sum of squares. (In cases in which there *is* an extant B x C
interaction, the HTO approach should be not used to test for
the A effect because, as I shall demonstrate in later material,
an extant B x C interaction can "contaminate" the HTO A statis-
tical test.)
Following are the three cases where the HTO sums of squares and
the SAS Type II sums of squares differ. (The HTO sums of
squares that appear below do not appear in the output from PROC
GLM because GLM cannot directly compute HTO sums of squares.)
HTO A
y = m + b + c
y = m + a + b + c */
XE = aD;
XR = J(48,1) || bD || cD;
call SS(result, y, XE, XR, level,printss,printh);
/* HTO B
y = m + a + c
y = m + a + b + c */
XE = bD;
XR = J(48,1) || aD || cD;
call SS(result, y, XE, XR, level,printss,printh);
/* HTO C
y = m + a + b
y = m + a + b + c */
XE = cD;
XR = J(48,1) || aD || bD;
call SS(result, y, XE, XR, level,printss,printh);
/*
HTOS SUMS OF SQUARES
I discuss HTOS sums of squares in a paper (1997, appendix D).
As with all the other types of sums of squares, the XE matrix
for an HTOS sum of squares consists of the submatrix of the de-
sign matrix for the effect being tested.
Recall the definition of one effect containing another effect
given in the discussion above of the SAS Type II sums of
squares. The XR matrix for an HTOS sum of squares consists of
the horizontal concatenation of
- the submatrices for all the effects at the same level as the
effect being tested (just like the HTO, Type II, and HTI sums
of squares)
- the submatrices for all the effects (if any) at lower levels
than the effect being tested (just like the HTO, Type II,
HTI, and sequential sum of squares).
- the submatrices for non-containing higher-level interactions
(just like the SAS Type II sums of squares) *if evidence of
these interactions is found in the data*.
Note that the HTOS approach is a conditional approach, with the
formula for computing a sum of squares depending on whether
evidence of higher-level non-containing interactions is found
in the data.
In the three-way case there are only three sums of squares in
which the HTOS sums of squares are conceptually (but not compu-
tationally) different from other sums of squares computed
above, namely the three main-effect sums of squares. If evi-
dence of a non-containing higher-level interaction is found,
the HTOS sum of squares for a main effect is identical to the
corresponding SAS Type II sums of squares, as discussed and
computed above. But if no evidence of a non-containing inter-
action is found, the HTOS sum of squares is identical to the
corresponding HTO sum of squares, as also discussed and com-
puted above.
The HTOS approach has two useful features:
1. If we are testing an effect and there is evidence that a
non-containing higher-level interaction exists in the data,
the HTOS approach correctly takes account of the interaction
in the computation, thereby ensuring that the statistical
test is valid.
2. On the other hand, if there is no evidence that the non-con-
taining interaction exists, the HTOS approach provides a
valid statistical test for the existence of a relationship
between the response variable and the relevant predictor
variable(s). This test is slightly more powerful than the
HTI and SAS Type II tests. I shall discuss the validity and
power of this approach in later material.
QUIT FROM PROC IML
This ends the computation of sums of squares in PROC IML.
Quit from IML.
*/
quit;
/*
GLM ANALYSIS
Analyze the data with PROC GLM for comparison with the above
output from IML. (The output from GLM comes in a separate out-
put file.)
Examination of the GLM output reveals that all the GLM sums of
squares are identical in all available digits to the sums of
squares computed above in IML.
*/
proc glm data=Searle_2;
class a b c;
model y = a | b | c / ss1 ss2 ss3 ss4;
quit;
options date linesize=80 probsig=2;
title ' ';
/*
APPENDIX: STEPS TO RUN THIS PROGRAM
1. Ensure that the STAT and IML components of the SAS system
are available on your computer. Information about the SAS
system is available at http://www.sas.com
2. Ensure that you have the source version of this program,
which is called PR0165.SAS (not the HTML version, which is
called PR0165.HTM). You can obtain a copy of the source
version in the "Computer Programs" section of the page at
http://www.matstat.com/ss/
3. Install a copy of the SS subroutine on your computer. This
subroutine does the actual computations of sums of squares
and is available at the above MatStat web site.
4. Edit the %INCLUDE statement above to correctly point to the
location of the SS.SAS subroutine file on your computer.
That is, change the
D:\PROGS\SS.SAS
in the statement to the location where SS.SAS is stored on
your computer.
5. (Optional.) Modify the two OPTIONS statements in the pro-
gram that set the DATE, LINESIZE, and PROBSIG options.
6. Submit the program to SAS.
REFERENCES
Macnaughton, D. B. 1997. Which sums of squares are best in un-
balanced analysis of variance? Available at
http://www.matstat.com/ss/
Macnaughton, D. B. 1998. PR0139.HTM: Computing numerator sums
of squares in unbalanced analysis of variance: Two-way
case). Available in the "Computer Programs" section at
http://www.matstat.com/ss/
Searle, S. R. 1987. _Linear Models for Unbalanced Data._ New
York: Wiley.
Yates, F. 1934. The analysis of multiple classifications with
unequal numbers in the different classes. _Journal of the
American Statistical Association_ 29, 51-66.
version of June 20, 1998
(end of program pr0165.sas) */