2 x 3 Unbalanced ANOVA

The following is self-documenting output from a SAS IML computer program:

NOTE: SAS (r) Proprietary Software Release 6.12  TS020
      Licensed to DONALD B. MACNAUGHTON, Site 0025250001.

1     /*
2    PR0139.SAS
3
4                 COMPUTING NUMERATOR SUMS OF SQUARES
5                 IN UNBALANCED ANALYSIS OF VARIANCE:
6                            Two-Way Case
7
8                        Donald B. Macnaughton
9                         donmac@matstat.com
10
11
12                         TABLE OF CONTENTS
13
14   Abstract >
15
16   Introduction >
17      - preliminary notes >
18      - introduction to the Searle example >
19      - a goal of experiments and analysis of variance >
20      - a controversy about sums of squares >
21      - model equations >
22      - the residual sum of squares of a model equation >
23      - an interpretation of analysis of variance sums of squares
24        in terms of model equations >
25      - summing up >
26      - program overview >
27
28   Preliminary Steps >
29      - start PROC IML >
30      - define the Searle data >
31      - generate the main effect submatrices of the design matrix
          >
32      - generate the interaction submatrix of the design matrix
          >
33      - obtain the SS subroutine and list it >
34          - abstract >
35          - uses >
36          - main arguments: y, XE, and XR >
37          - method >
38          - details of the method >
39              - the hypothesis matrix H as a function of XE and XR
                  >
40              - the projection matrix PM as a function of H >
41              - yp, the projection of y by PM >
42              - the sum of squares as the squared length of yp >
43              - other methods of computing the sum of squares >
44              - general comments >
45          - secondary arguments >
46          - notes >
47          - executable statements >
48          - references >
49      - set the values of the three secondary arguments of the SS
50        subroutine >
51
52   Compute the Seven Sums of Squares Using the SS Subroutine >
53      - HTO A sum of squares >
54          - specify the matrix XE for the effect being tested >
55          - specify the matrix XR for the other effects in the
56            two models >
57          - call SS to compute the sum of squares >
58      - HTO B sum of squares >
59      - HTI A sum of squares >
60      - HTI B sum of squares >
61      - sequential A sum of squares when A is entered first >
62      - sequential B sum of squares when B is entered first >
63      - interaction sum of squares >
64
65   Save the Data in a SAS Dataset and Quit from IML >
66
67   Compute the Sums of Squares using PROC GLM (for comparison with
68   the values generated above) >
69
70   Summary >
71
72   Notes >
73
74   Appendix: Steps to Run the Program >
75
76   References >
77
78   Output from PROC GLM >
79
80
81                              ABSTRACT
82
83   This SAS program illustrates a conceptual point of view and the
84   matrix arithmetic for computing the following types of analysis
85   of variance numerator sums of squares:
86
87   - HTO (Higher-level Terms are Omitted)
88       = SPSS ANOVA Experimental
89       = SAS Type II in the two-way case
90
91   - HTI (Higher-level Terms are Included)
92       = SAS Type III
93       = SPSS ANOVA Unique
94       = the default approach in many analysis of variance pro-
95         grams
96
97   - sequential
98       = SAS Type I
99       = SPSS ANOVA Hierarchical in the two-way case.
100
101  The conceptual point of view is one of computing an analysis of
102  variance sum of squares by computing the difference between the
103  residual sums of squares of two model equations (Yates 1934).
104  The matrix arithmetic is simple and is specified directly in
105  terms of the conceptual point of view.
106
107  The program is heavily annotated.  Computations are illustrated
108  using data from a 2 x 3 unbalanced experiment discussed by
109  Shayle Searle (1987, 79).
110
111
112                        PRELIMINARY NOTES
113
114  If you are not familiar with SAS, you can tell the difference
115  between my comments and the SAS program statements as follows:
116  My comments begin with the two symbols /* and end with the two
117  symbols */           /* Anything outside these symbols (except
118  blanks) is a program statement, which SAS will try to execute.
119
120  To lay some groundwork, I begin the program not with SAS pro-
121  gram statements but instead with about 500 lines of my own com-
122  ments.  These set the stage for the later program statements.
123
124
125                INTRODUCTION TO THE SEARLE EXAMPLE
126
127  Analysis of variance is a broadly used method for analyzing the
128  results of scientific experiments.  The method was invented by
129  Sir Ronald Aylmer Fisher (1925, 1935) and is generally viewed
130  as the most powerful and versatile available method for scien-
131  tifically inferring causation.
132
133  A current controversy in analysis of variance pertains to ana-
134  lyzing the data from "unbalanced" experiments.  The controversy
135  is important because (for various reasons) a large proportion
136  of real-world experiments end up being unbalanced.  Shayle
137  Searle addresses the controversy in his book LINEAR MODELS FOR
138  UNBALANCED DATA (1987) in which he discusses both mathematical
139  and philosophical issues.  Although I disagree with some of
140  Professor Searle's philosophical conclusions, I am in awe of
141  his mathematical work.  It is with deep respect that I offer
142  the following analysis of an important example in his book.
143
144  Searle's example is of an experiment to test whether "type of
145  potting soil" influences "time to germination" in three varie-
146  ties of carrot seed (1987, 78-79).
147
148  (Searle's example is from the field of agriculture.  However,
149  the discussion is not limited to experiments in the field of
150  agriculture.  Instead, both Searle's discussion and my discus-
151  sion apply to experiments in all fields of empirical research.
152  In particular, the discussion applies to a majority of the ex-
153  periments in the physical, biological, and social sciences.)
154
155  Clearly, the response variable in Searle's experiment is "time
156  to germination", and the two predictor variables are "soil
157  type" and "seed variety".  Searle presents the following data
158  as possible results of the experiment:
159
160                             TABLE 1
161                Time in Days to First Germination
162                of Three Varieties of Carrot Seed
163               Grown in Two Different Potting Soils
164                       -------------------
165                                  Seed
166                       Soil    Variety (B)
167                       Type   ------------
168                        (A)    1    2    3
169                       -------------------
170                         1     6   13   14
171                              10   15   22
172                              11
173
174                         2    12   31   18
175                              15         9
176                              19        12
177                              18
178                       -------------------
179
180  The first number in the body of the table (i.e., 6) indicates
181  that in one of the fifteen trials in the experiment it took six
182  days for seeds of variety 1 to germinate when they were planted
183  in soil of type 1.
184
185  Searle's experiment is unbalanced because, as the table shows,
186  the number of values of the response variable available in the
187  various "cells" in the experiment differs from cell to cell.
188  For example, three values of the response variable are avail-
189  able in the (1,1) cell, but only two values are available in
190  the (1,2) cell.
191
192
193          A GOAL OF EXPERIMENTS AND ANALYSIS OF VARIANCE
194
195  When discussing analysis of variance, it is important to be
196  aware of both the goal of scientific experiments and the role
197  of analysis of variance in achieving the goal.  Otherwise, the
198  discussion may become an arbitrary mathematical exercise.  One
199  useful way of characterizing the goal is
200
201      The goal of experiments and analysis of variance is to
202      obtain knowledge about relationships between variables.
203
204  In any scientific experiment, an important step in achieving
205  this goal is to determine, in as unequivocal a way as possible,
206  whether a relationship actually *exists* between the variables
207  under study.  In particular, we wish to determine whether the
208  response variable in an experiment "depends" on one or more of
209  the predictor variables.  If a relationship is found between
210  the variables, a second goal is to determine the nature of the
211  relationship.
212
213  Thus in the Searle data we wish to determine whether "time to
214  germination" depends on "soil type", or whether "time to germi-
215  nation" depends on "seed variety".
216
217  It is invariably the case in experimental research that we wish
218  to determine whether the dependence exists in the *population*
219  of entities under study, not just in the sample of entities
220  that participated in the experiment.  (In Searle's example the
221  entities are trials, or "carrot seed plantings".)
222
223  In an experiment with two predictor variables, the nature of
224  the relationship between the response variable and the predic-
225  tor variables can be either
226  - no (detectable) relationship or
227  - one or two "simple" relationships ("main effects") or
228  - an "interaction".
229
230  Interactions were invented by Fisher (1935, ch. VI).  Interac-
231  tions provide a comprehensive means for detecting any (detect-
232  able) form of relationship that might exist between the re-
233  sponse variable and the predictor variables in an experiment.
234  In particular, interactions help us to detect complicated rela-
235  tionships between the response variable and a predictor vari-
236  able in which the specific form of the relationship depends on
237  the level of one or more *other* predictor variables.  I give
238  formal definitions of the concepts of 'relationship between
239  variables', 'interaction', and 'simple relationship' in a paper
240  (1997, sec. 6).
241
242  The use of analysis of variance to detect relationships between
243  variables is (at a high level) straightforward:  We submit the
244  data summarizing the results of an experiment (e.g., the data
245  in table 1) to an analysis of variance program, and the program
246  computes a set of "p-values".  Assuming the experiment was
247  properly designed, the program provides a p-value for each sim-
248  ple relationship (main effect) and a p-value for each interac-
249  tion.  If the p-value for a main effect or interaction is low
250  enough (and if there is no reasonable alternative explanation),
251  we conclude that the particular relationship between variables
252  associated with the p-value is extant in the population of en-
253  tities under study.
254
255  I discuss the general scientific study of relationships between
256  variables (including the notion of a p-value) further in two
257  papers (1996a, 1996b).
258
259
260                A CONTROVERSY ABOUT SUMS OF SQUARES
261
262  To compute an analysis of variance p-value from the results of
263  an experiment, it is mathematically necessary to first compute
264  certain "sums of squares".  All analysis of variance programs
265  compute these sums of squares as an intermediate step in com-
266  puting p-values.  Currently controversy exists about how sums
267  of squares should be computed.  Controversy exists about both
268  the numerator sum of squares and the denominator sum of squares
269  used in the "F-ratio" to compute a p-value.  The present dis-
270  cussion focuses exclusively on computing numerator sums of
271  squares for unbalanced experiments.
272
273  (Since balanced experiments are an "internal" limiting case of
274  unbalanced experiments, the discussion below also applies to
275  balanced experiments.)
276
277  This program illustrates the computations of the three best-
278  known conceptual methods for computing numerator sums of
279  squares.  The program also illustrates the mathematical aspects
280  of computing numerator sums of squares by illustrating a simple
281  mathematical algorithm that can carry out all of the three con-
282  ceptual methods.
283
284  The purpose of this program is not to judge the various methods
285  of computing sums of squares, but rather to illustrate them.
286  Therefore, I make no judgments here about the merits of the
287  various methods.  However, I do make judgments in the paper
288  (1997) and I shall extend these judgments in material I shall
289  publish later.
290
291
292                         MODEL EQUATIONS
293
294  To understand sums of squares, it is useful to understand the
295  notion of a "model equation" of the relationship between the
296  response variable and the predictor variables in an experiment.
297  Two types of model equations (often called simply "models") are
298  in use today:  "overparameterized" models and "cell-means" mod-
299  els.  I discuss both types in the paper (1997, sec. 9).  The
300  following discussion is in terms of overparameterized models.
301
302  Consider a two-way experiment (e.g., the Searle experiment)
303  that has predictor variables A and B.  We can model the rela-
304  tionship between the response variable, called y, and the pre-
305  dictor variables with the following model equation:
306
307   y(i,j,k) = mu + alpha(i) + beta(j) + phi(i,j) + e(i,j,k).  (1)
308
309  (Normally the i's, j's and k's in the parentheses would be sub-
310  scripts, and the five terms on the right side of the equation
311  would be Greek letters, but these features are not yet avail-
312  able for comments in computer programs.)
313
314  The terms in the equation have the following interpretations:
315
316  y(i,j,k) = value of the response variable for the kth entity in
317             the (i,j) treatment group in the experiment
318
319        mu = grand mean of the values of the response variable
320             for all the entities in the population
321
322  alpha(i) = simple ("main") effect of predictor variable A on y
323             when A is at level i
324
325   beta(j) = simple ("main") effect of predictor variable B on y
326             when B is at level j
327
328  phi(i,j) = the joint (interaction) effect of predictor vari-
329             ables A and B on y when they are at levels i and j
330             respectively
331
332  e(i,j,k) = the "error" term, which takes account of the vari-
333             ation in y that cannot be accounted for by the other
334             four terms on the right side of the equation.
335
336  Model (1) gives us a succinct picture of how the value of the
337  response variable "depends" on the values of the two predictor
338  variables, and how it also depends on other unknown factors,
339  which are taken account of by the error term.  Model (1) is
340  called the "saturated" model for a two-way experiment because
341  it contains all the possible terms for such an experiment.
342
343  As we shall see, it often makes sense to use a reduced version
344  of (1) in which certain terms are omitted from the right side.
345  For example, if we omit the interaction term, phi, we get
346
347         y(i,j,k) = mu + alpha(i) + beta(j) + e(i,j,k).       (2)
348
349
350          THE RESIDUAL SUM OF SQUARES OF A MODEL EQUATION
351
352  Once we have the data from an experiment (e.g., the data in ta-
353  ble 1), we can "fit" various models to the data.  That is, we
354  can use a mathematical algorithm to compute values for the pa-
355  rameters associated with the terms in the model [i.e., values
356  for mu, the alpha(i)s, the beta(j)s, and the phi(i,j)s].  The
357  algorithm operates by choosing values for the parameters asso-
358  ciated with the four terms so that the resulting model gives
359  the "best" predictions of the values of the response variable y
360  for all the values of y obtained in the experiment.
361
362  The fitting of the terms is usually done by the method of least
363  squares or by the closely related method of maximum likelihood.
364  In the standard case, both methods yield "identical" estimates
365  for the values of the parameters associated with the terms
366  (excluding the error term e) on the right side of the equation
367  in the sense that
368      the sum of the squared differences between the values
369      of the response variable predicted by the equation and
370      the *actual* values of the response variable in the ex-
371      periment
372  is the lowest possible value.
373
374  After we have fitted a model to the results of an experiment
375  and obtained estimates for the values of the parameters, we can
376  then use the model to compute the predicted value for each of
377  the values of the response variable obtained in the experiment.
378  Then we can subtract each predicted value from the correspond-
379  ing actual value to get a number called the "residual".  If we
380  square each of these residuals and add the squared residuals
381  together, we get the "residual sum of squares" for the model
382  for the experimental data at hand.
383
384  (Of course, the residual sum of squares is the number that was
385  itself minimized two paragraphs above in order to determine the
386  estimates of the values of the parameters -- a piece of mathe-
387  matical bootstrapping that still amazes me.)
388
389
390     AN INTERPRETATION OF ANALYSIS OF VARIANCE SUMS OF SQUARES
391                    IN TERMS OF MODEL EQUATIONS
392
393  We can view all standard analysis of variance numerator sums of
394  squares as being the value we obtain if we subtract the resid-
395  ual sum of squares for one model from the residual sum of
396  squares for another model (Yates 1934, 63).  For example, con-
397  sider the following two models for the results of a two-way ex-
398  periment:
399
400        y(i,j,k) = mu            + beta(j) + e(i,j,k)         (3)
401
402        y(i,j,k) = mu + alpha(i) + beta(j) + e(i,j,k).        (2)
403
404  I shall use the term Rn to denote the residual sum of squares
405  for model (n).  Thus R3 denotes the residual sum of squares for
406  (3).
407
408  Suppose we perform a two-way experiment (such as the Searle ex-
409  periment), and suppose we (separately) fit models (2) and (3)
410  to the results of the experiment, yielding R2 and R3.  If we
411  subtract R2 from R3, this difference is identical to the HTO (=
412  SAS Type II = SPSS ANOVA Experimental) sum of squares for the A
413  main effect.
414
415  It stands to reason that the numerical difference R3 - R2
416  should equal a sum of squares for the A main effect since the
417  only conceptual difference between the two models is that (2)
418  contains alpha, the term for predictor variable A, and (3) does
419  not.
420
421  Consider the following two models:
422
423   y(i,j,k) = mu            + beta(j) + phi(i,j) + e(i,j,k)   (4)
424
425   y(i,j,k) = mu + alpha(i) + beta(j) + phi(i,j) + e(i,j,k).  (1)
426
427  Note that (4) and (1) are the same as (3) and (2) respectively,
428  except that both (4) and (1) have an extra term, namely
429  phi(i,j).  R4 - R1 is identical to the HTI (= SAS Type III =
430  SPSS ANOVA Unique) sum of squares for the A main effect.
431
432  We can see the source of the names HTI and HTO by studying the
433  two pairs of models above.  For the HTI sum of squares the
434  Higher-level interaction Term [phi(i,j)] is Included (HTI) in
435  both models.  For the HTO sum of squares the Higher-level in-
436  teraction Term is Omitted (HTO) from both models.
437
438  Finally, consider the following two models:
439
440              y(i,j,k) = mu            + e(i,j,k)             (5)
441
442              y(i,j,k) = mu + alpha(i) + e(i,j,k).            (6)
443
444  Note that (5) and (6) are the same as (3) and (2) respectively,
445  except that both (5) and (6) lack the term beta(j).  R5 - R6 is
446  identical to the sequential (= SAS Type I = SPSS ANOVA Hierar-
447  chical) sum of squares for the A main effect when A is entered
448  first (after the mean) into the model.
449
450  If we compute the difference between the residual sums of
451  squares of two models, the difference is called the "sum of
452  squares for the effect being tested".  The "effect" is the term
453  that is present in one of the two models, but absent from the
454  other.
455
456  Each sum of squares has associated with it a number of "degrees
457  of freedom".
458
459  (For any main effect, the number of degrees of freedom is one
460  less than the number of values assumed by the associated pre-
461  dictor variable in the experiment.  For example, in the Searle
462  data, predictor variable B assumes three different values in
463  the experiment so the number of degrees of freedom for B is
464  two.  For an interaction, the number of degrees of freedom is
465  the product of the numbers of degrees of freedom for the main
466  effects for each of the predictor variables associated with the
467  interaction.)
468
469  If we divide a sum of squares for a particular effect by its
470  degrees of freedom, we get the "mean square" for the effect.
471  Similarly, if we divide the residual sum of squares for the
472  saturated model by *its* degrees of freedom, we get the resid-
473  ual mean square.  The reason we might *want* to compute any of
474  these mean squares rests on three facts
475
476  - If there is *no* relationship in the population between the
477    response variable and predictor variable A, clearly the cor-
478    rect value for alpha(i) is zero for all values of i in all
479    the equations above.  In this case, it can be shown that the
480    three mean squares for the A effect can be expected (under
481    certain often-satisfiable assumptions) to equal the "residual
482    variance" in the experiment.  (The residual variance is esti-
483    mated by the residual mean square.)
484
485  - If there *is* a relationship between A and the response vari-
486    able, the three mean squares for the A effect can usually be
487    expected to be *greater* than the residual variance.
488
489  - Thus to determine whether there is evidence of a relationship
490    between the response variable and predictor variable A, we
491    need only determine whether the appropriate effect mean
492    square is significantly greater than the residual mean
493    square.  The p-value is simply an easy-to-interpret result of
494    this determination.
495
496  These facts, buttressed by mathematical arguments (Searle 1987,
497  sec. 8.6), imply that the three approaches to computing sums of
498  squares provide (with certain limitations) valid candidates for
499  the numerator sums of squares in F-ratios used to compute p-
500  values used to test for the existence of a relationship between
501  the response variable and predictor variable A.
502
503  (The technique I discuss above for detecting relationships be-
504  tween variables is undeniably complex.  A questioning reader
505  might wonder whether a simpler technique [with reasonable power
506  and objectivity] might be found.  So far, no such technique has
507  been found.)
508
509  The above discussion states that three popular types of numera-
510  tor sums of squares (HTO, HTI, and sequential) can be computed
511  in a two-way experiment by computing the difference between the
512  residual sums of squares of two model equations.  This general-
513  izes:
514
515      All standard analysis of variance numerator sums of
516      squares for two-way, three-way, and higher experiments
517      can be viewed as reflecting the difference between the
518      residual sums of squares of two overparameterized model
519      equations.
520
521
522                            SUMMING UP
523
524  - an important use of analysis of variance is to help research-
525    ers detect relationships between variables in populations of
526    entities
527
528  - we can detect relationships between variables by studying the
529    p-values obtained by applying analysis of variance to the re-
530    sults of an experiment
531
532  - analysis of variance computer programs compute numerator sums
533    of squares as an intermediate step in computing p-values
534
535  - controversy exists about which method of computing numerator
536    sums of squares is preferred in unbalanced experiments
537
538  - one easy-to-understand way of viewing all standard approaches
539    to computing numerator sums of squares is to view them as re-
540    flecting the difference between the residual sums of squares
541    of two different overparameterized model equations for the
542    relationship between the response variable and predictor
543    variables in the experiment.
544
545
546                         PROGRAM OVERVIEW
547
548  The program statements below demonstrate a simple algorithm for
549  computing analysis of variance numerator sums of squares.  The
550  algorithm takes three pieces of information as input
551
552  1. a specification of the two model equations whose residual
553     sums of squares we wish to "difference" in order to compute
554     an analysis of variance numerator sums of squares [e.g., (3)
555     and (2) above]
556
557  2. a specification of the layout of the experiment
558
559  3. the response variable data vector containing the values of
560     the response variable obtained in the experiment.
561
562  The algorithm uses the input to compute the specified sum of
563  squares.
564
565  To show that the algorithm works properly, the program computes
566  seven analysis of variance sums of squares for the Searle data
567  given above.
568
569  The program below is organized into five parts:
570
571  1. a set of six preliminary steps to get things ready for com-
572     puting the sums of squares
573
574  2. seven repetitions of three simple steps to compute the seven
575     sums of squares
576
577  3. a recomputation of the seven sums of squares with SAS PROC
578     GLM for comparison
579
580  4. a summary
581
582  5. notes, an appendix, and references.
583
584
585    PRELIMINARY STEP 1: START PROC IML AND RESET SOME OPTIONS
586
587  PROC IML (Interactive Matrix Language) is an easy-to-use com-
588  puter language for general matrix arithmetic.  It is an add-on
589  component of the SAS system and has many built-in functions to
590  facilitate matrix operations particular to statistical analy-
591  sis.
592
593  The PROC IML statement below initiates the IML environment.
594  After that statement is executed, the statements that follow
595  are statements in the IML language until we reach the QUIT
596  statement, which takes us back into a standard SAS program.
597  */
598
599  proc iml;
IML Ready
600
601   /*
602  The options in the following RESET statement control the desti-
603  nation and appearance of the IML output.
604  */
605
606  reset log fw=3 spaces=3;
607
608
609   /*
610            PRELIMINARY STEP 2:  DEFINE THE SEARLE DATA
611
612  To get the Searle data into IML, we define two column vectors
613  (a and b) of values for the two predictor variables, and we de-
614  fine a column vector y, containing the values of the response
615  variable.  Following are the three IML statements that define
616  the vectors a, b, and y:
617  */
618
619  a = { 1, 1, 1,   1, 1,   1, 1,   2, 2, 2, 2,   2,   2,2, 2 };
620
621  b = { 1, 1, 1,   2, 2,   3, 3,   1, 1, 1, 1,   2,   3,3, 3 };
622
623  y = { 6,10,11,  13,15,  14,22,  12,15,19,18,  31,  18,9,12 };
624   /*
625  Note how the values in the three vectors "line up" correctly
626  with each other to reflect exactly the same information as is
627  given near the beginning of this program in table 1.  For exam-
628  ple, the last three values in y (i.e., 18, 9, and 12) each line
629  up with a value of 2 in vector a and a value of 3 in vector b,
630  reflecting the fact that the last three values in y are associ-
631  ated with the (2,3) cell in table 1.
632
633  The set of fifteen numbers between the braces in each of the
634  three IML statements above is called a "matrix literal".  Al-
635  though the above three matrix literals are laid out horizon-
636  tally, they do not specify row vectors but instead specify
637  *column* vectors.  They specify column vectors because each of
638  the numbers (except the last) in each matrix literal is fol-
639  lowed by a comma, which in an IML matrix literal indicates the
640  end of a row.  (Numbers *within* a row of an IML matrix literal
641  are not separated by commas, but by blanks.)
642
643
644    PRELIMINARY STEP 3:  GENERATE THE MAIN EFFECT SUBMATRICES
645
646  Before we can begin computing sums of squares, we must first
647  generate the "full-column-rank" submatrices of the overall
648  "design matrix" for the experiment.  These submatrices (which
649  are surprisingly simple) are used in the computation of the
650  sums of squares.
651
652  Any analysis of variance has a separate submatrix of the design
653  matrix for each of its main effects (i.e., for each predictor
654  variable) and a separate submatrix for each interaction in the
655  experiment.  Each submatrix has N rows, where N equals the num-
656  ber of elements (entities, rows, cases, observations) in the
657  data.  (In the Searle data N is fifteen since there were fif-
658  teen trials in the experiment; these trials are reflected in
659  the fifteen elements in each of vectors a, b, and y defined
660  above.)  Each submatrix has DF columns, where DF is the number
661  of degrees of freedom associated with the main effect or inter-
662  action associated with the submatrix.
663
664  The full-column-rank submatrix for any main effect can be gen-
665  erated with a single statement in IML.  For example, to gener-
666  ate the submatrix for the main effect of predictor variable B,
667  and to store that submatrix in the matrix called Bdesign, we
668  use the following statement:
669
670                         Bdesign = designf(b)
671
672  where b is the column vector containing the raw values of pre-
673  dictor variable B used in the experiment.
674
675  DESIGNF is a built-in function in IML.  The function returns a
676  matrix of zeros and (positive and negative) ones, with N rows
677  and DF columns, where N and DF are as described above.
678
679  Consider the main effect for predictor variable B (seed vari-
680  ety) in the Searle data.  Here is a copy of the definition
681  (given above) of the (column) vector showing the different val-
682  ues of predictor variable B for the fifteen trials in the ex-
683  periment:
684
685  b = {
686  1,  1,  1,  2,  2,  3,  3,  1,  1,  1,  1,  2,  3,  3,  3 }.
687
688  Following is the IML statement to compute the submatrix of the
689  design matrix for predictor variable B.  This is followed by a
690  statement to print the newly computed matrix.  (The matrix is
691  printed transposed for ease of study.)
692  */
693
694  Bdesign = designf(b);
695
696  print (Bdesign`);

#TEM1001
  1   1   1   0   0  -1  -1   1   1   1   1   0  -1  -1  -1
  0   0   0   1   1  -1  -1   0   0   0   0   1  -1  -1  -1

697   /*
698  Consider the first row of 1's, zeros, and -1's in the output
699  above.  Compare the values in this row with the values in vec-
700  tor B given further above.  Note how the 1's identify the cases
701  in which B has the value 1.  Similarly, the zeros identify the
702  cases in which B has the value 2.  Finally, the -1's identify
703  the cases in which B has the value 3.
704
705  Similarly, in the second row note how the 1's identify the
706  cases in which B has the value 2.  And the -1's identify
707  (again) the cases in which B has the value 3.  The zeros iden-
708  tify the cases in which B has the value 1.
709
710  This generalizes:  Suppose some (discrete-valued) predictor
711  variable P has assumed m different values in an experiment.
712  (The value of m must be greater than or equal to 2 for P to be
713  a valid predictor variable.)  Then the (transposed) submatrix
714  for P will have m-1 rows, each row being associated with one of
715  the m values, and with one of the m values being "left out"
716  with no row of its own.  Within each row the elements are +1 if
717  this case received the associated value of predictor variable
718  P; -1 if the case received the "left out" value of predictor
719  variable P; and zero if m is greater than 2 and the case re-
720  ceived one of the other m-2 values of predictor variable P.
721
722  Design matrices are mysterious because, in the case of comput-
723  ing the standard analysis of variance sums of squares, they are
724  not unique.  That is, we could get exactly the same results in
725  the output below (apart from slight differences due to roundoff
726  errors) if we specified a new submatrix of the design matrix
727  for predictor variable B provided that the two columns in the
728  new matrix were themselves any two independent "linear combina-
729  tions" of the two columns of the submatrix for B given above.
730
731  How is it possible to get the same results if a submatrix of
732  the design matrix is replaced with a transformed version of it-
733  self?  This puzzling aspect of design matrices is explained by
734  the fact that the relevant information (for computing a sum of
735  squares) in a submatrix of a design matrix is not stored in the
736  *values* of the elements in the matrix -- it is stored in the
737  *relationships among* the values.  Independent linear combina-
738  tions of the columns in a submatrix preserve this relevant in-
739  formation.
740
741  (Technically, the linear combinations preserve the information
742  in the sense that all transformed sets of columns of a sub-
743  matrix of the design matrix so-defined are generating vectors
744  of the same relevant subspace of the N-dimensional vector space
745  under study.)
746
747  Now let us compute the submatrix of the design matrix for pre-
748  dictor variable A (soil type).  Here is a copy of the (column)
749  vector of values from the definition given above for this vari-
750  able:
751
752  a = {
753  1,  1,  1,  1,  1,  1,  1,  2,  2,  2,  2,  2,  2,  2,  2 }
754
755  Following are the statements to compute and print the submatrix
756  of the design matrix for predictor variable A.  (The matrix is
757  again printed transposed for ease of study.)
758  */
759
760  Adesign = designf(a);
761
762  print (Adesign`);

#TEM1001
  1   1   1   1   1   1   1  -1  -1  -1  -1  -1  -1  -1  -1

763   /*
764  Since predictor variable A assumed only two values in the ex-
765  periment, the (transposed) submatrix of the design matrix has
766  only one row.
767
768  As was the case with predictor variable B, note how the 1's and
769  -1's above in Adesign line up respectively with the 1's and 2's
770  in the column vector of the values of predictor variable A.
771
772
773     PRELIMINARY STEP 4:  GENERATE THE INTERACTION SUBMATRIX
774
775  Each interaction in an experiment has its own submatrix of the
776  design matrix.  We can generate an interaction submatrix by
777  computing the "horizontal direct product" of the design matri-
778  ces for all the main effects for the predictor variables asso-
779  ciated with the interaction.
780
781  Consider two matrices P and Q with n rows and np and nq col-
782  umns, respectively.  The horizontal direct product of P and Q
783  has n rows and np x nq columns.  Each column in the horizontal
784  direct product is associated with a unique pair of columns, one
785  from P and one from Q.  The elements in a given column in the
786  horizontal direct product are the scalar products of corre-
787  sponding elements in the two associated columns of P and Q.
788
789  For example, consider the following two matrices:
790
791              P =  {1 2,        and  Q = {1 6,
792                    3 4,                  2 5,
793                    5 6 }                 3 4 }.
794
795  The horizontal direct product of P and Q is
796
797                       { 1  6  2 12,
798                         6 15  8 20,
799                        15 20 18 24 }.
800
801  The IML function HDIR(P,Q) computes the horizontal direct prod-
802  uct of two matrices P and Q.
803
804  Following is a statement to generate the submatrix of the de-
805  sign matrix for the A x B interaction in the Searle data.  The
806  submatrix is generated by computing the horizontal direct prod-
807  uct of the two design submatrices (Adesign and Bdesign) gener-
808  ated above.  This is followed by a statement to print the in-
809  teraction submatrix (again transposed for ease of study).
810  */
811
812  ABdesign = hdir(Adesign, Bdesign);
813
814  print (ABdesign`);

#TEM1001
  1   1   1   0   0  -1  -1  -1  -1  -1  -1   0   1   1   1
  0   0   0   1   1  -1  -1   0   0   0   0  -1   1   1   1

815
816
817   /*
818    COMPUTATION STEP 1:  FOR A GIVEN SUM OF SQUARES SPECIFY XE
819       (WE SHALL FIRST COMPUTE THE HTO SUM OF SQUARES FOR A)
820
821  Having computed the three submatrices of the design matrix, we
822  can begin computing sums of squares.  Let us first compute the
823  HTO sum of squares for predictor variable A (soil type).
824
825  The actual computing of sums of squares in this program is done
826  by a subroutine called SS.  Using a subroutine is efficient be-
827  cause it consolidates the code to compute sums of squares into
828  a reusable and portable package.  I discuss the SS subroutine
829  in detail below.
830
831  In order to compute a sum of squares, the SS subroutine re-
832  quires three pieces of input:
833
834  1. the values of the response variable obtained in the experi-
835     ment
836
837  2. information about which sum of squares we wish to compute,
838     as specified by two overparameterized model equations, as
839     discussed above
840
841  3. information about the layout of the experiment.
842
843  The first piece of input, the values of the response variable,
844  was defined above in the definition of the vector y.  Thus we
845  need only pass the vector y to the subroutine for the subrou-
846  tine to use the values of the response variable.
847
848  The second and third pieces of input are passed to the subrou-
849  tine through two matrices that are obtained from the submatri-
850  ces of the design matrix discussed above.  The two matrices are
851  called XE and XR.
852
853  The first matrix, XE, is simply the submatrix of the design ma-
854  trix for the particular effect we wish to test.  In the present
855  case we are testing the A main effect.  Thus we specify XE as
856  follows:
857  */
858
859  XE = Adesign;
860
861
862   /*
863    COMPUTATION STEP 2:  FOR A GIVEN SUM OF SQUARES SPECIFY XR
864
865  In the present example we wish to compute the HTO numerator sum
866  of squares for the A main effect.  Thus (as discussed above) we
867  wish to compute the difference between the residual sums of
868  squares of the following two model equations:
869
870        y(i,j,k) = mu            + beta(j) + e(i,j,k)         (3)
871
872        y(i,j,k) = mu + alpha(i) + beta(j) + e(i,j,k).        (2)
873
874  We have already specified the conceptual difference between (3)
875  and (2) in the specification of XE above.  Thus all that re-
876  mains in order to fully specify the two equations is to list
877  the "other" terms on the right side of (3) and (2).
878
879  The "other" terms on the right side of (3) and (2) (excluding
880  the error term) are mu and beta.  We specify the other terms by
881  "horizontally concatenating" the submatrices for these terms to
882  form the matrix XR.
883
884  As suggested by the name, the horizontal concatenation of two
885  matrices P and Q is the matrix formed by appending the rows of
886  Q to the corresponding rows of P.  (Horizontal concatenation of
887  two matrices is possible only if the matrices have the same
888  number of rows.)  In SAS IML the expression P || Q is used to
889  specify the horizontal concatenation of two matrices P and Q.
890
891  Since mu is a constant in the two equations, the submatrix of
892  the design matrix for mu is a column vector of 1's.  We can
893  specify a column vector of fifteen 1's with the IML function
894
895                             j(15,1).
896
897  Thus we specify the XR matrix for the HTO A (soil type) sum of
898  squares in the experiment with the following statement:
899  */
900
901  XR = j(15,1) || Bdesign;
902
903   /*
904  Here is the (transposed) XR matrix:
905  */
906
907  print (XR`);

#TEM1001
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1
  1   1   1   0   0  -1  -1   1   1   1   1   0  -1  -1  -1
  0   0   0   1   1  -1  -1   0   0   0   0   1  -1  -1  -1

908
909   /*
910  Note that the XE and XR contain two types of information:
911
912  - information about which analysis of variance sum of squares
913    we wish to compute (as captured by the *choice* of the sub-
914    matrices we have used to define XE and XR)
915
916  - information about the layout of the experiment (as captured
917    in the *layout* of the submatrices of the design matrix used
918    to define XE and XR).
919
920  The E in XE stands for the Effect being tested.  The R in XR
921  stands for otheR terms on the right side of the two model equa-
922  tions.
923
924
925                       PRELIMINARY STEP 5:
926         MAKE THE SS SUBROUTINE AVAILABLE TO THE PROGRAM
927
928  (I have delayed this preliminary step until after defining XE
929  and XR, since this will help you to understand the subroutine.)
930
931  Since we have now defined y, XE, and XR for the HTO A (soil
932  type) effect, we are ready to call subroutine SS to compute the
933  sum of squares.  In fact, we shall call the subroutine momen-
934  tarily.  But before we can actually call the subroutine, we
935  must first make it known to IML.  We do this with the %INCLUDE
936  statement that follows.
937
938  Because the SOURCE2 option is specified in the %INCLUDE state-
939  ment, SAS lists all of the statements in the subroutine in the
940  output below immediately after the %INCLUDE statement.
941  */
942
943  %include 'D:\PROGS\SS.SAS' / source2;
NOTE: %INCLUDE (level 1) file D:\PROGS\SS.SAS is file D:\PROGS\SS.SAS.
944 + /*********************************************************/
945 +
946 +  start  SS  (result,  y, XE, XR,  level,printss,printh);
947 +
948 +   /*
949 +  SS.SAS
950 +
951 +         SUBROUTINE TO COMPUTE ANALYSIS OF VARIANCE
952 +                 NUMERATOR SUMS OF SQUARES
953 +
954 +                   Donald B. Macnaughton
955 +                    donmac@matstat.com
956 +
957 +  ABSTRACT
958 +
959 +  This SAS IML subroutine computes analysis of variance nu-
960 +  merator sums of squares for statistical tests in unbalanced
961 +  analysis of variance.  The subroutine can compute the fol-
962 +  lowing types of sums of squares:
963 +
964 +  - HTO (Higher-level Terms are Omitted)
965 +      = SPSS ANOVA Experimental
966 +      = SAS Type II in the two-way case
967 +
968 +  - SAS Type II
969 +
970 +  - HTOS (Higher-level Terms are Omitted unless Significant)
971 +      = superset of SAS Type II and HTO
972 +
973 +  - HTI (Higher-level Terms are Included)
974 +      = SAS Type III
975 +      = SPSS ANOVA Unique
976 +      = the default method in many analysis of variance pro-
977 +        grams
978 +
979 +  - sequential
980 +      = SAS Type I
981 +      = SPSS ANOVA Hierarchical in the two-way case
982 +
983 +  - SPSS ANOVA Hierarchical
984 +
985 +  - other types of sums of squares (including those in analy-
986 +    sis of covariance) that can be specified as being equal to
987 +    the difference between the residual sums of squares of two
988 +    overparameterized model equations.
989 +
990 +  Each time the subroutine is called it computes the single
991 +  sum of squares specified by the calling arguments and re-
992 +  turns the result to the calling program through the returned
993 +  argument RESULT.
994 +
995 +
996 +  SUBROUTINE USES
997 +
998 +  The subroutine
999 +
1000 +  - provides an easy way of computing numerator sums of
1001 +    squares that cannot be computed in some statistical pack-
1002 +    ages
1003 +
1004 +  - provides a way of checking how a particular sum of squares
1005 +    is computed in a statistical package
1006 +
1007 +  - illustrates the mathematical aspects of computing sums of
1008 +    squares
1009 +
1010 +  - assists in performing simulation tests of conjectures
1011 +    about analysis of variance sums of squares.
1012 +
1013 +
1014 +  SUBROUTINE MAIN ARGUMENTS
1015 +
1016 +  When calling this subroutine you must specify three main ar-
1017 +  guments:  y, XE, and XR.
1018 +
1019 +  The first main argument, y, must be a column vector contain-
1020 +  ing the values of the response variable obtained in the ex-
1021 +  periment.
1022 +
1023 +  The second and third main arguments, XE and XR, must be sub-
1024 +  matrices of the (full-column-rank) design matrix for the ex-
1025 +  periment.  These submatrices jointly specify the particular
1026 +  sum of squares to be computed.  XE must contain the subma-
1027 +  trix associated with the effect being tested.  XR must con-
1028 +  tain the horizontal concatenation of the submatrices associ-
1029 +  ated with all the other terms (except the error term) on the
1030 +  right side of the two model equations whose residual sums of
1031 +  squares you wish to difference.
1032 +
1033 +  I give examples of specification of y, XE, and XR in two
1034 +  sample calling programs (1998a, 1998b).
1035 +
1036 +
1037 +  SUBROUTINE METHOD
1038 +
1039 +  As noted above, the subroutine is supplied with the vector y
1040 +  and the matrices XE and XR.  Here is the code (all four
1041 +  lines) for computing the requested sum of squares and stor-
1042 +  ing it in the variable RESULT:
1043 +
1044 +  H = XE - XR * ginv(XR) * XE   <-- hypothesis matrix
1045 +                                    (dimensions are N x DF)
1046 +  PM = H * ginv(H)              <-- projection matrix (N x N)
1047 +  yp = PM * y                   <-- projection of y (N x 1)
1048 +  result = ssq(yp)              <-- squared length of the
1049 +                                    projection (1 x 1).
1050 +
1051 +  If you are familiar with matrix algebra, the above code will
1052 +  probably be understandable, even if you are not familiar
1053 +  with SAS IML.  But here are definitions for functions that
1054 +  might be puzzling:
1055 +
1056 +      ginv(X) = Moore-Penrose generalized inverse of X
1057 +
1058 +       ssq(x) = sum of the squares of the elements in x.
1059 +
1060 +  The subroutine method is derived from discussion and proofs
1061 +  by Hocking (1985) and Searle (1987).  Aldrich (1998) dis-
1062 +  cusses the history of the methods.  I discuss the method in
1063 +  more detail below.
1064 +
1065 +
1066 +  SUMMARY OF THE REST OF THE SUBROUTINE
1067 +
1068 +  The remaining lines in this subroutine (717 lines) contain
1069 +  - a discussion of details of the method discussed above
1070 +  - a discussion of details of the subroutine operation and
1071 +  - the executable version of the four statements given above.
1072 +  I recommend that you omit reading these lines unless you are
1073 +  interested in the details.  (To find the end of the subrou-
1074 +  tine to return to reading the calling program, search for
1075 +  the row of asterisks.)
1076 +
1077 +
1078 +  THE HYPOTHESIS MATRIX H
1079 +
1080 +  As noted above, the first step of this subroutine is to gen-
1081 +  erate the hypothesis matrix H, using the following state-
1082 +  ment:
1083 +
1084 +                H = XE - XR * ginv(XR) * XE.
1085 +
1086 +  Actually, the above statement is one of two different meth-
1087 +  ods that the subroutine can use to generate H.  (I describe
1088 +  how to specify which method is to be used in the Secondary
1089 +  Arguments section below.)  The other method is
1090 +
1091 +           H = XE - XR * inv(XR` * XR) * XR` * XE.
1092 +
1093 +  Algebraically, both methods yield exactly the same result
1094 +  for H, as shown by Graybill's theorem 6.2.16 (1983, 112).
1095 +
1096 +  (At several places in this subroutine I state that two meth-
1097 +  ods yield exactly the same result.  Although my statements
1098 +  are algebraically correct, if we use the two methods in a
1099 +  computer program, usually differences between the values
1100 +  computed by the two methods will occur, due to differences
1101 +  in roundoff errors.  One should be aware of these differ-
1102 +  ences but they can usually be ignored because they are usu-
1103 +  ally restricted to just the last few significant digits in
1104 +  computations that are performed in SAS in sixteen decimal
1105 +  digit precision.)
1106 +
1107 +  H has the following properties:
1108 +
1109 +   1. H has N rows, where N is the number of rows in y, XE,
1110 +      and XR.
1111 +
1112 +   2. H has DF columns, where DF is the number of degrees of
1113 +      freedom of the effect being tested.
1114 +
1115 +   3. Choose any cell in the table that summarizes the layout
1116 +      of the experiment.  (For an example of such a table, see
1117 +      Searle's carrot seed germination data [1987, 79], repro-
1118 +      duced in table 1 in Macnaughton [1998a].)  All the rows
1119 +      in H associated with the chosen cell are identical to
1120 +      each other.
1121 +
1122 +   4. The sum of each column of H equals zero.  This is sur-
1123 +      prising because the sums of the columns of XE and XR,
1124 +      which are used to generate H, generally do NOT equal
1125 +      zero in unbalanced experiments.
1126 +
1127 +   5. If H has only one column (i.e., DF = 1), we can view the
1128 +      elements in H as a statement of the hypothesis being
1129 +      tested.  That is, the elements in H are (indirectly)
1130 +      multiplied by the corresponding elements in the response
1131 +      data vector (in a "contrast") as a step in computing the
1132 +      sum of squares.
1133 +
1134 +   6. If H has more than one column, the columns of H are
1135 +      "linearly independent".
1136 +
1137 +   7. If H has more than one column, the elements in each col-
1138 +      umn of H can be viewed as a "portion" of the hypothesis
1139 +      being tested.  That is, each column of H represents a
1140 +      separate contrast that is applied to the response data
1141 +      vector.  The results of these contrasts are mathemati-
1142 +      cally combined to compute the sum of squares.  (The con-
1143 +      trasts are not directly applied to the response data
1144 +      vector, but only indirectly through PM, as discussed be-
1145 +      low.)
1146 +
1147 +   8. Surprisingly, H is not unique.  That is, for any given H
1148 +      we can replace it by any of an infinite number of "re-
1149 +      lated" matrices (which have, of course, the same number
1150 +      of rows and columns as H), and this subroutine will re-
1151 +      turn exactly the same value for the sum of squares.  H
1152 +      does not need to be unique to yield a unique sum of
1153 +      squares because the columns of H are not defining them-
1154 +      selves.  Instead, the columns are defining a unique sub-
1155 +      space of the N-dimensional vector space under study.
1156 +      Linear algebra shows that other versions of the matrix H
1157 +      can be used to define the same unique subspace.  More
1158 +      specifically, suppose T is ANY "non-singular" matrix
1159 +      with DF rows and DF columns.  Then we can replace H with
1160 +      H1 = H * T and this subroutine will return exactly the
1161 +      same sum of squares.
1162 +
1163 +
1164 +  THE PROJECTION MATRIX PM
1165 +
1166 +  As noted above in the "Method" section, the next step after
1167 +  computing H is to compute the projection matrix PM, using
1168 +  the following statement:
1169 +
1170 +                       PM = H * ginv(H).
1171 +
1172 +  Actually (as with H), the above statement is one of two dif-
1173 +  ferent methods that this subroutine can use to generate PM.
1174 +  The other method is
1175 +
1176 +                   PM = H * inv(H` * H) * H`.
1177 +
1178 +  As (again) shown by Graybill's theorem 6.2.16, both methods
1179 +  yield exactly the same value for PM.
1180 +
1181 +  The method immediately above is equivalent to computing
1182 +  Hocking's P(c) in (6.168) (1985, 153).  This method is par-
1183 +  tially shown to be valid in Graybill's Theorem 4.4.1 (1983,
1184 +  73).  Harville proves some useful results for projection ma-
1185 +  trices (1997, sec 12.3).
1186 +
1187 +  Note how the above two equations imply that PM is a "normal-
1188 +  ized" version of the hypothesis matrix H.  PM has the fol-
1189 +  lowing properties:
1190 +
1191 +   1. PM has N rows and N columns.
1192 +
1193 +   2. Choose any cell in the table that summarizes the layout
1194 +      of the experiment.  All the rows in PM associated with
1195 +      that cell are identical to each other.
1196 +
1197 +   3. PM is symmetric.  That is, the first row has exactly the
1198 +      same elements (in left-to-right order) as the first col-
1199 +      umn (in top-to-bottom order), and the second row has ex-
1200 +      actly the same elements as the second column, and so on.
1201 +
1202 +   4. The sum of the elements in each row of PM (and the sum
1203 +      of the elements in each column) is equal to zero.  This
1204 +      implies that if a row of PM is multiplied by the vector
1205 +      y, it produces a "contrast" of the values in y.
1206 +
1207 +   5. The trace of PM (i.e., the sum of the elements on the
1208 +      descending diagonal) is equal to DF, the number of de-
1209 +      grees of freedom associated with the hypothesis being
1210 +      tested.
1211 +
1212 +   6. The rank of PM (i.e., the number of linearly independent
1213 +      rows or columns in PM) is equal to DF.  Thus if DF is
1214 +      equal to 1, any row in PM is a multiple of any other
1215 +      row.  If DF is equal to k, any row in PM can be gener-
1216 +      ated as a linear combination of any k mutually independ-
1217 +      ent other rows.
1218 +
1219 +   7. The sum of the squares of the elements in PM is equal to
1220 +      DF.
1221 +
1222 +   8. The rows (and columns) of PM are linear combinations of
1223 +      the columns of H.
1224 +
1225 +   9. PM is unique.  That is, suppose a projection matrix PM
1226 +      is computed from a given hypothesis matrix H.  Suppose
1227 +      that then H is multiplied by any non-singular matrix T
1228 +      with DF rows and DF columns to yield H1.  Suppose that
1229 +      then a new projection matrix PM1 is computed from H1.
1230 +      Then PM1 = PM.
1231 +
1232 +  10. PM is idempotent.  That is, if we multiply PM by itself,
1233 +      the answer we get is PM.
1234 +
1235 +  11. If (as discussed below) we project an arbitrary vector x
1236 +      through PM to yield x1, and if we then project x1
1237 +      through PM to yield x2, we will find that x1 and x2 are
1238 +      identical, although they will generally differ from x.
1239 +
1240 +  12. PM has DF eigenvalues (characteristic values) that are
1241 +      equal to +1 and the remaining eigenvalues are equal to
1242 +      zero.  The eigenvectors (characteristic vectors) of PM
1243 +      that correspond to the non-zero eigenvalues can be used
1244 +      as columns to form a valid version of the hypothesis ma-
1245 +      trix H.
1246 +
1247 +
1248 +  yp, THE PROJECTION OF y BY PM
1249 +
1250 +  Thus far, we have only used information about the *predic-
1251 +  tor* variables in the experiment to derive the following
1252 +  four new matrices:  XE, XR, H, and PM.  That is, we have not
1253 +  yet taken any account of the values of the response variable
1254 +  stored in the vector y.  The next step in computing the sum
1255 +  of squares is to mathematically marry the response variable
1256 +  and the predictor variables.  We do so by using the projec-
1257 +  tion matrix PM to "project" y.  That is, we postmultiply the
1258 +  projection matrix by y to yield a new vector, called yp, as
1259 +  follows:
1260 +
1261 +                          yp = PM * y.
1262 +
1263 +  The vector yp has the following properties:
1264 +
1265 +  1. Like y, yp is an N x 1 column vector.
1266 +
1267 +  2. Choose any cell in the table that summarizes the layout
1268 +     of the experiment.  All the elements in yp associated
1269 +     with that cell are identical to each other.
1270 +
1271 +
1272 +  THE SUM OF SQUARES AS THE SQUARED LENGTH OF yp
1273 +
1274 +  The desired sum of squares is simply the squared length of
1275 +  yp, and is computed as
1276 +
1277 +                       result = ssq(yp).
1278 +
1279 +  Thus the sum of squares is simply the squared length of the
1280 +  projection of the vector y by the projection matrix PM.  PM
1281 +  has two important properties related to the projection of y
1282 +
1283 +  1. If no relationship exists between the response variable
1284 +     and the particular set of predictor variables associated
1285 +     with XE, (and if certain well-known assumptions are ade-
1286 +     quately satisfied), the length of the projection of y can
1287 +     be expected to be "short"; it will be roughly equal to a
1288 +     known length (i.e., a length that can be computed from
1289 +     the data).  On the other hand if a relationship between
1290 +     the variables *does* exist, the length of the projection
1291 +     of y will tend to be longer than the known length.  Thus
1292 +     computing a p-value is simply computing whether the pro-
1293 +     jection of y is longer than could be reasonably expected
1294 +     if no relationship exists.
1295 +
1296 +  2. If we are studying a balanced experiment, and if we com-
1297 +     pute projection matrices PM1 and PM2 for any two of the
1298 +     effects (main effects or interactions) in the experiment,
1299 +     PM1 * PM2 = 0, where 0 is an N x N matrix of zeros.
1300 +     (This impressive result does not generally occur in un-
1301 +     balanced experiments.)  This means that in a balanced ex-
1302 +     periment the projection of any vector y by PM1 is "or-
1303 +     thogonal" to (i.e., at right angles to) the projection of
1304 +     the same vector (or any other vector y1) by PM2.  It also
1305 +     means that if an experiment is balanced, none of the ef-
1306 +     fects in the experiment can "contaminate" the statistical
1307 +     tests of other effects in the experiment.  (This contami-
1308 +     nation, which I shall demonstrate in later material, oc-
1309 +     curs with some statistical tests in unbalanced experi-
1310 +     ments.)
1311 +
1312 +
1313 +  OTHER METHODS OF COMPUTING THE SUM OF SQUARES
1314 +
1315 +  Simple linear algebra implies that we can also compute the
1316 +  desired sum of squares as
1317 +
1318 +                       result = yp` * yp.
1319 +
1320 +  Also, Searle (1987) shows in (82) on page 264 and (90) on
1321 +  page 272 and (90) on page 318 that we can compute the de-
1322 +  sired sum of squares directly from PM as a quadratic form
1323 +
1324 +                     result = y` * PM * y.
1325 +
1326 +  I find the projection approach (a geometric approach) easier
1327 +  to understand than the quadratic form approach (an algebraic
1328 +  approach).  I visualize the response vector as an arrow in
1329 +  the N-dimensional vector space that is "projected through"
1330 +  the projection matrix to generate another arrow in a sub-
1331 +  space of the first space.  The length of the second arrow is
1332 +  related to selected properties of the first arrow.  In par-
1333 +  ticular, the projection matrix is specifically designed so
1334 +  that the length of the second arrow shows (in a way that is
1335 +  as mathematically "clear" as possible) the strength of sup-
1336 +  port (provided by the y-values obtained in the experiment)
1337 +  for the hypothesis that the relationship between variables
1338 +  associated with the projection matrix exists in the popula-
1339 +  tion.
1340 +
1341 +
1342 +  GENERAL COMMENTS ABOUT THE SUBROUTINE METHOD
1343 +
1344 +  It is helpful to review what this subroutine accomplishes.
1345 +  In essence, the calling program passes the matrices XE and
1346 +  XR to the subroutine.  These matrices contain the specifica-
1347 +  tions of the layout of the experiment and the specification
1348 +  of two model equations.  The calling program also passes the
1349 +  vector y to the subroutine.  This vector contains the values
1350 +  of the response variable obtained in the experiment.  Fol-
1351 +  lowing Yates' characterization of analysis of variance sums
1352 +  of squares (1934, 63), the subroutine uses the three argu-
1353 +  ments to compute the difference between the residual sums of
1354 +  squares of the two model equations.  Depending on how the
1355 +  calling program specifies the values of XE and XR, this al-
1356 +  lows the subroutine to compute a sum of squares using any of
1357 +  the seven approaches to computing analysis of variance sums
1358 +  of squares named in the abstract of the subroutine.  (I il-
1359 +  lustrate how to call the subroutine to compute sums of
1360 +  squares for some of the approaches in two computer programs
1361 +  [1998a, 1998b].)
1362 +
1363 +
1364 +  SUBROUTINE SECONDARY ARGUMENTS
1365 +
1366 +  The remaining lines in this subroutine (419 lines) contain
1367 +  - details of the subroutine operation and
1368 +  - the executable version of the statements given above.
1369 +  You can find the end of the subroutine to return to reading
1370 +  the calling program by searching for the row of asterisks.
1371 +
1372 +  To use this subroutine you must supply values for three sec-
1373 +  ondary arguments:  LEVEL, PRINTSS, and PRINTH.  These argu-
1374 +  ments control details of how the subroutine performs the
1375 +  computation and prints the results.
1376 +
1377 +  The argument LEVEL controls which method the subroutine uses
1378 +  to compute the hypothesis matrix H and the projection matrix
1379 +  PM.  If you set LEVEL to 1, the two matrices are computed
1380 +  using the standard inverse.  If you set LEVEL to 2, the two
1381 +  matrices are computed using the generalized inverse (as
1382 +  shown in the Method section).  Using LEVEL = 1 seems to
1383 +  yield solutions that are slightly more accurate.
1384 +
1385 +  The argument PRINTSS controls whether the subroutine prints
1386 +  the sum of squares it has computed.  If you set PRINTSS to
1387 +  1, the subroutine prints the value of the sum of squares in
1388 +  25.15 format.  If you set PRINTSS to zero, the subroutine
1389 +  does not print the sum of squares but instead only returns
1390 +  the value to the calling program through the argument
1391 +  RESULT.  If you set PRINTSS to 2, the subroutine prints all
1392 +  the printable digits of the computed sum of squares in E-no-
1393 +  tation for possible comparison against other computed val-
1394 +  ues.  If PRINTSS is 1 or 2, the subroutine also prints the
1395 +  results of tests of the integrity of the projection matrix
1396 +  PM.  (I describe the tests below.)
1397 +
1398 +  The argument PRINTH controls whether the subroutine prints
1399 +  the following three intermediate results:
1400 +  - the hypothesis matrix H
1401 +  - the projection matrix PM and
1402 +  - the projection, yp, of the response data vector.
1403 +  If you set PRINTH to 1, the subroutine prints the intermedi-
1404 +  ate results.  If you set PRINTH to 2, the subroutine prints
1405 +  the intermediate results but prints the hypothesis matrix
1406 +  and the projection of the response vector transposed, which
1407 +  can sometimes save space in the output.  If you set PRINTH
1408 +  to zero, the subroutine does not print intermediate results.
1409 +
1410 +
1411 +  SUBROUTINE NOTES
1412 +
1413 +  The numerical method of computing sums of squares used by
1414 +  this subroutine is efficient for supporting the conceptual
1415 +  approach to analysis of variance sums of squares of comput-
1416 +  ing the difference between the residual sums of squares of
1417 +  two overparameterized model equations.  However, the method
1418 +  is generally not the most efficient method in terms of
1419 +  - economy of computation in memory required
1420 +  - economy of computation in time required
1421 +  - minimization of roundoff errors for large datasets or ill-
1422 +    conditioned data.
1423 +  Nevertheless, the subroutine generally yields highly accu-
1424 +  rate sums of squares in minimal time and is therefore more
1425 +  than adequate for most applications that cannot be handled
1426 +  by a general analysis of variance program.
1427 +
1428 +  One inefficiency of the method used by this subroutine re-
1429 +  lates to the inclusion of identical rows within each of the
1430 +  following five arrays:
1431 +  - XE and XR, the relevant submatrices of the design matrix
1432 +  - H, the hypothesis matrix
1433 +  - PM, the projection matrix
1434 +  - yp, the projection of vector y.
1435 +  That is, (as noted above) for any given cell in the table
1436 +  that summarizes the layout of the experiment, all the rows
1437 +  in each of the five arrays that correspond to that cell are
1438 +  (within the array) identical -- one row for each value of
1439 +  the response variable associated with the cell.   Sunwoo and
1440 +  Kim (1997) discuss an approach to analyzing unbalanced ex-
1441 +  periments that eliminates this duplication.  For very large
1442 +  experiments, this subroutine could be enhanced to take ac-
1443 +  count of the Sunwoo and Kim approach, thereby substantially
1444 +  increasing the computational efficiency (at the expense of
1445 +  an increase in complexity).
1446 +
1447 +  (The enhancement might work as follows:  Compute y1, which
1448 +  is an m-dimensional column vector containing the cell means
1449 +  [or possibly cell totals] of the values of the response
1450 +  variable, where m is the number of cells in the experiment.
1451 +  Compute XE, XR, H, and PM as above but on the basis of an
1452 +  equivalent experiment with only one observation per cell.
1453 +  Compute yp1 as the projection of y1 by PM.  Scale yp1 with
1454 +  the Sunwoo and Kim T matrix to [in effect] yield yp, and
1455 +  compute the desired sum of squares from yp as before.)
1456 +
1457 +  The hypothesis matrix H I use in this subroutine has dimen-
1458 +  sions N x DF.  This is the transpose of the hypothesis ma-
1459 +  trix Ronald Hocking discusses in his book (1985, 153), and I
1460 +  discuss in a paper (1997, appendix C).  I have used the
1461 +  transpose because it yields slightly simpler notation.
1462 +
1463 +  To conserve memory, the subroutine erases XE and XR after
1464 +  using them.
1465 +
1466 +  If you wish to run this subroutine on an EBCDIC system
1467 +  (e.g., an IBM mainframe), see the note near the end.
1468 +
1469 +
1470 +  SUBROUTINE EXECUTABLE STATEMENTS
1471 +
1472 +  First, check the arguments passed to the subroutine and stop
1473 +  if a problem is found.
1474 +  */
1475 +
1476 +  if level ^= 1 & level ^=2 then do;
1477 +     print '***ERROR*** in call to SS subroutine.';
1478 +     print 'Value of LEVEL is' level;
1479 +     print 'LEVEL must be 1 or 2.  Execution terminated.';
1480 +     abort;
1481 +     end;
1482 +
1483 +  if printss ^= 0 & printss ^= 1 & printss ^= 2 then do;
1484 +     print '***ERROR*** in call to SS subroutine.';
1485 +     print 'Value of PRINTSS is' printss;
1486 +     print 'Value must be 0, 1, or 2. Execution terminated.';
1487 +     abort;
1488 +     end;
1489 +
1490 +  if printh ^= 0 & printh ^= 1 & printh ^= 2 then do;
1491 +     print '***ERROR*** in call to SS subroutine.';
1492 +     print 'Value of PRINTH is' printh;
1493 +     print 'PRINTH must be 0, 1, or 2. Execution terminated.';
1494 +     abort;
1495 +     end;
1496 +
1497 +  if type(y) = 'U' | type(XE) = 'U' | type(XR) = 'U' then do;
1498 +     print '***ERROR*** in call to SS subroutine.';
1499 +     string = 'One or more of the matrices y, XE';
1500 +     string = string + ', and XR do not exist.  You must ';
1501 +     string = string + 'specify the three matrices before ';
1502 +     string = string + 'calling the SS subroutine.';
1503 +     print string;
1504 +     print 'Execution terminated.';
1505 +     abort;
1506 +     end;
1507 +
1508 +  n = nrow(y);
1509 +  if nrow(XE) ^= n | nrow(XR) ^= n then do;
1510 +     string = '***ERROR*** in call to SS subroutine. ';
1511 +     string = string + 'Discrepancy found between the ';
1512 +     string = string + 'number of rows in y, XE, and XR:';
1513 +     print string;
1514 +     nrow_y = n;
1515 +     nrow_XE = nrow(XE);
1516 +     nrow_XR = nrow(XR);
1517 +     print nrow_y nrow_XE nrow_XR;
1518 +     print 'Execution terminated.';
1519 +     abort;
1520 +     end;
1521 +
1522 +   /*
1523 +  Compute Searle's M1 as defined in his (76) on pages 263 and
1524 +  318 of his book (1987) as
1525 +
1526 +                   M1 = I(n) - XR * ginv(XR)
1527 +
1528 +  where
1529 +
1530 +     I(n) = the identity matrix (of dimension n x n with 1's
1531 +            on the diagonal and zeros elsewhere).
1532 +
1533 +  Then compute the hypothesis matrix
1534 +
1535 +                   H = M1 * XE.
1536 +
1537 +  Note that XE in this subroutine is equivalent to Searle's X2
1538 +  and XR is equivalent to Searle's X1.
1539 +
1540 +  The following statements perform the above arithmetic but
1541 +  bypass the intermediate step of computing M1.  The chosen
1542 +  method depends on the value of LEVEL.
1543 +  */
1544 +
1545 +  if level = 1 then
1546 +     H = XE - XR * inv(XR` * XR) * XR` * XE;
1547 +  else
1548 +     H = XE - XR * ginv(XR) * XE;
1549 +
1550 +   /*
1551 +  Since they are no longer needed, erase XE and XR to conserve
1552 +  memory.  Note that erasing the two matrices here means that
1553 +  after returning from this subroutine the values in the ma-
1554 +  trices will not be available in the calling program.
1555 +
1556 +  To conserve memory, other matrices are also erased below as
1557 +  soon as they are no longer needed.
1558 +  */
1559 +
1560 +                 free XE XR;
1561 +
1562 +   /*
1563 +  Compute the projection matrix PM using the appropriate
1564 +  method, as determined by LEVEL.
1565 +  */
1566 +
1567 +  If level = 1 then
1568 +     PM = H * inv(H` * H) * H`;
1569 +  else
1570 +     PM = H * ginv(H);
1571 +
1572 +                 if printh = 0 then free H;
1573 +
1574 +   /*
1575 +  Compute the projection of y.
1576 +  */
1577 +
1578 +  yp = PM * y;
1579 +
1580 +                 if printss = 0 then free PM;
1581 +
1582 +   /*
1583 +  Compute the desired sum of squares as the squared length of
1584 +  the projection yp.
1585 +  */
1586 +
1587 +  result = ssq(yp);
1588 +
1589 +                 if printh = 0 then free yp;
1590 +
1591 +   /*
1592 +  If requested, print the computed sum of squares.
1593 +  */
1594 +
1595 +  if printss = 1 then print result [format=25.15];
1596 +  if printss = 2 then print result [format=e23.];
1597 +
1598 +   /*
1599 +  If requested, print the intermediate results:
1600 +  - the hypothesis matrix
1601 +  - the projection matrix
1602 +  - the projection of y.
1603 +
1604 +  Print the hypothesis matrix and the projection of y untrans-
1605 +  posed or transposed, depending on the value of PRINTH.
1606 +  */
1607 +
1608 +  if printh = 1 then do;
1609 +     print ,
1610 +      'Hypothesis matrix H = XE - XR * ginv(XR) * XE:', H;
1611 +     print ,'Projection matrix PM = H * ginv(H):', PM;
1612 +     print ,'Projection of y:  yp = PM * y:', yp;
1613 +     end;
1614 +
1615 +
1616 +  if printh = 2 then do;
1617 +     print ,
1618 +  'Transpose of hypoth. matrix H = XE - XR * ginv(XR) * XE:',
1619 +           (H`);
1620 +     print ,'Projection matrix PM = H * ginv(H):', PM;
1621 +     print ,'Transpose of projection of y:  yp = PM * y:',
1622 +           (yp`);
1623 +     end;
1624 +
1625 +   /*
1626 +  If the argument PRINTSS requested printing of sums of
1627 +  squares then, as a way of (indirectly) checking the accuracy
1628 +  of the subroutine, perform checks on PM to see how close it
1629 +  is to having the following properties:
1630 +  - symmetric
1631 +  - rows sum to zero
1632 +  - idempotent.
1633 +  */
1634 +
1635 +  if printss > 0 then do;
1636 +
1637 +   /*
1638 +  The checks are done by first computing the largest relative
1639 +  difference between corresponding values in two matrices that
1640 +  should be identical.  The relative difference between two
1641 +  matrices P and Q (called the "relative error") is defined as
1642 +
1643 +                       E = (P - Q) / P.
1644 +
1645 +  Both the subtraction and the division in the above expres-
1646 +  sion are done on an element-by-element basis.  Thus E is it-
1647 +  self a matrix with the same dimensions as P and Q.  The sub-
1648 +  routine computes L, the largest absolute value of the ele-
1649 +  ments in E as a measure of the equality of the two matrices.
1650 +
1651 +  The subroutine then converts L to a rough number of digits
1652 +  of accuracy with the formula
1653 +
1654 +                     ndigits = -log10(L).
1655 +
1656 +  First generate PMD, the divisor matrix for two of the rela-
1657 +  tive errors.  PMD is simply PM, except that we ensure that
1658 +  none of the values in PMD is zero.
1659 +  */
1660 +
1661 +     if all(PM ^= 0) then PMD = PM;
1662 +     else do;
1663 +        PMD = PM + ((PM = 0) * .6e-78);
1664 +        n = sum(PM = 0);
1665 +        string = 'elements of PM were zero.  .6E-78 ';
1666 +        string = string + 'was added to these elements ';
1667 +        string = string + 'to avoid dividing by zero in ';
1668 +        string = string + 'computing relative errors.';
1669 +        print n string;
1670 +        end;
1671 +
1672 +   /*
1673 +  Second, set DECACC, which is the number of decimal digits of
1674 +  accuracy carried in the computations.  This value depends on
1675 +  the computer operating system and language used to run the
1676 +  subroutine.  For my (Windows) computer the correct value
1677 +  with SAS is 16.
1678 +  */
1679 +
1680 +     decacc = 16;
1681 +
1682 +   /*
1683 +  Third, compute the maximum absolute relative difference be-
1684 +  tween corresponding elements in PM and its transpose as a
1685 +  measure of the symmetry of PM.
1686 +  */
1687 +
1688 +     mrelerrS = max( abs( (PM - PM`)/PMD ) );
1689 +     if mrelerrS = 0 then ndigitsS = decacc;
1690 +                     else ndigitsS = -log10(mrelerrS);
1691 +
1692 +   /*
1693 +  Fourth, compute the maximum absolute relative error between
1694 +  the sum of elements in a row of PM and zero; use the average
1695 +  of the absolute values of the elements in the row as the di-
1696 +  visor.  (See page 47 of the IML manual for subscript reduc-
1697 +  tion operators [SAS Institute Inc., 1989].)
1698 +  */
1699 +
1700 +     D = abs(PM);
1701 +     mrelerrR = max( abs(  PM[,+] / D[,:]  ) );
1702 +     if mrelerrR = 0 then ndigitsR = decacc;
1703 +                     else ndigitsR = -log10(mrelerrR);
1704 +
1705 +   /*
1706 +  Fifth, compute the maximum absolute relative difference be-
1707 +  tween corresponding elements in PM and PM-squared as a meas-
1708 +  ure of how close PM is to being idempotent.
1709 +  */
1710 +
1711 +     mrelerrI = max( abs( (PM - (PM * PM)) / PMD ) );
1712 +     if mrelerrI = 0 then ndigitsI = decacc;
1713 +                     else ndigitsI = -log10(mrelerrI);
1714 +
1715 +   /*
1716 +  Print the computed numbers of digits of accuracy.
1717 +  */
1718 +
1719 +     print ndigitsS [format=5.1]
1720 +           ndigitsR [format=5.1]
1721 +           ndigitsI [format=5.1];
1722 +
1723 +     end;
1723 +           /*  of if printss > 0 then do  */
1724 +
1725 +   /*
1726 +
1727 +  SUBROUTINE NOTE FOR USERS OF EBCDIC COMPUTER SYSTEMS
1728 +
1729 +  The NOT operator (^) above in this subroutine is the correct
1730 +  operator for ASCII systems.  If the subroutine is run on an
1731 +  EBCDIC system (e.g., an IBM mainframe), you may have to
1732 +  change each occurrence of ^ to the EBCDIC logical NOT opera-
1733 +  tor, which looks like a minus sign with a short vertical bar
1734 +  dropping down from the right end (and which is EBCDIC hexa-
1735 +  decimal 5F).
1736 +
1737 +
1738 +  SUBROUTINE REFERENCES
1739 +
1740 +  Aldrich, J. 1998. Doing least squares: Perspectives from
1741 +     Gauss and Yule. _International Statistical Review_ 66,
1742 +     61-81.
1743 +
1744 +  Graybill, F. A. 1983. _Matrices with Applications in
1745 +     Statistics_ 2d ed. Belmont, CA: Wadsworth.
1746 +
1747 +  Harville, D. A. 1997. _Matrix Algebra From a Statistician's
1748 +     Perspective._ New York: Springer-Verlag.
1749 +
1750 +  Hocking, R. R. 1985. _The Analysis of Linear Models._
1751 +     Monterey, CA: Brooks/Cole.
1752 +
1753 +  Macnaughton, D. B. 1997. Which sums of squares are best in
1754 +     unbalanced analysis of variance.  Available at
1755 +     http://www.matstat.com/ss/
1756 +
1757 +  Macnaughton, D. B. 1998a. PR0139.HTM:  Computing numerator
1758 +     sums of squares in unbalanced analysis of variance:  Two-
1759 +     way case.  Available in the "Computer Programs" section
1760 +     at http://www.matstat.com/ss/
1761 +
1762 +  Macnaughton, D. B. 1998b. PR0165.HTM:  Computing numerator
1763 +     sums of squares in unbalanced analysis of variance:
1764 +     Three-way case.  Available in the "Computer Programs"
1765 +     section at http://www.matstat.com/ss/
1766 +
1767 +  SAS Institute Inc. 1989. _SAS/IML Software: Usage and Refer-
1768 +     ence, Version 6, First Edition._ Cary, NC: author.
1769 +
1770 +  Searle, S. R. 1987. _Linear Models for Unbalanced Data._
1771 +     New York: Wiley.
1772 +
1773 +  Sunwoo, H., and B. C. Kim. 1997. Analysis of the unbalanced
1774 +     linear model based on the balanced model. _Journal of
1775 +     Statistical Computation and Simulation_ 56, 373-385.
1776 +
1777 +  Yates, F. 1934. The analysis of multiple classifications
1778 +     with unequal numbers in the different classes.  _Journal
1779 +     of the American Statistical Association_ 29, 51-66.
1780 +  */
1781 +
1782 +
1783 +  finish SS;
NOTE: Module SS defined.
1783 +               /*  end of subroutine SS
1784 +                 version of June 19, 1998
1785 +  ***********************************************************/
NOTE: %INCLUDE (level 1) ending.
1786
1787
1788   /*
1789                        PRELIMINARY STEP 6:
1790          SET THE VALUES OF THE THREE SECONDARY ARGUMENTS
1791                        OF THE SS SUBROUTINE
1792
1793  The values of the three secondary arguments of the SS subrou-
1794  tine are set immediately below.  These arguments control de-
1795  tails of how the subroutine computes sums of squares and prints
1796  results.  The values set below are used on every call to the
1797  subroutine in this program.  The values instruct the subroutine
1798  to
1799  - use the first method of computing sums of squares
1800  - print the value of the computed sum of squares and
1801  - print intermediate results (showing the transposed versions
1802    of the matrix H and the vector yp).
1803  */
1804
1805  level = 1;
1806  printss = 1;
1807  printh = 2;
1808
1809
1810   /*
1811                  COMPUTATION STEP 3:  CALL SS
1812
1813  Now that we have made the SS subroutine known to IML, we can
1814  use it.  Recall that we defined y, XE and XR above to allow us
1815  to compute the HTO sum of squares for the A (soil type) main
1816  effect in the Searle data.  Thus we are now ready to actually
1817  call the SS subroutine to do the computation.  But just before
1818  we do, note that Searle's exact answer for the HTO sum of
1819  squares for the A main effect (1987, 113, 114, 122 [typo]) is
1820
1821                83 127/141 = 83.90070 92198 58156 0
1822
1823  Here is the statement to call the SS subroutine to compute and
1824  print the HTO sum of squares for the A main effect:
1825  */
1826
1827  call SS(result,  y, XE, XR,  level,printss,printh);

                   RESULT
       83.900709219858100


Transpose of hypoth. matrix H = XE - XR * ginv(XR) * XE:

#TEM1001
1.1 1.1 1.1 .67 .67 1.2 1.2 -.9 -.9 -.9 -.9  -1 -.8 -.8 -.8


Projection matrix PM = H * ginv(H):

 PM
.09 .09 .09 .05 .05 0.1 0.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1
.09 .09 .09 .05 .05 0.1 0.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1
.09 .09 .09 .05 .05 0.1 0.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1
.05 .05 .05 .03 .03 .06 .06   0   0   0   0 -.1   0   0   0
.05 .05 .05 .03 .03 .06 .06   0   0   0   0 -.1   0   0   0
0.1 0.1 0.1 .06 .06 0.1 0.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1
0.1 0.1 0.1 .06 .06 0.1 0.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1
-.1 -.1 -.1   0   0 -.1 -.1 .05 .05 .05 .05 .08 .05 .05 .05
-.1 -.1 -.1   0   0 -.1 -.1 .05 .05 .05 .05 .08 .05 .05 .05
-.1 -.1 -.1   0   0 -.1 -.1 .05 .05 .05 .05 .08 .05 .05 .05
-.1 -.1 -.1   0   0 -.1 -.1 .05 .05 .05 .05 .08 .05 .05 .05
-.1 -.1 -.1 -.1 -.1 -.1 -.1 .08 .08 .08 .08 .12 .07 .07 .07
-.1 -.1 -.1   0   0 -.1 -.1 .05 .05 .05 .05 .07 .04 .04 .04
-.1 -.1 -.1   0   0 -.1 -.1 .05 .05 .05 .05 .07 .04 .04 .04
-.1 -.1 -.1   0   0 -.1 -.1 .05 .05 .05 .05 .07 .04 .04 .04


Transpose of projection of y:  yp = PM * y:

#TEM1001
 -3  -3  -3  -2  -2  -3  -3 2.1 2.1 2.1 2.1 3.2 1.9 1.9 1.9


NDIGITSS   NDIGITSR   NDIGITSI
    15.7       15.3       15.7

1828   /*
1829  The SS subroutine generated the twenty-seven printed lines
1830  above.  First it printed the computed sum of squares, 83.9....
1831  Note how this value is very close to Searle's exact answer: the
1832  difference occurs in the fifteenth significant digit (which, if
1833  it is rounded, should be 2 instead of 1).
1834
1835  Next, the subroutine printed the transposed version of the hy-
1836  pothesis matrix H that was used to compute the sum of squares.
1837  (This matrix was generated solely from XE and XR, which were,
1838  in turn, generated solely from the values of the two predictor
1839  variables.  That is, the response variable, y, played no role
1840  in generating H.)
1841
1842  Next, the subroutine printed the projection matrix PM.  This
1843  matrix was generated directly from the hypothesis matrix H and
1844  is simply a different way of stating the hypothesis being
1845  tested.
1846
1847  (The zeros shown above in PM are not true zeros but are only
1848  shown as zeros due to necessary rounding on the part of the SAS
1849  printing algorithm due to the narrow print fields.)
1850
1851  Note that PM is symmetric (i.e., the rows equal the columns)
1852  and the values in the matrix change in step with the changes in
1853  the treatment groups in the experiment, as defined by the dif-
1854  ferent values in the a and b vectors specified earlier.
1855
1856  Next, the subroutine printed the vector yp, which is the pro-
1857  jection of the vector y obtained when y is multiplied by PM.
1858  The computed sum of squares (i.e., 83.9...) was computed by
1859  summing the squared elements of yp.
1860
1861  To help us judge the accuracy of the computed sum of squares,
1862  the subroutine concluded by printing the three numbers
1863  NDIGITSS, NDIGITSR, and NDIGITSI.  These numbers are measures
1864  of the integrity of the projection matrix -- the smallest of
1865  these numbers (i.e., 15.3) gives a rough indication of the
1866  ceiling of the number of digits of accuracy in the numbers in
1867  the projection matrix.  Since IML maintains a precision of
1868  roughly sixteen floating point decimal digits on my (Windows)
1869  computer, the maximum possible number of digits of accuracy on
1870  my computer is 16.  (For curious readers, I describe the three
1871  numbers further in the comments near the end of the SS subrou-
1872  tine.)
1873
1874  Since the three measures of the integrity of the projection ma-
1875  trix suggest that the projection matrix is accurate to roughly
1876  fifteen significant digits, and since an accuracy of a sum of
1877  squares to four significant digits is fully adequate for com-
1878  puting a p-value, we can be confident that the computed sum of
1879  squares is sufficiently accurate.
1880
1881
1882            COMPUTE THE REMAINING SIX SUMS OF SQUARES
1883
1884  Following are steps to compute six other sums of squares for
1885  the Searle data.  Note that each case requires only the follow-
1886  ing three lines of code:
1887  - a line to specify the effect being tested (via XE)
1888  - a line to specify the other terms in the two models (via XR)
1889  - a call to the subroutine (with the CALL statement).
1890
1891  In each case I first state the two model equations whose resid-
1892  ual sums of squares are being differenced.  I state the models
1893  in an abbreviated form, omitting the subscripts on the terms
1894  and omitting the error term.
1895
1896
1897      SUM OF SQUARES FOR THE HTO B (SEED VARIETY) MAIN EFFECT
1898
1899  The two model equations for computing the HTO B sum of squares
1900  are
1901                          y = m + a
1902                          y = m + a + b.
1903
1904  Thus to compute the HTO B sum of squares we specify XE as
1905  BDESIGN -- the submatrix of the design matrix for main effect
1906  B.
1907  */
1908
1909  XE = Bdesign;
1910
1911   /*
1912  We specify XR as the horizontal concatenation of J(15,1) and
1913  ADESIGN, which are the submatrices for m and a, which are the
1914  other terms (excluding the error term) on the right side of the
1915  two model equations.
1916  */
1917
1918  XR = j(15,1) || Adesign;
1919
1920   /*
1921  Searle's exact answer for HTO B sum of squares (1987, 104, 113,
1922  114, 122) is
1923
1924                124 69/94 = 124.73404 25531 91489 4
1925  */
1926
1927  call SS(result,  y, XE, XR,  level,printss,printh);

                   RESULT
      124.734042553191000


Transpose of hypoth. matrix H = XE - XR * ginv(XR) * XE:

#TEM1001
.86 .86 .86 -.1 -.1  -1  -1 .88 .88 .88 .88 -.1  -1  -1  -1
  0   0   0   1   1  -1  -1 .25 .25 .25 .25 1.3 -.8 -.8 -.8


Projection matrix PM = H * ginv(H):

 PM
.09 .09 .09 -.1 -.1   0   0 .07 .07 .07 .07 -.1 -.1 -.1 -.1
.09 .09 .09 -.1 -.1   0   0 .07 .07 .07 .07 -.1 -.1 -.1 -.1
.09 .09 .09 -.1 -.1   0   0 .07 .07 .07 .07 -.1 -.1 -.1 -.1
-.1 -.1 -.1 .22 .22 -.1 -.1   0   0   0   0 .27   0   0   0
-.1 -.1 -.1 .22 .22 -.1 -.1   0   0   0   0 .27   0   0   0
  0   0   0 -.1 -.1 .16 .16 -.1 -.1 -.1 -.1 -.1 .13 .13 .13
  0   0   0 -.1 -.1 .16 .16 -.1 -.1 -.1 -.1 -.1 .13 .13 .13
.07 .07 .07   0   0 -.1 -.1 .07 .07 .07 .07   0 -.1 -.1 -.1
.07 .07 .07   0   0 -.1 -.1 .07 .07 .07 .07   0 -.1 -.1 -.1
.07 .07 .07   0   0 -.1 -.1 .07 .07 .07 .07   0 -.1 -.1 -.1
.07 .07 .07   0   0 -.1 -.1 .07 .07 .07 .07   0 -.1 -.1 -.1
-.1 -.1 -.1 .27 .27 -.1 -.1   0   0   0   0 .33 -.1 -.1 -.1
-.1 -.1 -.1   0   0 .13 .13 -.1 -.1 -.1 -.1 -.1 .12 .12 .12
-.1 -.1 -.1   0   0 .13 .13 -.1 -.1 -.1 -.1 -.1 .12 .12 .12
-.1 -.1 -.1   0   0 .13 .13 -.1 -.1 -.1 -.1 -.1 .12 .12 .12


Transpose of projection of y:  yp = PM * y:

#TEM1001
 -3  -3  -3 5.1 5.1 -.9 -.9  -2  -2  -2  -2 6.1 .19 .19 .19


NDIGITSS   NDIGITSR   NDIGITSI
    15.7       15.0       15.5

1928
1929
1930   /*
1931     SUM OF SQUARES FOR THE HTI (HIGHER-LEVEL TERMS INCLUDED)
1932                          A MAIN EFFECT
1933
1934  HTI sums of squares operate by including higher-level interac-
1935  tion terms in the two model equations whose residual sums of
1936  squares are differenced.  Thus the two model equations for com-
1937  puting the HTI A sum of squares are
1938                        y = m     + b + p
1939                        y = m + a + b + p.
1940  Note the appearance of the p (interaction) term in both equa-
1941  tions.  This term was omitted above in the computation of the
1942  two HTO sums of squares.  Recall that the submatrix of the de-
1943  sign matrix for the interaction is ABdesign.
1944
1945  Searle's exact answer for this sum of squares (1987, 91) is
1946
1947                123 27/35 = 123.77142 85714 28571 4
1948  */
1949
1950  XE = Adesign;
1951  XR = j(15,1) || Bdesign || ABdesign;
1952  call SS(result,  y, XE, XR,  level,printss,printh);

                   RESULT
      123.771428571428000


Transpose of hypoth. matrix H = XE - XR * ginv(XR) * XE:

#TEM1001
.69 .69 .69   1   1   1   1 -.5 -.5 -.5 -.5  -2 -.7 -.7 -.7


Projection matrix PM = H * ginv(H):

 PM
.04 .04 .04 .06 .06 .06 .06   0   0   0   0 -.1   0   0   0
.04 .04 .04 .06 .06 .06 .06   0   0   0   0 -.1   0   0   0
.04 .04 .04 .06 .06 .06 .06   0   0   0   0 -.1   0   0   0
.06 .06 .06 .09 .09 .09 .09   0   0   0   0 -.2 -.1 -.1 -.1
.06 .06 .06 .09 .09 .09 .09   0   0   0   0 -.2 -.1 -.1 -.1
.06 .06 .06 .09 .09 .09 .09   0   0   0   0 -.2 -.1 -.1 -.1
.06 .06 .06 .09 .09 .09 .09   0   0   0   0 -.2 -.1 -.1 -.1
  0   0   0   0   0   0   0 .02 .02 .02 .02 .09 .03 .03 .03
  0   0   0   0   0   0   0 .02 .02 .02 .02 .09 .03 .03 .03
  0   0   0   0   0   0   0 .02 .02 .02 .02 .09 .03 .03 .03
  0   0   0   0   0   0   0 .02 .02 .02 .02 .09 .03 .03 .03
-.1 -.1 -.1 -.2 -.2 -.2 -.2 .09 .09 .09 .09 .34 .11 .11 .11
  0   0   0 -.1 -.1 -.1 -.1 .03 .03 .03 .03 .11 .04 .04 .04
  0   0   0 -.1 -.1 -.1 -.1 .03 .03 .03 .03 .11 .04 .04 .04
  0   0   0 -.1 -.1 -.1 -.1 .03 .03 .03 .03 .11 .04 .04 .04


Transpose of projection of y:  yp = PM * y:

#TEM1001
 -2  -2  -2  -3  -3  -3  -3 1.6 1.6 1.6 1.6 6.5 2.2 2.2 2.2


NDIGITSS   NDIGITSR   NDIGITSI
    16.0       16.0       15.7

1953
1954
1955   /*
1956            SUM OF SQUARES FOR THE HTI B MAIN EFFECT
1957
1958  The two models for computing the HTI B sum of squares are
1959                        y = m + a     + p
1960                        y = m + a + b + p.
1961
1962  Searle does not give an exact answer for this sum of squares.
1963  The SAS Type III sum of squares for the B main effect is
1964
1965                           192.12765 957
1966  */
1967
1968  XE = Bdesign;
1969  XR = j(15,1) || Adesign || ABdesign;
1970  call SS(result,  y, XE, XR,  level,printss,printh);

                   RESULT
      192.127659574468000


Transpose of hypoth. matrix H = XE - XR * ginv(XR) * XE:

#TEM1001
.98 .98 .98 -.1 -.1  -1  -1 .73 .73 .73 .73 -.2 -.9 -.9 -.9
.17 .17 .17 .77 .77  -1  -1 .13 .13 .13 .13 1.5 -.7 -.7 -.7


Projection matrix PM = H * ginv(H):

 PM
0.1 0.1 0.1 -.1 -.1 -.1 -.1 .07 .07 .07 .07 -.1 -.1 -.1 -.1
0.1 0.1 0.1 -.1 -.1 -.1 -.1 .07 .07 .07 .07 -.1 -.1 -.1 -.1
0.1 0.1 0.1 -.1 -.1 -.1 -.1 .07 .07 .07 .07 -.1 -.1 -.1 -.1
-.1 -.1 -.1 .14 .14 -.1 -.1   0   0   0   0 .27   0   0   0
-.1 -.1 -.1 .14 .14 -.1 -.1   0   0   0   0 .27   0   0   0
-.1 -.1 -.1 -.1 -.1 0.2 0.2 -.1 -.1 -.1 -.1 -.1 .13 .13 .13
-.1 -.1 -.1 -.1 -.1 0.2 0.2 -.1 -.1 -.1 -.1 -.1 .13 .13 .13
.07 .07 .07   0   0 -.1 -.1 .06 .06 .06 .06 -.1   0   0   0
.07 .07 .07   0   0 -.1 -.1 .06 .06 .06 .06 -.1   0   0   0
.07 .07 .07   0   0 -.1 -.1 .06 .06 .06 .06 -.1   0   0   0
.07 .07 .07   0   0 -.1 -.1 .06 .06 .06 .06 -.1   0   0   0
-.1 -.1 -.1 .27 .27 -.1 -.1 -.1 -.1 -.1 -.1 .54 -.1 -.1 -.1
-.1 -.1 -.1   0   0 .13 .13   0   0   0   0 -.1 .09 .09 .09
-.1 -.1 -.1   0   0 .13 .13   0   0   0   0 -.1 .09 .09 .09
-.1 -.1 -.1   0   0 .13 .13   0   0   0   0 -.1 .09 .09 .09


Transpose of projection of y:  yp = PM * y:

#TEM1001
 -3  -3  -3 4.8 4.8 .16 .16  -2  -2  -2  -2 9.5 .11 .11 .11


NDIGITSS   NDIGITSR   NDIGITSI
    15.6       14.9       15.4

1971
1972
1973   /*
1974        SUM OF SQUARES FOR THE SEQUENTIAL A MAIN EFFECT
1975                    WHEN A IS ENTERED FIRST
1976
1977  The two models for computing the sequential A sum of squares
1978  when A is entered first are
1979                            y = m
1980                            y = m + a.
1981
1982  (When A is entered SECOND [i.e., after B], the sequential sum
1983  of squares for A is the same as the HTO sum of squares for A,
1984  which is computed above.)
1985
1986  Searle's exact answer for this sum of squares (1987, 93, 113,
1987  114, 122) is
1988
1989                             52 1/2
1990  */
1991
1992  XE = Adesign;
1993  XR = j(15,1);
1994  call SS(result,  y, XE, XR,  level,printss,printh);

                   RESULT
       52.500000000000000


Transpose of hypoth. matrix H = XE - XR * ginv(XR) * XE:

#TEM1001
1.1 1.1 1.1 1.1 1.1 1.1 1.1 -.9 -.9 -.9 -.9 -.9 -.9 -.9 -.9


Projection matrix PM = H * ginv(H):

 PM
.08 .08 .08 .08 .08 .08 .08 -.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1
.08 .08 .08 .08 .08 .08 .08 -.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1
.08 .08 .08 .08 .08 .08 .08 -.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1
.08 .08 .08 .08 .08 .08 .08 -.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1
.08 .08 .08 .08 .08 .08 .08 -.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1
.08 .08 .08 .08 .08 .08 .08 -.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1
.08 .08 .08 .08 .08 .08 .08 -.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1
-.1 -.1 -.1 -.1 -.1 -.1 -.1 .06 .06 .06 .06 .06 .06 .06 .06
-.1 -.1 -.1 -.1 -.1 -.1 -.1 .06 .06 .06 .06 .06 .06 .06 .06
-.1 -.1 -.1 -.1 -.1 -.1 -.1 .06 .06 .06 .06 .06 .06 .06 .06
-.1 -.1 -.1 -.1 -.1 -.1 -.1 .06 .06 .06 .06 .06 .06 .06 .06
-.1 -.1 -.1 -.1 -.1 -.1 -.1 .06 .06 .06 .06 .06 .06 .06 .06
-.1 -.1 -.1 -.1 -.1 -.1 -.1 .06 .06 .06 .06 .06 .06 .06 .06
-.1 -.1 -.1 -.1 -.1 -.1 -.1 .06 .06 .06 .06 .06 .06 .06 .06
-.1 -.1 -.1 -.1 -.1 -.1 -.1 .06 .06 .06 .06 .06 .06 .06 .06


Transpose of projection of y:  yp = PM * y:

#TEM1001
 -2  -2  -2  -2  -2  -2  -2 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8


NDIGITSS   NDIGITSR   NDIGITSI
    16.0       15.2       16.0

1995
1996
1997   /*
1998        SUM OF SQUARES FOR THE SEQUENTIAL B MAIN EFFECT
1999                    WHEN B IS ENTERED FIRST
2000
2001  The two models for computing the sequential B sum of squares
2002  when B is entered first are
2003                           y = m
2004                           y = m + b.
2005
2006  (When B is entered SECOND [i.e., after A], the sequential sum
2007  of squares for B is the same as the HTO sum of squares for B,
2008  which is computed above.)
2009
2010  Searle's exact answer for this sum of squares (1987, 95, 113,
2011  114, 122 [typo]) is
2012
2013                             93 1/3
2014  */
2015
2016  XE = Bdesign;
2017  XR = j(15,1);
2018  call SS(result,  y, XE, XR,  level,printss,printh);

                   RESULT
       93.333333333333300


Transpose of hypoth. matrix H = XE - XR * ginv(XR) * XE:

#TEM1001
.87 .87 .87 -.1 -.1  -1  -1 .87 .87 .87 .87 -.1  -1  -1  -1
.13 .13 .13 1.1 1.1 -.9 -.9 .13 .13 .13 .13 1.1 -.9 -.9 -.9


Projection matrix PM = H * ginv(H):

 PM
.08 .08 .08 -.1 -.1 -.1 -.1 .08 .08 .08 .08 -.1 -.1 -.1 -.1
.08 .08 .08 -.1 -.1 -.1 -.1 .08 .08 .08 .08 -.1 -.1 -.1 -.1
.08 .08 .08 -.1 -.1 -.1 -.1 .08 .08 .08 .08 -.1 -.1 -.1 -.1
-.1 -.1 -.1 .27 .27 -.1 -.1 -.1 -.1 -.1 -.1 .27 -.1 -.1 -.1
-.1 -.1 -.1 .27 .27 -.1 -.1 -.1 -.1 -.1 -.1 .27 -.1 -.1 -.1
-.1 -.1 -.1 -.1 -.1 .13 .13 -.1 -.1 -.1 -.1 -.1 .13 .13 .13
-.1 -.1 -.1 -.1 -.1 .13 .13 -.1 -.1 -.1 -.1 -.1 .13 .13 .13
.08 .08 .08 -.1 -.1 -.1 -.1 .08 .08 .08 .08 -.1 -.1 -.1 -.1
.08 .08 .08 -.1 -.1 -.1 -.1 .08 .08 .08 .08 -.1 -.1 -.1 -.1
.08 .08 .08 -.1 -.1 -.1 -.1 .08 .08 .08 .08 -.1 -.1 -.1 -.1
.08 .08 .08 -.1 -.1 -.1 -.1 .08 .08 .08 .08 -.1 -.1 -.1 -.1
-.1 -.1 -.1 .27 .27 -.1 -.1 -.1 -.1 -.1 -.1 .27 -.1 -.1 -.1
-.1 -.1 -.1 -.1 -.1 .13 .13 -.1 -.1 -.1 -.1 -.1 .13 .13 .13
-.1 -.1 -.1 -.1 -.1 .13 .13 -.1 -.1 -.1 -.1 -.1 .13 .13 .13
-.1 -.1 -.1 -.1 -.1 .13 .13 -.1 -.1 -.1 -.1 -.1 .13 .13 .13


Transpose of projection of y:  yp = PM * y:

#TEM1001
 -2  -2  -2 4.7 4.7   0   0  -2  -2  -2  -2 4.7   0   0   0


NDIGITSS   NDIGITSR   NDIGITSI
    15.4       15.1       15.4

2019
2020
2021   /*
2022         SUM OF SQUARES FOR THE A x B INTERACTION EFFECT
2023
2024  The two models for computing the A x B interaction sum of
2025  squares are
2026                       y = m + a + b
2027                       y = m + a + b + p.
2028
2029  Note that the sum of squares for the highest-level interaction
2030  in an experiment does not differ from one standard approach to
2031  computing analysis of variance sums of squares to the next.
2032
2033  Searle's exact answer for this sum of squares (1987, 113, 114)
2034  is
2035
2036                222 36/47 = 222.76595 74468 08510 6
2037  */
2038
2039  XE = ABdesign;
2040  XR = j(15,1) || Adesign || Bdesign;
2041  call SS(result,  y, XE, XR,  level,printss,printh);

                   RESULT
      222.765957446808000


Transpose of hypoth. matrix H = XE - XR * ginv(XR) * XE:

#TEM1001
.98 .98 .98 -.1 -.1  -1  -1 -.7 -.7 -.7 -.7 .19 .91 .91 .91
.17 .17 .17 .77 .77  -1  -1 -.1 -.1 -.1 -.1  -2 .68 .68 .68


Projection matrix PM = H * ginv(H):

 PM
0.1 0.1 0.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1 .11 .06 .06 .06
0.1 0.1 0.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1 .11 .06 .06 .06
0.1 0.1 0.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1 .11 .06 .06 .06
-.1 -.1 -.1 .14 .14 -.1 -.1 .04 .04 .04 .04 -.3 .04 .04 .04
-.1 -.1 -.1 .14 .14 -.1 -.1 .04 .04 .04 .04 -.3 .04 .04 .04
-.1 -.1 -.1 -.1 -.1 0.2 0.2 .07 .07 .07 .07 .11 -.1 -.1 -.1
-.1 -.1 -.1 -.1 -.1 0.2 0.2 .07 .07 .07 .07 .11 -.1 -.1 -.1
-.1 -.1 -.1 .04 .04 .07 .07 .06 .06 .06 .06 -.1   0   0   0
-.1 -.1 -.1 .04 .04 .07 .07 .06 .06 .06 .06 -.1   0   0   0
-.1 -.1 -.1 .04 .04 .07 .07 .06 .06 .06 .06 -.1   0   0   0
-.1 -.1 -.1 .04 .04 .07 .07 .06 .06 .06 .06 -.1   0   0   0
.11 .11 .11 -.3 -.3 .11 .11 -.1 -.1 -.1 -.1 .54 -.1 -.1 -.1
.06 .06 .06 .04 .04 -.1 -.1   0   0   0   0 -.1 .09 .09 .09
.06 .06 .06 .04 .04 -.1 -.1   0   0   0   0 -.1 .09 .09 .09
.06 .06 .06 .04 .04 -.1 -.1   0   0   0   0 -.1 .09 .09 .09


Transpose of projection of y:  yp = PM * y:

#TEM1001
 -1  -1  -1  -4  -4 5.9 5.9 .93 .93 .93 .93 8.1  -4  -4  -4


NDIGITSS   NDIGITSR   NDIGITSI
    15.7       15.2       15.6

2042
2043
2044   /*
2045                  SAVE THE DATA IN A SAS DATASET
2046
2047  The next three IML statements create a SAS dataset (called
2048  "Searle_1") and then transfer the values of vectors a, b, and y
2049  to variables with the same names in the dataset.  This enables
2050  us to compute all the sums of squares for the Searle data with
2051  SAS GLM.
2052  */
2053
2054  create Searle_1 var {a b y};
2055  append;
2056  close Searle_1;
NOTE: The data set WORK.SEARLE_1 has 15 observations and 3 variables.
2057
2058
2059   /*
2060                          QUIT FROM IML
2061  */
2062
2063  quit;
Exiting IML.
NOTE: The PROCEDURE IML used 27.57 seconds.


2064
2065
2066   /*
2067                        RUN PROC GLM TWICE
2068
2069  The following statements run SAS GLM on the data to enable com-
2070  paring the above IML output with output from an analysis of the
2071  data by GLM.  (The output from GLM comes in a separate output
2072  file.)

(Note: The output appears below after the end of this log.)

2073
2074  Examination of the GLM output reveals that all the GLM sums of
2075  squares are identical in all available digits to the sums of
2076  squares produced by the SS subroutine.
2077  */
2078
2079  title 'IML/GLM 2 x 3 Unbalanced ANOVA, Searle (1987, 79)';
2080  options nodate linesize=80 probsig=2;
2081
2082  proc glm data=Searle_1;
2083     class a b;
2084     model y = a | b / ss1 ss2 ss3;
2085     quit;

NOTE: The PROCEDURE GLM used 3.06 seconds.


2086
2087   /*
2088  Run GLM a second time with a and b reversed in the model state-
2089  ment to get the sequential (Type I) sum of squares for B when B
2090  is entered first into the model.
2091  */
2092
2093  proc glm data=Searle_1;
2094     class a b;
2095     model y = b | a / ss1 ss2 ss3;
2096     quit;

NOTE: The PROCEDURE GLM used 1.32 seconds.


2097
2098  options date linesize=80 probsig=2;
2099  title;
2100
2101
2102   /*
2103                              SUMMARY
2104
2105  This program illustrates three approaches to computing numera-
2106  tor sums of squares in unbalanced analysis of variance:  HTO,
2107  HTI, and sequential.  The three approaches are specified in
2108  terms of computing the difference between the residual sums of
2109  squares of two overparameterized model equations.  The two mod-
2110  els are specified in terms of two matrices, XE and XR, which
2111  are made up of submatrices of the overall (full-column-rank)
2112  design matrix for the experiment.  XE is the submatrix for the
2113  effect being tested.  XR is the horizontal concatenation of the
2114  submatrices for all the other terms (except the error term) on
2115  the right side of the two model equations.
2116
2117  Note the conceptual economy of the approach:  After the data
2118  are known to IML, the submatrices of the design matrix can be
2119  specified with one simple statement per submatrix (using the
2120  DESIGNF and HDIR functions).  It takes just two more statements
2121  to specify (via XE and XR) a particular sum of squares to be
2122  computed.  And it takes only four more reusable statements (in
2123  the SS subroutine) to compute any standard numerator sum of
2124  squares.
2125
2126
2127                               NOTES
2128
2129  If you wish to run this program on your computer, see the
2130  checklist in the appendix.
2131
2132  This program illustrates computation of numerator sums of
2133  squares in the two-way case -- i.e., the case with two dis-
2134  crete-valued predictor variables.  I discuss five approaches to
2135  computing analysis of variance numerator sums of squares in the
2136  three-way case in another program (1998).
2137
2138  I discuss in the paper (1997, sec. 17) the fact that statisti-
2139  cal tests provided by the HTO sums of squares are (in the rele-
2140  vant cases) generally more powerful than the corresponding
2141  tests provided by the HTI sums of squares.  This implies that
2142  if there is a relationship, an HTO sum of squares will usually
2143  be greater than the corresponding HTI sum of squares.  However,
2144  Searle's data twice illustrate the fact that an HTO sum of
2145  squares is not *always* greater than the corresponding HTI sum
2146  of squares.  Specifically, the HTO (SAS Type II) sum of squares
2147  for the A main effect in Searle's data is less than the HTI
2148  (SAS Type III) sum of squares for the same effect (83.9 versus
2149  123.8).  Similarly, the HTO sum of squares for the B main ef-
2150  fect in Searle's data is also less than the HTI sum of squares
2151  (124.7 versus 192.1).
2152
2153
2154              APPENDIX:  STEPS TO RUN THIS PROGRAM
2155
2156  1. Ensure that the STAT and IML components of the SAS system
2157     are available on your computer.  Information about the SAS
2158     system is available at http://www.sas.com
2159
2160  2. Ensure that you have the source version of this program,
2161     which is called PR0139.SAS (not the HTML version, which is
2162     called PR0139.HTM).  You can obtain a copy of the source
2163     version in the "Computer Programs" section of the page at
2164     http://www.matstat.com/ss/
2165
2166  3. Install a copy of the SS subroutine on your computer.  This
2167     subroutine does the actual computations of sums of squares
2168     and is available from the above MatStat web page.
2169
2170  4. Edit the %INCLUDE statement in preliminary step 5 above to
2171     correctly point to the location of the SS.SAS subroutine
2172     file on your computer.  That is, change the
2173                          D:\PROGS\SS.SAS
2174     in the statement to the location where SS.SAS is stored on
2175     your computer.
2176
2177  5. (Optional.)  Modify the two OPTIONS statements in the pro-
2178     gram that adjust the DATE, LINESIZE, and PROBSIG options.
2179
2180  6. Submit the program to SAS.
2181
2182
2183                            REFERENCES
2184
2185  Fisher, R. A. 1925. _Statistical Methods for Research Workers._
2186     Edinburgh: Oliver and Boyd.  The 14th edition of this semi-
2187     nal work appears in Fisher (1990).
2188
2189  Fisher, R. A. 1935. _The Design of Experiments._  Edinburgh:
2190     Oliver and Boyd.  The 8th edition of this seminal work ap-
2191     pears in Fisher (1990).
2192
2193  Fisher, R. A. 1990. _Statistical Methods, Experimental Design,
2194     and Scientific Inference_ edited by J. H. Bennett.  Oxford:
2195     Oxford University Press.
2196
2197  Macnaughton, D. B. 1996a. The introductory statistics course:
2198     A new approach.  Available at http://www.matstat.com/teach/
2199
2200  Macnaughton, D. B. 1996b. The entity-property-relationship ap-
2201     proach to statistics:  An introduction for students.  Avail-
2202     able at http://www.matstat.com/teach/
2203
2204  Macnaughton, D. B. 1997. Which sums of squares are best in un-
2205     balanced analysis of variance?  Available at
2206     http://www.matstat.com/ss/
2207
2208  Macnaughton, D. B. 1998. PR0165.HTM:  Computing numerator sums
2209     of squares in unbalanced analysis of variance:  Three-way
2210     case. Available in the "Computer Programs" section at
2211     http://www.matstat.com/ss/
2212
2213  Searle, S. R. 1987. _Linear Models for Unbalanced Data._ New
2214     York: Wiley.
2215
2216  Yates, F. 1934. The analysis of multiple classifications with
2217     unequal numbers in the different classes.  _Journal of the
2218     American Statistical Association_ 29, 51-66.
2219
2220                     version of June 19, 1998
2221                    (end of program pr0139.sas) */

This is the end of the program log for the run of the program.

GLM Output

Following are the three analysis of variance tables that were output on the first of the two runs of PROC GLM:

Source   DF      Type I SS    Mean Square   F Value   Pr > F

A         1    52.50000000    52.50000000      3.94   .07851
B         2   124.73404255    62.36702128      4.68   .04048
A*B       2   222.76595745   111.38297872      8.35   .00889

Source   DF     Type II SS    Mean Square   F Value   Pr > F

A         1    83.90070922    83.90070922      6.29   .03339
B         2   124.73404255    62.36702128      4.68   .04048
A*B       2   222.76595745   111.38297872      8.35   .00889

Source   DF    Type III SS    Mean Square   F Value   Pr > F

A         1   123.77142857   123.77142857      9.28   .01386
B         2   192.12765957    96.06382979      7.20   .01355
A*B       2   222.76595745   111.38297872      8.35   .00889

Note that all the numbers given above in the "SS" column are identical in all available digits to the corresponding sums of squares computed earlier in the program by PROC IML.

(As most readers will know, the numbers in the "Mean Square" column are computed by dividing the sums of squares by their degrees of freedom. The F values are computed by dividing the mean squares by the residual mean square, which SAS gives earlier [not shown] as 13.33333333. The p-values, which SAS labels "Pr > F", are computed from the F values using the "cumulative distribution function" for the "central F-distribution". Assuming there is no reasonable alternative explanation for the results, the low p-value for the A × B interaction [i.e., .00889 in all three tables] provides good evidence that [1] a relationship exists between the response variable and predictor variables A and B and [2] the relationship is a two-way interaction.

(After discovering the interactive relationship, the researcher's next step is to study a graph of the mean values of the response variable for the six treatment groups to gain an understanding of the relationship. By convention, the response variable is plotted on the vertical axis of the graph, one of the predictor variables is plotted on the horizontal axis, and the values of the other predictor variable are reflected by different lines (or different-shaped points) on the graph. "Standard error of the mean" bars are sometimes plotted on such graphs to give the reader a sense of the stability of the mean value of the response variable in each of the treatment groups [cells].)

Note the relationship between Type I and Type II sums of squares: the Type I sum of squares for B in the above tables is the same as the Type II sum of squares for B, but the Type I sum of squares for A is different from the Type II sum of squares for A.

Following are the three analysis of variance tables that were output on the second of the two runs of PROC GLM. This run had the same specifications as the first run except that the order of A and B was reversed in the MODEL statement. Comparison of the GLM output below with the GLM output above reveals that the only new value in the sums of squares column is the the Type I value for B (i.e., 93.3...).

Source   DF      Type I SS    Mean Square   F Value   Pr > F

B         2    93.33333333    46.66666667      3.50   .07508
A         1    83.90070922    83.90070922      6.29   .03339
A*B       2   222.76595745   111.38297872      8.35   .00889

Source   DF     Type II SS    Mean Square   F Value   Pr > F

B         2   124.73404255    62.36702128      4.68   .04048
A         1    83.90070922    83.90070922      6.29   .03339
A*B       2   222.76595745   111.38297872      8.35   .00889

Source   DF    Type III SS    Mean Square   F Value   Pr > F

B         2   192.12765957    96.06382979      7.20   .01355
A         1   123.77142857   123.77142857      9.28   .01386
A*B       2   222.76595745   111.38297872      8.35   .00889

This is the end of the output from PR0139.SAS.

Return to top

Donald Macnaughton's page on unbalanced analysis of variance