The following is self-documenting output from a SAS IML computer program:
NOTE: SAS (r) Proprietary Software Release 6.12 TS020
Licensed to DONALD B. MACNAUGHTON, Site 0025250001.
1 /*
2 PR0139.SAS
3
4 COMPUTING NUMERATOR SUMS OF SQUARES
5 IN UNBALANCED ANALYSIS OF VARIANCE:
6 Two-Way Case
7
8 Donald B. Macnaughton
9 donmac@matstat.com
10
11
12 TABLE OF CONTENTS
13
14 Abstract >
15
16 Introduction >
17 - preliminary notes >
18 - introduction to the Searle example >
19 - a goal of experiments and analysis of variance >
20 - a controversy about sums of squares >
21 - model equations >
22 - the residual sum of squares of a model equation >
23 - an interpretation of analysis of variance sums of squares
24 in terms of model equations >
25 - summing up >
26 - program overview >
27
28 Preliminary Steps >
29 - start PROC IML >
30 - define the Searle data >
31 - generate the main effect submatrices of the design matrix
>
32 - generate the interaction submatrix of the design matrix
>
33 - obtain the SS subroutine and list it >
34 - abstract >
35 - uses >
36 - main arguments: y, XE, and XR >
37 - method >
38 - details of the method >
39 - the hypothesis matrix H as a function of XE and XR
>
40 - the projection matrix PM as a function of H >
41 - yp, the projection of y by PM >
42 - the sum of squares as the squared length of yp >
43 - other methods of computing the sum of squares >
44 - general comments >
45 - secondary arguments >
46 - notes >
47 - executable statements >
48 - references >
49 - set the values of the three secondary arguments of the SS
50 subroutine >
51
52 Compute the Seven Sums of Squares Using the SS Subroutine >
53 - HTO A sum of squares >
54 - specify the matrix XE for the effect being tested >
55 - specify the matrix XR for the other effects in the
56 two models >
57 - call SS to compute the sum of squares >
58 - HTO B sum of squares >
59 - HTI A sum of squares >
60 - HTI B sum of squares >
61 - sequential A sum of squares when A is entered first >
62 - sequential B sum of squares when B is entered first >
63 - interaction sum of squares >
64
65 Save the Data in a SAS Dataset and Quit from IML >
66
67 Compute the Sums of Squares using PROC GLM (for comparison with
68 the values generated above) >
69
70 Summary >
71
72 Notes >
73
74 Appendix: Steps to Run the Program >
75
76 References >
77
78 Output from PROC GLM >
79
80
81 ABSTRACT
82
83 This SAS program illustrates a conceptual point of view and the
84 matrix arithmetic for computing the following types of analysis
85 of variance numerator sums of squares:
86
87 - HTO (Higher-level Terms are Omitted)
88 = SPSS ANOVA Experimental
89 = SAS Type II in the two-way case
90
91 - HTI (Higher-level Terms are Included)
92 = SAS Type III
93 = SPSS ANOVA Unique
94 = the default approach in many analysis of variance pro-
95 grams
96
97 - sequential
98 = SAS Type I
99 = SPSS ANOVA Hierarchical in the two-way case.
100
101 The conceptual point of view is one of computing an analysis of
102 variance sum of squares by computing the difference between the
103 residual sums of squares of two model equations (Yates 1934).
104 The matrix arithmetic is simple and is specified directly in
105 terms of the conceptual point of view.
106
107 The program is heavily annotated. Computations are illustrated
108 using data from a 2 x 3 unbalanced experiment discussed by
109 Shayle Searle (1987, 79).
110
111
112 PRELIMINARY NOTES
113
114 If you are not familiar with SAS, you can tell the difference
115 between my comments and the SAS program statements as follows:
116 My comments begin with the two symbols /* and end with the two
117 symbols */ /* Anything outside these symbols (except
118 blanks) is a program statement, which SAS will try to execute.
119
120 To lay some groundwork, I begin the program not with SAS pro-
121 gram statements but instead with about 500 lines of my own com-
122 ments. These set the stage for the later program statements.
123
124
125 INTRODUCTION TO THE SEARLE EXAMPLE
126
127 Analysis of variance is a broadly used method for analyzing the
128 results of scientific experiments. The method was invented by
129 Sir Ronald Aylmer Fisher (1925, 1935) and is generally viewed
130 as the most powerful and versatile available method for scien-
131 tifically inferring causation.
132
133 A current controversy in analysis of variance pertains to ana-
134 lyzing the data from "unbalanced" experiments. The controversy
135 is important because (for various reasons) a large proportion
136 of real-world experiments end up being unbalanced. Shayle
137 Searle addresses the controversy in his book LINEAR MODELS FOR
138 UNBALANCED DATA (1987) in which he discusses both mathematical
139 and philosophical issues. Although I disagree with some of
140 Professor Searle's philosophical conclusions, I am in awe of
141 his mathematical work. It is with deep respect that I offer
142 the following analysis of an important example in his book.
143
144 Searle's example is of an experiment to test whether "type of
145 potting soil" influences "time to germination" in three varie-
146 ties of carrot seed (1987, 78-79).
147
148 (Searle's example is from the field of agriculture. However,
149 the discussion is not limited to experiments in the field of
150 agriculture. Instead, both Searle's discussion and my discus-
151 sion apply to experiments in all fields of empirical research.
152 In particular, the discussion applies to a majority of the ex-
153 periments in the physical, biological, and social sciences.)
154
155 Clearly, the response variable in Searle's experiment is "time
156 to germination", and the two predictor variables are "soil
157 type" and "seed variety". Searle presents the following data
158 as possible results of the experiment:
159
160 TABLE 1
161 Time in Days to First Germination
162 of Three Varieties of Carrot Seed
163 Grown in Two Different Potting Soils
164 -------------------
165 Seed
166 Soil Variety (B)
167 Type ------------
168 (A) 1 2 3
169 -------------------
170 1 6 13 14
171 10 15 22
172 11
173
174 2 12 31 18
175 15 9
176 19 12
177 18
178 -------------------
179
180 The first number in the body of the table (i.e., 6) indicates
181 that in one of the fifteen trials in the experiment it took six
182 days for seeds of variety 1 to germinate when they were planted
183 in soil of type 1.
184
185 Searle's experiment is unbalanced because, as the table shows,
186 the number of values of the response variable available in the
187 various "cells" in the experiment differs from cell to cell.
188 For example, three values of the response variable are avail-
189 able in the (1,1) cell, but only two values are available in
190 the (1,2) cell.
191
192
193 A GOAL OF EXPERIMENTS AND ANALYSIS OF VARIANCE
194
195 When discussing analysis of variance, it is important to be
196 aware of both the goal of scientific experiments and the role
197 of analysis of variance in achieving the goal. Otherwise, the
198 discussion may become an arbitrary mathematical exercise. One
199 useful way of characterizing the goal is
200
201 The goal of experiments and analysis of variance is to
202 obtain knowledge about relationships between variables.
203
204 In any scientific experiment, an important step in achieving
205 this goal is to determine, in as unequivocal a way as possible,
206 whether a relationship actually *exists* between the variables
207 under study. In particular, we wish to determine whether the
208 response variable in an experiment "depends" on one or more of
209 the predictor variables. If a relationship is found between
210 the variables, a second goal is to determine the nature of the
211 relationship.
212
213 Thus in the Searle data we wish to determine whether "time to
214 germination" depends on "soil type", or whether "time to germi-
215 nation" depends on "seed variety".
216
217 It is invariably the case in experimental research that we wish
218 to determine whether the dependence exists in the *population*
219 of entities under study, not just in the sample of entities
220 that participated in the experiment. (In Searle's example the
221 entities are trials, or "carrot seed plantings".)
222
223 In an experiment with two predictor variables, the nature of
224 the relationship between the response variable and the predic-
225 tor variables can be either
226 - no (detectable) relationship or
227 - one or two "simple" relationships ("main effects") or
228 - an "interaction".
229
230 Interactions were invented by Fisher (1935, ch. VI). Interac-
231 tions provide a comprehensive means for detecting any (detect-
232 able) form of relationship that might exist between the re-
233 sponse variable and the predictor variables in an experiment.
234 In particular, interactions help us to detect complicated rela-
235 tionships between the response variable and a predictor vari-
236 able in which the specific form of the relationship depends on
237 the level of one or more *other* predictor variables. I give
238 formal definitions of the concepts of 'relationship between
239 variables', 'interaction', and 'simple relationship' in a paper
240 (1997, sec. 6).
241
242 The use of analysis of variance to detect relationships between
243 variables is (at a high level) straightforward: We submit the
244 data summarizing the results of an experiment (e.g., the data
245 in table 1) to an analysis of variance program, and the program
246 computes a set of "p-values". Assuming the experiment was
247 properly designed, the program provides a p-value for each sim-
248 ple relationship (main effect) and a p-value for each interac-
249 tion. If the p-value for a main effect or interaction is low
250 enough (and if there is no reasonable alternative explanation),
251 we conclude that the particular relationship between variables
252 associated with the p-value is extant in the population of en-
253 tities under study.
254
255 I discuss the general scientific study of relationships between
256 variables (including the notion of a p-value) further in two
257 papers (1996a, 1996b).
258
259
260 A CONTROVERSY ABOUT SUMS OF SQUARES
261
262 To compute an analysis of variance p-value from the results of
263 an experiment, it is mathematically necessary to first compute
264 certain "sums of squares". All analysis of variance programs
265 compute these sums of squares as an intermediate step in com-
266 puting p-values. Currently controversy exists about how sums
267 of squares should be computed. Controversy exists about both
268 the numerator sum of squares and the denominator sum of squares
269 used in the "F-ratio" to compute a p-value. The present dis-
270 cussion focuses exclusively on computing numerator sums of
271 squares for unbalanced experiments.
272
273 (Since balanced experiments are an "internal" limiting case of
274 unbalanced experiments, the discussion below also applies to
275 balanced experiments.)
276
277 This program illustrates the computations of the three best-
278 known conceptual methods for computing numerator sums of
279 squares. The program also illustrates the mathematical aspects
280 of computing numerator sums of squares by illustrating a simple
281 mathematical algorithm that can carry out all of the three con-
282 ceptual methods.
283
284 The purpose of this program is not to judge the various methods
285 of computing sums of squares, but rather to illustrate them.
286 Therefore, I make no judgments here about the merits of the
287 various methods. However, I do make judgments in the paper
288 (1997) and I shall extend these judgments in material I shall
289 publish later.
290
291
292 MODEL EQUATIONS
293
294 To understand sums of squares, it is useful to understand the
295 notion of a "model equation" of the relationship between the
296 response variable and the predictor variables in an experiment.
297 Two types of model equations (often called simply "models") are
298 in use today: "overparameterized" models and "cell-means" mod-
299 els. I discuss both types in the paper (1997, sec. 9). The
300 following discussion is in terms of overparameterized models.
301
302 Consider a two-way experiment (e.g., the Searle experiment)
303 that has predictor variables A and B. We can model the rela-
304 tionship between the response variable, called y, and the pre-
305 dictor variables with the following model equation:
306
307 y(i,j,k) = mu + alpha(i) + beta(j) + phi(i,j) + e(i,j,k). (1)
308
309 (Normally the i's, j's and k's in the parentheses would be sub-
310 scripts, and the five terms on the right side of the equation
311 would be Greek letters, but these features are not yet avail-
312 able for comments in computer programs.)
313
314 The terms in the equation have the following interpretations:
315
316 y(i,j,k) = value of the response variable for the kth entity in
317 the (i,j) treatment group in the experiment
318
319 mu = grand mean of the values of the response variable
320 for all the entities in the population
321
322 alpha(i) = simple ("main") effect of predictor variable A on y
323 when A is at level i
324
325 beta(j) = simple ("main") effect of predictor variable B on y
326 when B is at level j
327
328 phi(i,j) = the joint (interaction) effect of predictor vari-
329 ables A and B on y when they are at levels i and j
330 respectively
331
332 e(i,j,k) = the "error" term, which takes account of the vari-
333 ation in y that cannot be accounted for by the other
334 four terms on the right side of the equation.
335
336 Model (1) gives us a succinct picture of how the value of the
337 response variable "depends" on the values of the two predictor
338 variables, and how it also depends on other unknown factors,
339 which are taken account of by the error term. Model (1) is
340 called the "saturated" model for a two-way experiment because
341 it contains all the possible terms for such an experiment.
342
343 As we shall see, it often makes sense to use a reduced version
344 of (1) in which certain terms are omitted from the right side.
345 For example, if we omit the interaction term, phi, we get
346
347 y(i,j,k) = mu + alpha(i) + beta(j) + e(i,j,k). (2)
348
349
350 THE RESIDUAL SUM OF SQUARES OF A MODEL EQUATION
351
352 Once we have the data from an experiment (e.g., the data in ta-
353 ble 1), we can "fit" various models to the data. That is, we
354 can use a mathematical algorithm to compute values for the pa-
355 rameters associated with the terms in the model [i.e., values
356 for mu, the alpha(i)s, the beta(j)s, and the phi(i,j)s]. The
357 algorithm operates by choosing values for the parameters asso-
358 ciated with the four terms so that the resulting model gives
359 the "best" predictions of the values of the response variable y
360 for all the values of y obtained in the experiment.
361
362 The fitting of the terms is usually done by the method of least
363 squares or by the closely related method of maximum likelihood.
364 In the standard case, both methods yield "identical" estimates
365 for the values of the parameters associated with the terms
366 (excluding the error term e) on the right side of the equation
367 in the sense that
368 the sum of the squared differences between the values
369 of the response variable predicted by the equation and
370 the *actual* values of the response variable in the ex-
371 periment
372 is the lowest possible value.
373
374 After we have fitted a model to the results of an experiment
375 and obtained estimates for the values of the parameters, we can
376 then use the model to compute the predicted value for each of
377 the values of the response variable obtained in the experiment.
378 Then we can subtract each predicted value from the correspond-
379 ing actual value to get a number called the "residual". If we
380 square each of these residuals and add the squared residuals
381 together, we get the "residual sum of squares" for the model
382 for the experimental data at hand.
383
384 (Of course, the residual sum of squares is the number that was
385 itself minimized two paragraphs above in order to determine the
386 estimates of the values of the parameters -- a piece of mathe-
387 matical bootstrapping that still amazes me.)
388
389
390 AN INTERPRETATION OF ANALYSIS OF VARIANCE SUMS OF SQUARES
391 IN TERMS OF MODEL EQUATIONS
392
393 We can view all standard analysis of variance numerator sums of
394 squares as being the value we obtain if we subtract the resid-
395 ual sum of squares for one model from the residual sum of
396 squares for another model (Yates 1934, 63). For example, con-
397 sider the following two models for the results of a two-way ex-
398 periment:
399
400 y(i,j,k) = mu + beta(j) + e(i,j,k) (3)
401
402 y(i,j,k) = mu + alpha(i) + beta(j) + e(i,j,k). (2)
403
404 I shall use the term Rn to denote the residual sum of squares
405 for model (n). Thus R3 denotes the residual sum of squares for
406 (3).
407
408 Suppose we perform a two-way experiment (such as the Searle ex-
409 periment), and suppose we (separately) fit models (2) and (3)
410 to the results of the experiment, yielding R2 and R3. If we
411 subtract R2 from R3, this difference is identical to the HTO (=
412 SAS Type II = SPSS ANOVA Experimental) sum of squares for the A
413 main effect.
414
415 It stands to reason that the numerical difference R3 - R2
416 should equal a sum of squares for the A main effect since the
417 only conceptual difference between the two models is that (2)
418 contains alpha, the term for predictor variable A, and (3) does
419 not.
420
421 Consider the following two models:
422
423 y(i,j,k) = mu + beta(j) + phi(i,j) + e(i,j,k) (4)
424
425 y(i,j,k) = mu + alpha(i) + beta(j) + phi(i,j) + e(i,j,k). (1)
426
427 Note that (4) and (1) are the same as (3) and (2) respectively,
428 except that both (4) and (1) have an extra term, namely
429 phi(i,j). R4 - R1 is identical to the HTI (= SAS Type III =
430 SPSS ANOVA Unique) sum of squares for the A main effect.
431
432 We can see the source of the names HTI and HTO by studying the
433 two pairs of models above. For the HTI sum of squares the
434 Higher-level interaction Term [phi(i,j)] is Included (HTI) in
435 both models. For the HTO sum of squares the Higher-level in-
436 teraction Term is Omitted (HTO) from both models.
437
438 Finally, consider the following two models:
439
440 y(i,j,k) = mu + e(i,j,k) (5)
441
442 y(i,j,k) = mu + alpha(i) + e(i,j,k). (6)
443
444 Note that (5) and (6) are the same as (3) and (2) respectively,
445 except that both (5) and (6) lack the term beta(j). R5 - R6 is
446 identical to the sequential (= SAS Type I = SPSS ANOVA Hierar-
447 chical) sum of squares for the A main effect when A is entered
448 first (after the mean) into the model.
449
450 If we compute the difference between the residual sums of
451 squares of two models, the difference is called the "sum of
452 squares for the effect being tested". The "effect" is the term
453 that is present in one of the two models, but absent from the
454 other.
455
456 Each sum of squares has associated with it a number of "degrees
457 of freedom".
458
459 (For any main effect, the number of degrees of freedom is one
460 less than the number of values assumed by the associated pre-
461 dictor variable in the experiment. For example, in the Searle
462 data, predictor variable B assumes three different values in
463 the experiment so the number of degrees of freedom for B is
464 two. For an interaction, the number of degrees of freedom is
465 the product of the numbers of degrees of freedom for the main
466 effects for each of the predictor variables associated with the
467 interaction.)
468
469 If we divide a sum of squares for a particular effect by its
470 degrees of freedom, we get the "mean square" for the effect.
471 Similarly, if we divide the residual sum of squares for the
472 saturated model by *its* degrees of freedom, we get the resid-
473 ual mean square. The reason we might *want* to compute any of
474 these mean squares rests on three facts
475
476 - If there is *no* relationship in the population between the
477 response variable and predictor variable A, clearly the cor-
478 rect value for alpha(i) is zero for all values of i in all
479 the equations above. In this case, it can be shown that the
480 three mean squares for the A effect can be expected (under
481 certain often-satisfiable assumptions) to equal the "residual
482 variance" in the experiment. (The residual variance is esti-
483 mated by the residual mean square.)
484
485 - If there *is* a relationship between A and the response vari-
486 able, the three mean squares for the A effect can usually be
487 expected to be *greater* than the residual variance.
488
489 - Thus to determine whether there is evidence of a relationship
490 between the response variable and predictor variable A, we
491 need only determine whether the appropriate effect mean
492 square is significantly greater than the residual mean
493 square. The p-value is simply an easy-to-interpret result of
494 this determination.
495
496 These facts, buttressed by mathematical arguments (Searle 1987,
497 sec. 8.6), imply that the three approaches to computing sums of
498 squares provide (with certain limitations) valid candidates for
499 the numerator sums of squares in F-ratios used to compute p-
500 values used to test for the existence of a relationship between
501 the response variable and predictor variable A.
502
503 (The technique I discuss above for detecting relationships be-
504 tween variables is undeniably complex. A questioning reader
505 might wonder whether a simpler technique [with reasonable power
506 and objectivity] might be found. So far, no such technique has
507 been found.)
508
509 The above discussion states that three popular types of numera-
510 tor sums of squares (HTO, HTI, and sequential) can be computed
511 in a two-way experiment by computing the difference between the
512 residual sums of squares of two model equations. This general-
513 izes:
514
515 All standard analysis of variance numerator sums of
516 squares for two-way, three-way, and higher experiments
517 can be viewed as reflecting the difference between the
518 residual sums of squares of two overparameterized model
519 equations.
520
521
522 SUMMING UP
523
524 - an important use of analysis of variance is to help research-
525 ers detect relationships between variables in populations of
526 entities
527
528 - we can detect relationships between variables by studying the
529 p-values obtained by applying analysis of variance to the re-
530 sults of an experiment
531
532 - analysis of variance computer programs compute numerator sums
533 of squares as an intermediate step in computing p-values
534
535 - controversy exists about which method of computing numerator
536 sums of squares is preferred in unbalanced experiments
537
538 - one easy-to-understand way of viewing all standard approaches
539 to computing numerator sums of squares is to view them as re-
540 flecting the difference between the residual sums of squares
541 of two different overparameterized model equations for the
542 relationship between the response variable and predictor
543 variables in the experiment.
544
545
546 PROGRAM OVERVIEW
547
548 The program statements below demonstrate a simple algorithm for
549 computing analysis of variance numerator sums of squares. The
550 algorithm takes three pieces of information as input
551
552 1. a specification of the two model equations whose residual
553 sums of squares we wish to "difference" in order to compute
554 an analysis of variance numerator sums of squares [e.g., (3)
555 and (2) above]
556
557 2. a specification of the layout of the experiment
558
559 3. the response variable data vector containing the values of
560 the response variable obtained in the experiment.
561
562 The algorithm uses the input to compute the specified sum of
563 squares.
564
565 To show that the algorithm works properly, the program computes
566 seven analysis of variance sums of squares for the Searle data
567 given above.
568
569 The program below is organized into five parts:
570
571 1. a set of six preliminary steps to get things ready for com-
572 puting the sums of squares
573
574 2. seven repetitions of three simple steps to compute the seven
575 sums of squares
576
577 3. a recomputation of the seven sums of squares with SAS PROC
578 GLM for comparison
579
580 4. a summary
581
582 5. notes, an appendix, and references.
583
584
585 PRELIMINARY STEP 1: START PROC IML AND RESET SOME OPTIONS
586
587 PROC IML (Interactive Matrix Language) is an easy-to-use com-
588 puter language for general matrix arithmetic. It is an add-on
589 component of the SAS system and has many built-in functions to
590 facilitate matrix operations particular to statistical analy-
591 sis.
592
593 The PROC IML statement below initiates the IML environment.
594 After that statement is executed, the statements that follow
595 are statements in the IML language until we reach the QUIT
596 statement, which takes us back into a standard SAS program.
597 */
598
599 proc iml;
IML Ready
600
601 /*
602 The options in the following RESET statement control the desti-
603 nation and appearance of the IML output.
604 */
605
606 reset log fw=3 spaces=3;
607
608
609 /*
610 PRELIMINARY STEP 2: DEFINE THE SEARLE DATA
611
612 To get the Searle data into IML, we define two column vectors
613 (a and b) of values for the two predictor variables, and we de-
614 fine a column vector y, containing the values of the response
615 variable. Following are the three IML statements that define
616 the vectors a, b, and y:
617 */
618
619 a = { 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2,2, 2 };
620
621 b = { 1, 1, 1, 2, 2, 3, 3, 1, 1, 1, 1, 2, 3,3, 3 };
622
623 y = { 6,10,11, 13,15, 14,22, 12,15,19,18, 31, 18,9,12 };
624 /*
625 Note how the values in the three vectors "line up" correctly
626 with each other to reflect exactly the same information as is
627 given near the beginning of this program in table 1. For exam-
628 ple, the last three values in y (i.e., 18, 9, and 12) each line
629 up with a value of 2 in vector a and a value of 3 in vector b,
630 reflecting the fact that the last three values in y are associ-
631 ated with the (2,3) cell in table 1.
632
633 The set of fifteen numbers between the braces in each of the
634 three IML statements above is called a "matrix literal". Al-
635 though the above three matrix literals are laid out horizon-
636 tally, they do not specify row vectors but instead specify
637 *column* vectors. They specify column vectors because each of
638 the numbers (except the last) in each matrix literal is fol-
639 lowed by a comma, which in an IML matrix literal indicates the
640 end of a row. (Numbers *within* a row of an IML matrix literal
641 are not separated by commas, but by blanks.)
642
643
644 PRELIMINARY STEP 3: GENERATE THE MAIN EFFECT SUBMATRICES
645
646 Before we can begin computing sums of squares, we must first
647 generate the "full-column-rank" submatrices of the overall
648 "design matrix" for the experiment. These submatrices (which
649 are surprisingly simple) are used in the computation of the
650 sums of squares.
651
652 Any analysis of variance has a separate submatrix of the design
653 matrix for each of its main effects (i.e., for each predictor
654 variable) and a separate submatrix for each interaction in the
655 experiment. Each submatrix has N rows, where N equals the num-
656 ber of elements (entities, rows, cases, observations) in the
657 data. (In the Searle data N is fifteen since there were fif-
658 teen trials in the experiment; these trials are reflected in
659 the fifteen elements in each of vectors a, b, and y defined
660 above.) Each submatrix has DF columns, where DF is the number
661 of degrees of freedom associated with the main effect or inter-
662 action associated with the submatrix.
663
664 The full-column-rank submatrix for any main effect can be gen-
665 erated with a single statement in IML. For example, to gener-
666 ate the submatrix for the main effect of predictor variable B,
667 and to store that submatrix in the matrix called Bdesign, we
668 use the following statement:
669
670 Bdesign = designf(b)
671
672 where b is the column vector containing the raw values of pre-
673 dictor variable B used in the experiment.
674
675 DESIGNF is a built-in function in IML. The function returns a
676 matrix of zeros and (positive and negative) ones, with N rows
677 and DF columns, where N and DF are as described above.
678
679 Consider the main effect for predictor variable B (seed vari-
680 ety) in the Searle data. Here is a copy of the definition
681 (given above) of the (column) vector showing the different val-
682 ues of predictor variable B for the fifteen trials in the ex-
683 periment:
684
685 b = {
686 1, 1, 1, 2, 2, 3, 3, 1, 1, 1, 1, 2, 3, 3, 3 }.
687
688 Following is the IML statement to compute the submatrix of the
689 design matrix for predictor variable B. This is followed by a
690 statement to print the newly computed matrix. (The matrix is
691 printed transposed for ease of study.)
692 */
693
694 Bdesign = designf(b);
695
696 print (Bdesign`);
#TEM1001
1 1 1 0 0 -1 -1 1 1 1 1 0 -1 -1 -1
0 0 0 1 1 -1 -1 0 0 0 0 1 -1 -1 -1
697 /*
698 Consider the first row of 1's, zeros, and -1's in the output
699 above. Compare the values in this row with the values in vec-
700 tor B given further above. Note how the 1's identify the cases
701 in which B has the value 1. Similarly, the zeros identify the
702 cases in which B has the value 2. Finally, the -1's identify
703 the cases in which B has the value 3.
704
705 Similarly, in the second row note how the 1's identify the
706 cases in which B has the value 2. And the -1's identify
707 (again) the cases in which B has the value 3. The zeros iden-
708 tify the cases in which B has the value 1.
709
710 This generalizes: Suppose some (discrete-valued) predictor
711 variable P has assumed m different values in an experiment.
712 (The value of m must be greater than or equal to 2 for P to be
713 a valid predictor variable.) Then the (transposed) submatrix
714 for P will have m-1 rows, each row being associated with one of
715 the m values, and with one of the m values being "left out"
716 with no row of its own. Within each row the elements are +1 if
717 this case received the associated value of predictor variable
718 P; -1 if the case received the "left out" value of predictor
719 variable P; and zero if m is greater than 2 and the case re-
720 ceived one of the other m-2 values of predictor variable P.
721
722 Design matrices are mysterious because, in the case of comput-
723 ing the standard analysis of variance sums of squares, they are
724 not unique. That is, we could get exactly the same results in
725 the output below (apart from slight differences due to roundoff
726 errors) if we specified a new submatrix of the design matrix
727 for predictor variable B provided that the two columns in the
728 new matrix were themselves any two independent "linear combina-
729 tions" of the two columns of the submatrix for B given above.
730
731 How is it possible to get the same results if a submatrix of
732 the design matrix is replaced with a transformed version of it-
733 self? This puzzling aspect of design matrices is explained by
734 the fact that the relevant information (for computing a sum of
735 squares) in a submatrix of a design matrix is not stored in the
736 *values* of the elements in the matrix -- it is stored in the
737 *relationships among* the values. Independent linear combina-
738 tions of the columns in a submatrix preserve this relevant in-
739 formation.
740
741 (Technically, the linear combinations preserve the information
742 in the sense that all transformed sets of columns of a sub-
743 matrix of the design matrix so-defined are generating vectors
744 of the same relevant subspace of the N-dimensional vector space
745 under study.)
746
747 Now let us compute the submatrix of the design matrix for pre-
748 dictor variable A (soil type). Here is a copy of the (column)
749 vector of values from the definition given above for this vari-
750 able:
751
752 a = {
753 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2 }
754
755 Following are the statements to compute and print the submatrix
756 of the design matrix for predictor variable A. (The matrix is
757 again printed transposed for ease of study.)
758 */
759
760 Adesign = designf(a);
761
762 print (Adesign`);
#TEM1001
1 1 1 1 1 1 1 -1 -1 -1 -1 -1 -1 -1 -1
763 /*
764 Since predictor variable A assumed only two values in the ex-
765 periment, the (transposed) submatrix of the design matrix has
766 only one row.
767
768 As was the case with predictor variable B, note how the 1's and
769 -1's above in Adesign line up respectively with the 1's and 2's
770 in the column vector of the values of predictor variable A.
771
772
773 PRELIMINARY STEP 4: GENERATE THE INTERACTION SUBMATRIX
774
775 Each interaction in an experiment has its own submatrix of the
776 design matrix. We can generate an interaction submatrix by
777 computing the "horizontal direct product" of the design matri-
778 ces for all the main effects for the predictor variables asso-
779 ciated with the interaction.
780
781 Consider two matrices P and Q with n rows and np and nq col-
782 umns, respectively. The horizontal direct product of P and Q
783 has n rows and np x nq columns. Each column in the horizontal
784 direct product is associated with a unique pair of columns, one
785 from P and one from Q. The elements in a given column in the
786 horizontal direct product are the scalar products of corre-
787 sponding elements in the two associated columns of P and Q.
788
789 For example, consider the following two matrices:
790
791 P = {1 2, and Q = {1 6,
792 3 4, 2 5,
793 5 6 } 3 4 }.
794
795 The horizontal direct product of P and Q is
796
797 { 1 6 2 12,
798 6 15 8 20,
799 15 20 18 24 }.
800
801 The IML function HDIR(P,Q) computes the horizontal direct prod-
802 uct of two matrices P and Q.
803
804 Following is a statement to generate the submatrix of the de-
805 sign matrix for the A x B interaction in the Searle data. The
806 submatrix is generated by computing the horizontal direct prod-
807 uct of the two design submatrices (Adesign and Bdesign) gener-
808 ated above. This is followed by a statement to print the in-
809 teraction submatrix (again transposed for ease of study).
810 */
811
812 ABdesign = hdir(Adesign, Bdesign);
813
814 print (ABdesign`);
#TEM1001
1 1 1 0 0 -1 -1 -1 -1 -1 -1 0 1 1 1
0 0 0 1 1 -1 -1 0 0 0 0 -1 1 1 1
815
816
817 /*
818 COMPUTATION STEP 1: FOR A GIVEN SUM OF SQUARES SPECIFY XE
819 (WE SHALL FIRST COMPUTE THE HTO SUM OF SQUARES FOR A)
820
821 Having computed the three submatrices of the design matrix, we
822 can begin computing sums of squares. Let us first compute the
823 HTO sum of squares for predictor variable A (soil type).
824
825 The actual computing of sums of squares in this program is done
826 by a subroutine called SS. Using a subroutine is efficient be-
827 cause it consolidates the code to compute sums of squares into
828 a reusable and portable package. I discuss the SS subroutine
829 in detail below.
830
831 In order to compute a sum of squares, the SS subroutine re-
832 quires three pieces of input:
833
834 1. the values of the response variable obtained in the experi-
835 ment
836
837 2. information about which sum of squares we wish to compute,
838 as specified by two overparameterized model equations, as
839 discussed above
840
841 3. information about the layout of the experiment.
842
843 The first piece of input, the values of the response variable,
844 was defined above in the definition of the vector y. Thus we
845 need only pass the vector y to the subroutine for the subrou-
846 tine to use the values of the response variable.
847
848 The second and third pieces of input are passed to the subrou-
849 tine through two matrices that are obtained from the submatri-
850 ces of the design matrix discussed above. The two matrices are
851 called XE and XR.
852
853 The first matrix, XE, is simply the submatrix of the design ma-
854 trix for the particular effect we wish to test. In the present
855 case we are testing the A main effect. Thus we specify XE as
856 follows:
857 */
858
859 XE = Adesign;
860
861
862 /*
863 COMPUTATION STEP 2: FOR A GIVEN SUM OF SQUARES SPECIFY XR
864
865 In the present example we wish to compute the HTO numerator sum
866 of squares for the A main effect. Thus (as discussed above) we
867 wish to compute the difference between the residual sums of
868 squares of the following two model equations:
869
870 y(i,j,k) = mu + beta(j) + e(i,j,k) (3)
871
872 y(i,j,k) = mu + alpha(i) + beta(j) + e(i,j,k). (2)
873
874 We have already specified the conceptual difference between (3)
875 and (2) in the specification of XE above. Thus all that re-
876 mains in order to fully specify the two equations is to list
877 the "other" terms on the right side of (3) and (2).
878
879 The "other" terms on the right side of (3) and (2) (excluding
880 the error term) are mu and beta. We specify the other terms by
881 "horizontally concatenating" the submatrices for these terms to
882 form the matrix XR.
883
884 As suggested by the name, the horizontal concatenation of two
885 matrices P and Q is the matrix formed by appending the rows of
886 Q to the corresponding rows of P. (Horizontal concatenation of
887 two matrices is possible only if the matrices have the same
888 number of rows.) In SAS IML the expression P || Q is used to
889 specify the horizontal concatenation of two matrices P and Q.
890
891 Since mu is a constant in the two equations, the submatrix of
892 the design matrix for mu is a column vector of 1's. We can
893 specify a column vector of fifteen 1's with the IML function
894
895 j(15,1).
896
897 Thus we specify the XR matrix for the HTO A (soil type) sum of
898 squares in the experiment with the following statement:
899 */
900
901 XR = j(15,1) || Bdesign;
902
903 /*
904 Here is the (transposed) XR matrix:
905 */
906
907 print (XR`);
#TEM1001
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 0 0 -1 -1 1 1 1 1 0 -1 -1 -1
0 0 0 1 1 -1 -1 0 0 0 0 1 -1 -1 -1
908
909 /*
910 Note that the XE and XR contain two types of information:
911
912 - information about which analysis of variance sum of squares
913 we wish to compute (as captured by the *choice* of the sub-
914 matrices we have used to define XE and XR)
915
916 - information about the layout of the experiment (as captured
917 in the *layout* of the submatrices of the design matrix used
918 to define XE and XR).
919
920 The E in XE stands for the Effect being tested. The R in XR
921 stands for otheR terms on the right side of the two model equa-
922 tions.
923
924
925 PRELIMINARY STEP 5:
926 MAKE THE SS SUBROUTINE AVAILABLE TO THE PROGRAM
927
928 (I have delayed this preliminary step until after defining XE
929 and XR, since this will help you to understand the subroutine.)
930
931 Since we have now defined y, XE, and XR for the HTO A (soil
932 type) effect, we are ready to call subroutine SS to compute the
933 sum of squares. In fact, we shall call the subroutine momen-
934 tarily. But before we can actually call the subroutine, we
935 must first make it known to IML. We do this with the %INCLUDE
936 statement that follows.
937
938 Because the SOURCE2 option is specified in the %INCLUDE state-
939 ment, SAS lists all of the statements in the subroutine in the
940 output below immediately after the %INCLUDE statement.
941 */
942
943 %include 'D:\PROGS\SS.SAS' / source2;
NOTE: %INCLUDE (level 1) file D:\PROGS\SS.SAS is file D:\PROGS\SS.SAS.
944 + /*********************************************************/
945 +
946 + start SS (result, y, XE, XR, level,printss,printh);
947 +
948 + /*
949 + SS.SAS
950 +
951 + SUBROUTINE TO COMPUTE ANALYSIS OF VARIANCE
952 + NUMERATOR SUMS OF SQUARES
953 +
954 + Donald B. Macnaughton
955 + donmac@matstat.com
956 +
957 + ABSTRACT
958 +
959 + This SAS IML subroutine computes analysis of variance nu-
960 + merator sums of squares for statistical tests in unbalanced
961 + analysis of variance. The subroutine can compute the fol-
962 + lowing types of sums of squares:
963 +
964 + - HTO (Higher-level Terms are Omitted)
965 + = SPSS ANOVA Experimental
966 + = SAS Type II in the two-way case
967 +
968 + - SAS Type II
969 +
970 + - HTOS (Higher-level Terms are Omitted unless Significant)
971 + = superset of SAS Type II and HTO
972 +
973 + - HTI (Higher-level Terms are Included)
974 + = SAS Type III
975 + = SPSS ANOVA Unique
976 + = the default method in many analysis of variance pro-
977 + grams
978 +
979 + - sequential
980 + = SAS Type I
981 + = SPSS ANOVA Hierarchical in the two-way case
982 +
983 + - SPSS ANOVA Hierarchical
984 +
985 + - other types of sums of squares (including those in analy-
986 + sis of covariance) that can be specified as being equal to
987 + the difference between the residual sums of squares of two
988 + overparameterized model equations.
989 +
990 + Each time the subroutine is called it computes the single
991 + sum of squares specified by the calling arguments and re-
992 + turns the result to the calling program through the returned
993 + argument RESULT.
994 +
995 +
996 + SUBROUTINE USES
997 +
998 + The subroutine
999 +
1000 + - provides an easy way of computing numerator sums of
1001 + squares that cannot be computed in some statistical pack-
1002 + ages
1003 +
1004 + - provides a way of checking how a particular sum of squares
1005 + is computed in a statistical package
1006 +
1007 + - illustrates the mathematical aspects of computing sums of
1008 + squares
1009 +
1010 + - assists in performing simulation tests of conjectures
1011 + about analysis of variance sums of squares.
1012 +
1013 +
1014 + SUBROUTINE MAIN ARGUMENTS
1015 +
1016 + When calling this subroutine you must specify three main ar-
1017 + guments: y, XE, and XR.
1018 +
1019 + The first main argument, y, must be a column vector contain-
1020 + ing the values of the response variable obtained in the ex-
1021 + periment.
1022 +
1023 + The second and third main arguments, XE and XR, must be sub-
1024 + matrices of the (full-column-rank) design matrix for the ex-
1025 + periment. These submatrices jointly specify the particular
1026 + sum of squares to be computed. XE must contain the subma-
1027 + trix associated with the effect being tested. XR must con-
1028 + tain the horizontal concatenation of the submatrices associ-
1029 + ated with all the other terms (except the error term) on the
1030 + right side of the two model equations whose residual sums of
1031 + squares you wish to difference.
1032 +
1033 + I give examples of specification of y, XE, and XR in two
1034 + sample calling programs (1998a, 1998b).
1035 +
1036 +
1037 + SUBROUTINE METHOD
1038 +
1039 + As noted above, the subroutine is supplied with the vector y
1040 + and the matrices XE and XR. Here is the code (all four
1041 + lines) for computing the requested sum of squares and stor-
1042 + ing it in the variable RESULT:
1043 +
1044 + H = XE - XR * ginv(XR) * XE <-- hypothesis matrix
1045 + (dimensions are N x DF)
1046 + PM = H * ginv(H) <-- projection matrix (N x N)
1047 + yp = PM * y <-- projection of y (N x 1)
1048 + result = ssq(yp) <-- squared length of the
1049 + projection (1 x 1).
1050 +
1051 + If you are familiar with matrix algebra, the above code will
1052 + probably be understandable, even if you are not familiar
1053 + with SAS IML. But here are definitions for functions that
1054 + might be puzzling:
1055 +
1056 + ginv(X) = Moore-Penrose generalized inverse of X
1057 +
1058 + ssq(x) = sum of the squares of the elements in x.
1059 +
1060 + The subroutine method is derived from discussion and proofs
1061 + by Hocking (1985) and Searle (1987). Aldrich (1998) dis-
1062 + cusses the history of the methods. I discuss the method in
1063 + more detail below.
1064 +
1065 +
1066 + SUMMARY OF THE REST OF THE SUBROUTINE
1067 +
1068 + The remaining lines in this subroutine (717 lines) contain
1069 + - a discussion of details of the method discussed above
1070 + - a discussion of details of the subroutine operation and
1071 + - the executable version of the four statements given above.
1072 + I recommend that you omit reading these lines unless you are
1073 + interested in the details. (To find the end of the subrou-
1074 + tine to return to reading the calling program, search for
1075 + the row of asterisks.)
1076 +
1077 +
1078 + THE HYPOTHESIS MATRIX H
1079 +
1080 + As noted above, the first step of this subroutine is to gen-
1081 + erate the hypothesis matrix H, using the following state-
1082 + ment:
1083 +
1084 + H = XE - XR * ginv(XR) * XE.
1085 +
1086 + Actually, the above statement is one of two different meth-
1087 + ods that the subroutine can use to generate H. (I describe
1088 + how to specify which method is to be used in the Secondary
1089 + Arguments section below.) The other method is
1090 +
1091 + H = XE - XR * inv(XR` * XR) * XR` * XE.
1092 +
1093 + Algebraically, both methods yield exactly the same result
1094 + for H, as shown by Graybill's theorem 6.2.16 (1983, 112).
1095 +
1096 + (At several places in this subroutine I state that two meth-
1097 + ods yield exactly the same result. Although my statements
1098 + are algebraically correct, if we use the two methods in a
1099 + computer program, usually differences between the values
1100 + computed by the two methods will occur, due to differences
1101 + in roundoff errors. One should be aware of these differ-
1102 + ences but they can usually be ignored because they are usu-
1103 + ally restricted to just the last few significant digits in
1104 + computations that are performed in SAS in sixteen decimal
1105 + digit precision.)
1106 +
1107 + H has the following properties:
1108 +
1109 + 1. H has N rows, where N is the number of rows in y, XE,
1110 + and XR.
1111 +
1112 + 2. H has DF columns, where DF is the number of degrees of
1113 + freedom of the effect being tested.
1114 +
1115 + 3. Choose any cell in the table that summarizes the layout
1116 + of the experiment. (For an example of such a table, see
1117 + Searle's carrot seed germination data [1987, 79], repro-
1118 + duced in table 1 in Macnaughton [1998a].) All the rows
1119 + in H associated with the chosen cell are identical to
1120 + each other.
1121 +
1122 + 4. The sum of each column of H equals zero. This is sur-
1123 + prising because the sums of the columns of XE and XR,
1124 + which are used to generate H, generally do NOT equal
1125 + zero in unbalanced experiments.
1126 +
1127 + 5. If H has only one column (i.e., DF = 1), we can view the
1128 + elements in H as a statement of the hypothesis being
1129 + tested. That is, the elements in H are (indirectly)
1130 + multiplied by the corresponding elements in the response
1131 + data vector (in a "contrast") as a step in computing the
1132 + sum of squares.
1133 +
1134 + 6. If H has more than one column, the columns of H are
1135 + "linearly independent".
1136 +
1137 + 7. If H has more than one column, the elements in each col-
1138 + umn of H can be viewed as a "portion" of the hypothesis
1139 + being tested. That is, each column of H represents a
1140 + separate contrast that is applied to the response data
1141 + vector. The results of these contrasts are mathemati-
1142 + cally combined to compute the sum of squares. (The con-
1143 + trasts are not directly applied to the response data
1144 + vector, but only indirectly through PM, as discussed be-
1145 + low.)
1146 +
1147 + 8. Surprisingly, H is not unique. That is, for any given H
1148 + we can replace it by any of an infinite number of "re-
1149 + lated" matrices (which have, of course, the same number
1150 + of rows and columns as H), and this subroutine will re-
1151 + turn exactly the same value for the sum of squares. H
1152 + does not need to be unique to yield a unique sum of
1153 + squares because the columns of H are not defining them-
1154 + selves. Instead, the columns are defining a unique sub-
1155 + space of the N-dimensional vector space under study.
1156 + Linear algebra shows that other versions of the matrix H
1157 + can be used to define the same unique subspace. More
1158 + specifically, suppose T is ANY "non-singular" matrix
1159 + with DF rows and DF columns. Then we can replace H with
1160 + H1 = H * T and this subroutine will return exactly the
1161 + same sum of squares.
1162 +
1163 +
1164 + THE PROJECTION MATRIX PM
1165 +
1166 + As noted above in the "Method" section, the next step after
1167 + computing H is to compute the projection matrix PM, using
1168 + the following statement:
1169 +
1170 + PM = H * ginv(H).
1171 +
1172 + Actually (as with H), the above statement is one of two dif-
1173 + ferent methods that this subroutine can use to generate PM.
1174 + The other method is
1175 +
1176 + PM = H * inv(H` * H) * H`.
1177 +
1178 + As (again) shown by Graybill's theorem 6.2.16, both methods
1179 + yield exactly the same value for PM.
1180 +
1181 + The method immediately above is equivalent to computing
1182 + Hocking's P(c) in (6.168) (1985, 153). This method is par-
1183 + tially shown to be valid in Graybill's Theorem 4.4.1 (1983,
1184 + 73). Harville proves some useful results for projection ma-
1185 + trices (1997, sec 12.3).
1186 +
1187 + Note how the above two equations imply that PM is a "normal-
1188 + ized" version of the hypothesis matrix H. PM has the fol-
1189 + lowing properties:
1190 +
1191 + 1. PM has N rows and N columns.
1192 +
1193 + 2. Choose any cell in the table that summarizes the layout
1194 + of the experiment. All the rows in PM associated with
1195 + that cell are identical to each other.
1196 +
1197 + 3. PM is symmetric. That is, the first row has exactly the
1198 + same elements (in left-to-right order) as the first col-
1199 + umn (in top-to-bottom order), and the second row has ex-
1200 + actly the same elements as the second column, and so on.
1201 +
1202 + 4. The sum of the elements in each row of PM (and the sum
1203 + of the elements in each column) is equal to zero. This
1204 + implies that if a row of PM is multiplied by the vector
1205 + y, it produces a "contrast" of the values in y.
1206 +
1207 + 5. The trace of PM (i.e., the sum of the elements on the
1208 + descending diagonal) is equal to DF, the number of de-
1209 + grees of freedom associated with the hypothesis being
1210 + tested.
1211 +
1212 + 6. The rank of PM (i.e., the number of linearly independent
1213 + rows or columns in PM) is equal to DF. Thus if DF is
1214 + equal to 1, any row in PM is a multiple of any other
1215 + row. If DF is equal to k, any row in PM can be gener-
1216 + ated as a linear combination of any k mutually independ-
1217 + ent other rows.
1218 +
1219 + 7. The sum of the squares of the elements in PM is equal to
1220 + DF.
1221 +
1222 + 8. The rows (and columns) of PM are linear combinations of
1223 + the columns of H.
1224 +
1225 + 9. PM is unique. That is, suppose a projection matrix PM
1226 + is computed from a given hypothesis matrix H. Suppose
1227 + that then H is multiplied by any non-singular matrix T
1228 + with DF rows and DF columns to yield H1. Suppose that
1229 + then a new projection matrix PM1 is computed from H1.
1230 + Then PM1 = PM.
1231 +
1232 + 10. PM is idempotent. That is, if we multiply PM by itself,
1233 + the answer we get is PM.
1234 +
1235 + 11. If (as discussed below) we project an arbitrary vector x
1236 + through PM to yield x1, and if we then project x1
1237 + through PM to yield x2, we will find that x1 and x2 are
1238 + identical, although they will generally differ from x.
1239 +
1240 + 12. PM has DF eigenvalues (characteristic values) that are
1241 + equal to +1 and the remaining eigenvalues are equal to
1242 + zero. The eigenvectors (characteristic vectors) of PM
1243 + that correspond to the non-zero eigenvalues can be used
1244 + as columns to form a valid version of the hypothesis ma-
1245 + trix H.
1246 +
1247 +
1248 + yp, THE PROJECTION OF y BY PM
1249 +
1250 + Thus far, we have only used information about the *predic-
1251 + tor* variables in the experiment to derive the following
1252 + four new matrices: XE, XR, H, and PM. That is, we have not
1253 + yet taken any account of the values of the response variable
1254 + stored in the vector y. The next step in computing the sum
1255 + of squares is to mathematically marry the response variable
1256 + and the predictor variables. We do so by using the projec-
1257 + tion matrix PM to "project" y. That is, we postmultiply the
1258 + projection matrix by y to yield a new vector, called yp, as
1259 + follows:
1260 +
1261 + yp = PM * y.
1262 +
1263 + The vector yp has the following properties:
1264 +
1265 + 1. Like y, yp is an N x 1 column vector.
1266 +
1267 + 2. Choose any cell in the table that summarizes the layout
1268 + of the experiment. All the elements in yp associated
1269 + with that cell are identical to each other.
1270 +
1271 +
1272 + THE SUM OF SQUARES AS THE SQUARED LENGTH OF yp
1273 +
1274 + The desired sum of squares is simply the squared length of
1275 + yp, and is computed as
1276 +
1277 + result = ssq(yp).
1278 +
1279 + Thus the sum of squares is simply the squared length of the
1280 + projection of the vector y by the projection matrix PM. PM
1281 + has two important properties related to the projection of y
1282 +
1283 + 1. If no relationship exists between the response variable
1284 + and the particular set of predictor variables associated
1285 + with XE, (and if certain well-known assumptions are ade-
1286 + quately satisfied), the length of the projection of y can
1287 + be expected to be "short"; it will be roughly equal to a
1288 + known length (i.e., a length that can be computed from
1289 + the data). On the other hand if a relationship between
1290 + the variables *does* exist, the length of the projection
1291 + of y will tend to be longer than the known length. Thus
1292 + computing a p-value is simply computing whether the pro-
1293 + jection of y is longer than could be reasonably expected
1294 + if no relationship exists.
1295 +
1296 + 2. If we are studying a balanced experiment, and if we com-
1297 + pute projection matrices PM1 and PM2 for any two of the
1298 + effects (main effects or interactions) in the experiment,
1299 + PM1 * PM2 = 0, where 0 is an N x N matrix of zeros.
1300 + (This impressive result does not generally occur in un-
1301 + balanced experiments.) This means that in a balanced ex-
1302 + periment the projection of any vector y by PM1 is "or-
1303 + thogonal" to (i.e., at right angles to) the projection of
1304 + the same vector (or any other vector y1) by PM2. It also
1305 + means that if an experiment is balanced, none of the ef-
1306 + fects in the experiment can "contaminate" the statistical
1307 + tests of other effects in the experiment. (This contami-
1308 + nation, which I shall demonstrate in later material, oc-
1309 + curs with some statistical tests in unbalanced experi-
1310 + ments.)
1311 +
1312 +
1313 + OTHER METHODS OF COMPUTING THE SUM OF SQUARES
1314 +
1315 + Simple linear algebra implies that we can also compute the
1316 + desired sum of squares as
1317 +
1318 + result = yp` * yp.
1319 +
1320 + Also, Searle (1987) shows in (82) on page 264 and (90) on
1321 + page 272 and (90) on page 318 that we can compute the de-
1322 + sired sum of squares directly from PM as a quadratic form
1323 +
1324 + result = y` * PM * y.
1325 +
1326 + I find the projection approach (a geometric approach) easier
1327 + to understand than the quadratic form approach (an algebraic
1328 + approach). I visualize the response vector as an arrow in
1329 + the N-dimensional vector space that is "projected through"
1330 + the projection matrix to generate another arrow in a sub-
1331 + space of the first space. The length of the second arrow is
1332 + related to selected properties of the first arrow. In par-
1333 + ticular, the projection matrix is specifically designed so
1334 + that the length of the second arrow shows (in a way that is
1335 + as mathematically "clear" as possible) the strength of sup-
1336 + port (provided by the y-values obtained in the experiment)
1337 + for the hypothesis that the relationship between variables
1338 + associated with the projection matrix exists in the popula-
1339 + tion.
1340 +
1341 +
1342 + GENERAL COMMENTS ABOUT THE SUBROUTINE METHOD
1343 +
1344 + It is helpful to review what this subroutine accomplishes.
1345 + In essence, the calling program passes the matrices XE and
1346 + XR to the subroutine. These matrices contain the specifica-
1347 + tions of the layout of the experiment and the specification
1348 + of two model equations. The calling program also passes the
1349 + vector y to the subroutine. This vector contains the values
1350 + of the response variable obtained in the experiment. Fol-
1351 + lowing Yates' characterization of analysis of variance sums
1352 + of squares (1934, 63), the subroutine uses the three argu-
1353 + ments to compute the difference between the residual sums of
1354 + squares of the two model equations. Depending on how the
1355 + calling program specifies the values of XE and XR, this al-
1356 + lows the subroutine to compute a sum of squares using any of
1357 + the seven approaches to computing analysis of variance sums
1358 + of squares named in the abstract of the subroutine. (I il-
1359 + lustrate how to call the subroutine to compute sums of
1360 + squares for some of the approaches in two computer programs
1361 + [1998a, 1998b].)
1362 +
1363 +
1364 + SUBROUTINE SECONDARY ARGUMENTS
1365 +
1366 + The remaining lines in this subroutine (419 lines) contain
1367 + - details of the subroutine operation and
1368 + - the executable version of the statements given above.
1369 + You can find the end of the subroutine to return to reading
1370 + the calling program by searching for the row of asterisks.
1371 +
1372 + To use this subroutine you must supply values for three sec-
1373 + ondary arguments: LEVEL, PRINTSS, and PRINTH. These argu-
1374 + ments control details of how the subroutine performs the
1375 + computation and prints the results.
1376 +
1377 + The argument LEVEL controls which method the subroutine uses
1378 + to compute the hypothesis matrix H and the projection matrix
1379 + PM. If you set LEVEL to 1, the two matrices are computed
1380 + using the standard inverse. If you set LEVEL to 2, the two
1381 + matrices are computed using the generalized inverse (as
1382 + shown in the Method section). Using LEVEL = 1 seems to
1383 + yield solutions that are slightly more accurate.
1384 +
1385 + The argument PRINTSS controls whether the subroutine prints
1386 + the sum of squares it has computed. If you set PRINTSS to
1387 + 1, the subroutine prints the value of the sum of squares in
1388 + 25.15 format. If you set PRINTSS to zero, the subroutine
1389 + does not print the sum of squares but instead only returns
1390 + the value to the calling program through the argument
1391 + RESULT. If you set PRINTSS to 2, the subroutine prints all
1392 + the printable digits of the computed sum of squares in E-no-
1393 + tation for possible comparison against other computed val-
1394 + ues. If PRINTSS is 1 or 2, the subroutine also prints the
1395 + results of tests of the integrity of the projection matrix
1396 + PM. (I describe the tests below.)
1397 +
1398 + The argument PRINTH controls whether the subroutine prints
1399 + the following three intermediate results:
1400 + - the hypothesis matrix H
1401 + - the projection matrix PM and
1402 + - the projection, yp, of the response data vector.
1403 + If you set PRINTH to 1, the subroutine prints the intermedi-
1404 + ate results. If you set PRINTH to 2, the subroutine prints
1405 + the intermediate results but prints the hypothesis matrix
1406 + and the projection of the response vector transposed, which
1407 + can sometimes save space in the output. If you set PRINTH
1408 + to zero, the subroutine does not print intermediate results.
1409 +
1410 +
1411 + SUBROUTINE NOTES
1412 +
1413 + The numerical method of computing sums of squares used by
1414 + this subroutine is efficient for supporting the conceptual
1415 + approach to analysis of variance sums of squares of comput-
1416 + ing the difference between the residual sums of squares of
1417 + two overparameterized model equations. However, the method
1418 + is generally not the most efficient method in terms of
1419 + - economy of computation in memory required
1420 + - economy of computation in time required
1421 + - minimization of roundoff errors for large datasets or ill-
1422 + conditioned data.
1423 + Nevertheless, the subroutine generally yields highly accu-
1424 + rate sums of squares in minimal time and is therefore more
1425 + than adequate for most applications that cannot be handled
1426 + by a general analysis of variance program.
1427 +
1428 + One inefficiency of the method used by this subroutine re-
1429 + lates to the inclusion of identical rows within each of the
1430 + following five arrays:
1431 + - XE and XR, the relevant submatrices of the design matrix
1432 + - H, the hypothesis matrix
1433 + - PM, the projection matrix
1434 + - yp, the projection of vector y.
1435 + That is, (as noted above) for any given cell in the table
1436 + that summarizes the layout of the experiment, all the rows
1437 + in each of the five arrays that correspond to that cell are
1438 + (within the array) identical -- one row for each value of
1439 + the response variable associated with the cell. Sunwoo and
1440 + Kim (1997) discuss an approach to analyzing unbalanced ex-
1441 + periments that eliminates this duplication. For very large
1442 + experiments, this subroutine could be enhanced to take ac-
1443 + count of the Sunwoo and Kim approach, thereby substantially
1444 + increasing the computational efficiency (at the expense of
1445 + an increase in complexity).
1446 +
1447 + (The enhancement might work as follows: Compute y1, which
1448 + is an m-dimensional column vector containing the cell means
1449 + [or possibly cell totals] of the values of the response
1450 + variable, where m is the number of cells in the experiment.
1451 + Compute XE, XR, H, and PM as above but on the basis of an
1452 + equivalent experiment with only one observation per cell.
1453 + Compute yp1 as the projection of y1 by PM. Scale yp1 with
1454 + the Sunwoo and Kim T matrix to [in effect] yield yp, and
1455 + compute the desired sum of squares from yp as before.)
1456 +
1457 + The hypothesis matrix H I use in this subroutine has dimen-
1458 + sions N x DF. This is the transpose of the hypothesis ma-
1459 + trix Ronald Hocking discusses in his book (1985, 153), and I
1460 + discuss in a paper (1997, appendix C). I have used the
1461 + transpose because it yields slightly simpler notation.
1462 +
1463 + To conserve memory, the subroutine erases XE and XR after
1464 + using them.
1465 +
1466 + If you wish to run this subroutine on an EBCDIC system
1467 + (e.g., an IBM mainframe), see the note near the end.
1468 +
1469 +
1470 + SUBROUTINE EXECUTABLE STATEMENTS
1471 +
1472 + First, check the arguments passed to the subroutine and stop
1473 + if a problem is found.
1474 + */
1475 +
1476 + if level ^= 1 & level ^=2 then do;
1477 + print '***ERROR*** in call to SS subroutine.';
1478 + print 'Value of LEVEL is' level;
1479 + print 'LEVEL must be 1 or 2. Execution terminated.';
1480 + abort;
1481 + end;
1482 +
1483 + if printss ^= 0 & printss ^= 1 & printss ^= 2 then do;
1484 + print '***ERROR*** in call to SS subroutine.';
1485 + print 'Value of PRINTSS is' printss;
1486 + print 'Value must be 0, 1, or 2. Execution terminated.';
1487 + abort;
1488 + end;
1489 +
1490 + if printh ^= 0 & printh ^= 1 & printh ^= 2 then do;
1491 + print '***ERROR*** in call to SS subroutine.';
1492 + print 'Value of PRINTH is' printh;
1493 + print 'PRINTH must be 0, 1, or 2. Execution terminated.';
1494 + abort;
1495 + end;
1496 +
1497 + if type(y) = 'U' | type(XE) = 'U' | type(XR) = 'U' then do;
1498 + print '***ERROR*** in call to SS subroutine.';
1499 + string = 'One or more of the matrices y, XE';
1500 + string = string + ', and XR do not exist. You must ';
1501 + string = string + 'specify the three matrices before ';
1502 + string = string + 'calling the SS subroutine.';
1503 + print string;
1504 + print 'Execution terminated.';
1505 + abort;
1506 + end;
1507 +
1508 + n = nrow(y);
1509 + if nrow(XE) ^= n | nrow(XR) ^= n then do;
1510 + string = '***ERROR*** in call to SS subroutine. ';
1511 + string = string + 'Discrepancy found between the ';
1512 + string = string + 'number of rows in y, XE, and XR:';
1513 + print string;
1514 + nrow_y = n;
1515 + nrow_XE = nrow(XE);
1516 + nrow_XR = nrow(XR);
1517 + print nrow_y nrow_XE nrow_XR;
1518 + print 'Execution terminated.';
1519 + abort;
1520 + end;
1521 +
1522 + /*
1523 + Compute Searle's M1 as defined in his (76) on pages 263 and
1524 + 318 of his book (1987) as
1525 +
1526 + M1 = I(n) - XR * ginv(XR)
1527 +
1528 + where
1529 +
1530 + I(n) = the identity matrix (of dimension n x n with 1's
1531 + on the diagonal and zeros elsewhere).
1532 +
1533 + Then compute the hypothesis matrix
1534 +
1535 + H = M1 * XE.
1536 +
1537 + Note that XE in this subroutine is equivalent to Searle's X2
1538 + and XR is equivalent to Searle's X1.
1539 +
1540 + The following statements perform the above arithmetic but
1541 + bypass the intermediate step of computing M1. The chosen
1542 + method depends on the value of LEVEL.
1543 + */
1544 +
1545 + if level = 1 then
1546 + H = XE - XR * inv(XR` * XR) * XR` * XE;
1547 + else
1548 + H = XE - XR * ginv(XR) * XE;
1549 +
1550 + /*
1551 + Since they are no longer needed, erase XE and XR to conserve
1552 + memory. Note that erasing the two matrices here means that
1553 + after returning from this subroutine the values in the ma-
1554 + trices will not be available in the calling program.
1555 +
1556 + To conserve memory, other matrices are also erased below as
1557 + soon as they are no longer needed.
1558 + */
1559 +
1560 + free XE XR;
1561 +
1562 + /*
1563 + Compute the projection matrix PM using the appropriate
1564 + method, as determined by LEVEL.
1565 + */
1566 +
1567 + If level = 1 then
1568 + PM = H * inv(H` * H) * H`;
1569 + else
1570 + PM = H * ginv(H);
1571 +
1572 + if printh = 0 then free H;
1573 +
1574 + /*
1575 + Compute the projection of y.
1576 + */
1577 +
1578 + yp = PM * y;
1579 +
1580 + if printss = 0 then free PM;
1581 +
1582 + /*
1583 + Compute the desired sum of squares as the squared length of
1584 + the projection yp.
1585 + */
1586 +
1587 + result = ssq(yp);
1588 +
1589 + if printh = 0 then free yp;
1590 +
1591 + /*
1592 + If requested, print the computed sum of squares.
1593 + */
1594 +
1595 + if printss = 1 then print result [format=25.15];
1596 + if printss = 2 then print result [format=e23.];
1597 +
1598 + /*
1599 + If requested, print the intermediate results:
1600 + - the hypothesis matrix
1601 + - the projection matrix
1602 + - the projection of y.
1603 +
1604 + Print the hypothesis matrix and the projection of y untrans-
1605 + posed or transposed, depending on the value of PRINTH.
1606 + */
1607 +
1608 + if printh = 1 then do;
1609 + print ,
1610 + 'Hypothesis matrix H = XE - XR * ginv(XR) * XE:', H;
1611 + print ,'Projection matrix PM = H * ginv(H):', PM;
1612 + print ,'Projection of y: yp = PM * y:', yp;
1613 + end;
1614 +
1615 +
1616 + if printh = 2 then do;
1617 + print ,
1618 + 'Transpose of hypoth. matrix H = XE - XR * ginv(XR) * XE:',
1619 + (H`);
1620 + print ,'Projection matrix PM = H * ginv(H):', PM;
1621 + print ,'Transpose of projection of y: yp = PM * y:',
1622 + (yp`);
1623 + end;
1624 +
1625 + /*
1626 + If the argument PRINTSS requested printing of sums of
1627 + squares then, as a way of (indirectly) checking the accuracy
1628 + of the subroutine, perform checks on PM to see how close it
1629 + is to having the following properties:
1630 + - symmetric
1631 + - rows sum to zero
1632 + - idempotent.
1633 + */
1634 +
1635 + if printss > 0 then do;
1636 +
1637 + /*
1638 + The checks are done by first computing the largest relative
1639 + difference between corresponding values in two matrices that
1640 + should be identical. The relative difference between two
1641 + matrices P and Q (called the "relative error") is defined as
1642 +
1643 + E = (P - Q) / P.
1644 +
1645 + Both the subtraction and the division in the above expres-
1646 + sion are done on an element-by-element basis. Thus E is it-
1647 + self a matrix with the same dimensions as P and Q. The sub-
1648 + routine computes L, the largest absolute value of the ele-
1649 + ments in E as a measure of the equality of the two matrices.
1650 +
1651 + The subroutine then converts L to a rough number of digits
1652 + of accuracy with the formula
1653 +
1654 + ndigits = -log10(L).
1655 +
1656 + First generate PMD, the divisor matrix for two of the rela-
1657 + tive errors. PMD is simply PM, except that we ensure that
1658 + none of the values in PMD is zero.
1659 + */
1660 +
1661 + if all(PM ^= 0) then PMD = PM;
1662 + else do;
1663 + PMD = PM + ((PM = 0) * .6e-78);
1664 + n = sum(PM = 0);
1665 + string = 'elements of PM were zero. .6E-78 ';
1666 + string = string + 'was added to these elements ';
1667 + string = string + 'to avoid dividing by zero in ';
1668 + string = string + 'computing relative errors.';
1669 + print n string;
1670 + end;
1671 +
1672 + /*
1673 + Second, set DECACC, which is the number of decimal digits of
1674 + accuracy carried in the computations. This value depends on
1675 + the computer operating system and language used to run the
1676 + subroutine. For my (Windows) computer the correct value
1677 + with SAS is 16.
1678 + */
1679 +
1680 + decacc = 16;
1681 +
1682 + /*
1683 + Third, compute the maximum absolute relative difference be-
1684 + tween corresponding elements in PM and its transpose as a
1685 + measure of the symmetry of PM.
1686 + */
1687 +
1688 + mrelerrS = max( abs( (PM - PM`)/PMD ) );
1689 + if mrelerrS = 0 then ndigitsS = decacc;
1690 + else ndigitsS = -log10(mrelerrS);
1691 +
1692 + /*
1693 + Fourth, compute the maximum absolute relative error between
1694 + the sum of elements in a row of PM and zero; use the average
1695 + of the absolute values of the elements in the row as the di-
1696 + visor. (See page 47 of the IML manual for subscript reduc-
1697 + tion operators [SAS Institute Inc., 1989].)
1698 + */
1699 +
1700 + D = abs(PM);
1701 + mrelerrR = max( abs( PM[,+] / D[,:] ) );
1702 + if mrelerrR = 0 then ndigitsR = decacc;
1703 + else ndigitsR = -log10(mrelerrR);
1704 +
1705 + /*
1706 + Fifth, compute the maximum absolute relative difference be-
1707 + tween corresponding elements in PM and PM-squared as a meas-
1708 + ure of how close PM is to being idempotent.
1709 + */
1710 +
1711 + mrelerrI = max( abs( (PM - (PM * PM)) / PMD ) );
1712 + if mrelerrI = 0 then ndigitsI = decacc;
1713 + else ndigitsI = -log10(mrelerrI);
1714 +
1715 + /*
1716 + Print the computed numbers of digits of accuracy.
1717 + */
1718 +
1719 + print ndigitsS [format=5.1]
1720 + ndigitsR [format=5.1]
1721 + ndigitsI [format=5.1];
1722 +
1723 + end;
1723 + /* of if printss > 0 then do */
1724 +
1725 + /*
1726 +
1727 + SUBROUTINE NOTE FOR USERS OF EBCDIC COMPUTER SYSTEMS
1728 +
1729 + The NOT operator (^) above in this subroutine is the correct
1730 + operator for ASCII systems. If the subroutine is run on an
1731 + EBCDIC system (e.g., an IBM mainframe), you may have to
1732 + change each occurrence of ^ to the EBCDIC logical NOT opera-
1733 + tor, which looks like a minus sign with a short vertical bar
1734 + dropping down from the right end (and which is EBCDIC hexa-
1735 + decimal 5F).
1736 +
1737 +
1738 + SUBROUTINE REFERENCES
1739 +
1740 + Aldrich, J. 1998. Doing least squares: Perspectives from
1741 + Gauss and Yule. _International Statistical Review_ 66,
1742 + 61-81.
1743 +
1744 + Graybill, F. A. 1983. _Matrices with Applications in
1745 + Statistics_ 2d ed. Belmont, CA: Wadsworth.
1746 +
1747 + Harville, D. A. 1997. _Matrix Algebra From a Statistician's
1748 + Perspective._ New York: Springer-Verlag.
1749 +
1750 + Hocking, R. R. 1985. _The Analysis of Linear Models._
1751 + Monterey, CA: Brooks/Cole.
1752 +
1753 + Macnaughton, D. B. 1997. Which sums of squares are best in
1754 + unbalanced analysis of variance. Available at
1755 + http://www.matstat.com/ss/
1756 +
1757 + Macnaughton, D. B. 1998a. PR0139.HTM: Computing numerator
1758 + sums of squares in unbalanced analysis of variance: Two-
1759 + way case. Available in the "Computer Programs" section
1760 + at http://www.matstat.com/ss/
1761 +
1762 + Macnaughton, D. B. 1998b. PR0165.HTM: Computing numerator
1763 + sums of squares in unbalanced analysis of variance:
1764 + Three-way case. Available in the "Computer Programs"
1765 + section at http://www.matstat.com/ss/
1766 +
1767 + SAS Institute Inc. 1989. _SAS/IML Software: Usage and Refer-
1768 + ence, Version 6, First Edition._ Cary, NC: author.
1769 +
1770 + Searle, S. R. 1987. _Linear Models for Unbalanced Data._
1771 + New York: Wiley.
1772 +
1773 + Sunwoo, H., and B. C. Kim. 1997. Analysis of the unbalanced
1774 + linear model based on the balanced model. _Journal of
1775 + Statistical Computation and Simulation_ 56, 373-385.
1776 +
1777 + Yates, F. 1934. The analysis of multiple classifications
1778 + with unequal numbers in the different classes. _Journal
1779 + of the American Statistical Association_ 29, 51-66.
1780 + */
1781 +
1782 +
1783 + finish SS;
NOTE: Module SS defined.
1783 + /* end of subroutine SS
1784 + version of June 19, 1998
1785 + ***********************************************************/
NOTE: %INCLUDE (level 1) ending.
1786
1787
1788 /*
1789 PRELIMINARY STEP 6:
1790 SET THE VALUES OF THE THREE SECONDARY ARGUMENTS
1791 OF THE SS SUBROUTINE
1792
1793 The values of the three secondary arguments of the SS subrou-
1794 tine are set immediately below. These arguments control de-
1795 tails of how the subroutine computes sums of squares and prints
1796 results. The values set below are used on every call to the
1797 subroutine in this program. The values instruct the subroutine
1798 to
1799 - use the first method of computing sums of squares
1800 - print the value of the computed sum of squares and
1801 - print intermediate results (showing the transposed versions
1802 of the matrix H and the vector yp).
1803 */
1804
1805 level = 1;
1806 printss = 1;
1807 printh = 2;
1808
1809
1810 /*
1811 COMPUTATION STEP 3: CALL SS
1812
1813 Now that we have made the SS subroutine known to IML, we can
1814 use it. Recall that we defined y, XE and XR above to allow us
1815 to compute the HTO sum of squares for the A (soil type) main
1816 effect in the Searle data. Thus we are now ready to actually
1817 call the SS subroutine to do the computation. But just before
1818 we do, note that Searle's exact answer for the HTO sum of
1819 squares for the A main effect (1987, 113, 114, 122 [typo]) is
1820
1821 83 127/141 = 83.90070 92198 58156 0
1822
1823 Here is the statement to call the SS subroutine to compute and
1824 print the HTO sum of squares for the A main effect:
1825 */
1826
1827 call SS(result, y, XE, XR, level,printss,printh);
RESULT
83.900709219858100
Transpose of hypoth. matrix H = XE - XR * ginv(XR) * XE:
#TEM1001
1.1 1.1 1.1 .67 .67 1.2 1.2 -.9 -.9 -.9 -.9 -1 -.8 -.8 -.8
Projection matrix PM = H * ginv(H):
PM
.09 .09 .09 .05 .05 0.1 0.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1
.09 .09 .09 .05 .05 0.1 0.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1
.09 .09 .09 .05 .05 0.1 0.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1
.05 .05 .05 .03 .03 .06 .06 0 0 0 0 -.1 0 0 0
.05 .05 .05 .03 .03 .06 .06 0 0 0 0 -.1 0 0 0
0.1 0.1 0.1 .06 .06 0.1 0.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1
0.1 0.1 0.1 .06 .06 0.1 0.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1
-.1 -.1 -.1 0 0 -.1 -.1 .05 .05 .05 .05 .08 .05 .05 .05
-.1 -.1 -.1 0 0 -.1 -.1 .05 .05 .05 .05 .08 .05 .05 .05
-.1 -.1 -.1 0 0 -.1 -.1 .05 .05 .05 .05 .08 .05 .05 .05
-.1 -.1 -.1 0 0 -.1 -.1 .05 .05 .05 .05 .08 .05 .05 .05
-.1 -.1 -.1 -.1 -.1 -.1 -.1 .08 .08 .08 .08 .12 .07 .07 .07
-.1 -.1 -.1 0 0 -.1 -.1 .05 .05 .05 .05 .07 .04 .04 .04
-.1 -.1 -.1 0 0 -.1 -.1 .05 .05 .05 .05 .07 .04 .04 .04
-.1 -.1 -.1 0 0 -.1 -.1 .05 .05 .05 .05 .07 .04 .04 .04
Transpose of projection of y: yp = PM * y:
#TEM1001
-3 -3 -3 -2 -2 -3 -3 2.1 2.1 2.1 2.1 3.2 1.9 1.9 1.9
NDIGITSS NDIGITSR NDIGITSI
15.7 15.3 15.7
1828 /*
1829 The SS subroutine generated the twenty-seven printed lines
1830 above. First it printed the computed sum of squares, 83.9....
1831 Note how this value is very close to Searle's exact answer: the
1832 difference occurs in the fifteenth significant digit (which, if
1833 it is rounded, should be 2 instead of 1).
1834
1835 Next, the subroutine printed the transposed version of the hy-
1836 pothesis matrix H that was used to compute the sum of squares.
1837 (This matrix was generated solely from XE and XR, which were,
1838 in turn, generated solely from the values of the two predictor
1839 variables. That is, the response variable, y, played no role
1840 in generating H.)
1841
1842 Next, the subroutine printed the projection matrix PM. This
1843 matrix was generated directly from the hypothesis matrix H and
1844 is simply a different way of stating the hypothesis being
1845 tested.
1846
1847 (The zeros shown above in PM are not true zeros but are only
1848 shown as zeros due to necessary rounding on the part of the SAS
1849 printing algorithm due to the narrow print fields.)
1850
1851 Note that PM is symmetric (i.e., the rows equal the columns)
1852 and the values in the matrix change in step with the changes in
1853 the treatment groups in the experiment, as defined by the dif-
1854 ferent values in the a and b vectors specified earlier.
1855
1856 Next, the subroutine printed the vector yp, which is the pro-
1857 jection of the vector y obtained when y is multiplied by PM.
1858 The computed sum of squares (i.e., 83.9...) was computed by
1859 summing the squared elements of yp.
1860
1861 To help us judge the accuracy of the computed sum of squares,
1862 the subroutine concluded by printing the three numbers
1863 NDIGITSS, NDIGITSR, and NDIGITSI. These numbers are measures
1864 of the integrity of the projection matrix -- the smallest of
1865 these numbers (i.e., 15.3) gives a rough indication of the
1866 ceiling of the number of digits of accuracy in the numbers in
1867 the projection matrix. Since IML maintains a precision of
1868 roughly sixteen floating point decimal digits on my (Windows)
1869 computer, the maximum possible number of digits of accuracy on
1870 my computer is 16. (For curious readers, I describe the three
1871 numbers further in the comments near the end of the SS subrou-
1872 tine.)
1873
1874 Since the three measures of the integrity of the projection ma-
1875 trix suggest that the projection matrix is accurate to roughly
1876 fifteen significant digits, and since an accuracy of a sum of
1877 squares to four significant digits is fully adequate for com-
1878 puting a p-value, we can be confident that the computed sum of
1879 squares is sufficiently accurate.
1880
1881
1882 COMPUTE THE REMAINING SIX SUMS OF SQUARES
1883
1884 Following are steps to compute six other sums of squares for
1885 the Searle data. Note that each case requires only the follow-
1886 ing three lines of code:
1887 - a line to specify the effect being tested (via XE)
1888 - a line to specify the other terms in the two models (via XR)
1889 - a call to the subroutine (with the CALL statement).
1890
1891 In each case I first state the two model equations whose resid-
1892 ual sums of squares are being differenced. I state the models
1893 in an abbreviated form, omitting the subscripts on the terms
1894 and omitting the error term.
1895
1896
1897 SUM OF SQUARES FOR THE HTO B (SEED VARIETY) MAIN EFFECT
1898
1899 The two model equations for computing the HTO B sum of squares
1900 are
1901 y = m + a
1902 y = m + a + b.
1903
1904 Thus to compute the HTO B sum of squares we specify XE as
1905 BDESIGN -- the submatrix of the design matrix for main effect
1906 B.
1907 */
1908
1909 XE = Bdesign;
1910
1911 /*
1912 We specify XR as the horizontal concatenation of J(15,1) and
1913 ADESIGN, which are the submatrices for m and a, which are the
1914 other terms (excluding the error term) on the right side of the
1915 two model equations.
1916 */
1917
1918 XR = j(15,1) || Adesign;
1919
1920 /*
1921 Searle's exact answer for HTO B sum of squares (1987, 104, 113,
1922 114, 122) is
1923
1924 124 69/94 = 124.73404 25531 91489 4
1925 */
1926
1927 call SS(result, y, XE, XR, level,printss,printh);
RESULT
124.734042553191000
Transpose of hypoth. matrix H = XE - XR * ginv(XR) * XE:
#TEM1001
.86 .86 .86 -.1 -.1 -1 -1 .88 .88 .88 .88 -.1 -1 -1 -1
0 0 0 1 1 -1 -1 .25 .25 .25 .25 1.3 -.8 -.8 -.8
Projection matrix PM = H * ginv(H):
PM
.09 .09 .09 -.1 -.1 0 0 .07 .07 .07 .07 -.1 -.1 -.1 -.1
.09 .09 .09 -.1 -.1 0 0 .07 .07 .07 .07 -.1 -.1 -.1 -.1
.09 .09 .09 -.1 -.1 0 0 .07 .07 .07 .07 -.1 -.1 -.1 -.1
-.1 -.1 -.1 .22 .22 -.1 -.1 0 0 0 0 .27 0 0 0
-.1 -.1 -.1 .22 .22 -.1 -.1 0 0 0 0 .27 0 0 0
0 0 0 -.1 -.1 .16 .16 -.1 -.1 -.1 -.1 -.1 .13 .13 .13
0 0 0 -.1 -.1 .16 .16 -.1 -.1 -.1 -.1 -.1 .13 .13 .13
.07 .07 .07 0 0 -.1 -.1 .07 .07 .07 .07 0 -.1 -.1 -.1
.07 .07 .07 0 0 -.1 -.1 .07 .07 .07 .07 0 -.1 -.1 -.1
.07 .07 .07 0 0 -.1 -.1 .07 .07 .07 .07 0 -.1 -.1 -.1
.07 .07 .07 0 0 -.1 -.1 .07 .07 .07 .07 0 -.1 -.1 -.1
-.1 -.1 -.1 .27 .27 -.1 -.1 0 0 0 0 .33 -.1 -.1 -.1
-.1 -.1 -.1 0 0 .13 .13 -.1 -.1 -.1 -.1 -.1 .12 .12 .12
-.1 -.1 -.1 0 0 .13 .13 -.1 -.1 -.1 -.1 -.1 .12 .12 .12
-.1 -.1 -.1 0 0 .13 .13 -.1 -.1 -.1 -.1 -.1 .12 .12 .12
Transpose of projection of y: yp = PM * y:
#TEM1001
-3 -3 -3 5.1 5.1 -.9 -.9 -2 -2 -2 -2 6.1 .19 .19 .19
NDIGITSS NDIGITSR NDIGITSI
15.7 15.0 15.5
1928
1929
1930 /*
1931 SUM OF SQUARES FOR THE HTI (HIGHER-LEVEL TERMS INCLUDED)
1932 A MAIN EFFECT
1933
1934 HTI sums of squares operate by including higher-level interac-
1935 tion terms in the two model equations whose residual sums of
1936 squares are differenced. Thus the two model equations for com-
1937 puting the HTI A sum of squares are
1938 y = m + b + p
1939 y = m + a + b + p.
1940 Note the appearance of the p (interaction) term in both equa-
1941 tions. This term was omitted above in the computation of the
1942 two HTO sums of squares. Recall that the submatrix of the de-
1943 sign matrix for the interaction is ABdesign.
1944
1945 Searle's exact answer for this sum of squares (1987, 91) is
1946
1947 123 27/35 = 123.77142 85714 28571 4
1948 */
1949
1950 XE = Adesign;
1951 XR = j(15,1) || Bdesign || ABdesign;
1952 call SS(result, y, XE, XR, level,printss,printh);
RESULT
123.771428571428000
Transpose of hypoth. matrix H = XE - XR * ginv(XR) * XE:
#TEM1001
.69 .69 .69 1 1 1 1 -.5 -.5 -.5 -.5 -2 -.7 -.7 -.7
Projection matrix PM = H * ginv(H):
PM
.04 .04 .04 .06 .06 .06 .06 0 0 0 0 -.1 0 0 0
.04 .04 .04 .06 .06 .06 .06 0 0 0 0 -.1 0 0 0
.04 .04 .04 .06 .06 .06 .06 0 0 0 0 -.1 0 0 0
.06 .06 .06 .09 .09 .09 .09 0 0 0 0 -.2 -.1 -.1 -.1
.06 .06 .06 .09 .09 .09 .09 0 0 0 0 -.2 -.1 -.1 -.1
.06 .06 .06 .09 .09 .09 .09 0 0 0 0 -.2 -.1 -.1 -.1
.06 .06 .06 .09 .09 .09 .09 0 0 0 0 -.2 -.1 -.1 -.1
0 0 0 0 0 0 0 .02 .02 .02 .02 .09 .03 .03 .03
0 0 0 0 0 0 0 .02 .02 .02 .02 .09 .03 .03 .03
0 0 0 0 0 0 0 .02 .02 .02 .02 .09 .03 .03 .03
0 0 0 0 0 0 0 .02 .02 .02 .02 .09 .03 .03 .03
-.1 -.1 -.1 -.2 -.2 -.2 -.2 .09 .09 .09 .09 .34 .11 .11 .11
0 0 0 -.1 -.1 -.1 -.1 .03 .03 .03 .03 .11 .04 .04 .04
0 0 0 -.1 -.1 -.1 -.1 .03 .03 .03 .03 .11 .04 .04 .04
0 0 0 -.1 -.1 -.1 -.1 .03 .03 .03 .03 .11 .04 .04 .04
Transpose of projection of y: yp = PM * y:
#TEM1001
-2 -2 -2 -3 -3 -3 -3 1.6 1.6 1.6 1.6 6.5 2.2 2.2 2.2
NDIGITSS NDIGITSR NDIGITSI
16.0 16.0 15.7
1953
1954
1955 /*
1956 SUM OF SQUARES FOR THE HTI B MAIN EFFECT
1957
1958 The two models for computing the HTI B sum of squares are
1959 y = m + a + p
1960 y = m + a + b + p.
1961
1962 Searle does not give an exact answer for this sum of squares.
1963 The SAS Type III sum of squares for the B main effect is
1964
1965 192.12765 957
1966 */
1967
1968 XE = Bdesign;
1969 XR = j(15,1) || Adesign || ABdesign;
1970 call SS(result, y, XE, XR, level,printss,printh);
RESULT
192.127659574468000
Transpose of hypoth. matrix H = XE - XR * ginv(XR) * XE:
#TEM1001
.98 .98 .98 -.1 -.1 -1 -1 .73 .73 .73 .73 -.2 -.9 -.9 -.9
.17 .17 .17 .77 .77 -1 -1 .13 .13 .13 .13 1.5 -.7 -.7 -.7
Projection matrix PM = H * ginv(H):
PM
0.1 0.1 0.1 -.1 -.1 -.1 -.1 .07 .07 .07 .07 -.1 -.1 -.1 -.1
0.1 0.1 0.1 -.1 -.1 -.1 -.1 .07 .07 .07 .07 -.1 -.1 -.1 -.1
0.1 0.1 0.1 -.1 -.1 -.1 -.1 .07 .07 .07 .07 -.1 -.1 -.1 -.1
-.1 -.1 -.1 .14 .14 -.1 -.1 0 0 0 0 .27 0 0 0
-.1 -.1 -.1 .14 .14 -.1 -.1 0 0 0 0 .27 0 0 0
-.1 -.1 -.1 -.1 -.1 0.2 0.2 -.1 -.1 -.1 -.1 -.1 .13 .13 .13
-.1 -.1 -.1 -.1 -.1 0.2 0.2 -.1 -.1 -.1 -.1 -.1 .13 .13 .13
.07 .07 .07 0 0 -.1 -.1 .06 .06 .06 .06 -.1 0 0 0
.07 .07 .07 0 0 -.1 -.1 .06 .06 .06 .06 -.1 0 0 0
.07 .07 .07 0 0 -.1 -.1 .06 .06 .06 .06 -.1 0 0 0
.07 .07 .07 0 0 -.1 -.1 .06 .06 .06 .06 -.1 0 0 0
-.1 -.1 -.1 .27 .27 -.1 -.1 -.1 -.1 -.1 -.1 .54 -.1 -.1 -.1
-.1 -.1 -.1 0 0 .13 .13 0 0 0 0 -.1 .09 .09 .09
-.1 -.1 -.1 0 0 .13 .13 0 0 0 0 -.1 .09 .09 .09
-.1 -.1 -.1 0 0 .13 .13 0 0 0 0 -.1 .09 .09 .09
Transpose of projection of y: yp = PM * y:
#TEM1001
-3 -3 -3 4.8 4.8 .16 .16 -2 -2 -2 -2 9.5 .11 .11 .11
NDIGITSS NDIGITSR NDIGITSI
15.6 14.9 15.4
1971
1972
1973 /*
1974 SUM OF SQUARES FOR THE SEQUENTIAL A MAIN EFFECT
1975 WHEN A IS ENTERED FIRST
1976
1977 The two models for computing the sequential A sum of squares
1978 when A is entered first are
1979 y = m
1980 y = m + a.
1981
1982 (When A is entered SECOND [i.e., after B], the sequential sum
1983 of squares for A is the same as the HTO sum of squares for A,
1984 which is computed above.)
1985
1986 Searle's exact answer for this sum of squares (1987, 93, 113,
1987 114, 122) is
1988
1989 52 1/2
1990 */
1991
1992 XE = Adesign;
1993 XR = j(15,1);
1994 call SS(result, y, XE, XR, level,printss,printh);
RESULT
52.500000000000000
Transpose of hypoth. matrix H = XE - XR * ginv(XR) * XE:
#TEM1001
1.1 1.1 1.1 1.1 1.1 1.1 1.1 -.9 -.9 -.9 -.9 -.9 -.9 -.9 -.9
Projection matrix PM = H * ginv(H):
PM
.08 .08 .08 .08 .08 .08 .08 -.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1
.08 .08 .08 .08 .08 .08 .08 -.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1
.08 .08 .08 .08 .08 .08 .08 -.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1
.08 .08 .08 .08 .08 .08 .08 -.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1
.08 .08 .08 .08 .08 .08 .08 -.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1
.08 .08 .08 .08 .08 .08 .08 -.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1
.08 .08 .08 .08 .08 .08 .08 -.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1
-.1 -.1 -.1 -.1 -.1 -.1 -.1 .06 .06 .06 .06 .06 .06 .06 .06
-.1 -.1 -.1 -.1 -.1 -.1 -.1 .06 .06 .06 .06 .06 .06 .06 .06
-.1 -.1 -.1 -.1 -.1 -.1 -.1 .06 .06 .06 .06 .06 .06 .06 .06
-.1 -.1 -.1 -.1 -.1 -.1 -.1 .06 .06 .06 .06 .06 .06 .06 .06
-.1 -.1 -.1 -.1 -.1 -.1 -.1 .06 .06 .06 .06 .06 .06 .06 .06
-.1 -.1 -.1 -.1 -.1 -.1 -.1 .06 .06 .06 .06 .06 .06 .06 .06
-.1 -.1 -.1 -.1 -.1 -.1 -.1 .06 .06 .06 .06 .06 .06 .06 .06
-.1 -.1 -.1 -.1 -.1 -.1 -.1 .06 .06 .06 .06 .06 .06 .06 .06
Transpose of projection of y: yp = PM * y:
#TEM1001
-2 -2 -2 -2 -2 -2 -2 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8
NDIGITSS NDIGITSR NDIGITSI
16.0 15.2 16.0
1995
1996
1997 /*
1998 SUM OF SQUARES FOR THE SEQUENTIAL B MAIN EFFECT
1999 WHEN B IS ENTERED FIRST
2000
2001 The two models for computing the sequential B sum of squares
2002 when B is entered first are
2003 y = m
2004 y = m + b.
2005
2006 (When B is entered SECOND [i.e., after A], the sequential sum
2007 of squares for B is the same as the HTO sum of squares for B,
2008 which is computed above.)
2009
2010 Searle's exact answer for this sum of squares (1987, 95, 113,
2011 114, 122 [typo]) is
2012
2013 93 1/3
2014 */
2015
2016 XE = Bdesign;
2017 XR = j(15,1);
2018 call SS(result, y, XE, XR, level,printss,printh);
RESULT
93.333333333333300
Transpose of hypoth. matrix H = XE - XR * ginv(XR) * XE:
#TEM1001
.87 .87 .87 -.1 -.1 -1 -1 .87 .87 .87 .87 -.1 -1 -1 -1
.13 .13 .13 1.1 1.1 -.9 -.9 .13 .13 .13 .13 1.1 -.9 -.9 -.9
Projection matrix PM = H * ginv(H):
PM
.08 .08 .08 -.1 -.1 -.1 -.1 .08 .08 .08 .08 -.1 -.1 -.1 -.1
.08 .08 .08 -.1 -.1 -.1 -.1 .08 .08 .08 .08 -.1 -.1 -.1 -.1
.08 .08 .08 -.1 -.1 -.1 -.1 .08 .08 .08 .08 -.1 -.1 -.1 -.1
-.1 -.1 -.1 .27 .27 -.1 -.1 -.1 -.1 -.1 -.1 .27 -.1 -.1 -.1
-.1 -.1 -.1 .27 .27 -.1 -.1 -.1 -.1 -.1 -.1 .27 -.1 -.1 -.1
-.1 -.1 -.1 -.1 -.1 .13 .13 -.1 -.1 -.1 -.1 -.1 .13 .13 .13
-.1 -.1 -.1 -.1 -.1 .13 .13 -.1 -.1 -.1 -.1 -.1 .13 .13 .13
.08 .08 .08 -.1 -.1 -.1 -.1 .08 .08 .08 .08 -.1 -.1 -.1 -.1
.08 .08 .08 -.1 -.1 -.1 -.1 .08 .08 .08 .08 -.1 -.1 -.1 -.1
.08 .08 .08 -.1 -.1 -.1 -.1 .08 .08 .08 .08 -.1 -.1 -.1 -.1
.08 .08 .08 -.1 -.1 -.1 -.1 .08 .08 .08 .08 -.1 -.1 -.1 -.1
-.1 -.1 -.1 .27 .27 -.1 -.1 -.1 -.1 -.1 -.1 .27 -.1 -.1 -.1
-.1 -.1 -.1 -.1 -.1 .13 .13 -.1 -.1 -.1 -.1 -.1 .13 .13 .13
-.1 -.1 -.1 -.1 -.1 .13 .13 -.1 -.1 -.1 -.1 -.1 .13 .13 .13
-.1 -.1 -.1 -.1 -.1 .13 .13 -.1 -.1 -.1 -.1 -.1 .13 .13 .13
Transpose of projection of y: yp = PM * y:
#TEM1001
-2 -2 -2 4.7 4.7 0 0 -2 -2 -2 -2 4.7 0 0 0
NDIGITSS NDIGITSR NDIGITSI
15.4 15.1 15.4
2019
2020
2021 /*
2022 SUM OF SQUARES FOR THE A x B INTERACTION EFFECT
2023
2024 The two models for computing the A x B interaction sum of
2025 squares are
2026 y = m + a + b
2027 y = m + a + b + p.
2028
2029 Note that the sum of squares for the highest-level interaction
2030 in an experiment does not differ from one standard approach to
2031 computing analysis of variance sums of squares to the next.
2032
2033 Searle's exact answer for this sum of squares (1987, 113, 114)
2034 is
2035
2036 222 36/47 = 222.76595 74468 08510 6
2037 */
2038
2039 XE = ABdesign;
2040 XR = j(15,1) || Adesign || Bdesign;
2041 call SS(result, y, XE, XR, level,printss,printh);
RESULT
222.765957446808000
Transpose of hypoth. matrix H = XE - XR * ginv(XR) * XE:
#TEM1001
.98 .98 .98 -.1 -.1 -1 -1 -.7 -.7 -.7 -.7 .19 .91 .91 .91
.17 .17 .17 .77 .77 -1 -1 -.1 -.1 -.1 -.1 -2 .68 .68 .68
Projection matrix PM = H * ginv(H):
PM
0.1 0.1 0.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1 .11 .06 .06 .06
0.1 0.1 0.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1 .11 .06 .06 .06
0.1 0.1 0.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1 .11 .06 .06 .06
-.1 -.1 -.1 .14 .14 -.1 -.1 .04 .04 .04 .04 -.3 .04 .04 .04
-.1 -.1 -.1 .14 .14 -.1 -.1 .04 .04 .04 .04 -.3 .04 .04 .04
-.1 -.1 -.1 -.1 -.1 0.2 0.2 .07 .07 .07 .07 .11 -.1 -.1 -.1
-.1 -.1 -.1 -.1 -.1 0.2 0.2 .07 .07 .07 .07 .11 -.1 -.1 -.1
-.1 -.1 -.1 .04 .04 .07 .07 .06 .06 .06 .06 -.1 0 0 0
-.1 -.1 -.1 .04 .04 .07 .07 .06 .06 .06 .06 -.1 0 0 0
-.1 -.1 -.1 .04 .04 .07 .07 .06 .06 .06 .06 -.1 0 0 0
-.1 -.1 -.1 .04 .04 .07 .07 .06 .06 .06 .06 -.1 0 0 0
.11 .11 .11 -.3 -.3 .11 .11 -.1 -.1 -.1 -.1 .54 -.1 -.1 -.1
.06 .06 .06 .04 .04 -.1 -.1 0 0 0 0 -.1 .09 .09 .09
.06 .06 .06 .04 .04 -.1 -.1 0 0 0 0 -.1 .09 .09 .09
.06 .06 .06 .04 .04 -.1 -.1 0 0 0 0 -.1 .09 .09 .09
Transpose of projection of y: yp = PM * y:
#TEM1001
-1 -1 -1 -4 -4 5.9 5.9 .93 .93 .93 .93 8.1 -4 -4 -4
NDIGITSS NDIGITSR NDIGITSI
15.7 15.2 15.6
2042
2043
2044 /*
2045 SAVE THE DATA IN A SAS DATASET
2046
2047 The next three IML statements create a SAS dataset (called
2048 "Searle_1") and then transfer the values of vectors a, b, and y
2049 to variables with the same names in the dataset. This enables
2050 us to compute all the sums of squares for the Searle data with
2051 SAS GLM.
2052 */
2053
2054 create Searle_1 var {a b y};
2055 append;
2056 close Searle_1;
NOTE: The data set WORK.SEARLE_1 has 15 observations and 3 variables.
2057
2058
2059 /*
2060 QUIT FROM IML
2061 */
2062
2063 quit;
Exiting IML.
NOTE: The PROCEDURE IML used 27.57 seconds.
2064
2065
2066 /*
2067 RUN PROC GLM TWICE
2068
2069 The following statements run SAS GLM on the data to enable com-
2070 paring the above IML output with output from an analysis of the
2071 data by GLM. (The output from GLM comes in a separate output
2072 file.)
(Note: The output appears below after the end of this log.)
2073 2074 Examination of the GLM output reveals that all the GLM sums of 2075 squares are identical in all available digits to the sums of 2076 squares produced by the SS subroutine. 2077 */ 2078 2079 title 'IML/GLM 2 x 3 Unbalanced ANOVA, Searle (1987, 79)'; 2080 options nodate linesize=80 probsig=2; 2081 2082 proc glm data=Searle_1; 2083 class a b; 2084 model y = a | b / ss1 ss2 ss3; 2085 quit; NOTE: The PROCEDURE GLM used 3.06 seconds. 2086 2087 /* 2088 Run GLM a second time with a and b reversed in the model state- 2089 ment to get the sequential (Type I) sum of squares for B when B 2090 is entered first into the model. 2091 */ 2092 2093 proc glm data=Searle_1; 2094 class a b; 2095 model y = b | a / ss1 ss2 ss3; 2096 quit; NOTE: The PROCEDURE GLM used 1.32 seconds. 2097 2098 options date linesize=80 probsig=2; 2099 title; 2100 2101 2102 /* 2103 SUMMARY 2104 2105 This program illustrates three approaches to computing numera- 2106 tor sums of squares in unbalanced analysis of variance: HTO, 2107 HTI, and sequential. The three approaches are specified in 2108 terms of computing the difference between the residual sums of 2109 squares of two overparameterized model equations. The two mod- 2110 els are specified in terms of two matrices, XE and XR, which 2111 are made up of submatrices of the overall (full-column-rank) 2112 design matrix for the experiment. XE is the submatrix for the 2113 effect being tested. XR is the horizontal concatenation of the 2114 submatrices for all the other terms (except the error term) on 2115 the right side of the two model equations. 2116 2117 Note the conceptual economy of the approach: After the data 2118 are known to IML, the submatrices of the design matrix can be 2119 specified with one simple statement per submatrix (using the 2120 DESIGNF and HDIR functions). It takes just two more statements 2121 to specify (via XE and XR) a particular sum of squares to be 2122 computed. And it takes only four more reusable statements (in 2123 the SS subroutine) to compute any standard numerator sum of 2124 squares. 2125 2126 2127 NOTES 2128 2129 If you wish to run this program on your computer, see the 2130 checklist in the appendix. 2131 2132 This program illustrates computation of numerator sums of 2133 squares in the two-way case -- i.e., the case with two dis- 2134 crete-valued predictor variables. I discuss five approaches to 2135 computing analysis of variance numerator sums of squares in the 2136 three-way case in another program (1998). 2137 2138 I discuss in the paper (1997, sec. 17) the fact that statisti- 2139 cal tests provided by the HTO sums of squares are (in the rele- 2140 vant cases) generally more powerful than the corresponding 2141 tests provided by the HTI sums of squares. This implies that 2142 if there is a relationship, an HTO sum of squares will usually 2143 be greater than the corresponding HTI sum of squares. However, 2144 Searle's data twice illustrate the fact that an HTO sum of 2145 squares is not *always* greater than the corresponding HTI sum 2146 of squares. Specifically, the HTO (SAS Type II) sum of squares 2147 for the A main effect in Searle's data is less than the HTI 2148 (SAS Type III) sum of squares for the same effect (83.9 versus 2149 123.8). Similarly, the HTO sum of squares for the B main ef- 2150 fect in Searle's data is also less than the HTI sum of squares 2151 (124.7 versus 192.1). 2152 2153 2154 APPENDIX: STEPS TO RUN THIS PROGRAM 2155 2156 1. Ensure that the STAT and IML components of the SAS system 2157 are available on your computer. Information about the SAS 2158 system is available at http://www.sas.com 2159 2160 2. Ensure that you have the source version of this program, 2161 which is called PR0139.SAS (not the HTML version, which is 2162 called PR0139.HTM). You can obtain a copy of the source 2163 version in the "Computer Programs" section of the page at 2164 http://www.matstat.com/ss/ 2165 2166 3. Install a copy of the SS subroutine on your computer. This 2167 subroutine does the actual computations of sums of squares 2168 and is available from the above MatStat web page. 2169 2170 4. Edit the %INCLUDE statement in preliminary step 5 above to 2171 correctly point to the location of the SS.SAS subroutine 2172 file on your computer. That is, change the 2173 D:\PROGS\SS.SAS 2174 in the statement to the location where SS.SAS is stored on 2175 your computer. 2176 2177 5. (Optional.) Modify the two OPTIONS statements in the pro- 2178 gram that adjust the DATE, LINESIZE, and PROBSIG options. 2179 2180 6. Submit the program to SAS. 2181 2182 2183 REFERENCES 2184 2185 Fisher, R. A. 1925. _Statistical Methods for Research Workers._ 2186 Edinburgh: Oliver and Boyd. The 14th edition of this semi- 2187 nal work appears in Fisher (1990). 2188 2189 Fisher, R. A. 1935. _The Design of Experiments._ Edinburgh: 2190 Oliver and Boyd. The 8th edition of this seminal work ap- 2191 pears in Fisher (1990). 2192 2193 Fisher, R. A. 1990. _Statistical Methods, Experimental Design, 2194 and Scientific Inference_ edited by J. H. Bennett. Oxford: 2195 Oxford University Press. 2196 2197 Macnaughton, D. B. 1996a. The introductory statistics course: 2198 A new approach. Available at http://www.matstat.com/teach/ 2199 2200 Macnaughton, D. B. 1996b. The entity-property-relationship ap- 2201 proach to statistics: An introduction for students. Avail- 2202 able at http://www.matstat.com/teach/ 2203 2204 Macnaughton, D. B. 1997. Which sums of squares are best in un- 2205 balanced analysis of variance? Available at 2206 http://www.matstat.com/ss/ 2207 2208 Macnaughton, D. B. 1998. PR0165.HTM: Computing numerator sums 2209 of squares in unbalanced analysis of variance: Three-way 2210 case. Available in the "Computer Programs" section at 2211 http://www.matstat.com/ss/ 2212 2213 Searle, S. R. 1987. _Linear Models for Unbalanced Data._ New 2214 York: Wiley. 2215 2216 Yates, F. 1934. The analysis of multiple classifications with 2217 unequal numbers in the different classes. _Journal of the 2218 American Statistical Association_ 29, 51-66. 2219 2220 version of June 19, 1998 2221 (end of program pr0139.sas) */
This is the end of the program log for the run of the program.
Following are the three analysis of variance tables that were output on the first of the two runs of PROC GLM:
Source DF Type I SS Mean Square F Value Pr > F A 1 52.50000000 52.50000000 3.94 .07851 B 2 124.73404255 62.36702128 4.68 .04048 A*B 2 222.76595745 111.38297872 8.35 .00889 Source DF Type II SS Mean Square F Value Pr > F A 1 83.90070922 83.90070922 6.29 .03339 B 2 124.73404255 62.36702128 4.68 .04048 A*B 2 222.76595745 111.38297872 8.35 .00889 Source DF Type III SS Mean Square F Value Pr > F A 1 123.77142857 123.77142857 9.28 .01386 B 2 192.12765957 96.06382979 7.20 .01355 A*B 2 222.76595745 111.38297872 8.35 .00889
Note that all the numbers given above in the "SS" column are identical in all available digits to the corresponding sums of squares computed earlier in the program by PROC IML.
(As most readers will know, the numbers in the "Mean Square" column are computed by dividing the sums of squares by their degrees of freedom. The F values are computed by dividing the mean squares by the residual mean square, which SAS gives earlier [not shown] as 13.33333333. The p-values, which SAS labels "Pr > F", are computed from the F values using the "cumulative distribution function" for the "central F-distribution". Assuming there is no reasonable alternative explanation for the results, the low p-value for the A × B interaction [i.e., .00889 in all three tables] provides good evidence that [1] a relationship exists between the response variable and predictor variables A and B and [2] the relationship is a two-way interaction.
(After discovering the interactive relationship, the researcher's next step is to study a graph of the mean values of the response variable for the six treatment groups to gain an understanding of the relationship. By convention, the response variable is plotted on the vertical axis of the graph, one of the predictor variables is plotted on the horizontal axis, and the values of the other predictor variable are reflected by different lines (or different-shaped points) on the graph. "Standard error of the mean" bars are sometimes plotted on such graphs to give the reader a sense of the stability of the mean value of the response variable in each of the treatment groups [cells].)
Note the relationship between Type I and Type II sums of squares: the Type I sum of squares for B in the above tables is the same as the Type II sum of squares for B, but the Type I sum of squares for A is different from the Type II sum of squares for A.
Following are the three analysis of variance tables that were output on the second of the two runs of PROC GLM. This run had the same specifications as the first run except that the order of A and B was reversed in the MODEL statement. Comparison of the GLM output below with the GLM output above reveals that the only new value in the sums of squares column is the the Type I value for B (i.e., 93.3...).
Source DF Type I SS Mean Square F Value Pr > F B 2 93.33333333 46.66666667 3.50 .07508 A 1 83.90070922 83.90070922 6.29 .03339 A*B 2 222.76595745 111.38297872 8.35 .00889 Source DF Type II SS Mean Square F Value Pr > F B 2 124.73404255 62.36702128 4.68 .04048 A 1 83.90070922 83.90070922 6.29 .03339 A*B 2 222.76595745 111.38297872 8.35 .00889 Source DF Type III SS Mean Square F Value Pr > F B 2 192.12765957 96.06382979 7.20 .01355 A 1 123.77142857 123.77142857 9.28 .01386 A*B 2 222.76595745 111.38297872 8.35 .00889
This is the end of the output from PR0139.SAS.
Donald Macnaughton's page on unbalanced analysis of variance