/* PR0165.SAS COMPUTING NUMERATOR SUMS OF SQUARES IN UNBALANCED ANALYSIS OF VARIANCE: Three-Way Case Donald B. Macnaughton donmac@matstat.com TABLE OF CONTENTS Introductory Comments - abstract - introduction Preliminary Steps - load the Searle data into a SAS dataset and print the data - start PROC IML and read the data from the SAS dataset into IML - generate the three main effect submatrices of the design matrix - generate the four interaction submatrices of the design matrix - make the SS subroutine available to the program - set the values of the three secondary arguments of the SS subroutine Compute the Twenty-Two Sums of Squares Using the SS Subroutine - compute the six sequential (SAS Type I) sums of squares - compute the sum of squares for the highest-level interac- tion - compute the six SAS Type II sums of squares - compute the six HTI = SAS Type III sums of squares - compute three HTO sums of squares - discuss the HTOS sums of squares Quit from IML Run PROC GLM to Compute Nineteen of the Sums of Squares (for comparison with the values generated above) Appendix: Steps to Run the Program References Output from PROC PRINT Output from PROC GLM ABSTRACT This SAS program illustrates a conceptual point of view and the matrix arithmetic for computing the following types of analysis of variance numerator sums of squares: - HTO (Higher-level Terms are Omitted) = SAS Type II in the two-way case = SPSS ANOVA Experimental - SAS Type II - HTOS (Higher-level Terms are Omitted unless Significant) = a superset of SAS Type II and HTO - HTI (Higher-Level Terms are Included) = SAS Type III = SPSS ANOVA UNIQUE = the default method in many analysis of variance programs - sequential = SAS Type I = SPSS ANOVA Hierarchical in the two-way case. The conceptual point of view is one of computing an analysis of variance sum of squares by computing the difference between the residual sums of squares of two model equations (Yates 1934). The matrix arithmetic is simple, and is specified directly in terms of the conceptual point of view. Computations are illustrated using data from a 4 x 3 x 2 unbal- anced experiment discussed by Shayle Searle (1987, 392). INTRODUCTION Discussion in this program is an extension of discussion in an- other program, which computes sums of squares for an unbalanced 2 x 3 experiment (Macnaughton 1998). In that program I note that most researchers use analysis of variance to obtain the resulting p-values, which are used to help in detecting rela- tionships between variables. In order to compute analysis of variance p-values, it is mathematically necessary to first com- pute certain "numerator sums of squares". The other program illustrates the three best-known conceptual methods (HTO, HTI, and sequential) for computing numerator sums of squares when there are two discrete-valued predictor vari- ables in the experiment. When there are three or more dis- crete-valued predictor variables, other methods of computing numerator sums of squares become available. The present pro- gram illustrates some of these methods. BEGINNING OF THE PROGRAM STATEMENTS (Note: If you wish to run this program on your computer, see the checklist in the appendix.) Define the title for the output and set SAS system options. */ title 'IML/GLM 4 x 3 x 2 Unbalanced ANOVA, Searle (1987, 392)'; options linesize=80 nodate probsig=2; /* Load the Searle data into a SAS dataset. */ data Searle_2 (keep = a b c y); input a b c n @; do i = 1 to n; input y @; output; end; cards; 1 1 1 1 10 1 1 2 2 4 6 1 2 1 2 3 5 1 2 2 3 2 3 7 1 3 1 3 1 2 3 1 3 2 3 4 5 9 2 1 1 2 5 9 2 1 2 1 8 2 2 1 1 5 2 2 2 2 6 8 2 3 1 2 6 10 2 3 2 1 2 3 1 1 4 2 3 3 4 3 1 2 3 3 4 8 3 2 1 3 3 4 8 3 2 2 1 4 3 3 1 1 5 3 3 2 1 6 4 1 1 1 4 4 1 2 3 5 7 9 4 2 1 4 1 1 3 7 4 2 2 1 19 4 3 1 1 8 4 3 2 2 20 24 ; /* Print the data. */ proc print data=Searle_2; run; /* BEGINNING OF THE IML COMPUTATIONS Start PROC IML and reset options that control the destination and appearance of the IML output. */ proc iml; reset log spaces=3; /* Read the data from the SAS dataset into IML. Each of the four variables in the dataset (i.e., a, b, c, and y) becomes a col- umn vector in IML. The column vectors inherit the names of the respective dataset variables. */ use Searle_2; read all var _all_; /* Compute the main-effect submatrices of the overall design ma- trix using the IML DESIGNF function. */ aD = designf(a); bD = designf(b); cD = designf(c); /* Compute the interaction submatrices of the overall design ma- trix using the IML HDIR (horizontal direct product) function. */ abD = hdir(aD,bD); acD = hdir(aD,cD); bcD = hdir(bD,cD); abcD = hdir(abD,cD); /* Include the subroutine that will be used to do the computa- tions, but don't print it. */ %include 'D:\PROGS\SS.SAS' / NOsource2; /* Set values of the three secondary arguments for the SS subrou- tine. The settings instruct the subroutine to - use the first method of computing sums of squares - print the values of the computed sums of squares, but - omit printing the intermediate results. */ level = 1; printss = 1; printh = 0; /* BEGINNING OF CALLS TO THE SS SUBROUTINE To facilitate comparisons, sums of squares are computed in this program in the order in which they appear in the output from SAS PROC GLM. This order does not reflect the order of impor- tance of the sums of squares since (as I shall discuss further in later material) the SAS Type I sums of squares have a seri- ous problem. (The SAS Type IV sums of squares are omitted because [as illus- trated in later GLM output from this program] they are identi- cal to the HTI (= SAS Type III) sums of squares for the Searle data. Type IV sums of squares differ from HTI sums of squares only in the infrequent case in which there are empty cells.) As I discuss in the program for the two-way case (1998), it is useful to view the computation of analysis of variance sums of squares in terms of the operation of computing the difference between the residual sums of squares of two overparameterized model equations. The two equations are specified in terms of two matrices, XE and XR. XE is the submatrix of the (full-column-rank) design matrix for the effect being tested. XR is the horizontal concatenation of the submatrices of the design matrix for all the other terms (excluding the error term) on the right side of the two model equations whose residual sums of squares we wish to difference. SEQUENTIAL (SAS TYPE I) SUMS OF SQUARES Sequential sums of squares are based on a chosen sequence for computing the sums of squares for the various effects. The se- quence begins with the main effects and works upward through the interactions. The sequence starts with only the constant term mu in the model, and terms are added to the model, one at a time, as in stepwise regression. Once a term is added to the model, it stays. (A problem with the sequential method is that it is not clear which particular sequence is best. For example, SAS and SPSS use different logical sequences for their sequential sum of squares.) The sums of squares computed in this section demonstrate the sequence used by SAS in computing the Type I sum of squares. For each sum of squares I show (in abbreviated form) the two model equations whose residual sums of squares are being dif- ferenced. You may wish to confirm that the definition of XE and XR in each case is consistent with the two model equations. Sequential A y = m y = m + a */ XE = aD; XR = J(48,1); call SS(result, y, XE, XR, level,printss,printh); /* Note how the value of RESULT given above is identical in all available digits with the Type I sum of squares for A in the GLM output generated later in this program. (Similarly, all the following sums of squares computed in this IML program are identical to the corresponding sums of squares [when avail- able] from SAS GLM.) The values NDIGITSS, NDIGITSR, and NDIGITSI are rough indica- tors of the number of digits of accuracy of the values in the "projection matrix", which I discuss in the other program (1998). Sequential B y = m + a y = m + a + b */ XE = bD; XR = J(48,1) || aD; call SS(result, y, XE, XR, level,printss,printh); /* Sequential A x B y = m + a + b y = m + a + b + ab */ XE = abD; XR = J(48,1) || aD || bD; call SS(result, y, XE, XR, level,printss,printh); /* Sequential C y = m + a + b + ab y = m + a + b + c + ab */ XE = cD; XR = J(48,1) || aD || bD || abD; call SS(result, y, XE, XR, level,printss,printh); /* Sequential A x C y = m + a + b + c + ab y = m + a + b + c + ab + ac */ XE = acD; XR = J(48,1) || aD || bD || cD || abD; call SS(result, y, XE, XR, level,printss,printh); /* Sequential B x C y = m + a + b + c + ab + ac y = m + a + b + c + ab + ac + bc */ XE = bcD; XR = J(48,1) || aD || bD || cD || abD || acD; call SS(result, y, XE, XR, level,printss,printh); /* A x B x C Note that the sum of squares for the three-way A x B x C inter- action is computed only once in the IML portion of this program because it is the same for all the types of sums of squares. y = m + a + b + c + ab + ac + bc y = m + a + b + c + ab + ac + bc + abc */ XE = abcD; XR = J(48,1) || aD || bD || cD || abD || acD || bcD; call SS(result, y, XE, XR, level,printss,printh); /* SAS TYPE II SUMS OF SQUARES Consider a definition An analysis of variance effect is *contained* in an- other effect if the name of the former effect can be obtained from the name of the latter by deleting terms. For example, the A effect is contained in the A x B effect be- cause the name of the A effect can be obtained by deleting the term B from the name of the A x B effect. The XR matrix for a SAS Type II sum of squares is the horizon- tal concatenation of - all the submatrices of the design matrix for terms in the model equation at the same level as the effect being tested plus - all the submatrices (if any) of the design matrix for terms at lower levels than the effect being tested plus - all the remaining submatrices of the design matrix whose ef- fects do NOT contain the effect being tested. For example, the XR matrix for Type II A effect is the horizon- tal concatenation of - the submatrices for the two effects at the same level as A -- i.e., B and C. - (no submatrices for effects at lower levels than A because a main effect is the lowest level possible) - the submatrix for the B x C interaction (since this interac- tion does not contain the A effect). Thus the sum of squares for the Type II A effect is computed as follows: Type II A y = m + b + c + bc y = m + a + b + c + bc */ XE = aD; XR = J(48,1) || bD || cD || bcD; call SS(result, y, XE, XR, level,printss,printh); /* Type II B y = m + a + c + ac y = m + a + b + c + ac */ XE = bD; XR = J(48,1) || aD || cD || acD; call SS(result, y, XE, XR, level,printss,printh); /* Type II A x B y = m + a + b + c + ac + bc y = m + a + b + c + ab + ac + bc */ XE = abD; XR = J(48,1) || aD || bD || cD || acD || bcD; call SS(result, y, XE, XR, level,printss,printh); /* Type II C y = m + a + b + ab y = m + a + b + c + ab */ XE = cD; XR = J(48,1) || aD || bD || abD; call SS(result, y, XE, XR, level,printss,printh); /* Type II A x C y = m + a + b + c + ab + bc y = m + a + b + c + ab + ac + bc */ XE = acD; XR = J(48,1) || aD || bD || cD || abD || bcD; call SS(result, y, XE, XR, level,printss,printh); /* Type II B x C y = m + a + b + c + ab + ac y = m + a + b + c + ab + ac + bc */ XE = bcD; XR = J(48,1) || aD || bD || cD || abD || acD; call SS(result, y, XE, XR, level,printss,printh); /* HTI = SAS Type III SUMS OF SQUARES The XR matrix for an HTI (Higher-level Terms Included) sum of squares always consists of the horizontal concatenation of sub- matrices for ALL the terms on the right side of the "saturated" version of the overparameterized model equation (except the er- ror term and except the term specified in the specification of XE). (The saturated version of a model equation is simply the model that contains all the possible terms.) Thus for the Searle data the XR matrix for an HTI sum of squares is a hori- zontal concatenation of seven matrices in each of the six calls to SS that follow. HTI A y = m + b + c + ab + ac + bc + abc y = m + a + b + c + ab + ac + bc + abc */ XE = aD; XR = J(48,1) || bD || cD || abD || acD || bcD || abcD; call SS(result, y, XE, XR, level,printss,printh); /* HTI B y = m + a + c + ab + ac + bc + abc y = m + a + b + c + ab + ac + bc + abc */ XE = bD; XR = J(48,1) || aD || cD || abD || acD || bcD || abcD; call SS(result, y, XE, XR, level,printss,printh); /* HTI A x B y = m + a + b + c + ac + bc + abc y = m + a + b + c + ab + ac + bc + abc */ XE = abD; XR = J(48,1) || aD || bD || cD || acD || bcD || abcD; call SS(result, y, XE, XR, level,printss,printh); /* HTI C Searle's exact answer (1987, 394) is 62 y = m + a + b + ab + ac + bc + abc y = m + a + b + c + ab + ac + bc + abc */ XE = cD; XR = J(48,1) || aD || bD || abD || acD || bcD || abcD; call SS(result, y, XE, XR, level,printss,printh); /* HTI A x C y = m + a + b + c + ab + bc + abc y = m + a + b + c + ab + ac + bc + abc */ XE = acD; XR = J(48,1) || aD || bD || cD || abD || bcD || abcD; call SS(result, y, XE, XR, level,printss,printh); /* HTI B x C Searle's exact answer (1987, 395) is 192(1678)/11505 = 28.00312 90743 15514 99 y = m + a + b + c + ab + ac + abc y = m + a + b + c + ab + ac + bc + abc */ XE = bcD; XR = J(48,1) || aD || bD || cD || abD || acD || abcD; call SS(result, y, XE, XR, level,printss,printh); /* HTO SUMS OF SQUARES The XR matrix for an HTO sum of squares consists of the hori- zontal concatenation of the submatrices of the design matrix for all the effects at the same level as and at lower levels than the effect being tested. That is, Higher-level Terms are Omitted (HTO). Note the similarity and difference between the HTO and the SAS Type II methods: In a two-way experiment the HTO and SAS Type II sums of squares are identical. But in the three-way case and higher there are differences. In the three-way case the differences are only in the main effects. That is, for an HTO main effect the submatrices for all the higher-level effects are omitted from XR. On the other hand, for a SAS Type II main effect the submatrices for higher-level effects that do not contain the effect being tested are included in XR. For example, the XR matrix for the SAS Type II A main effect (as discussed above in the Type II section) is XR = J(48,1) || bD || cD || bcD. But the XR matrix for the HTO A main effect is XR = J(48,1) || bD || cD. In cases in which there is no B x C interaction extant in the population, the HTO sum of squares provides a slightly more powerful statistical test of the A effect than the SAS Type II sum of squares. (In cases in which there *is* an extant B x C interaction, the HTO approach should be not used to test for the A effect because, as I shall demonstrate in later material, an extant B x C interaction can "contaminate" the HTO A statis- tical test.) Following are the three cases where the HTO sums of squares and the SAS Type II sums of squares differ. (The HTO sums of squares that appear below do not appear in the output from PROC GLM because GLM cannot directly compute HTO sums of squares.) HTO A y = m + b + c y = m + a + b + c */ XE = aD; XR = J(48,1) || bD || cD; call SS(result, y, XE, XR, level,printss,printh); /* HTO B y = m + a + c y = m + a + b + c */ XE = bD; XR = J(48,1) || aD || cD; call SS(result, y, XE, XR, level,printss,printh); /* HTO C y = m + a + b y = m + a + b + c */ XE = cD; XR = J(48,1) || aD || bD; call SS(result, y, XE, XR, level,printss,printh); /* HTOS SUMS OF SQUARES I discuss HTOS sums of squares in a paper (1997, appendix D). As with all the other types of sums of squares, the XE matrix for an HTOS sum of squares consists of the submatrix of the de- sign matrix for the effect being tested. Recall the definition of one effect containing another effect given in the discussion above of the SAS Type II sums of squares. The XR matrix for an HTOS sum of squares consists of the horizontal concatenation of - the submatrices for all the effects at the same level as the effect being tested (just like the HTO, Type II, and HTI sums of squares) - the submatrices for all the effects (if any) at lower levels than the effect being tested (just like the HTO, Type II, HTI, and sequential sum of squares). - the submatrices for non-containing higher-level interactions (just like the SAS Type II sums of squares) *if evidence of these interactions is found in the data*. Note that the HTOS approach is a conditional approach, with the formula for computing a sum of squares depending on whether evidence of higher-level non-containing interactions is found in the data. In the three-way case there are only three sums of squares in which the HTOS sums of squares are conceptually (but not compu- tationally) different from other sums of squares computed above, namely the three main-effect sums of squares. If evi- dence of a non-containing higher-level interaction is found, the HTOS sum of squares for a main effect is identical to the corresponding SAS Type II sums of squares, as discussed and computed above. But if no evidence of a non-containing inter- action is found, the HTOS sum of squares is identical to the corresponding HTO sum of squares, as also discussed and com- puted above. The HTOS approach has two useful features: 1. If we are testing an effect and there is evidence that a non-containing higher-level interaction exists in the data, the HTOS approach correctly takes account of the interaction in the computation, thereby ensuring that the statistical test is valid. 2. On the other hand, if there is no evidence that the non-con- taining interaction exists, the HTOS approach provides a valid statistical test for the existence of a relationship between the response variable and the relevant predictor variable(s). This test is slightly more powerful than the HTI and SAS Type II tests. I shall discuss the validity and power of this approach in later material. QUIT FROM PROC IML This ends the computation of sums of squares in PROC IML. Quit from IML. */ quit; /* GLM ANALYSIS Analyze the data with PROC GLM for comparison with the above output from IML. (The output from GLM comes in a separate out- put file.) Examination of the GLM output reveals that all the GLM sums of squares are identical in all available digits to the sums of squares computed above in IML. */ proc glm data=Searle_2; class a b c; model y = a | b | c / ss1 ss2 ss3 ss4; quit; options date linesize=80 probsig=2; title ' '; /* APPENDIX: STEPS TO RUN THIS PROGRAM 1. Ensure that the STAT and IML components of the SAS system are available on your computer. Information about the SAS system is available at http://www.sas.com 2. Ensure that you have the source version of this program, which is called PR0165.SAS (not the HTML version, which is called PR0165.HTM). You can obtain a copy of the source version in the "Computer Programs" section of the page at http://www.matstat.com/ss/ 3. Install a copy of the SS subroutine on your computer. This subroutine does the actual computations of sums of squares and is available at the above MatStat web site. 4. Edit the %INCLUDE statement above to correctly point to the location of the SS.SAS subroutine file on your computer. That is, change the D:\PROGS\SS.SAS in the statement to the location where SS.SAS is stored on your computer. 5. (Optional.) Modify the two OPTIONS statements in the pro- gram that set the DATE, LINESIZE, and PROBSIG options. 6. Submit the program to SAS. REFERENCES Macnaughton, D. B. 1997. Which sums of squares are best in un- balanced analysis of variance? Available at http://www.matstat.com/ss/ Macnaughton, D. B. 1998. PR0139.HTM: Computing numerator sums of squares in unbalanced analysis of variance: Two-way case). Available in the "Computer Programs" section at http://www.matstat.com/ss/ Searle, S. R. 1987. _Linear Models for Unbalanced Data._ New York: Wiley. Yates, F. 1934. The analysis of multiple classifications with unequal numbers in the different classes. _Journal of the American Statistical Association_ 29, 51-66. version of June 20, 1998 (end of program pr0165.sas) */