/* PR0139.SAS COMPUTING NUMERATOR SUMS OF SQUARES IN UNBALANCED ANALYSIS OF VARIANCE: Two-Way Case Donald B. Macnaughton donmac@matstat.com TABLE OF CONTENTS Abstract Introduction - preliminary notes - introduction to the Searle example - a goal of experiments and analysis of variance - a controversy about sums of squares - model equations - the residual sum of squares of a model equation - an interpretation of analysis of variance sums of squares in terms of model equations - summing up - program overview Preliminary Steps - start PROC IML - define the Searle data - generate the main effect submatrices of the design matrix - generate the interaction submatrix of the design matrix - obtain the SS subroutine and list it - abstract - uses - main arguments: y, XE, and XR - method - details of the method - the hypothesis matrix H as a function of XE and XR - the projection matrix PM as a function of H - yp, the projection of y by PM - the sum of squares as the squared length of yp - other methods of computing the sum of squares - general comments - secondary arguments - notes - executable statements - references - set the values of the three secondary arguments of the SS subroutine Compute the Seven Sums of Squares Using the SS Subroutine - HTO A sum of squares - specify the matrix XE for the effect being tested - specify the matrix XR for the other effects in the two models - call SS to compute the sum of squares - HTO B sum of squares - HTI A sum of squares - HTI B sum of squares - sequential A sum of squares when A is entered first - sequential B sum of squares when B is entered first - interaction sum of squares Save the Data in a SAS Dataset and Quit from IML Compute the Sums of Squares using PROC GLM (for comparison with the values generated above) Summary Notes Appendix: Steps to Run the Program References Output from PROC GLM ABSTRACT This SAS program illustrates a conceptual point of view and the matrix arithmetic for computing the following types of analysis of variance numerator sums of squares: - HTO (Higher-level Terms are Omitted) = SPSS ANOVA Experimental = SAS Type II in the two-way case - HTI (Higher-level Terms are Included) = SAS Type III = SPSS ANOVA Unique = the default approach in many analysis of variance pro- grams - sequential = SAS Type I = SPSS ANOVA Hierarchical in the two-way case. The conceptual point of view is one of computing an analysis of variance sum of squares by computing the difference between the residual sums of squares of two model equations (Yates 1934). The matrix arithmetic is simple and is specified directly in terms of the conceptual point of view. The program is heavily annotated. Computations are illustrated using data from a 2 x 3 unbalanced experiment discussed by Shayle Searle (1987, 79). PRELIMINARY NOTES If you are not familiar with SAS, you can tell the difference between my comments and the SAS program statements as follows: My comments begin with the two symbols /* and end with the two symbols */ /* Anything outside these symbols (except blanks) is a program statement, which SAS will try to execute. To lay some groundwork, I begin the program not with SAS pro- gram statements but instead with about 500 lines of my own com- ments. These set the stage for the later program statements. INTRODUCTION TO THE SEARLE EXAMPLE Analysis of variance is a broadly used method for analyzing the results of scientific experiments. The method was invented by Sir Ronald Aylmer Fisher (1925, 1935) and is generally viewed as the most powerful and versatile available method for scien- tifically inferring causation. A current controversy in analysis of variance pertains to ana- lyzing the data from "unbalanced" experiments. The controversy is important because (for various reasons) a large proportion of real-world experiments end up being unbalanced. Shayle Searle addresses the controversy in his book LINEAR MODELS FOR UNBALANCED DATA (1987) in which he discusses both mathematical and philosophical issues. Although I disagree with some of Professor Searle's philosophical conclusions, I am in awe of his mathematical work. It is with deep respect that I offer the following analysis of an important example in his book. Searle's example is of an experiment to test whether "type of potting soil" influences "time to germination" in three varie- ties of carrot seed (1987, 78-79). (Searle's example is from the field of agriculture. However, the discussion is not limited to experiments in the field of agriculture. Instead, both Searle's discussion and my discus- sion apply to experiments in all fields of empirical research. In particular, the discussion applies to a majority of the ex- periments in the physical, biological, and social sciences.) Clearly, the response variable in Searle's experiment is "time to germination", and the two predictor variables are "soil type" and "seed variety". Searle presents the following data as possible results of the experiment: TABLE 1 Time in Days to First Germination of Three Varieties of Carrot Seed Grown in Two Different Potting Soils ------------------- Seed Soil Variety (B) Type ------------ (A) 1 2 3 ------------------- 1 6 13 14 10 15 22 11 2 12 31 18 15 9 19 12 18 ------------------- The first number in the body of the table (i.e., 6) indicates that in one of the fifteen trials in the experiment it took six days for seeds of variety 1 to germinate when they were planted in soil of type 1. Searle's experiment is unbalanced because, as the table shows, the number of values of the response variable available in the various "cells" in the experiment differs from cell to cell. For example, three values of the response variable are avail- able in the (1,1) cell, but only two values are available in the (1,2) cell. A GOAL OF EXPERIMENTS AND ANALYSIS OF VARIANCE When discussing analysis of variance, it is important to be aware of both the goal of scientific experiments and the role of analysis of variance in achieving the goal. Otherwise, the discussion may become an arbitrary mathematical exercise. One useful way of characterizing the goal is The goal of experiments and analysis of variance is to obtain knowledge about relationships between variables. In any scientific experiment, an important step in achieving this goal is to determine, in as unequivocal a way as possible, whether a relationship actually *exists* between the variables under study. In particular, we wish to determine whether the response variable in an experiment "depends" on one or more of the predictor variables. If a relationship is found between the variables, a second goal is to determine the nature of the relationship. Thus in the Searle data we wish to determine whether "time to germination" depends on "soil type", or whether "time to germi- nation" depends on "seed variety". It is invariably the case in experimental research that we wish to determine whether the dependence exists in the *population* of entities under study, not just in the sample of entities that participated in the experiment. (In Searle's example the entities are trials, or "carrot seed plantings".) In an experiment with two predictor variables, the nature of the relationship between the response variable and the predic- tor variables can be either - no (detectable) relationship or - one or two "simple" relationships ("main effects") or - an "interaction". Interactions were invented by Fisher (1935, ch. VI). Interac- tions provide a comprehensive means for detecting any (detect- able) form of relationship that might exist between the re- sponse variable and the predictor variables in an experiment. In particular, interactions help us to detect complicated rela- tionships between the response variable and a predictor vari- able in which the specific form of the relationship depends on the level of one or more *other* predictor variables. I give formal definitions of the concepts of 'relationship between variables', 'interaction', and 'simple relationship' in a paper (1997, sec. 6). The use of analysis of variance to detect relationships between variables is (at a high level) straightforward: We submit the data summarizing the results of an experiment (e.g., the data in table 1) to an analysis of variance program, and the program computes a set of "p-values". Assuming the experiment was properly designed, the program provides a p-value for each sim- ple relationship (main effect) and a p-value for each interac- tion. If the p-value for a main effect or interaction is low enough (and if there is no reasonable alternative explanation), we conclude that the particular relationship between variables associated with the p-value is extant in the population of en- tities under study. I discuss the general scientific study of relationships between variables (including the notion of a p-value) further in two papers (1996a, 1996b). A CONTROVERSY ABOUT SUMS OF SQUARES To compute an analysis of variance p-value from the results of an experiment, it is mathematically necessary to first compute certain "sums of squares". All analysis of variance programs compute these sums of squares as an intermediate step in com- puting p-values. Currently controversy exists about how sums of squares should be computed. Controversy exists about both the numerator sum of squares and the denominator sum of squares used in the "F-ratio" to compute a p-value. The present dis- cussion focuses exclusively on computing numerator sums of squares for unbalanced experiments. (Since balanced experiments are an "internal" limiting case of unbalanced experiments, the discussion below also applies to balanced experiments.) This program illustrates the computations of the three best- known conceptual methods for computing numerator sums of squares. The program also illustrates the mathematical aspects of computing numerator sums of squares by illustrating a simple mathematical algorithm that can carry out all of the three con- ceptual methods. The purpose of this program is not to judge the various methods of computing sums of squares, but rather to illustrate them. Therefore, I make no judgments here about the merits of the various methods. However, I do make judgments in the paper (1997) and I shall extend these judgments in material I shall publish later. MODEL EQUATIONS To understand sums of squares, it is useful to understand the notion of a "model equation" of the relationship between the response variable and the predictor variables in an experiment. Two types of model equations (often called simply "models") are in use today: "overparameterized" models and "cell-means" mod- els. I discuss both types in the paper (1997, sec. 9). The following discussion is in terms of overparameterized models. Consider a two-way experiment (e.g., the Searle experiment) that has predictor variables A and B. We can model the rela- tionship between the response variable, called y, and the pre- dictor variables with the following model equation: y(i,j,k) = mu + alpha(i) + beta(j) + phi(i,j) + e(i,j,k). (1) (Normally the i's, j's and k's in the parentheses would be sub- scripts, and the five terms on the right side of the equation would be Greek letters, but these features are not yet avail- able for comments in computer programs.) The terms in the equation have the following interpretations: y(i,j,k) = value of the response variable for the kth entity in the (i,j) treatment group in the experiment mu = grand mean of the values of the response variable for all the entities in the population alpha(i) = simple ("main") effect of predictor variable A on y when A is at level i beta(j) = simple ("main") effect of predictor variable B on y when B is at level j phi(i,j) = the joint (interaction) effect of predictor vari- ables A and B on y when they are at levels i and j respectively e(i,j,k) = the "error" term, which takes account of the vari- ation in y that cannot be accounted for by the other four terms on the right side of the equation. Model (1) gives us a succinct picture of how the value of the response variable "depends" on the values of the two predictor variables, and how it also depends on other unknown factors, which are taken account of by the error term. Model (1) is called the "saturated" model for a two-way experiment because it contains all the possible terms for such an experiment. As we shall see, it often makes sense to use a reduced version of (1) in which certain terms are omitted from the right side. For example, if we omit the interaction term, phi, we get y(i,j,k) = mu + alpha(i) + beta(j) + e(i,j,k). (2) THE RESIDUAL SUM OF SQUARES OF A MODEL EQUATION Once we have the data from an experiment (e.g., the data in ta- ble 1), we can "fit" various models to the data. That is, we can use a mathematical algorithm to compute values for the pa- rameters associated with the terms in the model [i.e., values for mu, the alpha(i)s, the beta(j)s, and the phi(i,j)s]. The algorithm operates by choosing values for the parameters asso- ciated with the four terms so that the resulting model gives the "best" predictions of the values of the response variable y for all the values of y obtained in the experiment. The fitting of the terms is usually done by the method of least squares or by the closely related method of maximum likelihood. In the standard case, both methods yield "identical" estimates for the values of the parameters associated with the terms (excluding the error term e) on the right side of the equation in the sense that the sum of the squared differences between the values of the response variable predicted by the equation and the *actual* values of the response variable in the ex- periment is the lowest possible value. After we have fitted a model to the results of an experiment and obtained estimates for the values of the parameters, we can then use the model to compute the predicted value for each of the values of the response variable obtained in the experiment. Then we can subtract each predicted value from the correspond- ing actual value to get a number called the "residual". If we square each of these residuals and add the squared residuals together, we get the "residual sum of squares" for the model for the experimental data at hand. (Of course, the residual sum of squares is the number that was itself minimized two paragraphs above in order to determine the estimates of the values of the parameters -- a piece of mathe- matical bootstrapping that still amazes me.) AN INTERPRETATION OF ANALYSIS OF VARIANCE SUMS OF SQUARES IN TERMS OF MODEL EQUATIONS We can view all standard analysis of variance numerator sums of squares as being the value we obtain if we subtract the resid- ual sum of squares for one model from the residual sum of squares for another model (Yates 1934, 63). For example, con- sider the following two models for the results of a two-way ex- periment: y(i,j,k) = mu + beta(j) + e(i,j,k) (3) y(i,j,k) = mu + alpha(i) + beta(j) + e(i,j,k). (2) I shall use the term Rn to denote the residual sum of squares for model (n). Thus R3 denotes the residual sum of squares for (3). Suppose we perform a two-way experiment (such as the Searle ex- periment), and suppose we (separately) fit models (2) and (3) to the results of the experiment, yielding R2 and R3. If we subtract R2 from R3, this difference is identical to the HTO (= SAS Type II = SPSS ANOVA Experimental) sum of squares for the A main effect. It stands to reason that the numerical difference R3 - R2 should equal a sum of squares for the A main effect since the only conceptual difference between the two models is that (2) contains alpha, the term for predictor variable A, and (3) does not. Consider the following two models: y(i,j,k) = mu + beta(j) + phi(i,j) + e(i,j,k) (4) y(i,j,k) = mu + alpha(i) + beta(j) + phi(i,j) + e(i,j,k). (1) Note that (4) and (1) are the same as (3) and (2) respectively, except that both (4) and (1) have an extra term, namely phi(i,j). R4 - R1 is identical to the HTI (= SAS Type III = SPSS ANOVA Unique) sum of squares for the A main effect. We can see the source of the names HTI and HTO by studying the two pairs of models above. For the HTI sum of squares the Higher-level interaction Term [phi(i,j)] is Included (HTI) in both models. For the HTO sum of squares the Higher-level in- teraction Term is Omitted (HTO) from both models. Finally, consider the following two models: y(i,j,k) = mu + e(i,j,k) (5) y(i,j,k) = mu + alpha(i) + e(i,j,k). (6) Note that (5) and (6) are the same as (3) and (2) respectively, except that both (5) and (6) lack the term beta(j). R5 - R6 is identical to the sequential (= SAS Type I = SPSS ANOVA Hierar- chical) sum of squares for the A main effect when A is entered first (after the mean) into the model. If we compute the difference between the residual sums of squares of two models, the difference is called the "sum of squares for the effect being tested". The "effect" is the term that is present in one of the two models, but absent from the other. Each sum of squares has associated with it a number of "degrees of freedom". (For any main effect, the number of degrees of freedom is one less than the number of values assumed by the associated pre- dictor variable in the experiment. For example, in the Searle data, predictor variable B assumes three different values in the experiment so the number of degrees of freedom for B is two. For an interaction, the number of degrees of freedom is the product of the numbers of degrees of freedom for the main effects for each of the predictor variables associated with the interaction.) If we divide a sum of squares for a particular effect by its degrees of freedom, we get the "mean square" for the effect. Similarly, if we divide the residual sum of squares for the saturated model by *its* degrees of freedom, we get the resid- ual mean square. The reason we might *want* to compute any of these mean squares rests on three facts - If there is *no* relationship in the population between the response variable and predictor variable A, clearly the cor- rect value for alpha(i) is zero for all values of i in all the equations above. In this case, it can be shown that the three mean squares for the A effect can be expected (under certain often-satisfiable assumptions) to equal the "residual variance" in the experiment. (The residual variance is esti- mated by the residual mean square.) - If there *is* a relationship between A and the response vari- able, the three mean squares for the A effect can usually be expected to be *greater* than the residual variance. - Thus to determine whether there is evidence of a relationship between the response variable and predictor variable A, we need only determine whether the appropriate effect mean square is significantly greater than the residual mean square. The p-value is simply an easy-to-interpret result of this determination. These facts, buttressed by mathematical arguments (Searle 1987, sec. 8.6), imply that the three approaches to computing sums of squares provide (with certain limitations) valid candidates for the numerator sums of squares in F-ratios used to compute p- values used to test for the existence of a relationship between the response variable and predictor variable A. (The technique I discuss above for detecting relationships be- tween variables is undeniably complex. A questioning reader might wonder whether a simpler technique [with reasonable power and objectivity] might be found. So far, no such technique has been found.) The above discussion states that three popular types of numera- tor sums of squares (HTO, HTI, and sequential) can be computed in a two-way experiment by computing the difference between the residual sums of squares of two model equations. This general- izes: All standard analysis of variance numerator sums of squares for two-way, three-way, and higher experiments can be viewed as reflecting the difference between the residual sums of squares of two overparameterized model equations. SUMMING UP - an important use of analysis of variance is to help research- ers detect relationships between variables in populations of entities - we can detect relationships between variables by studying the p-values obtained by applying analysis of variance to the re- sults of an experiment - analysis of variance computer programs compute numerator sums of squares as an intermediate step in computing p-values - controversy exists about which method of computing numerator sums of squares is preferred in unbalanced experiments - one easy-to-understand way of viewing all standard approaches to computing numerator sums of squares is to view them as re- flecting the difference between the residual sums of squares of two different overparameterized model equations for the relationship between the response variable and predictor variables in the experiment. PROGRAM OVERVIEW The program statements below demonstrate a simple algorithm for computing analysis of variance numerator sums of squares. The algorithm takes three pieces of information as input 1. a specification of the two model equations whose residual sums of squares we wish to "difference" in order to compute an analysis of variance numerator sums of squares [e.g., (3) and (2) above] 2. a specification of the layout of the experiment 3. the response variable data vector containing the values of the response variable obtained in the experiment. The algorithm uses the input to compute the specified sum of squares. To show that the algorithm works properly, the program computes seven analysis of variance sums of squares for the Searle data given above. The program below is organized into five parts: 1. a set of six preliminary steps to get things ready for com- puting the sums of squares 2. seven repetitions of three simple steps to compute the seven sums of squares 3. a recomputation of the seven sums of squares with SAS PROC GLM for comparison 4. a summary 5. notes, an appendix, and references. PRELIMINARY STEP 1: START PROC IML AND RESET SOME OPTIONS PROC IML (Interactive Matrix Language) is an easy-to-use com- puter language for general matrix arithmetic. It is an add-on component of the SAS system and has many built-in functions to facilitate matrix operations particular to statistical analy- sis. The PROC IML statement below initiates the IML environment. After that statement is executed, the statements that follow are statements in the IML language until we reach the QUIT statement, which takes us back into a standard SAS program. */ proc iml; /* The options in the following RESET statement control the desti- nation and appearance of the IML output. */ reset log fw=3 spaces=3; /* PRELIMINARY STEP 2: DEFINE THE SEARLE DATA To get the Searle data into IML, we define two column vectors (a and b) of values for the two predictor variables, and we de- fine a column vector y, containing the values of the response variable. Following are the three IML statements that define the vectors a, b, and y: */ a = { 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2,2, 2 }; b = { 1, 1, 1, 2, 2, 3, 3, 1, 1, 1, 1, 2, 3,3, 3 }; y = { 6,10,11, 13,15, 14,22, 12,15,19,18, 31, 18,9,12 }; /* Note how the values in the three vectors "line up" correctly with each other to reflect exactly the same information as is given near the beginning of this program in table 1. For exam- ple, the last three values in y (i.e., 18, 9, and 12) each line up with a value of 2 in vector a and a value of 3 in vector b, reflecting the fact that the last three values in y are associ- ated with the (2,3) cell in table 1. The set of fifteen numbers between the braces in each of the three IML statements above is called a "matrix literal". Al- though the above three matrix literals are laid out horizon- tally, they do not specify row vectors but instead specify *column* vectors. They specify column vectors because each of the numbers (except the last) in each matrix literal is fol- lowed by a comma, which in an IML matrix literal indicates the end of a row. (Numbers *within* a row of an IML matrix literal are not separated by commas, but by blanks.) PRELIMINARY STEP 3: GENERATE THE MAIN EFFECT SUBMATRICES Before we can begin computing sums of squares, we must first generate the "full-column-rank" submatrices of the overall "design matrix" for the experiment. These submatrices (which are surprisingly simple) are used in the computation of the sums of squares. Any analysis of variance has a separate submatrix of the design matrix for each of its main effects (i.e., for each predictor variable) and a separate submatrix for each interaction in the experiment. Each submatrix has N rows, where N equals the num- ber of elements (entities, rows, cases, observations) in the data. (In the Searle data N is fifteen since there were fif- teen trials in the experiment; these trials are reflected in the fifteen elements in each of vectors a, b, and y defined above.) Each submatrix has DF columns, where DF is the number of degrees of freedom associated with the main effect or inter- action associated with the submatrix. The full-column-rank submatrix for any main effect can be gen- erated with a single statement in IML. For example, to gener- ate the submatrix for the main effect of predictor variable B, and to store that submatrix in the matrix called Bdesign, we use the following statement: Bdesign = designf(b) where b is the column vector containing the raw values of pre- dictor variable B used in the experiment. DESIGNF is a built-in function in IML. The function returns a matrix of zeros and (positive and negative) ones, with N rows and DF columns, where N and DF are as described above. Consider the main effect for predictor variable B (seed vari- ety) in the Searle data. Here is a copy of the definition (given above) of the (column) vector showing the different val- ues of predictor variable B for the fifteen trials in the ex- periment: b = { 1, 1, 1, 2, 2, 3, 3, 1, 1, 1, 1, 2, 3, 3, 3 }. Following is the IML statement to compute the submatrix of the design matrix for predictor variable B. This is followed by a statement to print the newly computed matrix. (The matrix is printed transposed for ease of study.) */ Bdesign = designf(b); print (Bdesign`); /* Consider the first row of 1's, zeros, and -1's in the output above. Compare the values in this row with the values in vec- tor B given further above. Note how the 1's identify the cases in which B has the value 1. Similarly, the zeros identify the cases in which B has the value 2. Finally, the -1's identify the cases in which B has the value 3. Similarly, in the second row note how the 1's identify the cases in which B has the value 2. And the -1's identify (again) the cases in which B has the value 3. The zeros iden- tify the cases in which B has the value 1. This generalizes: Suppose some (discrete-valued) predictor variable P has assumed m different values in an experiment. (The value of m must be greater than or equal to 2 for P to be a valid predictor variable.) Then the (transposed) submatrix for P will have m-1 rows, each row being associated with one of the m values, and with one of the m values being "left out" with no row of its own. Within each row the elements are +1 if this case received the associated value of predictor variable P; -1 if the case received the "left out" value of predictor variable P; and zero if m is greater than 2 and the case re- ceived one of the other m-2 values of predictor variable P. Design matrices are mysterious because, in the case of comput- ing the standard analysis of variance sums of squares, they are not unique. That is, we could get exactly the same results in the output below (apart from slight differences due to roundoff errors) if we specified a new submatrix of the design matrix for predictor variable B provided that the two columns in the new matrix were themselves any two independent "linear combina- tions" of the two columns of the submatrix for B given above. How is it possible to get the same results if a submatrix of the design matrix is replaced with a transformed version of it- self? This puzzling aspect of design matrices is explained by the fact that the relevant information (for computing a sum of squares) in a submatrix of a design matrix is not stored in the *values* of the elements in the matrix -- it is stored in the *relationships among* the values. Independent linear combina- tions of the columns in a submatrix preserve this relevant in- formation. (Technically, the linear combinations preserve the information in the sense that all transformed sets of columns of a sub- matrix of the design matrix so-defined are generating vectors of the same relevant subspace of the N-dimensional vector space under study.) Now let us compute the submatrix of the design matrix for pre- dictor variable A (soil type). Here is a copy of the (column) vector of values from the definition given above for this vari- able: a = { 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2 } Following are the statements to compute and print the submatrix of the design matrix for predictor variable A. (The matrix is again printed transposed for ease of study.) */ Adesign = designf(a); print (Adesign`); /* Since predictor variable A assumed only two values in the ex- periment, the (transposed) submatrix of the design matrix has only one row. As was the case with predictor variable B, note how the 1's and -1's above in Adesign line up respectively with the 1's and 2's in the column vector of the values of predictor variable A. PRELIMINARY STEP 4: GENERATE THE INTERACTION SUBMATRIX Each interaction in an experiment has its own submatrix of the design matrix. We can generate an interaction submatrix by computing the "horizontal direct product" of the design matri- ces for all the main effects for the predictor variables asso- ciated with the interaction. Consider two matrices P and Q with n rows and np and nq col- umns, respectively. The horizontal direct product of P and Q has n rows and np x nq columns. Each column in the horizontal direct product is associated with a unique pair of columns, one from P and one from Q. The elements in a given column in the horizontal direct product are the scalar products of corre- sponding elements in the two associated columns of P and Q. For example, consider the following two matrices: P = {1 2, and Q = {1 6, 3 4, 2 5, 5 6 } 3 4 }. The horizontal direct product of P and Q is { 1 6 2 12, 6 15 8 20, 15 20 18 24 }. The IML function HDIR(P,Q) computes the horizontal direct prod- uct of two matrices P and Q. Following is a statement to generate the submatrix of the de- sign matrix for the A x B interaction in the Searle data. The submatrix is generated by computing the horizontal direct prod- uct of the two design submatrices (Adesign and Bdesign) gener- ated above. This is followed by a statement to print the in- teraction submatrix (again transposed for ease of study). */ ABdesign = hdir(Adesign, Bdesign); print (ABdesign`); /* COMPUTATION STEP 1: FOR A GIVEN SUM OF SQUARES SPECIFY XE (WE SHALL FIRST COMPUTE THE HTO SUM OF SQUARES FOR A) Having computed the three submatrices of the design matrix, we can begin computing sums of squares. Let us first compute the HTO sum of squares for predictor variable A (soil type). The actual computing of sums of squares in this program is done by a subroutine called SS. Using a subroutine is efficient be- cause it consolidates the code to compute sums of squares into a reusable and portable package. I discuss the SS subroutine in detail below. In order to compute a sum of squares, the SS subroutine re- quires three pieces of input: 1. the values of the response variable obtained in the experi- ment 2. information about which sum of squares we wish to compute, as specified by two overparameterized model equations, as discussed above 3. information about the layout of the experiment. The first piece of input, the values of the response variable, was defined above in the definition of the vector y. Thus we need only pass the vector y to the subroutine for the subrou- tine to use the values of the response variable. The second and third pieces of input are passed to the subrou- tine through two matrices that are obtained from the submatri- ces of the design matrix discussed above. The two matrices are called XE and XR. The first matrix, XE, is simply the submatrix of the design ma- trix for the particular effect we wish to test. In the present case we are testing the A main effect. Thus we specify XE as follows: */ XE = Adesign; /* COMPUTATION STEP 2: FOR A GIVEN SUM OF SQUARES SPECIFY XR In the present example we wish to compute the HTO numerator sum of squares for the A main effect. Thus (as discussed above) we wish to compute the difference between the residual sums of squares of the following two model equations: y(i,j,k) = mu + beta(j) + e(i,j,k) (3) y(i,j,k) = mu + alpha(i) + beta(j) + e(i,j,k). (2) We have already specified the conceptual difference between (3) and (2) in the specification of XE above. Thus all that re- mains in order to fully specify the two equations is to list the "other" terms on the right side of (3) and (2). The "other" terms on the right side of (3) and (2) (excluding the error term) are mu and beta. We specify the other terms by "horizontally concatenating" the submatrices for these terms to form the matrix XR. As suggested by the name, the horizontal concatenation of two matrices P and Q is the matrix formed by appending the rows of Q to the corresponding rows of P. (Horizontal concatenation of two matrices is possible only if the matrices have the same number of rows.) In SAS IML the expression P || Q is used to specify the horizontal concatenation of two matrices P and Q. Since mu is a constant in the two equations, the submatrix of the design matrix for mu is a column vector of 1's. We can specify a column vector of fifteen 1's with the IML function j(15,1). Thus we specify the XR matrix for the HTO A (soil type) sum of squares in the experiment with the following statement: */ XR = j(15,1) || Bdesign; /* Here is the (transposed) XR matrix: */ print (XR`); /* Note that the XE and XR contain two types of information: - information about which analysis of variance sum of squares we wish to compute (as captured by the *choice* of the sub- matrices we have used to define XE and XR) - information about the layout of the experiment (as captured in the *layout* of the submatrices of the design matrix used to define XE and XR). The E in XE stands for the Effect being tested. The R in XR stands for otheR terms on the right side of the two model equa- tions. PRELIMINARY STEP 5: MAKE THE SS SUBROUTINE AVAILABLE TO THE PROGRAM (I have delayed this preliminary step until after defining XE and XR, since this will help you to understand the subroutine.) Since we have now defined y, XE, and XR for the HTO A (soil type) effect, we are ready to call subroutine SS to compute the sum of squares. In fact, we shall call the subroutine momen- tarily. But before we can actually call the subroutine, we must first make it known to IML. We do this with the %INCLUDE statement that follows. Because the SOURCE2 option is specified in the %INCLUDE state- ment, SAS lists all of the statements in the subroutine in the output below immediately after the %INCLUDE statement. */ %include 'D:\PROGS\SS.SAS' / source2; /* PRELIMINARY STEP 6: SET THE VALUES OF THE THREE SECONDARY ARGUMENTS OF THE SS SUBROUTINE The values of the three secondary arguments of the SS subrou- tine are set immediately below. These arguments control de- tails of how the subroutine computes sums of squares and prints results. The values set below are used on every call to the subroutine in this program. The values instruct the subroutine to - use the first method of computing sums of squares - print the value of the computed sum of squares and - print intermediate results (showing the transposed versions of the matrix H and the vector yp). */ level = 1; printss = 1; printh = 2; /* COMPUTATION STEP 3: CALL SS Now that we have made the SS subroutine known to IML, we can use it. Recall that we defined y, XE and XR above to allow us to compute the HTO sum of squares for the A (soil type) main effect in the Searle data. Thus we are now ready to actually call the SS subroutine to do the computation. But just before we do, note that Searle's exact answer for the HTO sum of squares for the A main effect (1987, 113, 114, 122 [typo]) is 83 127/141 = 83.90070 92198 58156 0 Here is the statement to call the SS subroutine to compute and print the HTO sum of squares for the A main effect: */ call SS(result, y, XE, XR, level,printss,printh); /* The SS subroutine generated the twenty-seven printed lines above. First it printed the computed sum of squares, 83.9.... Note how this value is very close to Searle's exact answer: the difference occurs in the fifteenth significant digit (which, if it is rounded, should be 2 instead of 1). Next, the subroutine printed the transposed version of the hy- pothesis matrix H that was used to compute the sum of squares. (This matrix was generated solely from XE and XR, which were, in turn, generated solely from the values of the two predictor variables. That is, the response variable, y, played no role in generating H.) Next, the subroutine printed the projection matrix PM. This matrix was generated directly from the hypothesis matrix H and is simply a different way of stating the hypothesis being tested. (The zeros shown above in PM are not true zeros but are only shown as zeros due to necessary rounding on the part of the SAS printing algorithm due to the narrow print fields.) Note that PM is symmetric (i.e., the rows equal the columns) and the values in the matrix change in step with the changes in the treatment groups in the experiment, as defined by the dif- ferent values in the a and b vectors specified earlier. Next, the subroutine printed the vector yp, which is the pro- jection of the vector y obtained when y is multiplied by PM. The computed sum of squares (i.e., 83.9...) was computed by summing the squared elements of yp. To help us judge the accuracy of the computed sum of squares, the subroutine concluded by printing the three numbers NDIGITSS, NDIGITSR, and NDIGITSI. These numbers are measures of the integrity of the projection matrix -- the smallest of these numbers (i.e., 15.3) gives a rough indication of the ceiling of the number of digits of accuracy in the numbers in the projection matrix. Since IML maintains a precision of roughly sixteen floating point decimal digits on my (Windows) computer, the maximum possible number of digits of accuracy on my computer is 16. (For curious readers, I describe the three numbers further in the comments near the end of the SS subrou- tine.) Since the three measures of the integrity of the projection ma- trix suggest that the projection matrix is accurate to roughly fifteen significant digits, and since an accuracy of a sum of squares to four significant digits is fully adequate for com- puting a p-value, we can be confident that the computed sum of squares is sufficiently accurate. COMPUTE THE REMAINING SIX SUMS OF SQUARES Following are steps to compute six other sums of squares for the Searle data. Note that each case requires only the follow- ing three lines of code: - a line to specify the effect being tested (via XE) - a line to specify the other terms in the two models (via XR) - a call to the subroutine (with the CALL statement). In each case I first state the two model equations whose resid- ual sums of squares are being differenced. I state the models in an abbreviated form, omitting the subscripts on the terms and omitting the error term. SUM OF SQUARES FOR THE HTO B (SEED VARIETY) MAIN EFFECT The two model equations for computing the HTO B sum of squares are y = m + a y = m + a + b. Thus to compute the HTO B sum of squares we specify XE as BDESIGN -- the submatrix of the design matrix for main effect B. */ XE = Bdesign; /* We specify XR as the horizontal concatenation of J(15,1) and ADESIGN, which are the submatrices for m and a, which are the other terms (excluding the error term) on the right side of the two model equations. */ XR = j(15,1) || Adesign; /* Searle's exact answer for HTO B sum of squares (1987, 104, 113, 114, 122) is 124 69/94 = 124.73404 25531 91489 4 */ call SS(result, y, XE, XR, level,printss,printh); /* SUM OF SQUARES FOR THE HTI (HIGHER-LEVEL TERMS INCLUDED) A MAIN EFFECT HTI sums of squares operate by including higher-level interac- tion terms in the two model equations whose residual sums of squares are differenced. Thus the two model equations for com- puting the HTI A sum of squares are y = m + b + p y = m + a + b + p. Note the appearance of the p (interaction) term in both equa- tions. This term was omitted above in the computation of the two HTO sums of squares. Recall that the submatrix of the de- sign matrix for the interaction is ABdesign. Searle's exact answer for this sum of squares (1987, 91) is 123 27/35 = 123.77142 85714 28571 4 */ XE = Adesign; XR = j(15,1) || Bdesign || ABdesign; call SS(result, y, XE, XR, level,printss,printh); /* SUM OF SQUARES FOR THE HTI B MAIN EFFECT The two models for computing the HTI B sum of squares are y = m + a + p y = m + a + b + p. Searle does not give an exact answer for this sum of squares. The SAS Type III sum of squares for the B main effect is 192.12765 957 */ XE = Bdesign; XR = j(15,1) || Adesign || ABdesign; call SS(result, y, XE, XR, level,printss,printh); /* SUM OF SQUARES FOR THE SEQUENTIAL A MAIN EFFECT WHEN A IS ENTERED FIRST The two models for computing the sequential A sum of squares when A is entered first are y = m y = m + a. (When A is entered SECOND [i.e., after B], the sequential sum of squares for A is the same as the HTO sum of squares for A, which is computed above.) Searle's exact answer for this sum of squares (1987, 93, 113, 114, 122) is 52 1/2 */ XE = Adesign; XR = j(15,1); call SS(result, y, XE, XR, level,printss,printh); /* SUM OF SQUARES FOR THE SEQUENTIAL B MAIN EFFECT WHEN B IS ENTERED FIRST The two models for computing the sequential B sum of squares when B is entered first are y = m y = m + b. (When B is entered SECOND [i.e., after A], the sequential sum of squares for B is the same as the HTO sum of squares for B, which is computed above.) Searle's exact answer for this sum of squares (1987, 95, 113, 114, 122 [typo]) is 93 1/3 */ XE = Bdesign; XR = j(15,1); call SS(result, y, XE, XR, level,printss,printh); /* SUM OF SQUARES FOR THE A x B INTERACTION EFFECT The two models for computing the A x B interaction sum of squares are y = m + a + b y = m + a + b + p. Note that the sum of squares for the highest-level interaction in an experiment does not differ from one standard approach to computing analysis of variance sums of squares to the next. Searle's exact answer for this sum of squares (1987, 113, 114) is 222 36/47 = 222.76595 74468 08510 6 */ XE = ABdesign; XR = j(15,1) || Adesign || Bdesign; call SS(result, y, XE, XR, level,printss,printh); /* SAVE THE DATA IN A SAS DATASET The next three IML statements create a SAS dataset (called "Searle_1") and then transfer the values of vectors a, b, and y to variables with the same names in the dataset. This enables us to compute all the sums of squares for the Searle data with SAS GLM. */ create Searle_1 var {a b y}; append; close Searle_1; /* QUIT FROM IML */ quit; /* RUN PROC GLM TWICE The following statements run SAS GLM on the data to enable com- paring the above IML output with output from an analysis of the data by GLM. (The output from GLM comes in a separate output file.) Examination of the GLM output reveals that all the GLM sums of squares are identical in all available digits to the sums of squares produced by the SS subroutine. */ title 'IML/GLM 2 x 3 Unbalanced ANOVA, Searle (1987, 79)'; options nodate linesize=80 probsig=2; proc glm data=Searle_1; class a b; model y = a | b / ss1 ss2 ss3; quit; /* Run GLM a second time with a and b reversed in the model state- ment to get the sequential (Type I) sum of squares for B when B is entered first into the model. */ proc glm data=Searle_1; class a b; model y = b | a / ss1 ss2 ss3; quit; options date linesize=80 probsig=2; title; /* SUMMARY This program illustrates three approaches to computing numera- tor sums of squares in unbalanced analysis of variance: HTO, HTI, and sequential. The three approaches are specified in terms of computing the difference between the residual sums of squares of two overparameterized model equations. The two mod- els are specified in terms of two matrices, XE and XR, which are made up of submatrices of the overall (full-column-rank) design matrix for the experiment. XE is the submatrix for the effect being tested. XR is the horizontal concatenation of the submatrices for all the other terms (except the error term) on the right side of the two model equations. Note the conceptual economy of the approach: After the data are known to IML, the submatrices of the design matrix can be specified with one simple statement per submatrix (using the DESIGNF and HDIR functions). It takes just two more statements to specify (via XE and XR) a particular sum of squares to be computed. And it takes only four more reusable statements (in the SS subroutine) to compute any standard numerator sum of squares. NOTES If you wish to run this program on your computer, see the checklist in the appendix. This program illustrates computation of numerator sums of squares in the two-way case -- i.e., the case with two dis- crete-valued predictor variables. I discuss five approaches to computing analysis of variance numerator sums of squares in the three-way case in another program (1998). I discuss in the paper (1997, sec. 17) the fact that statisti- cal tests provided by the HTO sums of squares are (in the rele- vant cases) generally more powerful than the corresponding tests provided by the HTI sums of squares. This implies that if there is a relationship, an HTO sum of squares will usually be greater than the corresponding HTI sum of squares. However, Searle's data twice illustrate the fact that an HTO sum of squares is not *always* greater than the corresponding HTI sum of squares. Specifically, the HTO (SAS Type II) sum of squares for the A main effect in Searle's data is less than the HTI (SAS Type III) sum of squares for the same effect (83.9 versus 123.8). Similarly, the HTO sum of squares for the B main ef- fect in Searle's data is also less than the HTI sum of squares (124.7 versus 192.1). APPENDIX: STEPS TO RUN THIS PROGRAM 1. Ensure that the STAT and IML components of the SAS system are available on your computer. Information about the SAS system is available at http://www.sas.com 2. Ensure that you have the source version of this program, which is called PR0139.SAS (not the HTML version, which is called PR0139.HTM). You can obtain a copy of the source version in the "Computer Programs" section of the page at http://www.matstat.com/ss/ 3. Install a copy of the SS subroutine on your computer. This subroutine does the actual computations of sums of squares and is available from the above MatStat web page. 4. Edit the %INCLUDE statement in preliminary step 5 above to correctly point to the location of the SS.SAS subroutine file on your computer. That is, change the D:\PROGS\SS.SAS in the statement to the location where SS.SAS is stored on your computer. 5. (Optional.) Modify the two OPTIONS statements in the pro- gram that adjust the DATE, LINESIZE, and PROBSIG options. 6. Submit the program to SAS. REFERENCES Fisher, R. A. 1925. _Statistical Methods for Research Workers._ Edinburgh: Oliver and Boyd. The 14th edition of this semi- nal work appears in Fisher (1990). Fisher, R. A. 1935. _The Design of Experiments._ Edinburgh: Oliver and Boyd. The 8th edition of this seminal work ap- pears in Fisher (1990). Fisher, R. A. 1990. _Statistical Methods, Experimental Design, and Scientific Inference_ edited by J. H. Bennett. Oxford: Oxford University Press. Macnaughton, D. B. 1996a. The introductory statistics course: A new approach. Available at http://www.matstat.com/teach/ Macnaughton, D. B. 1996b. The entity-property-relationship ap- proach to statistics: An introduction for students. Avail- able at http://www.matstat.com/teach/ Macnaughton, D. B. 1997. Which sums of squares are best in un- balanced analysis of variance? Available at http://www.matstat.com/ss/ Macnaughton, D. B. 1998. PR0165.HTM: Computing numerator sums of squares in unbalanced analysis of variance: Three-way case. Available in the "Computer Programs" section at http://www.matstat.com/ss/ Searle, S. R. 1987. _Linear Models for Unbalanced Data._ New York: Wiley. Yates, F. 1934. The analysis of multiple classifications with unequal numbers in the different classes. _Journal of the American Statistical Association_ 29, 51-66. version of June 19, 1998 (end of program pr0139.sas) */