/*********************************************************/ start SS (result, y, XE, XR, level,printss,printh); /* SS.SAS SUBROUTINE TO COMPUTE ANALYSIS OF VARIANCE NUMERATOR SUMS OF SQUARES Donald B. Macnaughton donmac@matstat.com ABSTRACT This SAS IML subroutine computes analysis of variance nu- merator sums of squares for statistical tests in unbalanced analysis of variance. The subroutine can compute the fol- lowing types of sums of squares: - HTO (Higher-level Terms are Omitted) = SPSS ANOVA Experimental = SAS Type II in the two-way case - SAS Type II - HTOS (Higher-level Terms are Omitted unless Significant) = superset of SAS Type II and HTO - HTI (Higher-level Terms are Included) = SAS Type III = SPSS ANOVA Unique = the default method in many analysis of variance pro- grams - sequential = SAS Type I = SPSS ANOVA Hierarchical in the two-way case - SPSS ANOVA Hierarchical - other types of sums of squares (including those in analy- sis of covariance) that can be specified as being equal to the difference between the residual sums of squares of two overparameterized model equations. Each time the subroutine is called it computes the single sum of squares specified by the calling arguments and re- turns the result to the calling program through the returned argument RESULT. SUBROUTINE USES The subroutine - provides an easy way of computing numerator sums of squares that cannot be computed in some statistical pack- ages - provides a way of checking how a particular sum of squares is computed in a statistical package - illustrates the mathematical aspects of computing sums of squares - assists in performing simulation tests of conjectures about analysis of variance sums of squares. SUBROUTINE MAIN ARGUMENTS When calling this subroutine you must specify three main ar- guments: y, XE, and XR. The first main argument, y, must be a column vector contain- ing the values of the response variable obtained in the ex- periment. The second and third main arguments, XE and XR, must be sub- matrices of the (full-column-rank) design matrix for the ex- periment. These submatrices jointly specify the particular sum of squares to be computed. XE must contain the subma- trix associated with the effect being tested. XR must con- tain the horizontal concatenation of the submatrices associ- ated with all the other terms (except the error term) on the right side of the two model equations whose residual sums of squares you wish to difference. I give examples of specification of y, XE, and XR in two sample calling programs (1998a, 1998b). SUBROUTINE METHOD As noted above, the subroutine is supplied with the vector y and the matrices XE and XR. Here is the code (all four lines) for computing the requested sum of squares and stor- ing it in the variable RESULT: H = XE - XR * ginv(XR) * XE <-- hypothesis matrix (dimensions are N x DF) PM = H * ginv(H) <-- projection matrix (N x N) yp = PM * y <-- projection of y (N x 1) result = ssq(yp) <-- squared length of the projection (1 x 1). If you are familiar with matrix algebra, the above code will probably be understandable, even if you are not familiar with SAS IML. But here are definitions for functions that might be puzzling: ginv(X) = Moore-Penrose generalized inverse of X ssq(x) = sum of the squares of the elements in x. The subroutine method is derived from discussion and proofs by Hocking (1985) and Searle (1987). Aldrich (1998) dis- cusses the history of the methods. I discuss the method in more detail below. SUMMARY OF THE REST OF THE SUBROUTINE The remaining lines in this subroutine (717 lines) contain - a discussion of details of the method discussed above - a discussion of details of the subroutine operation and - the executable version of the four statements given above. I recommend that you omit reading these lines unless you are interested in the details. (To find the end of the subrou- tine to return to reading the calling program, search for the row of asterisks.) THE HYPOTHESIS MATRIX H As noted above, the first step of this subroutine is to gen- erate the hypothesis matrix H, using the following state- ment: H = XE - XR * ginv(XR) * XE. Actually, the above statement is one of two different meth- ods that the subroutine can use to generate H. (I describe how to specify which method is to be used in the Secondary Arguments section below.) The other method is H = XE - XR * inv(XR` * XR) * XR` * XE. Algebraically, both methods yield exactly the same result for H, as shown by Graybill's theorem 6.2.16 (1983, 112). (At several places in this subroutine I state that two meth- ods yield exactly the same result. Although my statements are algebraically correct, if we use the two methods in a computer program, usually differences between the values computed by the two methods will occur, due to differences in roundoff errors. One should be aware of these differ- ences but they can usually be ignored because they are usu- ally restricted to just the last few significant digits in computations that are performed in SAS in sixteen decimal digit precision.) H has the following properties: 1. H has N rows, where N is the number of rows in y, XE, and XR. 2. H has DF columns, where DF is the number of degrees of freedom of the effect being tested. 3. Choose any cell in the table that summarizes the layout of the experiment. (For an example of such a table, see Searle's carrot seed germination data [1987, 79], repro- duced in table 1 in Macnaughton [1998a].) All the rows in H associated with the chosen cell are identical to each other. 4. The sum of each column of H equals zero. This is sur- prising because the sums of the columns of XE and XR, which are used to generate H, generally do NOT equal zero in unbalanced experiments. 5. If H has only one column (i.e., DF = 1), we can view the elements in H as a statement of the hypothesis being tested. That is, the elements in H are (indirectly) multiplied by the corresponding elements in the response data vector (in a "contrast") as a step in computing the sum of squares. 6. If H has more than one column, the columns of H are "linearly independent". 7. If H has more than one column, the elements in each col- umn of H can be viewed as a "portion" of the hypothesis being tested. That is, each column of H represents a separate contrast that is applied to the response data vector. The results of these contrasts are mathemati- cally combined to compute the sum of squares. (The con- trasts are not directly applied to the response data vector, but only indirectly through PM, as discussed be- low.) 8. Surprisingly, H is not unique. That is, for any given H we can replace it by any of an infinite number of "re- lated" matrices (which have, of course, the same number of rows and columns as H), and this subroutine will re- turn exactly the same value for the sum of squares. H does not need to be unique to yield a unique sum of squares because the columns of H are not defining them- selves. Instead, the columns are defining a unique sub- space of the N-dimensional vector space under study. Linear algebra shows that other versions of the matrix H can be used to define the same unique subspace. More specifically, suppose T is ANY "non-singular" matrix with DF rows and DF columns. Then we can replace H with H1 = H * T and this subroutine will return exactly the same sum of squares. THE PROJECTION MATRIX PM As noted above in the "Method" section, the next step after computing H is to compute the projection matrix PM, using the following statement: PM = H * ginv(H). Actually (as with H), the above statement is one of two dif- ferent methods that this subroutine can use to generate PM. The other method is PM = H * inv(H` * H) * H`. As (again) shown by Graybill's theorem 6.2.16, both methods yield exactly the same value for PM. The method immediately above is equivalent to computing Hocking's P(c) in (6.168) (1985, 153). This method is par- tially shown to be valid in Graybill's Theorem 4.4.1 (1983, 73). Harville proves some useful results for projection ma- trices (1997, sec 12.3). Note how the above two equations imply that PM is a "normal- ized" version of the hypothesis matrix H. PM has the fol- lowing properties: 1. PM has N rows and N columns. 2. Choose any cell in the table that summarizes the layout of the experiment. All the rows in PM associated with that cell are identical to each other. 3. PM is symmetric. That is, the first row has exactly the same elements (in left-to-right order) as the first col- umn (in top-to-bottom order), and the second row has ex- actly the same elements as the second column, and so on. 4. The sum of the elements in each row of PM (and the sum of the elements in each column) is equal to zero. This implies that if a row of PM is multiplied by the vector y, it produces a "contrast" of the values in y. 5. The trace of PM (i.e., the sum of the elements on the descending diagonal) is equal to DF, the number of de- grees of freedom associated with the hypothesis being tested. 6. The rank of PM (i.e., the number of linearly independent rows or columns in PM) is equal to DF. Thus if DF is equal to 1, any row in PM is a multiple of any other row. If DF is equal to k, any row in PM can be gener- ated as a linear combination of any k mutually independ- ent other rows. 7. The sum of the squares of the elements in PM is equal to DF. 8. The rows (and columns) of PM are linear combinations of the columns of H. 9. PM is unique. That is, suppose a projection matrix PM is computed from a given hypothesis matrix H. Suppose that then H is multiplied by any non-singular matrix T with DF rows and DF columns to yield H1. Suppose that then a new projection matrix PM1 is computed from H1. Then PM1 = PM. 10. PM is idempotent. That is, if we multiply PM by itself, the answer we get is PM. 11. If (as discussed below) we project an arbitrary vector x through PM to yield x1, and if we then project x1 through PM to yield x2, we will find that x1 and x2 are identical, although they will generally differ from x. 12. PM has DF eigenvalues (characteristic values) that are equal to +1 and the remaining eigenvalues are equal to zero. The eigenvectors (characteristic vectors) of PM that correspond to the non-zero eigenvalues can be used as columns to form a valid version of the hypothesis ma- trix H. yp, THE PROJECTION OF y BY PM Thus far, we have only used information about the *predic- tor* variables in the experiment to derive the following four new matrices: XE, XR, H, and PM. That is, we have not yet taken any account of the values of the response variable stored in the vector y. The next step in computing the sum of squares is to mathematically marry the response variable and the predictor variables. We do so by using the projec- tion matrix PM to "project" y. That is, we postmultiply the projection matrix by y to yield a new vector, called yp, as follows: yp = PM * y. The vector yp has the following properties: 1. Like y, yp is an N x 1 column vector. 2. Choose any cell in the table that summarizes the layout of the experiment. All the elements in yp associated with that cell are identical to each other. THE SUM OF SQUARES AS THE SQUARED LENGTH OF yp The desired sum of squares is simply the squared length of yp, and is computed as result = ssq(yp). Thus the sum of squares is simply the squared length of the projection of the vector y by the projection matrix PM. PM has two important properties related to the projection of y 1. If no relationship exists between the response variable and the particular set of predictor variables associated with XE, (and if certain well-known assumptions are ade- quately satisfied), the length of the projection of y can be expected to be "short"; it will be roughly equal to a known length (i.e., a length that can be computed from the data). On the other hand if a relationship between the variables *does* exist, the length of the projection of y will tend to be longer than the known length. Thus computing a p-value is simply computing whether the pro- jection of y is longer than could be reasonably expected if no relationship exists. 2. If we are studying a balanced experiment, and if we com- pute projection matrices PM1 and PM2 for any two of the effects (main effects or interactions) in the experiment, PM1 * PM2 = 0, where 0 is an N x N matrix of zeros. (This impressive result does not generally occur in un- balanced experiments.) This means that in a balanced ex- periment the projection of any vector y by PM1 is "or- thogonal" to (i.e., at right angles to) the projection of the same vector (or any other vector y1) by PM2. It also means that if an experiment is balanced, none of the ef- fects in the experiment can "contaminate" the statistical tests of other effects in the experiment. (This contami- nation, which I shall demonstrate in later material, oc- curs with some statistical tests in unbalanced experi- ments.) OTHER METHODS OF COMPUTING THE SUM OF SQUARES Simple linear algebra implies that we can also compute the desired sum of squares as result = yp` * yp. Also, Searle (1987) shows in (82) on page 264 and (90) on page 272 and (90) on page 318 that we can compute the de- sired sum of squares directly from PM as a quadratic form result = y` * PM * y. I find the projection approach (a geometric approach) easier to understand than the quadratic form approach (an algebraic approach). I visualize the response vector as an arrow in the N-dimensional vector space that is "projected through" the projection matrix to generate another arrow in a sub- space of the first space. The length of the second arrow is related to selected properties of the first arrow. In par- ticular, the projection matrix is specifically designed so that the length of the second arrow shows (in a way that is as mathematically "clear" as possible) the strength of sup- port (provided by the y-values obtained in the experiment) for the hypothesis that the relationship between variables associated with the projection matrix exists in the popula- tion. GENERAL COMMENTS ABOUT THE SUBROUTINE METHOD It is helpful to review what this subroutine accomplishes. In essence, the calling program passes the matrices XE and XR to the subroutine. These matrices contain the specifica- tions of the layout of the experiment and the specification of two model equations. The calling program also passes the vector y to the subroutine. This vector contains the values of the response variable obtained in the experiment. Fol- lowing Yates' characterization of analysis of variance sums of squares (1934, 63), the subroutine uses the three argu- ments to compute the difference between the residual sums of squares of the two model equations. Depending on how the calling program specifies the values of XE and XR, this al- lows the subroutine to compute a sum of squares using any of the seven approaches to computing analysis of variance sums of squares named in the abstract of the subroutine. (I il- lustrate how to call the subroutine to compute sums of squares for some of the approaches in two computer programs [1998a, 1998b].) SUBROUTINE SECONDARY ARGUMENTS The remaining lines in this subroutine (419 lines) contain - details of the subroutine operation and - the executable version of the statements given above. You can find the end of the subroutine to return to reading the calling program by searching for the row of asterisks. To use this subroutine you must supply values for three sec- ondary arguments: LEVEL, PRINTSS, and PRINTH. These argu- ments control details of how the subroutine performs the computation and prints the results. The argument LEVEL controls which method the subroutine uses to compute the hypothesis matrix H and the projection matrix PM. If you set LEVEL to 1, the two matrices are computed using the standard inverse. If you set LEVEL to 2, the two matrices are computed using the generalized inverse (as shown in the Method section). Using LEVEL = 1 seems to yield solutions that are slightly more accurate. The argument PRINTSS controls whether the subroutine prints the sum of squares it has computed. If you set PRINTSS to 1, the subroutine prints the value of the sum of squares in 25.15 format. If you set PRINTSS to zero, the subroutine does not print the sum of squares but instead only returns the value to the calling program through the argument RESULT. If you set PRINTSS to 2, the subroutine prints all the printable digits of the computed sum of squares in E-no- tation for possible comparison against other computed val- ues. If PRINTSS is 1 or 2, the subroutine also prints the results of tests of the integrity of the projection matrix PM. (I describe the tests below.) The argument PRINTH controls whether the subroutine prints the following three intermediate results: - the hypothesis matrix H - the projection matrix PM and - the projection, yp, of the response data vector. If you set PRINTH to 1, the subroutine prints the intermedi- ate results. If you set PRINTH to 2, the subroutine prints the intermediate results but prints the hypothesis matrix and the projection of the response vector transposed, which can sometimes save space in the output. If you set PRINTH to zero, the subroutine does not print intermediate results. SUBROUTINE NOTES The numerical method of computing sums of squares used by this subroutine is efficient for supporting the conceptual approach to analysis of variance sums of squares of comput- ing the difference between the residual sums of squares of two overparameterized model equations. However, the method is generally not the most efficient method in terms of - economy of computation in memory required - economy of computation in time required - minimization of roundoff errors for large datasets or ill- conditioned data. Nevertheless, the subroutine generally yields highly accu- rate sums of squares in minimal time and is therefore more than adequate for most applications that cannot be handled by a general analysis of variance program. One inefficiency of the method used by this subroutine re- lates to the inclusion of identical rows within each of the following five arrays: - XE and XR, the relevant submatrices of the design matrix - H, the hypothesis matrix - PM, the projection matrix - yp, the projection of vector y. That is, (as noted above) for any given cell in the table that summarizes the layout of the experiment, all the rows in each of the five arrays that correspond to that cell are (within the array) identical -- one row for each value of the response variable associated with the cell. Sunwoo and Kim (1997) discuss an approach to analyzing unbalanced ex- periments that eliminates this duplication. For very large experiments, this subroutine could be enhanced to take ac- count of the Sunwoo and Kim approach, thereby substantially increasing the computational efficiency (at the expense of an increase in complexity). (The enhancement might work as follows: Compute y1, which is an m-dimensional column vector containing the cell means [or possibly cell totals] of the values of the response variable, where m is the number of cells in the experiment. Compute XE, XR, H, and PM as above but on the basis of an equivalent experiment with only one observation per cell. Compute yp1 as the projection of y1 by PM. Scale yp1 with the Sunwoo and Kim T matrix to [in effect] yield yp, and compute the desired sum of squares from yp as before.) The hypothesis matrix H I use in this subroutine has dimen- sions N x DF. This is the transpose of the hypothesis ma- trix Ronald Hocking discusses in his book (1985, 153), and I discuss in a paper (1997, appendix C). I have used the transpose because it yields slightly simpler notation. To conserve memory, the subroutine erases XE and XR after using them. If you wish to run this subroutine on an EBCDIC system (e.g., an IBM mainframe), see the note near the end. SUBROUTINE EXECUTABLE STATEMENTS First, check the arguments passed to the subroutine and stop if a problem is found. */ if level ^= 1 & level ^=2 then do; print '***ERROR*** in call to SS subroutine.'; print 'Value of LEVEL is' level; print 'LEVEL must be 1 or 2. Execution terminated.'; abort; end; if printss ^= 0 & printss ^= 1 & printss ^= 2 then do; print '***ERROR*** in call to SS subroutine.'; print 'Value of PRINTSS is' printss; print 'Value must be 0, 1, or 2. Execution terminated.'; abort; end; if printh ^= 0 & printh ^= 1 & printh ^= 2 then do; print '***ERROR*** in call to SS subroutine.'; print 'Value of PRINTH is' printh; print 'PRINTH must be 0, 1, or 2. Execution terminated.'; abort; end; if type(y) = 'U' | type(XE) = 'U' | type(XR) = 'U' then do; print '***ERROR*** in call to SS subroutine.'; string = 'One or more of the matrices y, XE'; string = string + ', and XR do not exist. You must '; string = string + 'specify the three matrices before '; string = string + 'calling the SS subroutine.'; print string; print 'Execution terminated.'; abort; end; n = nrow(y); if nrow(XE) ^= n | nrow(XR) ^= n then do; string = '***ERROR*** in call to SS subroutine. '; string = string + 'Discrepancy found between the '; string = string + 'number of rows in y, XE, and XR:'; print string; nrow_y = n; nrow_XE = nrow(XE); nrow_XR = nrow(XR); print nrow_y nrow_XE nrow_XR; print 'Execution terminated.'; abort; end; /* Compute Searle's M1 as defined in his (76) on pages 263 and 318 of his book (1987) as M1 = I(n) - XR * ginv(XR) where I(n) = the identity matrix (of dimension n x n with 1's on the diagonal and zeros elsewhere). Then compute the hypothesis matrix H = M1 * XE. Note that XE in this subroutine is equivalent to Searle's X2 and XR is equivalent to Searle's X1. The following statements perform the above arithmetic but bypass the intermediate step of computing M1. The chosen method depends on the value of LEVEL. */ if level = 1 then H = XE - XR * inv(XR` * XR) * XR` * XE; else H = XE - XR * ginv(XR) * XE; /* Since they are no longer needed, erase XE and XR to conserve memory. Note that erasing the two matrices here means that after returning from this subroutine the values in the ma- trices will not be available in the calling program. To conserve memory, other matrices are also erased below as soon as they are no longer needed. */ free XE XR; /* Compute the projection matrix PM using the appropriate method, as determined by LEVEL. */ If level = 1 then PM = H * inv(H` * H) * H`; else PM = H * ginv(H); if printh = 0 then free H; /* Compute the projection of y. */ yp = PM * y; if printss = 0 then free PM; /* Compute the desired sum of squares as the squared length of the projection yp. */ result = ssq(yp); if printh = 0 then free yp; /* If requested, print the computed sum of squares. */ if printss = 1 then print result [format=25.15]; if printss = 2 then print result [format=e23.]; /* If requested, print the intermediate results: - the hypothesis matrix - the projection matrix - the projection of y. Print the hypothesis matrix and the projection of y untrans- posed or transposed, depending on the value of PRINTH. */ if printh = 1 then do; print , 'Hypothesis matrix H = XE - XR * ginv(XR) * XE:', H; print ,'Projection matrix PM = H * ginv(H):', PM; print ,'Projection of y: yp = PM * y:', yp; end; if printh = 2 then do; print , 'Transpose of hypoth. matrix H = XE - XR * ginv(XR) * XE:', (H`); print ,'Projection matrix PM = H * ginv(H):', PM; print ,'Transpose of projection of y: yp = PM * y:', (yp`); end; /* If the argument PRINTSS requested printing of sums of squares then, as a way of (indirectly) checking the accuracy of the subroutine, perform checks on PM to see how close it is to having the following properties: - symmetric - rows sum to zero - idempotent. */ if printss > 0 then do; /* The checks are done by first computing the largest relative difference between corresponding values in two matrices that should be identical. The relative difference between two matrices P and Q (called the "relative error") is defined as E = (P - Q) / P. Both the subtraction and the division in the above expres- sion are done on an element-by-element basis. Thus E is it- self a matrix with the same dimensions as P and Q. The sub- routine computes L, the largest absolute value of the ele- ments in E as a measure of the equality of the two matrices. The subroutine then converts L to a rough number of digits of accuracy with the formula ndigits = -log10(L). First generate PMD, the divisor matrix for two of the rela- tive errors. PMD is simply PM, except that we ensure that none of the values in PMD is zero. */ if all(PM ^= 0) then PMD = PM; else do; PMD = PM + ((PM = 0) * .6e-78); n = sum(PM = 0); string = 'elements of PM were zero. .6E-78 '; string = string + 'was added to these elements '; string = string + 'to avoid dividing by zero in '; string = string + 'computing relative errors.'; print n string; end; /* Second, set DECACC, which is the number of decimal digits of accuracy carried in the computations. This value depends on the computer operating system and language used to run the subroutine. For my (Windows) computer the correct value with SAS is 16. */ decacc = 16; /* Third, compute the maximum absolute relative difference be- tween corresponding elements in PM and its transpose as a measure of the symmetry of PM. */ mrelerrS = max( abs( (PM - PM`)/PMD ) ); if mrelerrS = 0 then ndigitsS = decacc; else ndigitsS = -log10(mrelerrS); /* Fourth, compute the maximum absolute relative error between the sum of elements in a row of PM and zero; use the average of the absolute values of the elements in the row as the di- visor. (See page 47 of the IML manual for subscript reduc- tion operators [SAS Institute Inc., 1989].) */ D = abs(PM); mrelerrR = max( abs( PM[,+] / D[,:] ) ); if mrelerrR = 0 then ndigitsR = decacc; else ndigitsR = -log10(mrelerrR); /* Fifth, compute the maximum absolute relative difference be- tween corresponding elements in PM and PM-squared as a meas- ure of how close PM is to being idempotent. */ mrelerrI = max( abs( (PM - (PM * PM)) / PMD ) ); if mrelerrI = 0 then ndigitsI = decacc; else ndigitsI = -log10(mrelerrI); /* Print the computed numbers of digits of accuracy. */ print ndigitsS [format=5.1] ndigitsR [format=5.1] ndigitsI [format=5.1]; end; /* of if printss > 0 then do */ /* SUBROUTINE NOTE FOR USERS OF EBCDIC COMPUTER SYSTEMS The NOT operator (^) above in this subroutine is the correct operator for ASCII systems. If the subroutine is run on an EBCDIC system (e.g., an IBM mainframe), you may have to change each occurrence of ^ to the EBCDIC logical NOT opera- tor, which looks like a minus sign with a short vertical bar dropping down from the right end (and which is EBCDIC hexa- decimal 5F). SUBROUTINE REFERENCES Aldrich, J. 1998. Doing least squares: Perspectives from Gauss and Yule. _International Statistical Review_ 66, 61-81. Graybill, F. A. 1983. _Matrices with Applications in Statistics_ 2d ed. Belmont, CA: Wadsworth. Harville, D. A. 1997. _Matrix Algebra From a Statistician's Perspective._ New York: Springer-Verlag. Hocking, R. R. 1985. _The Analysis of Linear Models._ Monterey, CA: Brooks/Cole. Macnaughton, D. B. 1997. Which sums of squares are best in unbalanced analysis of variance. Available at http://www.matstat.com/ss/ Macnaughton, D. B. 1998a. PR0139.HTM: Computing numerator sums of squares in unbalanced analysis of variance: Two- way case. Available in the "Computer Programs" section at http://www.matstat.com/ss/ Macnaughton, D. B. 1998b. PR0165.HTM: Computing numerator sums of squares in unbalanced analysis of variance: Three-way case. Available in the "Computer Programs" section at http://www.matstat.com/ss/ SAS Institute Inc. 1989. _SAS/IML Software: Usage and Refer- ence, Version 6, First Edition._ Cary, NC: author. Searle, S. R. 1987. _Linear Models for Unbalanced Data._ New York: Wiley. Sunwoo, H., and B. C. Kim. 1997. Analysis of the unbalanced linear model based on the balanced model. _Journal of Statistical Computation and Simulation_ 56, 373-385. Yates, F. 1934. The analysis of multiple classifications with unequal numbers in the different classes. _Journal of the American Statistical Association_ 29, 51-66. */ finish SS; /* end of subroutine SS version of June 19, 1998 ***********************************************************/