Home Subject: Re: Experimental Research In Education: The Most Exciting Talk at the 2005 Joint Statistical Meetings To: EdStat E-Mail List ApStat E-Mail List Teaching Statistics E-Mail List sci.stat.edu Usenet Newsgroup From: Donald B. Macnaughton < email@example.com > Date: Sunday February 11, 2007 ----------------------------------------------------------------- Some Basic Ideas of Science This post replies to comments by Mike Palij. But first, to sup- port the discussion, it is helpful to propose definitions of some basic ideas of science, as follows: EMPIRICAL RESEARCH is any activity in which measurements (observations) are gathered from some area of experience and then reasonable conclusions are drawn from the meas- urements. Measurements in empirical research are usefully viewed as reflecting the values of VARIABLES, which reflect the measured values of properties of entities. (Entities may be people, other living things, physical objects, or any other type of object or thing.) An EMPIRICAL RESEARCH PROJECT (or logical sub-unit of an empirical research project) is usually usefully viewed as studying the relationship between a single response vari- able and one or more predictor variables in the entities in the population of interest. This study is performed by analyzing the measured values of the variables ob- tained from a sample of entities from the population. An EXPERIMENT is an empirical research project in which at least one of the predictor variables is “manipulated” (i.e., caused to take certain values in the entities in the sample) by the researcher. An OBSERVATIONAL RESEARCH PROJECT is an empirical re- search project in which none of the predictor variables is manipulated by the researcher, and the values of the predictor variables are simply observed. I discuss the ideas in more detail in a paper (2002). The description of an empirical research project is qualified with the word “usually”. This implies that some empirical re- search projects (or sub-units) don’t satisfy the description -- that is, they can’t be reasonably viewed as studying the rela- tionship between a single response variable and one or more pre- dictor variables. My experience suggests that research projects that can’t be reasonably viewed as satisfying the description ap- pear in less than three percent of published empirical research reports (in science, technology, and business). I discuss some of these research projects in the paper (2002, Appendix I.2). For economy of words, this small group of research projects is mostly ignored in the discussion below. Novelty of Experiments in Education Quoting my August 25, 2005 post, Mike Palij wrote (on August 26 in the EdStat e-mail list) < snip > > Experimental studies in education are not a new idea. I agree. I suspect that the idea of performing proper experi- ments in education can be traced back to the 1920’s or 1930’s when Fisher, Neyman, Pearson, and other statisticians first ex- plained and debated the idea of a proper experiment. However, proper experiments in education haven’t often been performed. What is a “proper” experiment? Here is a reasonable definition: An experiment (or randomized trial) is a PROPER EXPERI- MENT if it has been performed according to widely ac- cepted principles of scientific practice, experimental design, and data analysis, as described by Bailar and Mosteller (1992), Box, Hunter, and Hunter (2005), Fleiss (1986), Kirk (1995), Winer, Brown, and Michels (1991), and many others. I suspect that proper experiments haven’t often been performed in education for two reasons: (a) proper experiments in education are difficult, and (b) perhaps simply due to tradition, research- ers in education have lacked a strong connection to the ideas of experimental research. Problematic Nature of Experiments in Education < snip > > There are a number of reasons why experiments are problematic > in educational settings I fully agree. Some key problems in performing a good education experiment to compare teaching approaches are 1. What is the RESPONSE VARIABLE (or variables) that we will use to compare the teaching approaches? Will it be marks, or at- titudes, or aptitudes, or some other measure of the students (or the classes of students) under study? 2. What are the PREDICTOR VARIABLES that we will measure in the research? One predictor variable will reflect the different teaching approaches under study. What other predictor vari- ables (e.g., students’ age or gender) should we measure to as- sist our understanding? Is it reasonable to use students’ race as a predictor variable? 3. How can we design the experiment in a way that will ELIMINATE teacher effects and other REASONABLE ALTERNATIVE EXPLANATIONS of any significant differences we find in the values of the response variable between the teaching approaches? 4. How can we design the experiment so that it is as LIKELY as possible that we will FIND STRONG EVIDENCE of the relationship we are looking for between the response variable and the pre- dictor variable(s) (assuming that the relationship actually exists)? 5. How can we RECRUIT TEACHERS to participate in the experiment? 6. How can we OBTAIN FUNDING to pay for the experiment? 7. After we have performed the experiment, how can we ANALYZE the RESULTS and DRAW SCIENTIFICALLY VALID CONCLUSIONS? 8. While properly addressing the preceding seven problems, how can we MINIMIZE the COSTS of the experiment? Appendix A expands the eight problems and discusses some solu- tions. Necessity of Experiments in Education < snip > > As for the necessity of experimental research to provide the > basis for either (a) science or (b) valid conclusions, the as- > tronomers say “Hi! We’ve been engaging in science for hundreds > of years without having performed a single experiment!”. Mike suggests that if astronomers don’t perform experiments, then education researchers also don’t need to perform experiments. I think that this is a thought-provoking point, but an invalid ar- gument. The argument is invalid because astronomers CAN’T per- form experiments because they can’t manipulate distant astronomi- cal events. If they could manipulate these events (at a reason- able cost), it seems doubtless that they would. That is, they would perform proper experiments just like scientists in disci- plines that regularly perform experiments, such as in most branches of physics, chemistry, engineering, medicine, biology, and psychology. The fact that astronomers can’t perform astronomical experiments makes astronomy an “observational” discipline. Other observa- tional disciplines that generally can’t perform experiments (due to the remoteness or untouchability of the phenomena they study) include anthropology, archaeology, economics, epidemiology, geol- ogy, paleontology, and some areas of sociology. Such observa- tional disciplines base their inferences on careful observational empirical research, often studying relationships between vari- ables. (In historical disciplines, observational research is sometimes [due to the paucity of data] reduced to careful consid- eration of physical or anecdotal information about entities, properties of entities, or variables, without focusing on the concept of ‘relationship between variables’.) Proper observational research often enables us to reliably PREDICT the values of the response variable (in new situations), which is an important benefit. However, in education research we would generally like to learn how to reliably CONTROL the values of the response variable (in new situations). That is, we would like to learn how to structure an education program so that it provides students with the BEST education. Unfortunately, obser- vational research is almost always equivocal about control -- subject to multiple competing reasonable explanations. For example, suppose we are presented with the results of an ob- servational research project in education that suggests that a certain teaching approach A is better (in the sense of exhibiting significantly better average values in students of the chosen re- sponse variable) than another teaching approach B. In this case (due to the nature of observational research) it is almost always possible to find a reasonable alternative explanation of the re- search finding, and this explanation implies that approach A may NOT be better than approach B. But if we find such an explana- tion, this implies that the research is equivocal. This means that the research is of substantially less value because it can’t reliably help us to decide which teaching approach is better. (A reasonable alternative explanation in observational education research is often in terms of “confounding” of the teaching ap- proaches under study in the research with some other aspect [i.e., variable] of the research situation. Then it is generally possible that this other variable can fully account for the dif- ference between the average values of the response variable under the different teaching approaches. For example, an observational research project might confound two teaching approaches with two different schools -- school 1 uses teaching approach A, and school 2 uses teaching approach B. In this case if we find a significant difference in the average values of the response variable between the two approaches, it is possible that the teaching approaches have no differential effect on the values of the response variable. That is, [unless the School variable is appropriately (and expensively) taken account of in the design of the research] it is possible that a certain difference between the SCHOOLS caused the observed significant differences in the average values of the response variable [in students or classes] between the teaching approaches.) In contrast, suppose that a proper EXPERIMENT provides good evi- dence that teaching approach A is better than teaching approach B. In this case the finding is unequivocal. This is because proper experiments are explicitly designed to eliminate confound- ing and other reasonable alternative explanations. Thus in this case we can safely (tentatively) conclude that approach A will be better than approach B in new situations (if the relevant condi- tions are sufficiently similar to those of the experiment). Thus proper experiments are preferred to observational research pro- jects in education research. (The equivocation in observational research relates to drawing conclusions about causation. That is, we are interested in whether teaching approach A CAUSES students to do better than teaching approach B. Evidence about relationships between vari- ables obtained in observational research is generally equivocal about causation. In contrast, evidence about relationships ob- tained in proper experimental research is unequivocal about cau- sation.) Changing Attitudes Toward Experiments in Education Mike noted that members of the respected American Educational Re- search Association (AERA) have carefully considered the issue of experimental research in education. Mike’s point is directly re- flected in the opening sentence of the official description of the theme of the 2006 AERA annual meeting: Current social and political pressures on education re- search suggest that research must meet the demands of evidence-based and scientifically based inquiry (Ladson- Billings and Tate 2006). The idea of “current” pressures reflects the fact that the pres- sures on education research are new, having arrived over the last decade or so. The sentence implies that education researchers are moving toward “evidence-based and scientifically based” re- search. This suggests that education researchers should gener- ally perform proper experiments (because observational research results are generally equivocal). (Having acknowledged the importance of proper research, the de- scription of the theme of the 2006 AERA meeting turns to the theme itself, which pertains to education research in the public interest, education research that will “increase the common- wealth”. The discussion of the theme is available at http://www.aera.net/annualmeeting/?id=694 ) Opportunities for Experiments in Education The preceding discussion suggests that (a) the area of experimen- tal studies in education is only now beginning to open up and (b) this area will become the mainstream of education research as granting agencies and journal editors reinforce the point that proper experiments are preferred to observational research. Be- cause the area is opening up, it has many opportunities for thoughtful researchers. To perform a proper education experiment a researcher must be fa- miliar with the principles of experimental design, power analy- sis, and (often) repeated measurements analysis of variance. Some education researchers are less familiar with these topics. They may find it helpful to follow the path of many medical re- searchers who collaborate with a statistician with experience in the topics. To ensure that the research design is efficient, I recommend that this collaboration begin early in the design phase of the research. Opportunities exist for statisticians to present courses to edu- cation researchers about the statistical and scientific aspects of education research. I propose topics for such a course in ap- pendix B. I believe that the movement toward experimentally based education research will yield a body of reliable research results that will substantially improve education. Don Macnaughton Donald B. Macnaughton firstname.lastname@example.org Appendices Appendix A: Eight Problems in Experiments in Education Appendix B: Courses About Experiments For Education Researchers Appendix C: Can Human Performance or Behavior Be Predicted from a Person’s Race? Appendix D: Specifying a Repeated Measurements Analysis of Variance Appendix A: Eight Problems in Experiments in Education The body of this post lists eight problems that arise in experi- ments in education. This appendix briefly expands the problems and discusses some general solutions. Problem 1: Choosing the Response Variable Suppose we are designing an education experiment to compare two teaching approaches. In choosing the response variable we can reasonably begin by answering the following question: What would we like the teaching approaches under study to do (accomplish)? That is, what is our teaching goal? This question can be answered at a high level by deciding which of the following is the main goal: 1. maximize student knowledge and understanding of the subject area 2. maximize certain student attitudes toward the subject area 3. maximize student aptitudes in the use of the subject area in practical applications 4. optimize some other property (or combination of properties) of the students. The definition of the teaching goal in an education experiment depends on the particular topic or discipline being taught, on the type of students being taught, and on the course designer’s and researcher’s interests. The definition of the goal deserves careful attention because it lies at the heart of the research. After we have defined the teaching goal, we can choose the re- sponse variable by addressing a second important question, which is How can we best measure the effectiveness of a teaching approach to satisfy the teaching goal? The answer to this question defines the response variable. For example, if the goal is to maximize students’ knowledge and un- derstanding, the main response variable in an education experi- ment will be a measure of each student’s knowledge and under- standing, typically a weighted average of marks on assignments or tests. In this case it is important to devise a “fair” measure of students’ knowledge and understanding by devising “fair” as- signments or tests -- a challenging but doable task. Similarly, if the teaching goal is to maximize certain student attitudes toward the subject area of the course, the response variable(s) will be one or more measures of students’ attitudes. Attitudes are important because they play a key role in peoples’ decisions. It is generally unnecessary to devise a test of atti- tudes toward the subject area of a course because reliable stan- dardized attitude tests (which can be administered in less than twenty minutes) are available for many subject areas. (You can find tests of students’ attitudes toward a subject area by searching the Web of Science [available in some university li- braries], other science literature databases, and the world wide web for articles and books that have the word “attitude” [or “at- titudes”] and the name of your subject area in their titles or keywords. You may also be able to find acceptable generic atti- tude tests. If you are studying a subject area in which an ac- ceptable test of student attitudes toward the area is not avail- able, and if you are comfortable with statistical ideas, you can study other attitude tests and then use the principles of atti- tude scale development [Aiken 2002, Krosnick, Judd, and Wittenbrink 2005] and the statistical procedure of exploratory factor analysis [Thompson 2004] to develop such a test.) Similarly, if the teaching goal is to maximize students’ apti- tudes (which is a key goal in courses that teach applied skills), the response variable will be a measure of students’ aptitudes. Standardized measures of aptitudes are available in some areas, and can be found with the methods in the first sentence of the preceding paragraph. Problem 2: Choosing the Predictor Variables Choosing the predictor variables in an education experiment to compare teaching approaches requires first that we choose the two (or more) teaching approaches that we wish to compare. These teaching approaches define the values of the “teaching approach” predictor variable. This variable is manipulated in the students in the sense that some students receive one of the teaching ap- proaches, and other students receive the other. Two reasonable teaching approaches to compare are (a) the traditional approach to teaching some topic or discipline and (b) the top contender to replace the traditional approach. The experiment stages a fair contest between the two approaches. As noted above, in addition to choosing the main predictor vari- able, we must also decide which other variables of the situation under study we will measure. Generally, the more “relevant” pre- dictor variables we measure in an experiment, the better the un- derstanding we obtain of the relationship between variables we are studying. Thus it may be useful to measure each student’s age and gender. It may also be relevant to measure students’ previous experience with the material, intelligence, socioeco- nomic status, and years of experience speaking the language that the course is taught in. Appendix C discusses the use of “race” as a predictor variable in research studying measures of human performance or behavior. Problem 3: Eliminating Reasonable Alternative Explanations The need to eliminate reasonable alternative explanations of a research finding stems from the sensible principle that good re- search must be unequivocal. Eliminating reasonable alternative explanations is difficult because many forms of reasonable expla- nations are possible, and some are hard to recognize. Thus ex- perienced researchers spend considerable time trying to think of reasonable alternative explanations of research results, espe- cially results of their own planned research. If the elimination of reasonable alternative explanations is properly done (gener- ally through careful research design), it (mostly) eliminates the possibility that the associated research conclusion will be ruled invalid due to a reasonable alternative explanation that someone thinks of later. Confounding alternative explanations are eliminated by random as- signment of experimental entities to treatments. That is, in an education experiment we randomly assign students (or classes of students) to teaching approaches. Such random assignment helps to ensure (in a probabilistic sense) that the different teaching approaches could not be confounded with schools or with many other variables. For logistical reasons, random assignment is sometimes difficult in education research. In the case of randomly assigning stu- dents to different teaching approaches, the problem can often be solved (albeit at some expense) by running both (or all) the teaching approaches together at the same time on the same day of the week in nearby (similar) facilities. (If enough resources exist, this concurrent presentation of the set of treatments can be repeated on different days of the week or at different loca- tions.) This means that students can be readily randomly as- signed to the treatments (teaching approaches) without confound- ing because none of the treatments has extraneous time or loca- tion advantages. I recommend that the researcher (a) discourage students from switching between classes, but permit such switches, and (b) identity any students who switch, perhaps to be with a friend. The students who switch classes after the course has begun can be studied and then (due to unrepresentativeness) be omitted from the analysis. Problems 4, 7, and 8: Power, Analysis, and Minimizing Costs The fourth, seventh, and eighth problems respectively pertain to maximizing the power of the statistical tests in an experiment, analyzing the data obtained in the experiment, and minimizing the costs of the experiment. Detailed technical help with these problems is available from the field of statistics, which has ef- ficient general methods for the design and analysis of powerful but inexpensive experiments. I discuss an introduction to the methods of statistics in the paper (2002). One can get practical help with problems 4, 7, and 8 by studying the medical research techniques of multicenter clinical trials. These techniques can help because proper experiments in education must be performed in parallel at several teaching institutions (or at least in several classes) to eliminate teacher effects and to provide power and generality. This is similar to how medical research is performed in parallel in several hospitals in a mul- ticenter clinical trial. (Technical Aside: In medical research the patient is often the entity [unit] of analysis, but in education research the class of students is often the [implicit] entity of analysis. A class of students in education research is analogous to a hospital [or other grouping] of patients in a multicenter clinical trial. A researcher can often substantially increase the power of statis- tical tests in education research [with often only a minor in- crease in costs] by designing the research so that the student [instead of the class] is the entity of analysis. This can be done by using a pre-post [i.e., repeated measurements] experimen- tal design. That is, the value of the response variable is meas- ured in each student in the experiment both before and after the students experience their assigned teaching approach. Such a de- sign is feasible with many but not all response variables.) Problem 5: Recruiting Teachers Recruiting teachers or teaching departments to participate in ex- perimental research in education is facilitated if the researcher explains how the research will provide important educational benefits. If this is carefully done, and if the disruption of the existing program is not too great, appropriate participants can generally be recruited, just as appropriate medical personnel at different hospitals are recruited as a first step in a multi- center clinical trial. Problem 6: Obtaining Funding To obtain research funding a researcher submits a carefully writ- ten grant proposal to an appropriate funding agency. The pro- posal competes with other proposals for funds from the pool of funds distributed by the agency. Research proposals are judged on the following criteria: - the reasonableness of the hypothesized phenomenon that the re- search will study - the clarity of thinking in the rationale and implications of the research - the potential of the hypothesized phenomenon to make a worth- while contribution to the field under study, and - the conformity of the proposal to the correct style and proto- col for grant proposals to the agency. The research projects whose proposals best satisfy the above cri- teria generally receive funding. Appendix B: Courses About Experiments For Education Researchers I recommend that courses about experimental research for educa- tion researchers discuss the eight problems discussed in appendix A in terms of examples of good and bad education experiments. I recommend that teachers discuss real or realistic experiments (as opposed to abstract experiments) because realistic experiments enable students to consider the specific research goal of each example. Considering researchers’ goals helps students to formu- late the goals of their own research. In discussing a “good” education experiment it is important to convey to students the practical benefits that are provided by the results because these benefits generally justify the care taken to perform the experiment. In view of the usefulness of a pre-post experimental design for increasing power, I recommend that this type of design and the proper analysis of the results of this type of experiment be dis- cussed in detail. Discussion of data analysis can best omit all mathematical con- cepts and focus on interpreting the output from the computer analysis of the data. Most experiments to compare teaching ap- proaches have a continuous reasonably-well-behaved response vari- able (e.g., marks or attitude scores), and the main predictor variable (i.e., “teaching approach”) is discrete. Thus the re- sults of these experiments are best analyzed with analysis of variance (which becomes repeated measurements analysis of vari- ance if a pre-post design is used). (Repeated measurements analysis of variance is also called “mixed model” analysis because the right-hand side of model equation of the relationship between the variables contains a mixture of (a) “fixed” terms (associated with the predictor variables) and (b) “random” terms associated with unaccounted-for variation in the values of the response variable. However, I prefer the term “re- peated measurements” because it is more intuitive for beginners.) I recommend that discussion of data analysis be in terms of exam- ples of good experiments in education research. I recommend that the discussion cover the following topics: 1. For each example, a discussion of the research hypothesis (or research question) under study, a discussion of the conduct of the experiment, and a discussion of the layout of the data ta- ble that was obtained in the experiment and that is the basis of the data analysis. 2. What the computer (software) must be told in order to perform the analysis of variance. We must tell it - the location of the data table (often in a file on the computer) - which variable in the data table is the response variable in the analysis - which variable(s) in the table is (are) the predictor variable(s) - in repeated measurements experiments which of the predic- tor variables vary within experimental entities and which vary between experimental entities (or, equivalently, which variable uniquely identifies the experimental enti- ties) - in more complicated experiments details of the relation- ship between variables we are studying, such as the hy- pothesized form of the model equation of the relationship. 3. How to tell the computer what it must be told. This varies among different software products, and is discussed further in appendix D. 4. How to interpret each p-value in an analysis of variance table produced by the computer from the analysis of the data. 5. How to understand tables of means of the response variable for effects with low p-values, as produced by the computer. 6. How to graphically illustrate means of the response variable for effects with low p-values for ease of understanding of the results. 7. How to understand other output from the computer such as (a) a measure of strength or “effect size” of each relationship be- tween variables studied in the experiment and (b) the esti- mates of the values of the parameters of the model equation. 8. How (to avoid embarrassing errors) the researcher must confirm that the underlying assumptions of a statistical analysis are adequately satisfied before drawing final conclusions. (This includes (a) checking of the univariate distribution of each of the variables for anomalies and (b) appropriate verifica- tion that the data have certain necessary properties. The specific properties depend on which statistical procedure is used to study the relationship.) These assumptions are often but not always adequately satisfied. 9. How to interpret the analysis in terms of the research hy- pothesis. Some statistics courses omit or minimize the first and last top- ics, which pertain to the research hypothesis. This omission oc- curs because some courses are more focused either on data analy- sis or on statistical theory, which are both vast topics. How- ever, consideration of the implications of the research for the research hypothesis is clearly important in courses aimed at edu- cation researchers. In high-level terms, the results of a data analysis have one of three implications for the research hypothesis, which are: 1. the results support the research hypothesis (and there is no reasonable alternative explanation of the results) 2. the results neither support nor refute the research hypothesis (either because the research found no good evidence of the sought-after relationship between variables or because a rea- sonable alternative explanation of the results is available) 3. the results refute (contradict) the research hypothesis. Because of the positive way that research hypotheses are framed (e.g., drug D reduces cancer), researchers performing a research project almost always hope that the first outcome will occur. That is, they hope that the results of the research project will support their research hypothesis. In this case, if the hypothe- sis was carefully thought out, the finding will make a contribu- tion (perhaps substantial) to the associated field of study. Un- fortunately, the second outcome sometimes occurs, perhaps because the research hypothesis is false, or because the research was poorly designed, or due to the whims of chance. Although the third outcome is generally possible, it rarely occurs in prac- tice. After students understand the basic ideas of drawing conclusions from data analysis, I recommend that they learn how to use the computer to generate realistic artificial data. Such data gen- eration is not difficult if students are given the appropriate instructions (or program templates) and if they are shown how to generalize the instructions as necessary. The ability to generate realistic artificial data has three bene- fits: First, generating artificial data helps students to under- stand the postulated model equation of the relationship between the variables that is under study. This understanding comes be- cause generating data is most easily understood through writing a simple computer program that substitutes the values of the pre- dictor variables into the model equation and then evaluates the equation to generate the predicted value of the response vari- able. (This evaluation is done repeatedly through the use of program loops and with one or more random number generators to generate the values of the random term[s] in the equation.) From a theoretical perspective the model equation is the essence of a relationship between variables, and the students’ experience in programming the essence helps them to understand it. Second, the ability to generate artificial data gives students a source of “tame” data, which they can then analyze with data analysis procedures. Because the students generate the data, they know exactly what relationship(s) is (are) present (or not) in the data. This allows them to see how the data analysis pro- cedures work to detect and characterize relationships between variables. This helps students to develop knowledge and trust of the procedures. Third, students can be encouraged to use their ability to gener- ate artificial data when designing their own research projects. That is, during the design phase of a research project students can generate sets of realistic artificial data that resemble the data they expect to obtain in the research. Then they can care- fully analyze these artificial data with the planned data analy- sis procedure. This gives students a thorough review of the planned research design and data analysis procedure before the design and analysis are put into practice. Such a review is of- ten helpful in eliminating later serious problems and in increas- ing efficiency, especially for beginning researchers. (Statistical software vendors can help users to generate artifi- cial data by providing easy-to-follow instructions for generating data for all the common types of relationships between variables. I recommend that these instructions be placed prominently in the software documentation so that beginners can easily find this im- portant resource. A component of the software that can generate a data table from fill-in-the-blanks specifications might also be helpful for beginners.) I believe that the topics in this appendix, when developed in ap- propriate detail, give prospective researchers a reasonable in- troduction to how to perform experiments in education research. Appendix C: Can Human Performance or Behavior Be Predicted from a Person’s Race? The Model Equation of a Relationship Between Performance and Pre- dictors of Performance Suppose that the variable y is a particular measure of human per- formance or human behavior. For example, y might reflect stu- dents’ grade point averages or it might reflect some measure of athletes’ ability. Under the scientific approach we think that y “depends” on a number of other variables. For example, we think that each student’s grade point average probably depends on the students’ intelligence, on their motivation, on their parents’ style of parenting, on the attitudes of their friends, perhaps on their diet, and on various other variables. The relationship between y and the other variables can be written in a general model equation as y = f(x1, x2, ..., xn) + ε. The performance measure, y (e.g., grade point average), is the response variable in the relationship, and the x1, x2, ..., xn are the relevant predictor variables (e.g., intelligence, motiva- tion, etc.). The notation f(...) stands for a mathematical func- tion that outputs the estimated value of y for a person when the values of the x’s for the person are substituted into it. [The detailed mathematical form of f(...) is discovered through analy- sis of relevant empirical research data.] The symbol n indicates the number of predictor variables under consideration, which is typically between one and five. The Greek letter ε (epsilon) on the right end of the equation is the “error” term. It takes account of the fact that f(...) gen- erally can’t perfectly predict the actual measured value of y for a person from the x’s. The error term is a “random” variable be- cause it has a different seemingly random value each time the equation is applied. Usually ε is sensibly modeled as being half the time greater than zero and half the time symmetrically less than zero, so it has an average value of zero. Invariably ε is modeled as being more often closer to zero than farther away. (Technical Aside: In any particular [standard] instance of a variable [and at a given time] the variable [just like the prop- erty behind the variable] has a single value. [For simplicity, I ignore here (a) the idea that a particular value of a variable may be “missing” and (b) the more general but infrequent case of variables that are vectors.] A variable can be classified as be- ing either a continuous variable or a discrete variable. If a variable is a continuous variable, its value in a particular in- stance can theoretically be any value between the minimum and maximum permissible values. Continuous variables almost always have numeric values. For example, grade point average is a con- tinuous variable that for a given student [in some schools] can have any value between 0.00 and 4.00, such as 3.82. The values of variables that are obtained from conventional measuring in- struments [of any type, e.g., ruler, stopwatch] are usually con- tinuous variables. [The values of any continuous variable are limited to a certain maximum number of significant digits (often between two and four) due to limitations in the accuracy of the measuring instrument that is used to measure the values.] In contrast, if a variable is a discrete variable, its value in a particular instance can be one of only a limited number of dif- ferent values, usually less than thirty, and sometimes as few as only two. [Discrete variables can be ordinal -- with an implicit ordering -- or categorical.] For example, the variable “likes to dance” is reasonably viewed as a [ordinal] discrete variable that for a given person has one of a range of five [or perhaps seven] possible values indicating different levels of liking or dislik- ing to dance. [The limitation to five or seven values generally occurs in raw variables that reflect human judgments or opinions, as discussed by Miller, 1956.] For simplicity, the discussion in this appendix assumes that a continuous response variable is al- ways used in research projects because using continuous response variables is [when feasible] the more efficient and more common approach. If the response variable in a particular empirical re- search project is discrete, some of the technical ideas behind model equations of relationships between variables change, but the main principles in this appendix still apply, as shown by the theory of generalized linear models [McCullagh and Nelder, 1989].) We (i.e., society) can use the model equation of a (properly verified) relationship between variables to help us to predict and sometimes control the values of the response variable in a new situation on the basis of measuring or controlling the values of the predictor variables in the situation. If the variables are carefully chosen, the ability to predict or control can be of substantial value. For example, if we can find a (ethical) method to control (i.e., raise) grade point averages in students by controlling the values of other variables, we can use the method to help students to excel. As research into a relationship between variables advances, more predictor variables may be discovered that can be (correctly) in- cluded in the function f(...) for the relationship, which makes the predictions (or control) made by the function more accurate. As the predictions become more accurate, the average (absolute) size of the error term ε in the model equation for the relation- ship becomes accordingly smaller. In some areas of research (e.g., in many areas of the hard sciences) the model equations make almost perfect predictions. Thus the error terms in these equations are so small that they are often sensibly ignored. The Concept of ‘Race’ The discussion below uses the concept of ‘race’. Most people above the age of ten or so have a reasonable intuitive under- standing of this concept in the sense that they can reliably (though not perfectly) assign themselves and other people to ra- cial categories. (The assignments are “reliable” in the sense that different people generally agree with each other [at a mutu- ally acceptable level of classification] about the assignments.) Although most people understand the concept of ‘race’ at an in- tuitive level, formal definitions of the concept are difficult. The definitions break into three classes, which are 1. definitions in terms of a person’s biological ancestry (e.g., in terms of classification of the person’s genetic DNA) 2. definitions in terms of a person’s self-reported race 3. definitions in terms of a person’s observable attributes such as skin color, hair color, facial characteristics, and speech characteristics. Each of the three classes contains various definitions of race. Each definition provides (at least in theory) a way of assigning people to racial categories. The categories are usually discrete categories (e.g., Asian, Black, Mixed, Native American, White, Other) rather than reflecting one or more continuous scales. For example, using the self-report approach a researcher might ask each person studied in a research project which of the above six racial categories they belong to. Thus race would be defined and measured in terms of the six categories. A second researcher might define race in terms of eight or ten or even more catego- ries reflecting the many identifiable groups of people in the world. The various definitions of race are closely associated, but are different because assignments to racial categories by one of the definitions will sometimes disagree with assignments by another. For example, a person may ancestrally belong in whole or in large part to one race, but may report or appear as belonging to an- other. Definitions of race in the first class (biological ancestry) are generally preferred to definitions in the other two classes be- cause the first class seems basic, and the other two classes seem to be merely less accurate reflections of it. However, defini- tions in the first (and third) class are often difficult to im- plement in practice in research. (Some of the difficulties arise from respected ethical considerations.) Thus if a research pro- ject is performed on people in which each person’s race is meas- ured, the researcher will often measure race in terms of self- reported race. Is Performance “Causally Dependent” on Race? Suppose that a research project is carried out to study the rela- tionship between (a) a measure of human performance (or behavior) as the response variable and (b) a set of other variables that are predictor variables. The predictor variables may reflect a person’s attributes and may reflect manipulations applied to the person in an experiment. Suppose the researcher includes in the research a predictor variable that reflects race (or ethnicity). And suppose that the research project finds good evidence of a relationship between performance and race -- the average level of performance of people from one race is significantly higher than the average level of performance of people from another race. This raises a key question: Can we conclude from this relation- ship between variables that human performance depends to some ex- tent on one’s race? In other words, can we conclude that differ- ences in race cause differences in performance? For example, evidence exists that Asians score somewhat higher (on average) on intelligence tests than Whites, who in turn score somewhat higher (on average) than Blacks. Does this imply that a person’s intelligence depends (partly) on their race? No. The evidence of the relationship doesn’t imply dependence or a causal relationship because this relationship between variables is invariably studied with observational research, as opposed to proper experimental research. Observational research must be used because it is impossible in a practical sense to manipulate “race” in a proper experiment. That is, unlike assigning treat- ments to people (or people to treatments) in an experiment, a re- searcher can’t assign races to people (or people to races) be- cause race has already been assigned. Because research projects studying the relationship between performance and race are in- variably (in that aspect) observational research projects, the results of the research are open to reasonable alternative expla- nations, as discussed in problem 3 in appendix A. For example, the relationship between intelligence test scores and race was found through observational research. This rela- tionship could easily be accounted for by other causal variables that are confounded with race and that have (unfortunately) been omitted from the analysis (typically because they are unknown or are deemed unimportant). For example, due to a history of op- pression that began with Whites’ slavery of Blacks, many Blacks throughout the world have had less access to opportunities and resources, which might account for the differences in average in- telligence test scores between Blacks and Whites. Thus if vari- ables that properly reflect the relevant types of oppression (in- cluding relevant historical effects) are included in the analy- sis, the Black/White aspect of the relationship between intelli- gence test scores and race might easily vanish. Similarly, Asians might score higher (on average) on intelligence tests than Blacks and Whites due to cultural childhood influences among Asians that emphasize disciplined logical thinking. Thus if a variable reflecting childhood encouragement of logical thinking is included in the analysis, this second aspect of the relation- ship between intelligence test scores and race might also vanish. To minimize expensive errors, science demands unequivocal evi- dence of causation before causation can be inferred. But, as noted, the results of research projects studying the relationship between performance and race are generally equivocal because they are open to reasonable alternative explanations. Therefore, it is generally scientifically impossible to infer that performance in humans is causally dependent on a person’s race. The word “generally” in the preceding paragraph indicates the possibility of exceptions. That is, it is conceivable that an ingenious researcher might find a way to perform a proper experi- ment or find a way to deal with all of the confounding variables and all of the alternative explanations. This researcher might still find good evidence of a causal relationship between a cer- tain response variable reflecting performance and a person’s race. Then we could conclude that a causal relationship exists between performance and race. However, the chance of this occur- ring is low because 1. Eliminating all reasonable alternative explanations would be very difficult or impossible. 2. Knowing that performance depends to a small extent on race is not of much obvious theoretical or practical use. Therefore, there is little scientific incentive to study this type of re- lationship. (In contrast, knowing that certain other response variables depend on race is sometimes quite useful, such as in the prevention and treatment of diseases.) 3. Knowing that performance depends on race may suggest to some people a basis for racial discrimination. Discrimination is undesirable because it usually harms everyone involved. Therefore, there is a social disincentive to study this type of relationship. Can Performance Be Predicted in Individuals on the Basis of Their Race? Although it may be true that performance doesn’t depend on race, it is still true that certain relationships exist between per- formance and race. For example, as noted, a relationship exists between intelligence test scores and race. Thus perhaps race could be used to predict (and thereby indirectly control) per- formance, even though a direct causal relationship between these variables may not exist. For example, Asians perform better (on average) on intelligence tests than Blacks or Whites. Therefore, a company president wishing to maximize the intelligence of the employees of the company might decide to hire only Asians as em- ployees. Although such an approach to hiring is unethical, it is instructive to temporarily ignore the ethics and consider it from a strictly scientific point of view. Is it scientifically sensi- ble to hire on the basis of a known relationship between some im- portant measure of performance and race? No. It is generally inefficient to predict individual human per- formance from race because other predictor variables are substan- tially more accurate. In particular, instead of using race, a company will hire more effective employees if it bases its hiring decisions on each job candidate’s education and experience, to- gether with the candidate’s performance in an interview, and per- haps the candidate’s performance on empirically valid aptitude tests. This approach to hiring selects more effective employees because any differences in performance between the races are (if present at all) very small when compared to the vast (and identi- fiable) differences in performance that occur within each racial group. (Perhaps an employer could properly use education, experience, interview performance, and aptitude test scores in a model equa- tion to predict job performance, but still reasonably include a predictor variable reflecting race in the equation. Perhaps in- cluding race would significantly improve the predictions made by the equation, even though all the other variables are also prop- erly used in the equation. That is, including the other vari- ables in the relationship might make the relationship between performance and race more “visible” in the analysis. This is theoretically possible if a certain “interactive” type of rela- tionship between variables occurs. However, the presently by far more common outcome in social research when more predictor vari- ables are added is that the individual predictor variables become weaker rather than stronger due to relationships [confoundings] among them. Since race is already only at best a very weak pre- dictor of performance, it too is likely to become weaker or non- existent in a model equation as the number of predictor variables is increased.) Therefore, even if ethical considerations are ignored, it is gen- erally not scientifically reasonable to predict performance in individuals on the basis of a known relationship between perform- ance and race. The Social Taboo Against Concluding That Performance Depends On or Is Predictable From Race Despite the preceding points, there is a tendency among some peo- ple to think that one race (or religion) is superior to others in one or more areas of performance. Unfortunately, this point of view can lead to appalling undeserved human suffering. There- fore, civilized society uses another important incentive to work in concert with the somewhat complicated logical arguments in the preceding paragraphs that performance can’t be reasonably viewed as depending on (or as predictable from) race. This incentive operates in the ethical realm and exists in the form of a strong social taboo against concluding that performance depends on race. This taboo exists without any need for justification in the sense that many people accept it on an intuitive level without ques- tioning it (because it is fair). The taboo is vividly illustrated by the experience of Glayde Whitney, a behavior geneticist with a record of distinguished re- search in the genetics of mouse taste, and who was the 1995 President of the Behavior Genetics Association (BGA). In view of the BGA’s name, many of its members have considered the idea of relationships between performance and race. Interestingly, most behavior geneticists believe that no such relationships exist. Thus Whitney astounded the association by suggesting in his Presidential Address that race plays a role in causing murders. He presented (in a speech at an evening banquet) reliable evi- dence that the murder rate in the United States was significantly higher among non-Whites than among Whites. He then said Like it or not, it is a reasonable scientific hypothesis that some, perhaps much, of the race difference in murder rate is caused by genetic differences in contributory variables such as low intelligence, lack of empathy, ag- gressive acting out, and impulsive lack of foresight (1995, p. 336). The next morning Whitney was shunned at the meeting of the BGA Executive Committee, and the committee voted (with Whitney ab- staining) to issue an official statement denouncing his comments. Also, the editor of the BGA journal declined (contrary to stan- dard policy) to publish the text of the Presidential Address in the journal (Whitney, 1995). After the meeting the incoming 1996 BGA president circulated an open letter calling Whitney’s com- ments “nonscientific, misleading, and cruel,” and urging Whitney to resign from the association (“Specter at the Feast,” 1995). Whitney’s hypothesis is that race exerts a causal influence on murder, and he was correct in saying that this hypothesis is a “reasonable scientific hypothesis”. However, due to the possi- bility of reasonable alternative explanations (perhaps in terms of poverty and alienation), he erred in believing that the murder statistics properly support the hypothesis. In view of the error in scientific logic and in view of the taboo against concluding that performance depends on race, the members of the Behavior Genetics Association moved quickly to distance themselves from Whitney’s scientifically unfounded and socially inappropriate causal conclusion. (A similar taboo pertains to concluding that performance in indi- viduals depends on their sex [gender]. We [society] allow meas- ures of physical performance to depend on sex because sufficient obvious differences exist between the sexes in determinants of physical performance [e.g., in average body weight] to justify such differences. We also generally allow differences in “emo- tional” performance between the sexes, although the distinction may be diminishing. However, we have a justified strong social prohibition against concluding that intellectual performance de- pends on sex because such differences might be used by some peo- ple as a basis for sex discrimination.) Does Performance Not Depend on Race? The discussion above suggests that we can’t reasonably conclude that performance depends on race. It is instructive to consider the negation of this idea. That is, can we conclude that per- formance doesn’t depend on race? Many people believe that human performance doesn’t directly de- pend on race. (I am in this group.) However, the statement that performance doesn’t depend on race is a statement of a scientific “null hypothesis” -- a statement that something doesn’t exist. (Here the null hypothesis says that no causal relationship exists in humans between a given measure of performance and race.) It is impossible to scientifically prove that something that is logically possible doesn’t exist (assuming that the size of the thing isn’t specified). Thus a null hypothesis can’t be directly empirically supported. Thus we can’t scientifically prove that performance doesn’t depend (in some perhaps very small way) on race. Despite the preceding point, scientific logic dictates (through the principle of parsimony) that we assume that a null hypothesis is true until (if ever) incontrovertible empirical evidence to the contrary is brought forward. Thus (since no incontrovertible evidence is presently available) we assume that performance doesn’t depend on race, even though we can’t prove it is true. Rejecting the Null Hypothesis As a rule, scientists are highly interested in properly rejecting null hypotheses about causal relationships between variables. This rejection is performed by finding empirical evidence that implies the existence of the relationship. Scientists are inter- ested in rejecting null hypotheses because the knowledge gained in rejecting a (carefully chosen) null hypothesis is generally of theoretical or practical use. However, the case of relationships between performance and race is an important exception. In this case most scientists and other thoughtful people are not interested in trying to reject the null hypothesis because, as noted, rejection is not seen as being particularly scientifically useful, and rejection might be used by some people as a basis for racial discrimination. Summing Up The preceding discussion leads to a certain type of negative (null) conclusion. A conclusion of this type is often unstated because experienced scientists take such a conclusion for granted until (if ever) it is rejected. However, in view of the harmful- ness of racial discrimination, the conclusion is worth stating: There is presently no convincing scientific evidence that per- formance (or behavior) in individuals can be reasonably predicted from their race or ethnicity. Appendix D: Specifying a Repeated Measurements Analysis of Variance The procedure for requesting a repeated measurements analysis of variance from a statistical analysis computer program is compli- cated because one must understand two somewhat complicated lan- guages: - the language of statistical ideas related to repeated measure- ments analysis of variance (i.e., variation, within- and be- tween-entity variation, main effect, interaction, and p-value) - the language of the computer program chosen to analyze the data. (In general, each program uses a different proprietary language to specify the required information.) Also, requesting a repeated measurements analysis of variance is complicated because two layouts are available for organizing the data table, and most software is only capable of analyzing data organized according to one of the layouts, and software manuals sometimes don’t carefully distinguish between the layouts. One layout for organizing the data is with one row of data per response-variable value. For example, suppose we perform a re- peated measurements experiment to compare teaching approach A with teaching approach B using a measure of knowledge as the re- sponse variable. And suppose we measure the students’ knowledge of the subject area before they are exposed to the teaching ap- proaches and we measure their knowledge again after each student has had three months of exposure to one or the other of the ap- proaches. Then our data table might be organized as follows: -------------------------------------- Teaching Student Approach Time Knowledge -------------------------------------- Jack A Before 55 Jack A After 65 Mary B Before 63 Mary B After 75 Jean A Before 68 Jean A After 69 Bill B Before 49 Bill B After 82 etc. ------------------------------------- The table indicates that Jack had a measured knowledge value of 55 before receiving teaching approach A and a measured knowledge value of 65 after receiving teaching approach A, and so on for the other students. A second layout for organizing the data is with one row of data per experimental entity, i.e., one row per student in the present discussion. Under this layout the information in the above table could be organized as follows: ------------------------------------------ Teaching Knowledge Knowledge Student Approach Before After ------------------------------------------ Jack A 55 65 Mary B 63 75 Jean A 68 69 Bill B 49 82 etc. ------------------------------------------ In this second layout for organizing the data the response vari- able (Knowledge) has multiple columns in the data table, with these columns reflecting the repeated measurements aspect of the research, with one column for each time the response variable was measured in the students. Traditionally this second layout has been used to organize repeated measurements data. (This may be because this layout is non-redundant and thus more compact than the first layout.) However, the first layout may be slightly easier to understand because each variable has only a single col- umn in the data table and no variables are hidden or implicit. (In the table immediately above the variable Knowledge has two columns in the table and the variable Time has no column -- time is implied by the two Knowledge columns.) I hope that statisti- cal software developers will debate the advantages of the two layouts for organizing repeated measurements data and then stan- dardize on the better layout (or perhaps make both layouts avail- able). It is easy for an expert to use statistical software to convert a data table from one of the two layouts for organizing repeated measurements data to the other. However, for a less experienced researcher this conversion can be surprisingly difficult in the details. Thus I recommend that less experienced researchers de- termine which data organization their software requires and then ensure that the data table is organized properly from the start. References Aiken, L. R. 2002. Attitudes and related psychosocial constructs: Theories, assessment, and research. Thousand Oaks, CA: Sage. Bailar, J. C., III, and Mosteller, F., eds. 1992. Medical uses of statistics (2nd ed.). Boston: NEJM (New England Journal of Medicine) Books. Box, G. E. P., Hunter, J. S., and Hunter, W. G. 2005. Statistics for experimenters (2nd ed.). New York: John Wiley. Fleiss, J. L. 1986. The design and analysis of clinical experi- ments. New York: John Wiley. Kirk, R. E. 1995. Experimental design: Procedures for behavioral sciences (3rd ed.). Pacific Grove, CA: Brooks/Cole. Krosnick, J. A., Judd, C. M., and Wittenbrink, B. 2005. The meas- urement of attitudes. In D. Albarracin, B. T. Johnson, and M. P. Zannna (Eds.), The handbook of attitudes, (pp. 21-76). Mah- wah, NJ: Lawrence Erlbaum. Ladson-Billings, G., and Tate, W. 2006. 2006. American Education Research Association annual meeting theme: Education research in the public interest. Available at http://www.aera.net/annualmeeting/?id=694 Macnaughton, D. B. 2002. The introductory statistics course: The entity-property-relationship approach. Available at http://www.matstat.com/teach McCullagh, P., and Nelder, J. A. 1989. Generalized linear models (2nd ed.). London: Chapman and Hall. Miller, G. A. 1956. The magical number seven, plus or minus two: Some limits on our capacity for processing information. Psy- chological Review 63:81-97. Also available at http://www.well.com/~smalin/miller.html Specter at the Feast. 1995 (July 7). Science 269:35. Thompson, B. 2004. Exploratory and confirmatory factor analysis: Understanding concepts and applications. Washington, DC: American Psychological Association. Whitney, G. 1995. Ideology and censorship in behavior genetics. Mankind Quarterly 35:327-342. Also available at http://www.lrainc.com/swtaboo/taboos/gw-icbg.html Winer, B. J., Brown, D. R., and Michels, K. M. 1991. Statistical principles in experimental design (3rd ed.). New York: McGraw- Hill.
Return to top
Home page for the Entity-Property-Relationship Approach to Introductory Statistics