Home

Subject: Re: Experimental Research In Education: The Most Exciting
Talk at the 2005 Joint Statistical Meetings

To: EdStat E-Mail List
ApStat E-Mail List
Teaching Statistics E-Mail List
sci.stat.edu Usenet Newsgroup

From: Donald B. Macnaughton < donmac@matstat.com >

Date: Sunday February 11, 2007

-----------------------------------------------------------------

Some Basic Ideas of Science

This post replies to comments by Mike Palij.  But first, to sup-
port the discussion, it is helpful to propose definitions of some
basic ideas of science, as follows:

EMPIRICAL RESEARCH is any activity in which measurements
(observations) are gathered from some area of experience
and then reasonable conclusions are drawn from the meas-
urements.

Measurements in empirical research are usefully viewed as
reflecting the values of VARIABLES, which reflect the
measured values of properties of entities.  (Entities may
be people, other living things, physical objects, or any
other type of object or thing.)

An EMPIRICAL RESEARCH PROJECT (or logical sub-unit of an
empirical research project) is usually usefully viewed as
studying the relationship between a single response vari-
able and one or more predictor variables in the entities
in the population of interest.  This study is performed
by analyzing the measured values of the variables ob-
tained from a sample of entities from the population.

An EXPERIMENT is an empirical research project in which
at least one of the predictor variables is “manipulated”
(i.e., caused to take certain values in the entities in
the sample) by the researcher.

An OBSERVATIONAL RESEARCH PROJECT is an empirical re-
search project in which none of the predictor variables
is manipulated by the researcher, and the values of the
predictor variables are simply observed.

I discuss the ideas in more detail in a paper (2002).

The description of an empirical research project is qualified
with the word “usually”.  This implies that some empirical re-
search projects (or sub-units) don’t satisfy the description --
that is, they can’t be reasonably viewed as studying the rela-
tionship between a single response variable and one or more pre-
dictor variables.  My experience suggests that research projects
that can’t be reasonably viewed as satisfying the description ap-
pear in less than three percent of published empirical research
reports (in science, technology, and business).  I discuss some
of these research projects in the paper (2002, Appendix I.2).
For economy of words, this small group of research projects is
mostly ignored in the discussion below.

Novelty of Experiments in Education

Quoting my August 25, 2005 post, Mike Palij wrote (on August 26
in the EdStat e-mail list)

< snip >
> Experimental studies in education are not a new idea.

I agree.  I suspect that the idea of performing proper experi-
ments in education can be traced back to the 1920’s or 1930’s
when Fisher, Neyman, Pearson, and other statisticians first ex-
plained and debated the idea of a proper experiment.  However,
proper experiments in education haven’t often been performed.

What is a “proper” experiment?  Here is a reasonable definition:

An experiment (or randomized trial) is a PROPER EXPERI-
MENT if it has been performed according to widely ac-
cepted principles of scientific practice, experimental
design, and data analysis, as described by Bailar and
Mosteller (1992), Box, Hunter, and Hunter (2005), Fleiss
(1986), Kirk (1995), Winer, Brown, and Michels (1991),
and many others.

I suspect that proper experiments haven’t often been performed in
education for two reasons: (a) proper experiments in education
are difficult, and (b) perhaps simply due to tradition, research-
ers in education have lacked a strong connection to the ideas of
experimental research.

Problematic Nature of Experiments in Education

< snip >
> There are a number of reasons why experiments are problematic
> in educational settings

I fully agree.  Some key problems in performing a good education
experiment to compare teaching approaches are

1. What is the RESPONSE VARIABLE (or variables) that we will use
to compare the teaching approaches?  Will it be marks, or at-
titudes, or aptitudes, or some other measure of the students
(or the classes of students) under study?

2. What are the PREDICTOR VARIABLES that we will measure in the
research?  One predictor variable will reflect the different
teaching approaches under study.  What other predictor vari-
ables (e.g., students’ age or gender) should we measure to as-
sist our understanding?  Is it reasonable to use students’
race as a predictor variable?

3. How can we design the experiment in a way that will ELIMINATE
teacher effects and other REASONABLE ALTERNATIVE EXPLANATIONS
of any significant differences we find in the values of the
response variable between the teaching approaches?

4. How can we design the experiment so that it is as LIKELY as
possible that we will FIND STRONG EVIDENCE of the relationship
we are looking for between the response variable and the pre-
dictor variable(s) (assuming that the relationship actually
exists)?

5. How can we RECRUIT TEACHERS to participate in the experiment?

6. How can we OBTAIN FUNDING to pay for the experiment?

7. After we have performed the experiment, how can we ANALYZE the
RESULTS and DRAW SCIENTIFICALLY VALID CONCLUSIONS?

8. While properly addressing the preceding seven problems, how
can we MINIMIZE the COSTS of the experiment?

Appendix A expands the eight problems and discusses some solu-
tions.

Necessity of Experiments in Education

< snip >
> As for the necessity of experimental research to provide the
> basis for either (a) science or (b) valid conclusions, the as-
> tronomers say “Hi!  We’ve been engaging in science for hundreds
> of years without having performed a single experiment!”.

Mike suggests that if astronomers don’t perform experiments, then
education researchers also don’t need to perform experiments.  I
think that this is a thought-provoking point, but an invalid ar-
gument.  The argument is invalid because astronomers CAN’T per-
form experiments because they can’t manipulate distant astronomi-
cal events.  If they could manipulate these events (at a reason-
able cost), it seems doubtless that they would.  That is, they
would perform proper experiments just like scientists in disci-
plines that regularly perform experiments, such as in most
branches of physics, chemistry, engineering, medicine, biology,
and psychology.

The fact that astronomers can’t perform astronomical experiments
makes astronomy an “observational” discipline.  Other observa-
tional disciplines that generally can’t perform experiments (due
to the remoteness or untouchability of the phenomena they study)
include anthropology, archaeology, economics, epidemiology, geol-
ogy, paleontology, and some areas of sociology.  Such observa-
tional disciplines base their inferences on careful observational
empirical research, often studying relationships between vari-
ables.  (In historical disciplines, observational research is
sometimes [due to the paucity of data] reduced to careful consid-
eration of physical or anecdotal information about entities,
properties of entities, or variables, without focusing on the
concept of ‘relationship between variables’.)

Proper observational research often enables us to reliably
PREDICT the values of the response variable (in new situations),
which is an important benefit.  However, in education research we
would generally like to learn how to reliably CONTROL the values
of the response variable (in new situations).  That is, we would
like to learn how to structure an education program so that it
provides students with the BEST education.  Unfortunately, obser-
vational research is almost always equivocal about control --
subject to multiple competing reasonable explanations.

For example, suppose we are presented with the results of an ob-
servational research project in education that suggests that a
certain teaching approach A is better (in the sense of exhibiting
significantly better average values in students of the chosen re-
sponse variable) than another teaching approach B.  In this case
(due to the nature of observational research) it is almost always
possible to find a reasonable alternative explanation of the re-
search finding, and this explanation implies that approach A may
NOT be better than approach B.  But if we find such an explana-
tion, this implies that the research is equivocal.  This means
that the research is of substantially less value because it can’t
reliably help us to decide which teaching approach is better.

(A reasonable alternative explanation in observational education
research is often in terms of “confounding” of the teaching ap-
proaches under study in the research with some other aspect
[i.e., variable] of the research situation.  Then it is generally
possible that this other variable can fully account for the dif-
ference between the average values of the response variable under
the different teaching approaches.  For example, an observational
research project might confound two teaching approaches with two
different schools -- school 1 uses teaching approach A, and
school 2 uses teaching approach B.  In this case if we find a
significant difference in the average values of the response
variable between the two approaches, it is possible that the
teaching approaches have no differential effect on the values of
the response variable.  That is, [unless the School variable is
appropriately (and expensively) taken account of in the design of
the research] it is possible that a certain difference between
the SCHOOLS caused the observed significant differences in the
average values of the response variable [in students or classes]
between the teaching approaches.)

In contrast, suppose that a proper EXPERIMENT provides good evi-
dence that teaching approach A is better than teaching approach
B.  In this case the finding is unequivocal.  This is because
proper experiments are explicitly designed to eliminate confound-
ing and other reasonable alternative explanations.  Thus in this
case we can safely (tentatively) conclude that approach A will be
better than approach B in new situations (if the relevant condi-
tions are sufficiently similar to those of the experiment).  Thus
proper experiments are preferred to observational research pro-
jects in education research.

(The equivocation in observational research relates to drawing
conclusions about causation.  That is, we are interested in
whether teaching approach A CAUSES students to do better than
teaching approach B.  Evidence about relationships between vari-
ables obtained in observational research is generally equivocal
tained in proper experimental research is unequivocal about cau-
sation.)

Changing Attitudes Toward Experiments in Education

Mike noted that members of the respected American Educational Re-
search Association (AERA) have carefully considered the issue of
experimental research in education.  Mike’s point is directly re-
flected in the opening sentence of the official description of
the theme of the 2006 AERA annual meeting:

Current social and political pressures on education re-
search suggest that research must meet the demands of
evidence-based and scientifically based inquiry (Ladson-
Billings and Tate 2006).

The idea of “current” pressures reflects the fact that the pres-
sures on education research are new, having arrived over the last
decade or so.  The sentence implies that education researchers
are moving toward “evidence-based and scientifically based” re-
search.  This suggests that education researchers should gener-
ally perform proper experiments (because observational research
results are generally equivocal).

(Having acknowledged the importance of proper research, the de-
scription of the theme of the 2006 AERA meeting turns to the
theme itself, which pertains to education research in the public
interest, education research that will “increase the common-
wealth”.  The discussion of the theme is available at
http://www.aera.net/annualmeeting/?id=694 )

Opportunities for Experiments in Education

The preceding discussion suggests that (a) the area of experimen-
tal studies in education is only now beginning to open up and (b)
this area will become the mainstream of education research as
granting agencies and journal editors reinforce the point that
proper experiments are preferred to observational research.  Be-
cause the area is opening up, it has many opportunities for
thoughtful researchers.

To perform a proper education experiment a researcher must be fa-
miliar with the principles of experimental design, power analy-
sis, and (often) repeated measurements analysis of variance.
Some education researchers are less familiar with these topics.
They may find it helpful to follow the path of many medical re-
searchers who collaborate with a statistician with experience in
the topics.  To ensure that the research design is efficient, I
recommend that this collaboration begin early in the design phase
of the research.

Opportunities exist for statisticians to present courses to edu-
cation researchers about the statistical and scientific aspects
of education research.  I propose topics for such a course in ap-
pendix B.

I believe that the movement toward experimentally based education
research will yield a body of reliable research results that will
substantially improve education.

Don Macnaughton

Donald B. Macnaughton
donmac@matstat.com

Appendices

Appendix A:  Eight Problems in Experiments in Education

Appendix B:  Courses About Experiments For Education Researchers

Appendix C:  Can Human Performance or Behavior Be Predicted from
a Person’s Race?

Appendix D:  Specifying a Repeated Measurements Analysis of
Variance

Appendix A:  Eight Problems in Experiments in Education

The body of this post lists eight problems that arise in experi-
ments in education.  This appendix briefly expands the problems
and discusses some general solutions.

Problem 1:  Choosing the Response Variable

Suppose we are designing an education experiment to compare two
teaching approaches.  In choosing the response variable we can
reasonably begin by answering the following question:

What would we like the teaching approaches under study to
do (accomplish)?  That is, what is our teaching goal?

This question can be answered at a high level by deciding which
of the following is the main goal:

1. maximize student knowledge and understanding of the subject
area

2. maximize certain student attitudes toward the subject area

3. maximize student aptitudes in the use of the subject area in
practical applications

4. optimize some other property (or combination of properties) of
the students.

The definition of the teaching goal in an education experiment
depends on the particular topic or discipline being taught, on
the type of students being taught, and on the course designer’s
and researcher’s interests.  The definition of the goal deserves
careful attention because it lies at the heart of the research.

After we have defined the teaching goal, we can choose the re-
sponse variable by addressing a second important question, which
is

How can we best measure the effectiveness of a teaching
approach to satisfy the teaching goal?

The answer to this question defines the response variable.  For
example, if the goal is to maximize students’ knowledge and un-
derstanding, the main response variable in an education experi-
ment will be a measure of each student’s knowledge and under-
standing, typically a weighted average of marks on assignments or
tests.  In this case it is important to devise a “fair” measure
of students’ knowledge and understanding by devising “fair” as-
signments or tests -- a challenging but doable task.

Similarly, if the teaching goal is to maximize certain student
attitudes toward the subject area of the course, the response
variable(s) will be one or more measures of students’ attitudes.
Attitudes are important because they play a key role in peoples’
decisions.  It is generally unnecessary to devise a test of atti-
tudes toward the subject area of a course because reliable stan-
dardized attitude tests (which can be administered in less than
twenty minutes) are available for many subject areas.

(You can find tests of students’ attitudes toward a subject area
by searching the Web of Science [available in some university li-
braries], other science literature databases, and the world wide
web for articles and books that have the word “attitude” [or “at-
titudes”] and the name of your subject area in their titles or
keywords.  You may also be able to find acceptable generic atti-
tude tests.  If you are studying a subject area in which an ac-
ceptable test of student attitudes toward the area is not avail-
able, and if you are comfortable with statistical ideas, you can
study other attitude tests and then use the principles of atti-
tude scale development [Aiken 2002, Krosnick, Judd, and
Wittenbrink 2005] and the statistical procedure of exploratory
factor analysis [Thompson 2004] to develop such a test.)

Similarly, if the teaching goal is to maximize students’ apti-
tudes (which is a key goal in courses that teach applied skills),
the response variable will be a measure of students’ aptitudes.
Standardized measures of aptitudes are available in some areas,
and can be found with the methods in the first sentence of the
preceding paragraph.

Problem 2:  Choosing the Predictor Variables

Choosing the predictor variables in an education experiment to
compare teaching approaches requires first that we choose the two
(or more) teaching approaches that we wish to compare.  These
teaching approaches define the values of the “teaching approach”
predictor variable.  This variable is manipulated in the students
in the sense that some students receive one of the teaching ap-
proaches, and other students receive the other.  Two reasonable
teaching approaches to compare are (a) the traditional approach
to teaching some topic or discipline and (b) the top contender to
replace the traditional approach.  The experiment stages a fair
contest between the two approaches.

As noted above, in addition to choosing the main predictor vari-
able, we must also decide which other variables of the situation
under study we will measure.  Generally, the more “relevant” pre-
dictor variables we measure in an experiment, the better the un-
derstanding we obtain of the relationship between variables we
are studying.  Thus it may be useful to measure each student’s
age and gender.  It may also be relevant to measure students’
previous experience with the material, intelligence, socioeco-
nomic status, and years of experience speaking the language that
the course is taught in.

Appendix C discusses the use of “race” as a predictor variable in
research studying measures of human performance or behavior.

Problem 3:  Eliminating Reasonable Alternative Explanations

The need to eliminate reasonable alternative explanations of a
research finding stems from the sensible principle that good re-
search must be unequivocal.  Eliminating reasonable alternative
explanations is difficult because many forms of reasonable expla-
nations are possible, and some are hard to recognize.  Thus ex-
perienced researchers spend considerable time trying to think of
reasonable alternative explanations of research results, espe-
cially results of their own planned research.  If the elimination
of reasonable alternative explanations is properly done (gener-
ally through careful research design), it (mostly) eliminates the
possibility that the associated research conclusion will be ruled
invalid due to a reasonable alternative explanation that someone
thinks of later.

Confounding alternative explanations are eliminated by random as-
signment of experimental entities to treatments.  That is, in an
education experiment we randomly assign students (or classes of
students) to teaching approaches.  Such random assignment helps
to ensure (in a probabilistic sense) that the different teaching
approaches could not be confounded with schools or with many
other variables.

For logistical reasons, random assignment is sometimes difficult
in education research.  In the case of randomly assigning stu-
dents to different teaching approaches, the problem can often be
solved (albeit at some expense) by running both (or all) the
teaching approaches together at the same time on the same day of
the week in nearby (similar) facilities.  (If enough resources
exist, this concurrent presentation of the set of treatments can
be repeated on different days of the week or at different loca-
tions.)  This means that students can be readily randomly as-
signed to the treatments (teaching approaches) without confound-
ing because none of the treatments has extraneous time or loca-
tion advantages.  I recommend that the researcher (a) discourage
students from switching between classes, but permit such
switches, and (b) identity any students who switch, perhaps to be
with a friend.  The students who switch classes after the course
has begun can be studied and then (due to unrepresentativeness)
be omitted from the analysis.

Problems 4, 7, and 8: Power, Analysis, and Minimizing Costs

The fourth, seventh, and eighth problems respectively pertain to
maximizing the power of the statistical tests in an experiment,
analyzing the data obtained in the experiment, and minimizing the
costs of the experiment.  Detailed technical help with these
problems is available from the field of statistics, which has ef-
ficient general methods for the design and analysis of powerful
but inexpensive experiments.  I discuss an introduction to the
methods of statistics in the paper (2002).

One can get practical help with problems 4, 7, and 8 by studying
the medical research techniques of multicenter clinical trials.
These techniques can help because proper experiments in education
must be performed in parallel at several teaching institutions
(or at least in several classes) to eliminate teacher effects and
to provide power and generality.  This is similar to how medical
research is performed in parallel in several hospitals in a mul-
ticenter clinical trial.

(Technical Aside:  In medical research the patient is often the
entity [unit] of analysis, but in education research the class of
students is often the [implicit] entity of analysis.  A class of
students in education research is analogous to a hospital [or
other grouping] of patients in a multicenter clinical trial.  A
researcher can often substantially increase the power of statis-
tical tests in education research [with often only a minor in-
crease in costs] by designing the research so that the student
[instead of the class] is the entity of analysis.  This can be
done by using a pre-post [i.e., repeated measurements] experimen-
tal design.  That is, the value of the response variable is meas-
ured in each student in the experiment both before and after the
students experience their assigned teaching approach.  Such a de-
sign is feasible with many but not all response variables.)

Problem 5: Recruiting Teachers

Recruiting teachers or teaching departments to participate in ex-
perimental research in education is facilitated if the researcher
explains how the research will provide important educational
benefits.  If this is carefully done, and if the disruption of
the existing program is not too great, appropriate participants
can generally be recruited, just as appropriate medical personnel
at different hospitals are recruited as a first step in a multi-
center clinical trial.

Problem 6: Obtaining Funding

To obtain research funding a researcher submits a carefully writ-
ten grant proposal to an appropriate funding agency.  The pro-
posal competes with other proposals for funds from the pool of
funds distributed by the agency.  Research proposals are judged
on the following criteria:

- the reasonableness of the hypothesized phenomenon that the re-
search will study

- the clarity of thinking in the rationale and implications of
the research

- the potential of the hypothesized phenomenon to make a worth-
while contribution to the field under study, and

- the conformity of the proposal to the correct style and proto-
col for grant proposals to the agency.

The research projects whose proposals best satisfy the above cri-

For Education Researchers

I recommend that courses about experimental research for educa-
tion researchers discuss the eight problems discussed in appendix
A in terms of examples of good and bad education experiments.  I
recommend that teachers discuss real or realistic experiments (as
opposed to abstract experiments) because realistic experiments
enable students to consider the specific research goal of each
example.  Considering researchers’ goals helps students to formu-
late the goals of their own research.

In discussing a “good” education experiment it is important to
convey to students the practical benefits that are provided by
the results because these benefits generally justify the care
taken to perform the experiment.

In view of the usefulness of a pre-post experimental design for
increasing power, I recommend that this type of design and the
proper analysis of the results of this type of experiment be dis-
cussed in detail.

Discussion of data analysis can best omit all mathematical con-
cepts and focus on interpreting the output from the computer
analysis of the data.  Most experiments to compare teaching ap-
proaches have a continuous reasonably-well-behaved response vari-
able (e.g., marks or attitude scores), and the main predictor
variable (i.e., “teaching approach”) is discrete.  Thus the re-
sults of these experiments are best analyzed with analysis of
variance (which becomes repeated measurements analysis of vari-
ance if a pre-post design is used).

(Repeated measurements analysis of variance is also called “mixed
model” analysis because the right-hand side of model equation of
the relationship between the variables contains a mixture of (a)
“fixed” terms (associated with the predictor variables) and (b)
“random” terms associated with unaccounted-for variation in the
values of the response variable.  However, I prefer the term “re-
peated measurements” because it is more intuitive for beginners.)

I recommend that discussion of data analysis be in terms of exam-
ples of good experiments in education research.  I recommend that
the discussion cover the following topics:

1. For each example, a discussion of the research hypothesis (or
research question) under study, a discussion of the conduct of
the experiment, and a discussion of the layout of the data ta-
ble that was obtained in the experiment and that is the basis
of the data analysis.

2. What the computer (software) must be told in order to perform
the analysis of variance.  We must tell it

- the location of the data table (often in a file on the
computer)

- which variable in the data table is the response variable
in the analysis

- which variable(s) in the table is (are) the predictor
variable(s)

- in repeated measurements experiments which of the predic-
tor variables vary within experimental entities and which
vary between experimental entities (or, equivalently,
which variable uniquely identifies the experimental enti-
ties)

- in more complicated experiments details of the relation-
ship between variables we are studying, such as the hy-
pothesized form of the model equation of the relationship.

3. How to tell the computer what it must be told.  This varies
among different software products, and is discussed further in
appendix D.

4. How to interpret each p-value in an analysis of variance table
produced by the computer from the analysis of the data.

5. How to understand tables of means of the response variable for
effects with low p-values, as produced by the computer.

6. How to graphically illustrate means of the response variable
for effects with low p-values for ease of understanding of the
results.

7. How to understand other output from the computer such as (a) a
measure of strength or “effect size” of each relationship be-
tween variables studied in the experiment and (b) the esti-
mates of the values of the parameters of the model equation.

8. How (to avoid embarrassing errors) the researcher must confirm
that the underlying assumptions of a statistical analysis are
adequately satisfied before drawing final conclusions.  (This
includes (a) checking of the univariate distribution of each
of the variables for anomalies and (b) appropriate verifica-
tion that the data have certain necessary properties.  The
specific properties depend on which statistical procedure is
used to study the relationship.)  These assumptions are often

9. How to interpret the analysis in terms of the research hy-
pothesis.

Some statistics courses omit or minimize the first and last top-
ics, which pertain to the research hypothesis.  This omission oc-
curs because some courses are more focused either on data analy-
sis or on statistical theory, which are both vast topics.  How-
ever, consideration of the implications of the research for the
research hypothesis is clearly important in courses aimed at edu-
cation researchers.

In high-level terms, the results of a data analysis have one of
three implications for the research hypothesis, which are:

1. the results support the research hypothesis (and there is no
reasonable alternative explanation of the results)

2. the results neither support nor refute the research hypothesis
(either because the research found no good evidence of the
sought-after relationship between variables or because a rea-
sonable alternative explanation of the results is available)

3. the results refute (contradict) the research hypothesis.

Because of the positive way that research hypotheses are framed
(e.g., drug D reduces cancer), researchers performing a research
project almost always hope that the first outcome will occur.
That is, they hope that the results of the research project will
support their research hypothesis.  In this case, if the hypothe-
sis was carefully thought out, the finding will make a contribu-
tion (perhaps substantial) to the associated field of study.  Un-
fortunately, the second outcome sometimes occurs, perhaps because
the research hypothesis is false, or because the research was
poorly designed, or due to the whims of chance.  Although the
third outcome is generally possible, it rarely occurs in prac-
tice.

After students understand the basic ideas of drawing conclusions
from data analysis, I recommend that they learn how to use the
computer to generate realistic artificial data.  Such data gen-
eration is not difficult if students are given the appropriate
instructions (or program templates) and if they are shown how to
generalize the instructions as necessary.

The ability to generate realistic artificial data has three bene-
fits:  First, generating artificial data helps students to under-
stand the postulated model equation of the relationship between
the variables that is under study.  This understanding comes be-
cause generating data is most easily understood through writing a
simple computer program that substitutes the values of the pre-
dictor variables into the model equation and then evaluates the
equation to generate the predicted value of the response vari-
able.  (This evaluation is done repeatedly through the use of
program loops and with one or more random number generators to
generate the values of the random term[s] in the equation.)  From
a theoretical perspective the model equation is the essence of a
relationship between variables, and the students’ experience in
programming the essence helps them to understand it.

Second, the ability to generate artificial data gives students a
source of “tame” data, which they can then analyze with data
analysis procedures.  Because the students generate the data,
they know exactly what relationship(s) is (are) present (or not)
in the data.  This allows them to see how the data analysis pro-
cedures work to detect and characterize relationships between
variables.  This helps students to develop knowledge and trust of
the procedures.

Third, students can be encouraged to use their ability to gener-
ate artificial data when designing their own research projects.
That is, during the design phase of a research project students
can generate sets of realistic artificial data that resemble the
data they expect to obtain in the research.  Then they can care-
fully analyze these artificial data with the planned data analy-
sis procedure.  This gives students a thorough review of the
planned research design and data analysis procedure before the
design and analysis are put into practice.  Such a review is of-
ten helpful in eliminating later serious problems and in increas-
ing efficiency, especially for beginning researchers.

(Statistical software vendors can help users to generate artifi-
cial data by providing easy-to-follow instructions for generating
data for all the common types of relationships between variables.
I recommend that these instructions be placed prominently in the
software documentation so that beginners can easily find this im-
portant resource.  A component of the software that can generate
a data table from fill-in-the-blanks specifications might also be

I believe that the topics in this appendix, when developed in ap-
propriate detail, give prospective researchers a reasonable in-
troduction to how to perform experiments in education research.

Appendix C:  Can Human Performance or Behavior
Be Predicted from a Person’s Race?

The Model Equation of a Relationship Between Performance and Pre-
dictors of Performance

Suppose that the variable y is a particular measure of human per-
formance or human behavior.  For example, y might reflect stu-
dents’ grade point averages or it might reflect some measure of
athletes’ ability.  Under the scientific approach we think that y
“depends” on a number of other variables.  For example, we think
that each student’s grade point average probably depends on the
students’ intelligence, on their motivation, on their parents’
style of parenting, on the attitudes of their friends, perhaps on
their diet, and on various other variables.

The relationship between y and the other variables can be written
in a general model equation as

y = f(x1, x2, ..., xn) + ε.

The performance measure, y (e.g., grade point average), is the
response variable in the relationship, and the x1, x2, ..., xn
are the relevant predictor variables (e.g., intelligence, motiva-
tion, etc.).  The notation f(...) stands for a mathematical func-
tion that outputs the estimated value of y for a person when the
values of the x’s for the person are substituted into it.  [The
detailed mathematical form of f(...) is discovered through analy-
sis of relevant empirical research data.]  The symbol n indicates
the number of predictor variables under consideration, which is
typically between one and five.

The Greek letter ε (epsilon) on the right end of the equation is
the “error” term.  It takes account of the fact that f(...) gen-
erally can’t perfectly predict the actual measured value of y for
a person from the x’s.  The error term is a “random” variable be-
cause it has a different seemingly random value each time the
equation is applied.  Usually ε is sensibly modeled as being half
the time greater than zero and half the time symmetrically less
than zero, so it has an average value of zero.  Invariably ε is
modeled as being more often closer to zero than farther away.

(Technical Aside:  In any particular [standard] instance of a
variable [and at a given time] the variable [just like the prop-
erty behind the variable] has a single value.  [For simplicity, I
ignore here (a) the idea that a particular value of a variable
may be “missing” and (b) the more general but infrequent case of
variables that are vectors.]  A variable can be classified as be-
ing either a continuous variable or a discrete variable.  If a
variable is a continuous variable, its value in a particular in-
stance can theoretically be any value between the minimum and
maximum permissible values.  Continuous variables almost always
have numeric values.  For example, grade point average is a con-
tinuous variable that for a given student [in some schools] can
have any value between 0.00 and 4.00, such as 3.82.  The values
of variables that are obtained from conventional measuring in-
struments [of any type, e.g., ruler, stopwatch] are usually con-
tinuous variables.  [The values of any continuous variable are
limited to a certain maximum number of significant digits (often
between two and four) due to limitations in the accuracy of the
measuring instrument that is used to measure the values.]  In
contrast, if a variable is a discrete variable, its value in a
particular instance can be one of only a limited number of dif-
ferent values, usually less than thirty, and sometimes as few as
only two.  [Discrete variables can be ordinal -- with an implicit
ordering -- or categorical.]  For example, the variable “likes to
dance” is reasonably viewed as a [ordinal] discrete variable that
for a given person has one of a range of five [or perhaps seven]
possible values indicating different levels of liking or dislik-
ing to dance.  [The limitation to five or seven values generally
occurs in raw variables that reflect human judgments or opinions,
as discussed by Miller, 1956.]  For simplicity, the discussion in
this appendix assumes that a continuous response variable is al-
ways used in research projects because using continuous response
variables is [when feasible] the more efficient and more common
approach.  If the response variable in a particular empirical re-
search project is discrete, some of the technical ideas behind
model equations of relationships between variables change, but
the main principles in this appendix still apply, as shown by the
theory of generalized linear models [McCullagh and Nelder,
1989].)

We (i.e., society) can use the model equation of a (properly
verified) relationship between variables to help us to predict
and sometimes control the values of the response variable in a
new situation on the basis of measuring or controlling the values
of the predictor variables in the situation.  If the variables
are carefully chosen, the ability to predict or control can be of
substantial value.  For example, if we can find a (ethical)
method to control (i.e., raise) grade point averages in students
by controlling the values of other variables, we can use the
method to help students to excel.

As research into a relationship between variables advances, more
predictor variables may be discovered that can be (correctly) in-
cluded in the function f(...) for the relationship, which makes
the predictions (or control) made by the function more accurate.
As the predictions become more accurate, the average (absolute)
size of the error term ε in the model equation for the relation-
ship becomes accordingly smaller.  In some areas of research
(e.g., in many areas of the hard sciences) the model equations
make almost perfect predictions.  Thus the error terms in these
equations are so small that they are often sensibly ignored.

The Concept of ‘Race’

The discussion below uses the concept of ‘race’.  Most people
above the age of ten or so have a reasonable intuitive under-
standing of this concept in the sense that they can reliably
(though not perfectly) assign themselves and other people to ra-
cial categories.  (The assignments are “reliable” in the sense
that different people generally agree with each other [at a mutu-
ally acceptable level of classification] about the assignments.)

Although most people understand the concept of ‘race’ at an in-
tuitive level, formal definitions of the concept are difficult.
The definitions break into three classes, which are

1. definitions in terms of a person’s biological ancestry (e.g.,
in terms of classification of the person’s genetic DNA)

2. definitions in terms of a person’s self-reported race

3. definitions in terms of a person’s observable attributes such
as skin color, hair color, facial characteristics, and speech
characteristics.

Each of the three classes contains various definitions of race.
Each definition provides (at least in theory) a way of assigning
people to racial categories.  The categories are usually discrete
categories (e.g., Asian, Black, Mixed, Native American, White,
Other) rather than reflecting one or more continuous scales.

For example, using the self-report approach a researcher might
ask each person studied in a research project which of the above
six racial categories they belong to.  Thus race would be defined
and measured in terms of the six categories.  A second researcher
might define race in terms of eight or ten or even more catego-
ries reflecting the many identifiable groups of people in the
world.

The various definitions of race are closely associated, but are
different because assignments to racial categories by one of the
definitions will sometimes disagree with assignments by another.
For example, a person may ancestrally belong in whole or in large
part to one race, but may report or appear as belonging to an-
other.

Definitions of race in the first class (biological ancestry) are
generally preferred to definitions in the other two classes be-
cause the first class seems basic, and the other two classes seem
to be merely less accurate reflections of it.  However, defini-
tions in the first (and third) class are often difficult to im-
plement in practice in research.  (Some of the difficulties arise
from respected ethical considerations.)  Thus if a research pro-
ject is performed on people in which each person’s race is meas-
ured, the researcher will often measure race in terms of self-
reported race.

Is Performance “Causally Dependent” on Race?

Suppose that a research project is carried out to study the rela-
tionship between (a) a measure of human performance (or behavior)
as the response variable and (b) a set of other variables that
are predictor variables.  The predictor variables may reflect a
person’s attributes and may reflect manipulations applied to the
person in an experiment.  Suppose the researcher includes in the
research a predictor variable that reflects race (or ethnicity).
And suppose that the research project finds good evidence of a
relationship between performance and race -- the average level of
performance of people from one race is significantly higher than
the average level of performance of people from another race.
This raises a key question:  Can we conclude from this relation-
ship between variables that human performance depends to some ex-
tent on one’s race?  In other words, can we conclude that differ-
ences in race cause differences in performance?

For example, evidence exists that Asians score somewhat higher
(on average) on intelligence tests than Whites, who in turn score
somewhat higher (on average) than Blacks.  Does this imply that a
person’s intelligence depends (partly) on their race?

No.  The evidence of the relationship doesn’t imply dependence or
a causal relationship because this relationship between variables
is invariably studied with observational research, as opposed to
proper experimental research.  Observational research must be
used because it is impossible in a practical sense to manipulate
“race” in a proper experiment.  That is, unlike assigning treat-
ments to people (or people to treatments) in an experiment, a re-
searcher can’t assign races to people (or people to races) be-
cause race has already been assigned.  Because research projects
studying the relationship between performance and race are in-
variably (in that aspect) observational research projects, the
results of the research are open to reasonable alternative expla-
nations, as discussed in problem 3 in appendix A.

For example, the relationship between intelligence test scores
and race was found through observational research.  This rela-
tionship could easily be accounted for by other causal variables
that are confounded with race and that have (unfortunately) been
omitted from the analysis (typically because they are unknown or
are deemed unimportant).  For example, due to a history of op-
pression that began with Whites’ slavery of Blacks, many Blacks
resources, which might account for the differences in average in-
telligence test scores between Blacks and Whites.  Thus if vari-
ables that properly reflect the relevant types of oppression (in-
cluding relevant historical effects) are included in the analy-
sis, the Black/White aspect of the relationship between intelli-
gence test scores and race might easily vanish.  Similarly,
Asians might score higher (on average) on intelligence tests than
Blacks and Whites due to cultural childhood influences among
Asians that emphasize disciplined logical thinking.  Thus if a
variable reflecting childhood encouragement of logical thinking
is included in the analysis, this second aspect of the relation-
ship between intelligence test scores and race might also vanish.

To minimize expensive errors, science demands unequivocal evi-
dence of causation before causation can be inferred.  But, as
noted, the results of research projects studying the relationship
between performance and race are generally equivocal because they
are open to reasonable alternative explanations.  Therefore, it
is generally scientifically impossible to infer that performance
in humans is causally dependent on a person’s race.

The word “generally” in the preceding paragraph indicates the
possibility of exceptions.  That is, it is conceivable that an
ingenious researcher might find a way to perform a proper experi-
ment or find a way to deal with all of the confounding variables
and all of the alternative explanations.  This researcher might
still find good evidence of a causal relationship between a cer-
tain response variable reflecting performance and a person’s
race.  Then we could conclude that a causal relationship exists
between performance and race.  However, the chance of this occur-
ring is low because

1. Eliminating all reasonable alternative explanations would be
very difficult or impossible.

2. Knowing that performance depends to a small extent on race is
not of much obvious theoretical or practical use.  Therefore,
there is little scientific incentive to study this type of re-
lationship.  (In contrast, knowing that certain other response
variables depend on race is sometimes quite useful, such as in
the prevention and treatment of diseases.)

3. Knowing that performance depends on race may suggest to some
people a basis for racial discrimination.  Discrimination is
undesirable because it usually harms everyone involved.
Therefore, there is a social disincentive to study this type
of relationship.

Can Performance Be Predicted in Individuals on the Basis of Their
Race?

Although it may be true that performance doesn’t depend on race,
it is still true that certain relationships exist between per-
formance and race.  For example, as noted, a relationship exists
between intelligence test scores and race.  Thus perhaps race
could be used to predict (and thereby indirectly control) per-
formance, even though a direct causal relationship between these
variables may not exist.  For example, Asians perform better (on
average) on intelligence tests than Blacks or Whites.  Therefore,
a company president wishing to maximize the intelligence of the
employees of the company might decide to hire only Asians as em-
ployees.  Although such an approach to hiring is unethical, it is
instructive to temporarily ignore the ethics and consider it from
a strictly scientific point of view.  Is it scientifically sensi-
ble to hire on the basis of a known relationship between some im-
portant measure of performance and race?

No.  It is generally inefficient to predict individual human per-
formance from race because other predictor variables are substan-
tially more accurate.  In particular, instead of using race, a
company will hire more effective employees if it bases its hiring
decisions on each job candidate’s education and experience, to-
gether with the candidate’s performance in an interview, and per-
haps the candidate’s performance on empirically valid aptitude
tests.  This approach to hiring selects more effective employees
because any differences in performance between the races are (if
present at all) very small when compared to the vast (and identi-
fiable) differences in performance that occur within each racial
group.

(Perhaps an employer could properly use education, experience,
interview performance, and aptitude test scores in a model equa-
tion to predict job performance, but still reasonably include a
predictor variable reflecting race in the equation.  Perhaps in-
cluding race would significantly improve the predictions made by
the equation, even though all the other variables are also prop-
erly used in the equation.  That is, including the other vari-
ables in the relationship might make the relationship between
performance and race more “visible” in the analysis.  This is
theoretically possible if a certain “interactive” type of rela-
tionship between variables occurs.  However, the presently by far
more common outcome in social research when more predictor vari-
ables are added is that the individual predictor variables become
weaker rather than stronger due to relationships [confoundings]
among them.  Since race is already only at best a very weak pre-
dictor of performance, it too is likely to become weaker or non-
existent in a model equation as the number of predictor variables
is increased.)

Therefore, even if ethical considerations are ignored, it is gen-
erally not scientifically reasonable to predict performance in
individuals on the basis of a known relationship between perform-
ance and race.

The Social Taboo Against Concluding That Performance Depends On
or Is Predictable From Race

Despite the preceding points, there is a tendency among some peo-
ple to think that one race (or religion) is superior to others in
one or more areas of performance.  Unfortunately, this point of
view can lead to appalling undeserved human suffering.  There-
fore, civilized society uses another important incentive to work
in concert with the somewhat complicated logical arguments in the
preceding paragraphs that performance can’t be reasonably viewed
as depending on (or as predictable from) race.  This incentive
operates in the ethical realm and exists in the form of a strong
social taboo against concluding that performance depends on race.
This taboo exists without any need for justification in the sense
that many people accept it on an intuitive level without ques-
tioning it (because it is fair).

The taboo is vividly illustrated by the experience of Glayde
Whitney, a behavior geneticist with a record of distinguished re-
search in the genetics of mouse taste, and who was the 1995
President of the Behavior Genetics Association (BGA).  In view of
the BGA’s name, many of its members have considered the idea of
relationships between performance and race.  Interestingly, most
behavior geneticists believe that no such relationships exist.
Thus Whitney astounded the association by suggesting in his
Presidential Address that race plays a role in causing murders.
He presented (in a speech at an evening banquet) reliable evi-
dence that the murder rate in the United States was significantly
higher among non-Whites than among Whites.  He then said

Like it or not, it is a reasonable scientific hypothesis
that some, perhaps much, of the race difference in murder
rate is caused by genetic differences in contributory
variables such as low intelligence, lack of empathy, ag-
gressive acting out, and impulsive lack of foresight
(1995, p. 336).

The next morning Whitney was shunned at the meeting of the BGA
Executive Committee, and the committee voted (with Whitney ab-
staining) to issue an official statement denouncing his comments.
Also, the editor of the BGA journal declined (contrary to stan-
dard policy) to publish the text of the Presidential Address in
the journal (Whitney, 1995).  After the meeting the incoming 1996
BGA president circulated an open letter calling Whitney’s com-
ments “nonscientific, misleading, and cruel,” and urging Whitney
to resign from the association (“Specter at the Feast,” 1995).

Whitney’s hypothesis is that race exerts a causal influence on
murder, and he was correct in saying that this hypothesis is a
“reasonable scientific hypothesis”.  However, due to the possi-
bility of reasonable alternative explanations (perhaps in terms
of poverty and alienation), he erred in believing that the murder
statistics properly support the hypothesis.

In view of the error in scientific logic and in view of the taboo
against concluding that performance depends on race, the members
of the Behavior Genetics Association moved quickly to distance
themselves from Whitney’s scientifically unfounded and socially
inappropriate causal conclusion.

(A similar taboo pertains to concluding that performance in indi-
viduals depends on their sex [gender].  We [society] allow meas-
ures of physical performance to depend on sex because sufficient
obvious differences exist between the sexes in determinants of
physical performance [e.g., in average body weight] to justify
such differences.  We also generally allow differences in “emo-
tional” performance between the sexes, although the distinction
may be diminishing.  However, we have a justified strong social
prohibition against concluding that intellectual performance de-
pends on sex because such differences might be used by some peo-
ple as a basis for sex discrimination.)

Does Performance Not Depend on Race?

The discussion above suggests that we can’t reasonably conclude
that performance depends on race.  It is instructive to consider
the negation of this idea.  That is, can we conclude that per-
formance doesn’t depend on race?

Many people believe that human performance doesn’t directly de-
pend on race.  (I am in this group.)  However, the statement that
performance doesn’t depend on race is a statement of a scientific
“null hypothesis” -- a statement that something doesn’t exist.
(Here the null hypothesis says that no causal relationship exists
in humans between a given measure of performance and race.)  It
is impossible to scientifically prove that something that is
logically possible doesn’t exist (assuming that the size of the
thing isn’t specified).  Thus a null hypothesis can’t be directly
empirically supported.  Thus we can’t scientifically prove that
performance doesn’t depend (in some perhaps very small way) on
race.

Despite the preceding point, scientific logic dictates (through
the principle of parsimony) that we assume that a null hypothesis
is true until (if ever) incontrovertible empirical evidence to
the contrary is brought forward.  Thus (since no incontrovertible
evidence is presently available) we assume that performance
doesn’t depend on race, even though we can’t prove it is true.

Rejecting the Null Hypothesis

As a rule, scientists are highly interested in properly rejecting
null hypotheses about causal relationships between variables.
This rejection is performed by finding empirical evidence that
implies the existence of the relationship.  Scientists are inter-
ested in rejecting null hypotheses because the knowledge gained
in rejecting a (carefully chosen) null hypothesis is generally of
theoretical or practical use.

However, the case of relationships between performance and race
is an important exception.  In this case most scientists and
other thoughtful people are not interested in trying to reject
the null hypothesis because, as noted, rejection is not seen as
being particularly scientifically useful, and rejection might be
used by some people as a basis for racial discrimination.

Summing Up

The preceding discussion leads to a certain type of negative
(null) conclusion.  A conclusion of this type is often unstated
because experienced scientists take such a conclusion for granted
until (if ever) it is rejected.  However, in view of the harmful-
ness of racial discrimination, the conclusion is worth stating:
There is presently no convincing scientific evidence that per-
formance (or behavior) in individuals can be reasonably predicted
from their race or ethnicity.

Appendix D: Specifying a Repeated Measurements
Analysis of Variance

The procedure for requesting a repeated measurements analysis of
variance from a statistical analysis computer program is compli-
cated because one must understand two somewhat complicated lan-
guages:

- the language of statistical ideas related to repeated measure-
ments analysis of variance (i.e., variation, within- and be-
tween-entity variation, main effect, interaction, and p-value)

- the language of the computer program chosen to analyze the
data.  (In general, each program uses a different proprietary
language to specify the required information.)

Also, requesting a repeated measurements analysis of variance is
complicated because two layouts are available for organizing the
data table, and most software is only capable of analyzing data
organized according to one of the layouts, and software manuals
sometimes don’t carefully distinguish between the layouts.

One layout for organizing the data is with one row of data per
response-variable value.  For example, suppose we perform a re-
peated measurements experiment to compare teaching approach A
with teaching approach B using a measure of knowledge as the re-
sponse variable.  And suppose we measure the students’ knowledge
of the subject area before they are exposed to the teaching ap-
proaches and we measure their knowledge again after each student
has had three months of exposure to one or the other of the ap-
proaches.  Then our data table might be organized as follows:

--------------------------------------
Teaching
Student   Approach   Time    Knowledge
--------------------------------------
Jack        A       Before      55
Jack        A       After       65
Mary        B       Before      63
Mary        B       After       75
Jean        A       Before      68
Jean        A       After       69
Bill        B       Before      49
Bill        B       After       82
etc.
-------------------------------------

The table indicates that Jack had a measured knowledge value of
55 before receiving teaching approach A and a measured knowledge
value of 65 after receiving teaching approach A, and so on for
the other students.

A second layout for organizing the data is with one row of data
per experimental entity, i.e., one row per student in the present
discussion.  Under this layout the information in the above table
could be organized as follows:

------------------------------------------
Teaching   Knowledge   Knowledge
Student   Approach     Before      After
------------------------------------------
Jack        A          55          65
Mary        B          63          75
Jean        A          68          69
Bill        B          49          82
etc.
------------------------------------------

In this second layout for organizing the data the response vari-
able (Knowledge) has multiple columns in the data table, with
these columns reflecting the repeated measurements aspect of the
research, with one column for each time the response variable was
measured in the students.  Traditionally this second layout has
been used to organize repeated measurements data.  (This may be
because this layout is non-redundant and thus more compact than
the first layout.)  However, the first layout may be slightly
easier to understand because each variable has only a single col-
umn in the data table and no variables are hidden or implicit.
(In the table immediately above the variable Knowledge has two
columns in the table and the variable Time has no column -- time
is implied by the two Knowledge columns.)  I hope that statisti-
cal software developers will debate the advantages of the two
layouts for organizing repeated measurements data and then stan-
dardize on the better layout (or perhaps make both layouts avail-
able).

It is easy for an expert to use statistical software to convert a
data table from one of the two layouts for organizing repeated
measurements data to the other.  However, for a less experienced
researcher this conversion can be surprisingly difficult in the
details.  Thus I recommend that less experienced researchers de-
termine which data organization their software requires and then
ensure that the data table is organized properly from the start.

References

Aiken, L. R. 2002. Attitudes and related psychosocial constructs:
Theories, assessment, and research. Thousand Oaks, CA: Sage.

Bailar, J. C., III, and Mosteller, F., eds. 1992. Medical uses of
statistics (2nd ed.). Boston: NEJM (New England Journal of
Medicine) Books.

Box, G. E. P., Hunter, J. S., and Hunter, W. G. 2005. Statistics
for experimenters (2nd ed.). New York: John Wiley.

Fleiss, J. L. 1986. The design and analysis of clinical experi-
ments. New York: John Wiley.

Kirk, R. E. 1995. Experimental design: Procedures for behavioral
sciences (3rd ed.). Pacific Grove, CA: Brooks/Cole.

Krosnick, J. A., Judd, C. M., and Wittenbrink, B. 2005. The meas-
urement of attitudes. In D. Albarracin, B. T. Johnson, and M.
P. Zannna (Eds.), The handbook of attitudes, (pp. 21-76). Mah-
wah, NJ: Lawrence Erlbaum.

Ladson-Billings, G., and Tate, W. 2006. 2006. American Education
Research Association annual meeting theme: Education research
in the public interest. Available at
http://www.aera.net/annualmeeting/?id=694

Macnaughton, D. B. 2002. The introductory statistics course: The
entity-property-relationship approach. Available at
http://www.matstat.com/teach

McCullagh, P., and Nelder, J. A. 1989. Generalized linear models
(2nd ed.). London: Chapman and Hall.

Miller, G. A. 1956. The magical number seven, plus or minus two:
Some limits on our capacity for processing information. Psy-
chological Review 63:81-97. Also available at
http://www.well.com/~smalin/miller.html

Specter at the Feast. 1995 (July 7). Science 269:35.

Thompson, B. 2004. Exploratory and confirmatory factor analysis:
Understanding concepts and applications. Washington, DC:
American Psychological Association.

Whitney, G. 1995. Ideology and censorship in behavior genetics.
Mankind Quarterly 35:327-342. Also available at
http://www.lrainc.com/swtaboo/taboos/gw-icbg.html

Winer, B. J., Brown, D. R., and Michels, K. M. 1991. Statistical
principles in experimental design (3rd ed.). New York: McGraw-
Hill.