Subject: Experimental Research In Education: The Most Exciting Talk 
         at the 2005 Joint Statistical Meetings 

     To: EdStat E-Mail List
         ApStat E-Mail List
         Teaching Statistics E-Mail List
         sci.stat.edu Usenet Newsgroup

   From: Donald B. Macnaughton < donmac@matstat.com >

   Date: Thursday August 25, 2005

     Cc: Mack Shelley, Candace Schau


For me, the most exciting talk at the 2005 Joint Statistical 
Meetings in Minneapolis (August 7-11) was an invited talk given 
by Mack Shelley, II of Iowa State University. The talk was enti-
tled "Education Research Meets the Gold Standard: Statistics, 
Education, and Research Methods after 'No Child Left Behind'". 
Here's the abstract: 

    This talk will inform the national debate over the role 
    of scientific standards for research in education, par-
    ticularly as those standards are influenced by statisti-
    cal methods and theory. It will bring together research-
    ers in statistics and education to discuss the dramati-
    cally changing context of contemporary education re-
    search. Standards for acceptable research in this key 
    area are affected greatly by creation of the Institute of 
    Education Sciences in the U.S. Department of Education 
    and passage of the No Child Left Behind Act of 2001 and 
    the Education Sciences Reform Act (H.R. 3801). These re-
    constituted federal support for research and dissemina-
    tion of information in education, are meant to foster 
    "scientifically valid research," and established the 
    "gold standard" for education research. Greater emphasis 
    in education research now is placed on quantification, as 
    well as the use of randomized trials and the selection of 
    valid control groups. This talk should sustain and ex-
    pand the dialogue between the statistical community and 
    those who implement the education research agenda. 

The PowerPoint slides for the talk are at 

Consider a definition: 

    An experiment (or randomized trial) is a "proper" experi-
    ment if it has been performed according to generally ac-
    cepted principles of scientific practice and experimental 
    design, as described by Bailar and Mosteller (1992), Box, 
    Hunter, and Hunter (2005), Fleiss (1986), Kirk (1995), 
    Winer, Brown, and Michels (1991), and many others. 

In the past, proper experiments in education research have gener-
ally not been done, mostly because such experiments are somewhat 
complicated. Instead, educators have relied on observational re-
search (and anecdotes) to support education policy decisions. 
However, this approach is unreliable, as illustrated by the New 
Math fiasco in the 1960's and early 1970's (Fang 1968; Kline 
1973; Miller 1990; Stein 1996, chap. 12). 

Proper experiments in education research have two important ad-
vantages over observational research: 

- Proper experiments are unequivocal, but observational research 
  is invariably equivocal. Thus proper experiments in education 
  research (for the most part) reliably increase our knowledge of 
  how best to conduct an education program. (This is fully 
  analogous to the way that proper experiments in medicine have 
  greatly increased our knowledge of how to promote wellness and 
  fight disease.) 

- Proper experiments focus attention on the important question of 
  what we would like education to do for us. (This focus is 
  through the choice of the response [outcome, dependent] vari-
  able[s] for the experiment.) 

Perhaps due to lack of knowledge, some education researchers con-
tinue to perform less rigorous education research, which leads to 
wastage of time, wastage of opportunities, and wastage of money. 
In view of this wastage, and as noted by Shelley in his talk, the 
"What Works" arm of the United States Department of Education is 
beginning to rate education research projects on how well they 
satisfy the requirements of proper research. Each evaluated re-
search project is given a rating on a three-point scale, with the 
levels being (1) Meets evidence standards, (2) Meets evidence 
standards with reservations, and (3) Does not meet evidence 

Researchers planning to perform education research may find it 
helpful to read about the evidence standards to review or learn 
what's needed for education research to be "proper". The What 
Works program is described at http://www.whatworks.ed.gov 

An effective way for a less experienced researcher to ensure that 
their research is proper is to collaborate with another re-
searcher who is familiar with experimental design and the pit-
falls of education research. 

I discuss some issues about proper research in the field of sta-
tistics education in appendices A and B and I discuss a way to 
substantially increase the power of statistical tests in educa-
tion experiments in appendix C. 

Don Macnaughton 

Donald B. Macnaughton MatStat Research Consulting Inc 
donmac@matstat.com Toronto, Canada 


The discussion in the body of this essay applies to all areas of 
education. Of special interest to me is the area of STATISTICS 
education and in particular the introductory statistics course 
for students who aren't majoring in statistics. This course is 
important because statistics is a cornerstone of science, and 
thus proper understanding of the basic use of statistics in sci-
entific research will give students a better understanding of 

Unfortunately, many students fail to understand the introductory 
statistics course. We know this by the experience that most sta-
tistics teachers have with non-statisticians they meet, perhaps 
at a party -- many non-statisticians report that they took an in-
troductory statistics course, but were totally lost. 

It seems clear that we can improve the introductory statistics 
course through proper designed experiments aimed at determining 
which selection of topics and which teaching approaches give stu-
dents the greatest benefits. I discuss a useful response variable 
for experiments in statistics education in appendix B. 

Some researchers in statistics education do not perform experi-
ments and they state that proper experimental research in statis-
tics education is premature. If asked why experimental research 
is premature, these researchers give vague answers, such as say-
ing that "preliminary work" must be done. I hope that research-
ers who believe that proper experimental research in statistics 
education is premature will clearly spell out the steps they feel 
are necessary before this very important research can begin. It 
is wasteful to delay when so many students fail to understand our 


Candace Schau has developed the Survey of Attitudes Toward Sta-
tistics (SATS). This survey, which can be administered to stu-
dents in less than ten minutes, consists of 36 statements that 
students rate on a seven-point scale ranging from "strongly dis-
agree" to "strongly agree". For example, the fifth statement is 
"Statistics is worthless." The SATS provides six reliable scores 
for a student that reflect the student's attitudes toward statis-
tics on scales that are named Value, Affect, Cognitive Compe-
tence, Difficulty, Interest, and Effort. 

The Value scale is particularly important because it reflects how 
highly students value the field of statistics. We might say the 
"best" introductory statistics course for a group of students is 
the course that most improves the students' scores on the SATS 
Value scale. This is reasonable because a student's sense of the 
value of the field of statistics (as instilled by a course) is 
arguably more important than any statistical knowledge instilled 
by the course. This is because statistical knowledge (e.g., how 
to do a t-test) is generally forgotten shortly after the student 
completes the final exam. But the student's valuation of our 
field usually lasts a lifetime and drives his or her decisions 
and remarks about the field. 

Administering and scoring the SATS is easy. Therefore, I encour-
age every introductory statistics teacher to administer it to 
their students before and after each statistics course they 
teach. If you compute the average difference between the stu-
dents' "before" and "after" scores, you can determine whether the 
course tends to make students' attitudes better or worse, and by 
roughly how much. 

The SATS is available at 

(The results of the SATS can be disappointing because SATS scores 
in many courses are worse at the end than at the beginning. How-
ever, for a seriously committed teacher it is useful to be aware 
of this problem and its extent as a stimulus to search for im-

(On a technical matter, the differences between the students' 
"before" and "after" scores on a SATS scale are useful as a rudi-
mentary measure. However, if SATS scores are used in analysis of 
variance, for complete information the raw "before" and "after" 
scores for the analyzed scale should be included in the analysis 
instead of merely using the differences between them.) 


Measuring the response variable in each student before the course 
begins and again at the end of the course (as discussed in appen-
dix B) can substantially increase the power of the statistical 
tests in an experiment. This is because the repeated measurement 
of the response variable enables us (when other standard condi-
tions are met) to use the statistical procedure of repeated meas-
urements analysis of variance. This results in certain key sta-
tistical tests being based on "within-student" comparisons, which 
generally provide substantially more powerful tests than the be-
tween-class comparisons that may otherwise be necessary. 

(Despite the point in the preceding paragraph, it's usually nec-
essary to study more than two classes of students in experimental 
research comparing education programs. We need more than two 
classes to eliminate the possibility of reasonable alternative 
explanations muddying the interpretation of the results. For ex-
ample, we must eliminate the possibility of teacher differences 
accounting for significant differences in the mean values of the 
response variable for the different programs being compared. We 
can eliminate this possibility with multiple classes with multi-
ple teachers. And [if proper random assignment of students to 
classes can't be done] we also need multiple classes to reduce 
the chance of a student-class selection bias.) 


Bailar, J. C., III, and Mosteller, F., eds. 1992. _Medical uses 
   of Statistics._ 2d ed. Boston: NEJM (New England Journal of 
   Medicine) Books. 

Box, G. E. P., Hunter, J. S., and Hunter, W. G. 2005. _Statistics 
   for Experimenters._ New York: John Wiley. 

Fang, J. 1968. _Numbers Racket; The Aftermath of "New Math"._ 
   Port Washington, NY: Kennikat Press. 

Fleiss, J. L. 1986. _The Design and Analysis of Clinical Experi-
   ments._ New York: John Wiley. 

Kirk, R. E. 1995. _Experimental Design: Procedures for Behavioral 
   Sciences,_ (3d ed.). Pacific Grove, CA: Brooks/Cole. 

Kline, M. 1973. _Why Johnny Can't Add: The Failure of the New 
   Math._ New York: St. Martin's Press. 

Miller, J. W. 1990. Whatever Happened to New Math? _American 
   Heritage,_ 41(8) (Dec), 76-83. 

Stein, S. K. 1996. _Strength in Numbers: Discovering the Joy and 
   Power of Mathematics in Everyday Life._ New York: John Wiley. 

Winer, B. J., Brown, D. R., and Michels, K. M. 1991. _Statistical 
   Principles in Experimental Design_ (3d ed.). New York: McGraw-

Return to top

Home page for the Entity-Property-Relationship Approach to Introductory Statistics