The Introductory Statistics Course:
The Entity-Property-Relationship Approach
Donald B. Macnaughton
| NOTES
http://www.matstat.com/teach/eprt0130.pdf .
This paper proposes six concepts for discussion at the beginning of an introductory statistics course for students who are not majoring in statistics or mathematics. The concepts are (1) entities, (2) properties of entities, (3) variables, (4) a major goal of empirical research: to predict and control the values of variables, (5) relationships between variables as a key to prediction and control, and (6) statistical techniques for studying relationships between variables as a means to accurate prediction and control. After students have learned the six concepts they learn standard statistical topics in terms of the concepts. It is recommended that each concept be taught in a bottom-up fashion with emphasis on concrete practical examples. It is suggested that the approach gives students a lasting appreciation of the vital role of the field of statistics in empirical research. KEY WORDS: Statistics education; Teaching; Role of statistics in empirical research. Two former presidents of the American Statistical Association have stated that "students frequently view statistics as the worst course taken in college" (Hogg 1991, Iman 1994). A third former president has stated that the field of statistics is in a "crisis" and the subject has become "irrelevant to much of scientific enquiry" (Box 1995). The 2001 president has stated that statistics is "still among the most despised of college courses" (Scheaffer 2001). Many statisticians reluctantly agree with these remarks. In contrast, many statisticians agree that the field of statistics is a fundamental tool of the scientific method, which plays a key role in modern society. Thus rather than being a worst course and possibly irrelevant, the introductory statistics course ought to be a friendly introduction to the simplicity, beauty, and truth of the scientific method. Teachers must therefore reshape the introductory course. Many teachers have already contributed to the reshaping, as noted below. This paper proposes further changes. I focus on the introductory statistics course for students who are not majoring in statistics or mathematics, whom I call "non-statistics-majors". Most students who take an introductory statistics course are members of this group. The introductory course for non-statistics-majors is important because it is a main seedbed of public opinion about the field of statistics. Section 2 defines the concept of empirical research', which appears throughout this paper. Section 3 recommends two goals for the introductory statistics course. Section 4 proposes six concepts for discussion at the beginning of an introductory course. Section 5 illustrates how the six concepts provide a deep and broad foundation on which we can build the field of statistics. Section 6 discusses testing the proposed approach. Section 7 identifies considerations for teachers wishing to use the approach, and Section 8 gives a summary. 2. A DEFINITION OF "EMPIRICAL RESEARCH" At many points in this paper I refer to "empirical research", which thus deserves a definition: Empirical research is any activity in which data are gathered from some area of experience and then conclusions are drawn from the data about the area of experience. Empirical research is a crucial step of the scientific method, which is central to many areas of human endeavor, such as in science, education, business, industry, law, and government. Section 5.5 discusses the scientific method. |
|
3.1 The Value of Emphasizing Goals Emphasizing the goals of an undertaking helps one to define and focus on what is most important. This helps substantially in structuring work efficiently. Thus, following Hogg (1990, 1992), I recommend that all introductory statistics courses have carefully considered goals. Ask yourself -- what are the goals of the introductory statistics course most familiar to you? I have observed goal-setting exercises in which the goals were given much attention for a brief period, but then were forgotten or ignored in the subsequent months and years -- a waste of a valuable resource. Because the course goals specify what the teacher believes is most important, I recommend that teachers regularly revisit their goals to ask (a) if the goals are still reasonable and (b) if the day-to-day operations are effectively serving the goals. 3.2 Topic-Based Goals Have a Significant Drawback Many introductory statistics courses have what can be called "topic-based" goals. A teacher using such goals does not specify general goals, but instead simply specifies a list of statistical topics to be covered in the course, perhaps in the form of a syllabus. For example, a teacher of a traditional course might aim to cover (in specified amounts of detail) the topics of probability theory, distribution theory, point and interval estimation, and hypothesis testing. Similarly, a teacher of an activity-based course might make a list of statistical topics and then assign various activities to the students in order to cover the topics. Unfortunately, topic-based goals have a significant drawback: By emphasizing lower-level statistical topics, these goals usually fail to emphasize what is essential, which is to help students to appreciate the vital role or use of the field of statistics. Unless students understand and appreciate the general role of statistics, knowledge of statistical topics will be both of little interest to them and of little use. 3.3 Recommended Goals (A Lasting Appreciation of the Role of Statistics) I recommend that the goals of an introductory statistics course for non-statistics-majors be
I suggest that it is more important to satisfy these goals than to satisfy goals stated in terms of statistical topics. How can we best satisfy these goals? First, it seems clear that students can appreciate the role of statistics only if they understand it. This leads one to ask, "What is the role of statistics?" The next section describes a sequence of simple concepts that give students a broad overview of the vital role of the field of statistics in empirical research. Second, but of equal importance, most students will appreciate the role of statistics only if they see the practical value of statistics. Thus I recommend that introductory teachers place heavy emphasis on practical examples of the use of statistics. Surprisingly, some introductory approaches use examples that are not practical. Practical examples are discussed at several places below, with general remarks in Section 7.5. Other authors (after Hogg) who discuss goals for the introductory course include Chromiak, Hoefler, Rossman, and Tesman (1992), Cobb (1992, 1993, 2000), Iversen (1992), Watkins, Burrill, Landwehr, and Scheaffer (1992), Hoerl and Snee (1995), Gal and Garfield (1997a, pp. 2 - 5), Moore (1997a), and Garfield (2000). This section describes six simple concepts I recommend for discussion at the beginning of an introductory statistics course for non-statistics-majors. These concepts help students to appreciate the role of statistics by highlighting a recurring simple pattern in the use of statistics across almost all empirical research. Most university and college students can learn the concepts in between one and three class sessions. To make the approach easy for teachers to use, the following six subsections present the six concepts as a condensed version of how they might be presented to students. Teachers using the approach in an introductory course will need to expand the discussion with more examples, as discussed in Sections 7.5 through 7.8. At a few points in this section I discuss pedagogical and statistical issues that are beyond the interest or understanding of most beginning students. These discussions are for teachers and are identified with initial asterisks. I recommend that this material be omitted in an introductory course for non-statistics-majors. I begin with what may be the most fundamental concept of human reality. If you study your train of thought, you will probably agree that you think about "things". For example, during the next few seconds you may think about, among other things, a friend, an appointment, today's weather, and an idea. Each of these things is an example of an entity. Many different types of entities exist. Some common types are
Clearly, the things in the list are diverse. Thus it may at first seem absurd to think that they have anything in common. But they all do have one important property in common -- they are all things (entities). *Appendix A discusses whether some of the things in the list are really things or entities. Entities are fundamental units of human reality because people unconsciously view everything (every thing) in our reality as being an entity. This dramatically simplifies our thinking because it allows us to view everything (at the most basic level) the same way, as discussed in the next subsection. The External World. When considering entities it is useful to consider the concept of external world', which can be defined as follows: The external world is what is "out there" -- what we see when we look out the windows in our heads and what we sense through our other senses. People usually view entities as existing both in the external world and in our minds. We use the entities in our minds mainly to stand for entities in the external world, much as we use a map to stand for its territory. People learn to use the concept of entity' when they are infants. We use the concept unconsciously as a way of organizing the multitude of stimuli that enter our minds from the external world moment by moment while we are awake. People Group Entities into Types (Populations). We learn as infants to group entities into types. For example, as infants we observe that the entities "mother" and "father" and other similar entities in the external world all have heads with two eyes, a mouth, and usually hair on top. We (unconsciously) group these entities into a type -- the type we call "people". Similarly, as infants we learn to group all inanimate physical objects (beginning with small familiar objects in the crib) into a type. Grouping entities into types simplifies things because we recognize that all the entities of a given type have many things (properties) in common, as discussed in the next subsection. In statistics and empirical research the set of all the entities of a given type is called the "population" of entities of that type. For example, a web site on the Internet is an entity (of an electronic or computer-object sort), and the set of all the web sites on the Internet constitutes the population of web sites. Entities and Language. In language people use nouns to stand for entities. For example, in the sentence "Peter Smith is in room 302" the proper noun "Peter Smith" identities a particular entity that is a person. Similarly, the noun phrase "room 302" identifies an entity that is a physical location. People usually think of entities unconsciously. However, we sometimes do need to refer to them in general terms. In these situations statisticians and empirical researchers may refer to entities as members of the population, cases, elements, individuals, instances, items, objects, observations, specimens, subjects, things, or (experimental, observational, or survey) units. *I discuss why I recommend the term "entity" for general discussion in a paper (1998a, app. E.1). *Why Discuss Entities? Most human thoughts can be expressed in sentences in language. Since most human sentences contain at least one noun, most human sentences refer directly to entities. Thus the concept of entity' pervades human thought. However, the concept is almost always at a deep unconscious level. Thus the concept is rarely directly discussed in everyday conversation, or in statistics, or in empirical research. This raises the question of whether we need to discuss the concept of entity' at the beginning of an introductory statistics course. If we are dealing with a specific problem in statistics or empirical research, we can usually omit direct discussion of entities. This is because in specific problems we rarely need to drill down all the way to the foundational concept and discuss "things" at such a basic level. Instead, discussions invariably concern one or more specific types of entities, which are best referred to by their type names. For example, medical researchers study two familiar types of entities called "people" and "diseases". On the other hand, if we are dealing with the general problems of statistics and empirical research, it is reasonable to begin with the concept of entity'. This is because the easy-to-understand concept of entity' can be reasonably viewed as a unifying foundation for many other concepts of statistics and empirical research. This idea is directly illustrated in the following subsections and is extended in Section 5.3. *Teaching the Concept of 'Entity'. The concept (principle) of 'entity' is basic and general (and abstract). Moore, in an important article about statistics education, notes that students in introductory statistics courses (and people in general) often have difficulty learning from basic general principles down to special cases (2001, sec. 4). I fully agree with Moore's point. Clearly, unless we bypass the "top-down" nature of the concept of 'entity', its inherent abstractness and generality will be fatal stumbling blocks for many less advanced students. To bypass the need for students to learn in a top-down fashion, and to ensure that students understand the concept of 'entity', I recommend that teachers develop it using a "bottom-up" sequence of ideas, beginning with familiar concrete examples of entities and working up at an appropriate pace for the students to the general concept. I further discuss the bottom-up approach to teaching statistical concepts in a Usenet post (forthcoming). Every entity has associated with it a set of properties. For example, all people have thousands of different properties, two of which are "height" and "blood group". Values of Properties. For any particular entity, each of its properties has a value. We usually report the value of a property with a number, with words, or with a symbol. For example, on April 12, 2001, in a number (the value of) my height was 176.3 centimeters, in a word my height was "medium", and in a symbol my blood group was "O pos". We know or experience an entity by knowing or experiencing its properties and their values. (We know or experience a living entity by [in large part] knowing or experiencing its behavior. Researchers who study behavior usually view it as a complicated set of properties of the living entities they study.) Variation in Values of Properties. The values of properties of entities generally vary. The values almost always vary from one entity to the next and they usually also vary within individual entities over time. For example, peoples' heights vary from person to person, and a person's height varies over time. Measures of Values of Properties. A key operation of empirical research is to determine the values of selected properties of the entities under study. To determine the value of a property of an entity, researchers apply an appropriate measuring instrument to the entity. If the instrument is measuring correctly, it will return a measurement that is an estimate of the value of the property of the entity at the time of the measurement. For example, if we wish to know the (value of the) height (property) of a person, we can apply a height-measuring instrument (e.g., a tape measure) to the person, and the instrument will return a value that is (in the specified units) an estimate of the person's height. Measuring instruments (which are sometimes called "measures") are often physical devices such as a tape measure, a speedometer in a car, or litmus paper. But they may also be of other types, such as paper-and-pencil tests administered to students or subjective judgments provided by experts. Measuring instruments are important because all conclusions in empirical research are based directly on the estimates of values of properties obtained from measuring instruments, as discussed in the following subsections. The "True" Value of a Property of an Entity. Since a measuring instrument can generally only provide an estimate of the value of the property it measures, this leads to the question of what it means to speak of the "true" value of a property of an entity. Empirical researchers usually view the true value of a property in terms of commonly-agreed-upon measurement standards because this facilitates communication and understanding. For example, researchers in the physical sciences usually view the "true" values of the properties of the entities they study in terms of the definitions and standards maintained by the International Bureau of Weights and Measures (BIPM 2001). *I further discuss the idea of the true value of a property in a Usenet post (2001). Properties and Populations. Section 4.1 says that humans learn as infants to group entities into types (populations) such as "people" and "inanimate physical objects". An important aspect of these types is that we learn to (unconsciously) view all the entities of a given type as having exactly the same properties. However, as noted, the values of the properties generally vary from entity to entity and over time. For example, we unconsciously view all solid physical objects as having the properties "weight", "shape", "size", and "color(s)" although different solid physical objects generally have different values of these properties. Viewing all the entities of a given type as having (sharing) exactly the same properties is a key unifying principle of human reality. Properties in Language and Thought. In everyday language, people often use adjectives and adverbs to report the values of properties of entities. For example, if someone says "Peter Smith is tall", the adjective "tall" reports (an estimate of the value of) Peter's height. Similarly, if someone says "Peter Smith is very tall", the adverb "very" refines the report of Peter's height. Similarly, if someone says "The tiger is running quickly", the adverb "quickly" reports the value of the current "speed" property of the tiger. A property of an entity may also be called an ability, aspect, attribute, capability, capacity, character, characteristic, countenance, dimension, disposition, facet, factor, faculty, feature, finding, indication, indicator, nature, quality, quantity, scalar, trait, or vector. *Appendix B discusses why I recommend the term "property" for general discussion. *Appendix C discusses the evolution of entities and properties in human thought. *Teaching the Concept of 'Property'. As with the concept of 'entity', the main ideas associated with properties are basic, general, and abstract. Thus the ideas can be hard for less-advanced students to understand. Thus, to ensure understanding, I recommend that the ideas be introduced in a bottom-up sequence, beginning with familiar concrete examples of properties, values, variation, and measures, and working up at an appropriate pace for the students to the general ideas. As noted, empirical researchers use measuring instruments to determine estimates of the values of properties of entities. When these estimates are studied formally, statisticians and empirical researchers usually refer to them as "variables". A reasonable definition of the statistical concept of 'variable' is A variable is a formal representation of a property of entities. *Appendix D compares some dictionary definitions of the concept of 'variable'. Appendix E further discusses the distinction between properties and variables. Values of Variables. Like properties, variables have values. And like the values of properties, the values of variables generally vary. An important subgoal of all serious empirical research is to make the measured values of the variables as close as reasonably possible (at the time of the measurement) to the true values of the associated properties in the entities under study. Researchers do this by using measuring instruments and procedures that are as accurate as possible. This generally increases the accuracy of the conclusions they draw from empirical research. Clearly, time plays a key role in the idea of the value of a variable: In statistics and empirical research the value of a variable for an entity is generally viewed as an estimate or "snapshot" (possibly with distortion) of the true value of the associated property of the entity at a particular time (or perhaps over a particular time period). Data and Data Tables. The concepts of 'entity', 'property', and 'variable' lead directly to the concept of 'data', which can be defined as follows: Data are the (measured) values of one or more variables (properties) for one or more entities. (A single value of a single variable is called a "datum".) All empirical research projects generate data. The (raw) data from a research project (or from a logical unit of a larger research project) are invariably organized in a table. Each row in the table is associated with one entity of the type under study. Each column is associated with a different property of the entities (or a property of the entities' environment), as reflected in the values of the variable associated with the column. Each cell (intersection of a row and a column) in the table contains the value (at the time of measurement) of the variable associated with the column for the entity associated with the row. The data table (with appropriate footnotes) is the complete record of what was observed in an empirical research project. Thus the table is central to drawing reasonable conclusions from the project (as explained in the next three subsections). The table also provides a succinct summary of the design of the project. Thus when considering or planning an empirical research project it is helpful for students to study the data table, or a manageable number of rows if the table is large. To increase understanding I recommend that studied tables have carefully worded column headings, and that they contain realistic made-up data if real data are unavailable. *Realistic data are further discussed in Section 7.8. *Teaching the Concepts of 'Variable' and 'Data Table'. Discussion of the somewhat abstract material above is almost never enough for students to properly understand the concepts of 'variable' and 'data table'. Furthermore, as discussed in Section 5.10, students often misunderstand the concept of 'variable'. But the concept of 'variable' is central to virtually all statistical discussions. Thus students must understand variables (and data tables) if they are to understand statistics. As with entities and properties, students can readily understand the concepts of 'variable' and 'data table' if the ideas are developed in a bottom-up fashion, beginning with concrete examples and working up at an appropriate pace for the students to the general concepts. I recommend that teachers choose variables for discussion that are interesting, easy to understand, and that empirical researchers are seriously interested in studying. For example, automotive engineers are seriously interested in studying the variable "fuel usage per kilometer" in automobiles because appropriate study of this variable enables them to minimize fuel usage and thereby make automobiles less expensive to run. Thus a teacher might show students a data table with different types of automobiles representing the rows and relevant automobile variables (including "fuel usage per kilometer") representing the columns. If the components of such a table are carefully discussed, students attain a concrete sense of the entities, properties, and variables associated with the table. Definition of Empirical Research. To help in the meta-discussion, Section 2 above proposes a definition of the concept of 'empirical research'. For introductory statistics courses following the approach of this paper, that definition is properly presented at this point in the development of the ideas, after the introduction of the concept of 'data', which is used in the definition of "empirical research". 4.4 A Goal of Empirical Research: To Predict and Control the Values of Variables A central idea in the definition of empirical research is that researchers "draw conclusions from data". Why do researchers wish to do this -- what are the goals of empirical research? Prediction and Control. One important goal of empirical research is to discover how to predict and control (with maximum accuracy) the values of properties of entities. In other words, the goal is to discover how to predict and control the values of variables in entities. For example, an important goal of (empirical) medical research is to discover how to accurately predict and control the state of the human body, where the state is reflected in various medical properties or variables, such as blood pressure, white blood count, and other measures of health or disease. We seek the ability to predict and control the values of variables because it provides many social and commercial benefits. For example, if a medical researcher can discover how to better predict or control people's risk of heart attacks, this discovery provides the social benefit of saving lives. Similarly, if an organization can discover how to better control variables that reflect important properties of its operations (e.g., customer satisfaction, product performance, product usefulness, product reliability), this discovery helps the organization to optimize its operations and thereby become more successful. Since the ability to predict and control the values of variables is of broad usefulness, many branches of society (in science, business, technology, education, and government) provide substantial support to empirical research aimed at learning how to predict or control the values of key properties or variables. *Explanation and Understanding. A second goal of empirical research is to explain and understand the area of experience under study in the research. However, examination of the concepts of 'explanation' and 'understanding' suggests that these concepts are subordinate to prediction and control for three reasons:
I support these points in two Usenet posts (1996a, 1997b). See also Section 5.5 below. *Prediction and Control Deserve Emphasis. The preceding discussion notes the substantial social and commercial benefits of accurate prediction and control of the values of variables. The discussion also notes the relationship between prediction and control on the one hand and explanation and understanding on the other. Later discussion (Section 5.4) illustrates how it is useful to characterize many of the standard statistical procedures as procedures for achieving accurate prediction or control of the values of variables. These points suggest it is reasonable to focus the introductory statistics course for non-statistics-majors on the use of statistics in empirical research as a means to accurate prediction and control. Where Do We Predict and Control? Before considering the main question of how to predict and control the values of variables, it is useful to consider the following preliminary question: Where do statisticians and empirical researchers predict and control the values of variables? We predict and control the values of variables in the entities in the population of entities under study. We seek the ability to predict and control the values of variables in entities in populations because this approach enables us to make our knowledge as general as possible. Such generality is desirable because the ability to predict or control the values of a variable in any entity in a broad population is almost always more useful than the ability to predict or control (with the same accuracy) the values of the same variable in some subset of the population. For example, in medical research the population of entities of interest is often all the people in the world. The goal of the research is to find ways to predict or control the values of important medical variables in any person in the population (ideally including all people living, dead, and unborn). Similarly, in organizational research the population of entities under study might be all the weeks in the life of a particular organization. Here the goal of the research might be to find ways to predict and control the values of important organizational variables in any week (especially later weeks) in the life of the organization. Thus if we include the concept of 'population', we can say A fundamental goal of empirical research is to discover how to predict and control (with maximum accuracy) the values of variables (properties) in entities in populations. 4.5 Relationships Between Variables as a Key to Prediction and Control The Concept of 'Relationship Between Variables'. Given the goal of predicting and controlling the values of variables, a key question is How can we predict and control the values of variables? The main answer is We can predict and control the values of variables by studying relationships between variables. In a "relationship between variables" one variable (called the response variable) "depends" on one or more other variables (called the predictor variable[s]). Almost all prediction and control in all areas of empirical research is done on the basis of this simple idea. For example, medical researchers have discovered that a relationship exists between the amount of saturated fat ingested by a person (predictor variable) and the risk that a person will have a heart attack (response variable). The relationship is that more saturated fat is associated with a higher risk of a heart attack. Knowing this relationship helps doctors and patients to predict and control heart attacks. (Empirical research about the relationship between saturated fat and heart attacks is summarized by Kromhout 1999, Liebson and Amsterdam 1999, and de Lorgeril and Salen 2000.) In addition to using the concept of 'dependence' to characterize relationships between variables, we can characterize them as follows: A relationship exists between two variables if we find that when the values of the predictor variable(s) "go up and down" in the entities under study (or in the entities' environment), the values of the response variable also go up and down (or down and up) somewhat "in step" with the values of the predictor variable(s). For many non-statistics-majors the above two informal characterizations of relationships are sufficient if they are properly illustrated with practical examples. For more advanced students I propose a formal definition of the concept of 'relationship between variables' in a paper for students (1997a, sec. 7.10) and I discuss an important alternative definition in a Usenet post (2002). Population and Sample. When empirical researchers study a relationship between variables they usually do not attempt to directly study the relationship in every entity in the population of interest because that would be impossible or prohibitively expensive. Instead, researchers study the relationship in a subset of the population, which is called a "sample". Researchers usually design empirical research projects with between 6 and 2000 entities in the sample. A View of Empirical Research. Examination of empirical research projects suggests that most can be reasonably viewed as attempting to make a correct generalization for a population of entities about a relationship between variables. The generalization is made from studying the relationship between the variables in the data table for the entities in the selected sample. The generalization (if made properly) enables us to accurately predict or control the values of the property associated with the response variable in new situations for any entity in the population. For example, the medical researchers who discovered the relationship between saturated fat consumption and heart attacks did so by studying data tables for samples of people. These tables have one or more variables for each person that reflect the person's heart attacks and also one or more variables that reflect the person's fat consumption. The researchers used statistical procedures to look for a relationship between fat consumption and heart attacks in the tables, and such a relationship has been reliably found. These findings (and other supporting information) lead doctors to believe that the relationship exists in all the people in the world. Accuracy of Predictions. Prediction or control that is done on the basis of relationships between variables is generally not perfectly accurate. However, if the prediction or control is done properly, mathematical proofs exist to show that it is of the highest possible accuracy given the available information. Examples of Relationships. We can show students the pervasiveness and usefulness of relationships between variables by discussing practical examples. For example, teachers and students can discuss (using appropriate graphics) whether a relationship exists between
Each of these examples identifies a possible relationship between variables. Each of these relationships (and relationships between any other pairs or larger sets of compatible variables) can be studied in an empirical research project. If the research project finds conclusive evidence of the relationship, we can use the knowledge of the relationship to predict and possibly control the values of the response variable in new entities from the population on the basis of their values of the predictor variable(s). Nine Questions About an Empirical Research Project. One can usually understand an empirical research project by considering it in terms of nine questions, which are
By considering sufficient practical examples, students recognize that most empirical (including most scientific) research projects can be reasonably understood by considering them in terms of the nine questions. Thus students recognize that most empirical research projects can be reasonably viewed as studies of relationships between variables in entities in samples, with the aim being to develop the ability to accurately predict or control the values of the property associated with the response variable in new situations for any entity in the population. *Appendix H discusses some possible counterexamples to the points in the preceding paragraph. Section 5.5 discusses the relationship between relationships between variables and some general concepts of science. Mosteller (1990) and Lipsey (1990) discuss the idea of a reasonable alternative explanation. I briefly discuss some history of the concept of 'relationship between variables' in a Usenet post (2001). Terminology. More than eighty terms are available to name the concept of 'relationship' in the phrase "relationship between variables". For example, we can speak of an "association" between variables, or a "relation" between variables, or a "dependence" between variables. *Appendix F discusses why I recommend the term "relationship" for general discussion. Similarly, several general terms are available to name the response variable and the predictor variable(s) in a relationship between variables. For example, a response variable may be called a "predicted" variable or a "dependent" variable, and a predictor variable may be called an "explanatory" variable or an "independent" variable. *Appendix G discusses why I recommend the terms "response" and "predictor" for general discussion. *Statistical Concepts. In the initial discussion of the above ideas I recommend that teachers not introduce any additional statistical concepts beyond some simple graphics to illustrate relationships. I recommend this approach because I believe it is important for students to develop (through practical examples) a strong understanding of the unifying concept of 'relationship between variables' before they try to understand the complicated statistical ideas behind the study of relationships. 4.6 Statistical Techniques for Studying Relationships Between Variables as a Means to Accurate Prediction and Control Once students properly understand and appreciate the usefulness of relationships between variables as a means to prediction and control, we can then bring the field of statistics out onto the stage. We can introduce the role of statistics as follows: Statistics is a set of optimal general techniques to help empirical researchers study variables and relationships between variables in entities in samples, mainly as a means to accurately predict and control the values of variables (properties) in entities in populations. After developing this idea, we can spend the rest of the course and subsequent courses discussing standard statistical principles and methods in terms of it. This approach enables us to unify most discussion in statistics under the concepts of entities, properties, variables, and relationship between variables. Sections 5.3 through 5.9 further discuss this unification. The preceding subsections propose six concepts for discussion at the beginning of an introductory statistics course for non-statistics-majors. The concepts are
After introducing the six concepts, the teacher spends the rest of the course covering statistical techniques for studying relationships between variables. The course is thus centered on the fundamental statistical concept of 'relationship between variables' as a means to accurate prediction and control. Depending on the level of the students, my experience suggests that the six concepts can be properly introduced in between one and eight class sessions. (As noted above, most university and college students can learn the high-level concepts in between one and three class sessions.) Study of the details of the sixth concept can last a lifetime. Teachers can ensure that students understand the concepts by developing them through bottom-up sequences of ideas, beginning with familiar concrete examples of each concept and working up at an appropriate pace for the students to the general concept. I call the approach to the introductory statistics course described above the "entity-property-relationship" (EPR) approach. Section 5 discusses evaluating the EPR approach, Section 6 discusses testing it, and Section 7 discusses implementing it. 5. EVALUATING THE EPR APPROACH This section presents material to help readers evaluate the entity-property-relationship approach to the introductory statistics course. 5.1 Main Differences Between the EPR Approach and Other Approaches The EPR approach differs from other approaches to the introductory course in the following ways:
Despite the above differences, the EPR approach is consistent with and thus compatible with most other approaches to the introductory statistics course -- the differences above are merely differences in ordering and emphasis of statistical topics. Section 5.14 illustrates the relationship between the EPR approach and several other popular approaches. 5.2 The Concepts of the Approach Are Easy to Understand Sections 4.1 and 4.2 imply that the concepts of 'entity' and 'property' pervade students' unconscious thought. Therefore, if we carefully bring these concepts into students' consciousness (through sufficient practical examples), students find the concepts easy to understand. Similarly, Sections 4.3 through 4.5 suggest that if we carefully develop the concepts of 'variable' and 'relationship between variables' for students with practical examples, these concepts are also easy for students to understand. The ease of understanding leads me to conjecture that the concepts of entities, properties, variables, and relationships can be taught at all levels of teaching statistics from late elementary school up, with only the teaching time and depth of coverage of the concepts varying at different levels. 5.3 The Approach Provides a Deep and Broad Foundation for Statistical Concepts Section 4.3 introduces the fundamental statistical concept of 'variable' in terms of the concepts of 'entity' and 'property'. Section 4.5 introduces the fundamental statistical concept of 'relationship between properties' (relationship between variables), which is clearly also built atop the concepts of 'entity' and 'property'. The concepts of 'entity', 'property', and 'relationship' can be used as a foundation for other statistical concepts. Here is a sequence of definitions that develop some basic statistical concepts from the three concepts:
The definitions cover many of the main statistical concepts. Each definition is built atop the concepts of 'entity', 'property', or 'relationship', or is built atop concepts that are themselves built atop the three concepts. Furthermore, the concepts of entities, properties, and relationships appear to be among the most fundamental concepts of human reality. Thus the EPR approach provides a deep and broad foundation for statistical concepts. 5.4 The Approach Unifies Statistical Methods Statistical methods can perform the following four groups of techniques to help empirical researchers study relationships between variables:
These techniques are of substantial help in answering important questions 5, 7, and 8 in Section 4.5. I discuss these techniques further in the paper for students (1997a, secs 8-13). (Logically, the four groups of techniques seem best listed in the above order. However, pedagogically, in an introductory statistics course it makes sense to discuss simple techniques for illustrating relationships before discussing techniques for detecting relationships.) The four groups of techniques raise the question: Which of the currently available statistical methods can actually perform these techniques? The following twenty-one statistical methods can perform one or more of the four groups of techniques:
Upon consideration, many statisticians will agree that the above list of twenty-one statistical methods contains almost all of the currently popular methods, including what most statisticians would view as the "main" methods. Many statisticians will also agree that the only techniques that the statistical methods in the list can perform are given in the four groups of techniques that appear in the first paragraph of this subsection. Appendix I provides support for these claims and explains why certain statistical methods are excluded from the list. Since the list of twenty-one statistical methods contains almost all of the currently popular methods (including the main methods), and since each method in the list is fully explained (at a high level) by the four groups of statistical techniques that are emphasized in the EPR approach, therefore the approach unifies statistical methods. That is, the EPR approach allows us to teach each new statistical method in terms of the same set of simple concepts: entities, properties, variables, and relationships between variables. Emphasizing the simple commonalities that exist among the methods makes the field of statistics substantially easier for students to understand. 5.5 The Approach Links Well with General Concepts of Science The EPR approach has strong links with three important general concepts of science, as follows: Scientific Method. The scientific method (also known as the hypothetico-deductive method of science) has apparently existed implicitly among skilled artisans and tradespeople since the dawn of civilization. It was brought into formal consideration by a long line of artisan-scientist-philosophers who have shown us that systematic observation and experimentation are keys to understanding any area of experience. Fowler (1962) gives a concise overview of many of the people and events in the development of the formal scientific method, and Dunbar (1995) discusses the "pan-cultural" and "pan-species" nature of science. Dingle (1952) and Drake (1970) suggest that Galileo introduced fundamental advances to the understanding and practice of the method. Many philosophers and scientists have written about the method. It is reasonable to view the scientific method as consisting of four steps:
The scientific method is central to science because almost all modern scientific research (and most other empirical research) proceeds formally according to the method. The method is repeated over and over, as discussed by Box and Draper (1987, sec. 1.3). Interestingly, actual scientific research often proceeds quite differently from the first three steps above, with frequent reorderings of the steps, surprising serendipity, and many false starts. However, scientific research is usually formally viewed in terms of the scientific method because experience has shown that this point of view is generally the most efficient. Examination of instances of the use of the scientific method suggests that the implication in step 2 can usually be usefully viewed as a statement of a relationship between variables in some population of entities. This can be seen by applying the nine questions discussed in Section 4.5 to specific research projects that exemplify the method -- the questions almost always reasonably apply. Thus the key concept of the EPR approach of 'relationship between variables' plays a central (though often implicit) role in the scientific method. Appendix H discusses some possible counterexamples to the point in the preceding paragraph. I further discuss the scientific method in a Usenet post (2001, app. A). Scientific Explanation and Understanding. As suggested in Section 4.4, "explanation" and "understanding" play prominent roles in science. What are scientific explanation and understanding? To help understand this question I wrote a brief version of the accepted scientific explanation of a particular physical phenomenon -- the phenomenon of ocean tides. After writing the explanation I disassembled it to see what it consisted of (1997b). The disassembly suggests that the scientific explanation of ocean tides contains seven types of statements, as follows:
Because the seven types of statements are all quite general, and because experience suggests that many (all?) other scientific explanations can be given in terms of (at most) the seven types of statements, it appears that most (all?) scientific explanations consist of merely (at most) the seven types of statements. We can view scientific "understanding" as taking place in an individual person. A person has understanding of some state of affairs or phenomenon if they have learned to think and speak in terms of the "correct" explanation of it. The seven types of statements of a scientific explanation are all important, but the sixth type (about relationships between variables) is perhaps the most important. This is because statements of relationships directly enable accurate prediction and control. Thus the key concept of the EPR approach of 'relationship between variables' plays a central role in scientific explanation and understanding. Mathematical Equations. Mathematical equations are crucial on the theoretical side of many branches of science. But most (all?) mathematical equations in science (as opposed to equations in pure mathematics) are simply statements of known or hypothesized relationships between variables (relationships between properties of entities). Thus the key concept of the EPR approach of 'relationship between variables' again plays a central role. 5.6 The Approach Unifies Empirical Research As discussed in Section 4.5 and Appendix H, most empirical research projects can be usefully viewed as studying relationships between variables. Thus by focusing on the concept of 'relationship between variables' the EPR approach unifies most empirical research. 5.7 The Approach Links Well with General Concepts of Commerce A commercial product is an entity, as are instances of a product, as is a commodity, as is a financial instrument (e.g., a stock or a bond), as is a loan repayment or dividend, and as is an interaction with a customer. These and all other entities that are used in commerce are efficiently handled by the EPR approach. To achieve a general understanding of the logical constructs used in commerce it is helpful to study how commercial organizations store information. Almost all progressive commercial organizations use a computer "database" as their main repository for information. This is because databases have easy-to-use, versatile, reliable, and secure features that allow one to easily assemble information to generate reports, invoices, charts, and other graphical, statistical, and textual information as a broad and fundamental aid to operating an organization. A database consists of a set of one or more "tables". Each table holds information about entities of a particular type. For example, a manufacturing company might have one database table that holds information about its products, another that holds information about its customers, another that holds information about its invoices, and so on. The database (or databases) of a larger progressive organization may contain hundreds (or even thousands) of tables holding information about all the types of entities in which the organization has a serious interest. A database table is conceptually identical to a statistical data table, as described in Section 4.3 -- a rectangular array that contains one or more "rows" associated with the entities the table is tracking and one or more "columns" associated with properties of the entities. Each cell in the table contains the value of the property associated with the column for the entity associated with the row. The database of a progressive commercial organization will hold a substantial proportion of the organization's information because even "documents" (which are entities) can be stored in a database table to facilitate ready access. (A cell in a modern database table can hold an entire document.) As noted, database tables are the main repositories for information in commerce, and the rows in a database table are associated with entities and the columns are associated with properties of the entities. Thus the basic concepts of the EPR approach of 'entity' and 'property' play fundamental (implicit) roles throughout commerce. (Data mining is reasonably viewed as the study of relationships between variables reflected in the columns of database tables.) 5.8 The Approach Links Well with Language Section 4.1 notes that nouns are used in language to denote entities and Section 4.2 notes that adjectives and adverbs are often used in language to denote the values of properties of entities. Since language is intimately tied to human thought, it is of interest to consider how the concepts of the EPR approach relate to other parts of speech. A verb usually express one of the following ideas:
When one views entities broadly, the acts, actions, occurrences, events, states, modes, values, properties, variables, and relationships in the list are themselves all entities. (Appendix A further discusses this idea.) Furthermore, most sentences that contain verbs also contain one or more nouns. (All verbs in coherent sentences have a subject, perhaps implicit, which is represented by a noun. Transitive verbs have an object, perhaps implicit, which is also represented by a noun.) The verbs describe various "things" about the entities denoted by the nouns including things about the properties of the entities. Thus when verbs are used in language, entities are invariably present and of central interest. (Verbs are also often linked to the concept of 'time'. We can view time [both duration and point in time] as a property of the entity that contains all other entities -- the entity we call "experience" or "reality". Alternatively, we can view time as a property of events.) The remaining parts of speech function as or support nouns, verbs, adjectives, and adverbs as follows:
Thus the concepts of the EPR approach link well with the various parts of speech. Thus the approach links well with language at a fundamental level. 5.9 The Concepts of the Approach Are Fundamental Subsections 5.2 through 5.8 suggest that the concepts of entities, properties, variables, and relationships between variables
These points together with consideration of the fundamental statistical concepts discussed in Section 5.3 suggest that the concepts of the EPR approach are more fundamental than many (all?) of the other concepts that are traditionally discussed in statistics courses. 5.10 Easy-to-Understand Fundamental Concepts Should Be Taught First What Should Come First? Concepts in a body of knowledge are generally easiest to understand and use if they are developed in a logical order beginning with the most fundamental. This is especially true if the fundamental concepts are themselves easy to understand. As suggested in Section 5.2, the concepts of 'entity' and 'property' and (to a lesser degree) 'variable' and 'relationship' are easy to understand. Furthermore, these simple concepts are (by virtue of their logical priority) substantially easier to understand than the various traditional statistical concepts that depend on them. In view of the ease of understanding, and in view of the logical priority, it is reasonable to carefully cover the concepts of 'entity', 'property', 'variable', and 'relationship' first, before introducing even the most rudimentary of the other traditional statistical topics. As suggested in Sections 5.3 through 5.9, this unifies and simplifies discussion of the traditional topics. Must We Emphasize Entities and Properties? An associate editor (assumed male for expository convenience) noted that his students have little difficulty distinguishing the concepts of 'entity' and 'property'. (He refers to the concepts respectively as "case" and "variable" -- see my 1998a paper, App. E.1 and Appendices D, and E below.) Thus he is noting that these concepts are easy for students to understand. In view of this he wonders whether it is necessary to emphasize entities and properties at the beginning of the introductory statistics course. Perhaps the concepts are simple enough that we need not discuss them at all. To address this issue, let us consider the concept of 'variable'. This concept is arguably the most ubiquitous concept in statistics. Clearly, students must understand the concept of 'variable' before they can understand the concept of 'relationship between variables'. How do students usually learn this concept? Students usually first learn the concept of 'variable' in their first algebra class in grade 6 or later. The introductory algebra teacher usually does not teach the concept directly in terms of entities and properties. Instead, the teacher teaches the concept in terms of examples of entities and properties. For example, in the first discussion of the concept of 'variable' the algebra teacher may say that the variable x will represent the height of some person. The teacher will typically say (in a carefully worded discussion) that we do not know the value of x at the present time, but we do know the value of x plus a known constant. The teacher will then show the students how to use algebra to determine what the value of x must be. During the rest of the introductory algebra course, and during subsequent mathematics courses, students will encounter many other examples of variables that represent unknown values (of properties of entities), and students will learn various ways to manipulate and solve for these values. But they will usually not consciously understand the unifying concept that a variable is reasonably viewed as a formal representation of a property of entities. Consider some differences between the use of the concept of 'variable' in mathematics and the use in statistics:
Perhaps due to the above differences between the mathematical and statistical concepts of 'variable', and perhaps due to the mathematical (algebraic) genesis of the concept of 'variable' in students' minds, many non-statistics-majors have difficulty understanding the fundamental statistical concept of 'variable'. This can be seen by asking students to define the concept -- many students have difficulty giving a reasonable definition. Some students may say that a statistical variable is a "measurement of something", which (although vague) is certainly correct. But they are often unable to say, without prompting, what the "something" is -- both in the specific sense of voluntarily identifying the relevant entities in a given situation and in the general sense of linking the concept of 'variable' to the more fundamental concept of 'property of an entity'. The preceding five paragraphs suggest that many students entering the introductory statistics course lack a clear understanding of the statistical concept of 'variable'. But examination of currently popular textbooks for the introductory course suggests that most approaches assume entering students have a clear understanding of this concept. (Some books do introduce entities, properties, and variables, but spend only a page or two on these topics at the beginning and then never return to focused discussion of them. This approach forgoes the substantial unifying power of the concepts. For example, using the same concepts but different terminology Moore briefly discusses "individuals", "characteristics", and "variables" at the beginning of two of his introductory texts [1997b, 2000]. I discuss Moore's use of the concepts in a Usenet post [1997c].) Sections 4.1 and 4.2 above suggest that all students unconsciously learn the concepts of 'entity' and 'property' (the concepts not the words) when they are infants or young children. Thus if we carefully bring these concepts into students' consciousness, and if we then carefully build the important statistical concept of 'variable' atop these foundational concepts, we ensure that students have a proper understanding of the concept of 'variable'. This understanding helps students to understand the central statistical concept of 'relationship between variables'. Therefore, it is useful to spend time at the beginning of an introductory statistics course discussing the fundamental concepts of 'entity' and 'property'. 5.11 The Approach Gives Students a Lasting Appreciation of Statistics Section 3.3 recommends that the first goal of an introductory statistics course be to give students a lasting appreciation of the vital role of the field of statistics in empirical research. Does the EPR approach satisfy this goal? Consider:
These points suggest that the EPR approach gives students a lasting appreciation of the field of statistics and its vital role in empirical research. 5.12 The Approach Links Well with the Concept of 'Data Analysis' Many approaches to introductory statistics emphasize the concepts of 'data' and 'data analysis'. One can see this by noting the frequent occurrence of the word "data" in the preface and early chapters of many textbooks and other discussions. In contrast, the EPR approach does not emphasize the concepts of 'data' or 'data analysis' and instead emphasizes the concept of 'relationship between variables as a means to accurate prediction and control'. The EPR approach links well with the concepts of 'data' and 'data analysis'. This is because the exact operation of what is called "data analysis" is an essential step of the EPR approach. Data analysis is the step in which we actually study the relevant data to look for information about relationships between the variables. Tukey initiated emphasis on the concept of 'data analysis' in statistics education with his seminal book Exploratory Data Analysis (1977). Tukey and later reformers emphasize this concept to emphasize the important practical side of the field of statistics, which lies in helping researchers to analyze research data. Emphasis on data analysis has helped to eliminate the detrimental emphasis on mathematical statistics in teaching statistics to non-statistics-majors. As further discussed in Section 7.9, de-emphasizing the mathematical side of statistics makes the material easier for non-statistics-majors to understand. Because emphasis on mathematical statistics is now greatly diminished in introductory statistics courses for non-statistics-majors, and because these courses now generally focus on analyzing data, it is useful to ask whether central emphasis on the concept of 'data analysis' is still necessary or whether central emphasis on another concept is more effective. This question is important because the concept of 'data analysis', although fully correct, is functionally vague -- to a beginner doing "data analysis" sounds like something medieval monks might do with great rigor in a remote monastery in the mountains, but with no known practical value. Instead of emphasizing data and data analysis, the EPR approach sharpens the focus by emphasizing the function of data analysis which, as discussed above, can be usefully viewed as being (mainly) to give us the ability to predict and control the values of variables through study of relationships between variables. This emphasis on function or usefulness (when coupled with sufficient practical examples) substantially increases beginning students' appreciation of the role of statistics. A teacher can show students the link between relationships between variables on the one hand and data analysis on the other by noting that relationships between variables are generally studied in terms of relationships between (the values of the variables in) the columns of data in a data table. 5.13 The Concepts Are Old But the Approach Is New The concepts of entities, properties, and relationships are not new. Indeed, all statisticians and empirical researchers use these concepts implicitly throughout their thinking and discussion. However, as discussed in Section 5.10, the fundamental concepts of 'entity', 'property', 'variable', and 'relationship' are almost never carefully discussed in a unified approach in introductory statistics courses. I believe that the unfortunate omission of unified discussion of these concepts is the main reason why the field of statistics is so widely misunderstood. (Some leaders in statistics education have already independently adopted an important aspect of the EPR approach in that they emphasize relationships between variables in their introductory courses. For example, using an idea developed by Gudmund Iversen, George Cobb teaches two introductory courses, both of which start with relationships -- one devoted to experimental design and applied analysis of variance and the other devoted to applied regression [G. Cobb, personal communication, August 21, 1996]. Similarly, Robin Lock teaches an introductory course devoted to time series analysis -- i.e., methods for studying relationships between variables when an important predictor variable is "time" [Cobb 1993, sec. 3.1].) 5.14 The Approach Links Well with Other Approaches to the Introductory Course Many helpful new approaches to teaching the introductory statistics course have recently been proposed. As suggested by Moore (1997a), these approaches fall neatly into two distinct groups: conceptual approaches and pedagogical approaches. Each of the conceptual approaches emphasizes a particular set of statistical concepts and de-emphasizes other statistical concepts. The EPR approach is a conceptual approach. Other (sometimes overlapping) conceptual approaches include
In contrast to the conceptual approaches, the pedagogical approaches to teaching the introductory statistics course emphasize new ways of teaching any set of statistical concepts. The pedagogical approaches include
I discuss some criteria for evaluating pedagogical approaches in a paper (1998a, sec. 7). Most introductory statistics teachers now use some combination of the above conceptual and pedagogical approaches. The main disagreement among teachers is only about the relative emphasis that each approach deserves. (It is possible to classify the use of multimedia, film, video, computers, and calculators as "technological" approaches to the introductory course, rather than as "pedagogical" approaches. However, it seems more reasonable to view technology as a means to better pedagogy rather than as an end in itself.) A simple relationship exists between the EPR conceptual approach to the introductory statistics course and the other approaches -- the EPR approach can be effectively used in conjunction with any (or any group) of them. Moore (1997a) reviews several of the new approaches to statistics education. Cox (1998) comments on some general aspects of statistics education. Gordon and Gordon (1992), Hoaglin and Moore (1992), and T. Moore (2000) give papers by leading statistics educators about teaching statistics. Hawkins, Jolliffe, and Glickman (1992) give a general discussion of teaching statistical concepts. 5.15 Responses to Criticisms of the EPR Approach The EPR approach has been criticized as being too "abstract" for students to understand. I discuss this important criticism in a Usenet post (forthcoming). I discuss some other insightful criticisms of the approach in a series of Usenet posts (1996-2001). It is interesting that statisticians, who are the keepers of the keys to empirical (scientific) research, perform almost no serious empirical research in statistics education. Instead, much of what is reported as "testing" of approaches in statistics education is anecdotal. That is, the author or proponents of a new approach use the approach one or more times in courses and then report that the approach was successful. Unfortunately, no matter how "successful" a course might appear to be, anecdotal reports do not reflect valid empirical research about the approach used in the course. This is because reasonable alternative explanations invariably exist that could explain why the course was as successful as it was. Some possible reasons why a course might be successful are
These alternative explanations (and other situation-specific alternative explanations) imply that anecdotal testing of approaches to teaching statistics is invariably equivocal. We can eliminate the equivocation of anecdotal testing by testing approaches with randomized experiments. Such experiments (when properly performed) provide clear comparative evidence of the effectiveness of different approaches to teaching statistics. Some readers may feel that experimentation in statistics education is not possible because too many confounding variables are present. For example, "instructor teaching ability" must be properly accounted for before unequivocal conclusions can be drawn. However, confounding variables can usually be accounted for in experimental research, albeit at the expense of increased cost and complexity. Furthermore, accounting for confounding variables in experimental research in statistics education would appear to be no more complicated than accounting for them in multicenter clinical trials, where accounting for confounding variables is standard practice. Some readers may feel it would be difficult to ensure protocol adherence by the multiple statistics teachers that are needed in a proper experimental trial of different approaches to statistics education. Clearly, this is a challenging problem, although perhaps no more difficult than ensuring protocol adherence in multicenter clinical trials, where various monitoring systems are used to ensure adherence. Some readers may feel that experimentation in statistics education may be ineffective because no reasonable response variable can be found that is sensitive enough to discriminate between different treatments. However, this is an empirical question that awaits serious attempts to address it. I further discuss methods and problems of experimentally testing approaches to the introductory statistics course, including a recommendation that one use "attitude toward statistics" as a response variable, in a paper (1998a, app. A and B). 6.2 Testing of the EPR Approach Despite my preceding comments, I regret that I cannot report proper experimental testing of the EPR approach -- such testing is beyond my resources. Although I cannot report proper testing of the EPR approach, I can report some enthusiastic remarks from three teachers who used a draft textbook for the approach (Macnaughton 1986) in their courses. They commented that ... students found the book enjoyable and easy to understand. Using a unique approach, Macnaughton has provided a comprehensive first-rate introduction to the material. I would highly recommend the book for use in introductory statistics courses .... - Professor Alexander Even, The Ontario Institute for Studies in Education ... students obtained a good understanding of the basic principles of statistical analysis. ... [the approach] substantially simplifies the material without sacrificing important concepts. - Professor John Flowers, School of Physical and Health Education, University of Toronto The absence of overt mathematics enables the underlying principles of scientific research ... to be more directly apprehended by persons who have ... weak grounding in mathematics. ... ... Students' comments have been uniformly favorable .... ... the book is to be commended to the instructor. - Professor Donald F. Burrill, The Ontario Institute for Studies in Education These remarks are encouraging, but are far from being definitive about the effectiveness of the EPR approach. I hope that publication of this paper will facilitate proper testing of the approach. 7. IMPLEMENTING THE EPR APPROACH Until textbooks based on the EPR approach are available, a teacher wishing to use the approach in an introductory course can use the paper for students (1997a) to reinforce class discussion of the six introductory concepts. The following twelve subsections discuss implementation considerations. 7.1 Motivating Students on the First Day of Class The first day of class is important because if the lesson is properly designed, it will establish a positive attitude about the course in students' minds. What should be the first statistical idea we introduce to students on the first day of class? I recommend that the first idea be the promise that the course will teach students how to make accurate predictions. For example, we can promise students they will learn how to accurately (but generally not perfectly) predict
(Along with prediction methods, I recommend that the introductory course devote substantial attention to the methods of exercising accurate control through formal experimentation. However, for simplicity, I recommend that discussion of control and experimentation be omitted at the very beginning -- the promise of accurate predictions seems quite enough to engage students. Section 7.3 further discusses experimentation.) If we promise students on the first day of class that they will learn how to make accurate predictions, we arouse their curiosity and set the stage for development of the six concepts discussed in Section 4. The promise also sets a practical tone for the course, which is more likely to impress most students than if we begin with mathematical discussion. If we promise students on the first day of class that they will learn how to make accurate predictions, we must later fulfill our promise. In particular, the thoughtful student will be interested in whether we can demonstrate practical methods for making accurate predictions. Fortunately, statistical methods (broadly viewed) are universally accepted among statistically experienced researchers as the most practical and most accurate methods available for making predictions (and exercising control). 7.2 What Topics Should Follow the Six Concepts? Section 4.6 recommends that after introducing the six concepts of the EPR approach the teacher spend the rest of the course expanding the sixth concept by covering standard statistical topics. The present and the next subsections discuss ways of covering the standard topics. As a general principle for choosing topics, I recommend that teachers cover topics that are used more frequently in empirical research first. One easy way to implement the EPR approach is to follow discussion of the six concepts with material selected from an already existing introductory statistics course. The teacher can use the six concepts to introduce and unify the material. This enables the teacher to use the EPR approach in an already existing course with only a minimum amount of modification to the course. A more unified way of implementing the EPR approach is to break the course into five phases: an introductory phase, a practical-experience phase, a generalization phase, a specific-methods phase (optional), and a mathematics phase (also optional). Introductory Phase. In this first phase the teacher introduces the six concepts discussed in Section 4. Practical-Experience Phase. To solidify students' understanding of the six concepts, the teacher follows the introductory phase with a practical-experience phase in which students obtain hands-on experience with specific statistical methods. I recommend that the teacher begin the practical-experience phase with discussion of a commonly occurring simple type research project -- the observational research project that studies the relationship between two continuous variables (2000 Hayden response, red tab 6). Possibly using the material in the paper for students (1997a) as an introduction, the teacher can discuss how to design an observational research project to study the relationship between two continuous variables, how to use a scatterplot to illustrate such a relationship, how to use statistical techniques to analyze data from such a research project to determine if a relationship is present between the variables, and how to use the model equation derived from such a relationship to make predictions. To reinforce the discussion, I recommend that students be given computer assignments to detect and study (practical) relationships between pairs of continuous variables in various sets of data. If time permits, the bivariate case can be extended to the multiple regression case. Next, the teacher can discuss "experiments" and the associated statistical methods as a powerful tool for studying causal relationships between variables. This topic is important because the study of causal relationships through experiments is the soul of empirical research. Discussion can be in terms of designing simple experiments to study causal relationships, illustrating causal relationships, detecting causal relationships in experimental data, and predicting and controlling on the basis of causal relationships. To reinforce the discussion, I recommend that students be given computer assignments to detect and study simple (practical) causal relationships between variables. If time permits, the fully randomized one-way case can be extended to the multi-way case, repeated measurements, blocking, analysis of covariance, and so on. The length of the practical-experience phase should be adjusted to allow enough time at the end of the course to properly cover the material in the important next phase. Generalization Phase. The teacher follows the practical-experience phase with a generalization phase that introduces important topics that span almost the entire field of statistics. The recommended topics are
A reasonable ending-point for an introductory statistics course is at the end of the generalization phase. Specific-Methods Phase. For students who are likely to perform empirical or statistical research in their careers, I recommend that the preceding three phases be extended (in courses following the introductory course) with detailed discussion of selected statistical methods, such as those listed in Section 5.4 and Appendix I.2. For each method in Section 5.4, I recommend that the following topics be covered (when applicable):
Except for statistics or mathematics majors, I recommend that the use of mathematics be avoided in the specific-methods phase. Instead, I recommend that attention be focused on designing research projects and on correctly interpreting the relevant output from statistical software. Mathematics Phases. For students pursuing careers in statistics or mathematics, I recommend that discussion in the earlier phases be interwoven and extended with discussion of the underlying mathematics. I illustrate an approach to such discussion in a paper about computing sums of squares in unbalanced analysis of variance (1998d). I recommend that statistics majors be made (at a high level) as aware of the role of statistics in empirical research as students in other disciplines. This facilitates interaction between the theoretical and practical sides of statistics. 7.4 "Basis for Action" Versus "Decision Procedure" In choosing topics for the introductory statistics course it is helpful to distinguish between a basis for action and a decision procedure. A basis for action is a statement that a relationship exists between a response variable that we wish to predict or control and one or more predictor variables. If the variables are appropriately chosen, such a relationship will often suggest that some action be taken. For example, if social researchers find that a relationship exists between secondary school practices with difficult students and the amount of subsequent delinquent behavior by these students, this suggests a basis for action in the choice of secondary school practices. On the other hand, a decision procedure is a procedure that assists us to make some form of decision. Some possible forms of decision are
The second through fourth types of decision play important roles in statistics, but I focus here on the first. Procedures for making action decisions are studied in a branch of statistics called "decision theory", which was founded by Wald (1950). Such procedures must take account of many diverse inputs. Some typical inputs are one or more bases for action (i.e., relationships between variables), various social or commercial values (or goals, perhaps stated as objective or utility functions), alternative explanations, error sizes, costs, and side-effects. A procedure for making an action decision takes appropriate account of all these inputs and provides as output an "optimal" recommendation whether (and possibly how) to perform various actions. In view of the multiple diverse inputs, useful procedures for making action decisions are much more complex than relationships between variables. Perhaps due in part to this complexity, most action decisions are still made (in all areas) on the basis of informal and intuitive criteria rather than by formal decision procedures. Swets, Dawes, and Monahan (2000) and Edwards and Fasolo (2001) discuss some current work in decision procedures. (A formal procedure for making action decisions can be efficiently characterized as a set of relationships between variables in which the response variables are indicators of whether [or how] actions should be taken and the predictor variables reflect the various inputs to the decisions. Decision procedures for assisting with the second through fourth types of decisions above can be similarly characterized.) The question arises whether the introductory statistics course should discuss procedures for making action decisions. Because such procedures are complicated and infrequently used, I recommend that the introductory course omit this topic. And although the formal procedures for making optimal action decisions are an interesting and important area of study, it seems more in keeping with the standard use of statistics in empirical research to focus the introductory course on the study of relationships between variables. A relationship can suggest a basis for action in a substantive area although (except indirectly) relationships do not provide the final decision whether to act. Barnett discusses the distinction between statistical inference (which can often be reasonably viewed as inference about relationships between variables) and decision procedures (1982). Bordley (2001) discusses teaching decision theory in applied statistics courses. As noted at several points above, and following Jowett and Davies (1960), Scott (1976), Hunter (1977, pp. 16-17), Cobb (1987, sec. 4.2), and Willett and Singer (1992, p. 91), I recommend that any implementation of the EPR approach discuss each main concept in terms of numerous practical examples. Practical examples can appear in lectures, textbooks, multimedia courseware, class discussions, exercises, activities, and projects. A Criterion for Practicality of Research Projects. An important type of example in an introductory statistics course is an example of an empirical research project that studies a relationship between variables. To determine whether such an example is "practical", I recommend (after Deming [Wallis 1980, p. 321] and Scheaffer [1992, p. 69]) that the teacher consider the following question: Does an understanding of the relationship between variables in the example have an obvious potential social, scientific, or commercial benefit? That is, does knowledge of the relationship suggest some clear basis for action? If we choose examples that suggest a clear basis for action, and if we ensure that students see the practical benefits provided in the examples, we help students to appreciate the practical value of statistics. Finding Practical Examples. Examples of empirical research projects that suggest a clear basis for action are easy to find in most fields of empirical research. For example, consider research in medicine to study the relationship between AIDS symptoms (as reflected in a response variable) and a new treatment for AIDS (as reflected in a predictor variable, typically a measure of the dose of the new treatment). This research is very practical according to the criterion. That is, if AIDS research finds a new relationship between relevant variables, it suggests a clear basis for action in the treatment of AIDS. Similarly, research to study relationships between variables that help to make computer hardware or software more efficient or less expensive is also (in a commercial sense) very practical because if this research finds a new relevant relationship, it suggests a clear basis for action in designing computer hardware or software. Examples that fail to satisfy the practicality criterion seriously detract from the field of statistics because they associate the field with problems that appear to be frivolous (or at best inconsequential). For example, a research project that studies the relationship between people's forearm lengths and their foot lengths is a "frivolous" research project because students can see no obvious practical use of knowledge of this relationship. Study of frivolous research projects trivializes the field of statistics. (Interestingly, if one looks hard enough, one can find practical uses of most relationships between variables. For example, the relationship between forearm length and foot length is occasionally used in orthopedics, paleontology, and physical anthropology. If a particular group of students is likely to be impressed by an obscure example, this is a reasonable example for them. But if a complicated explanation is needed before students can see the practicality of an example, most students are unimpressed.) If the students in a particular introductory statistics course are all specializing in the same discipline, and if that discipline performs empirical research, we can almost certainly make the greatest impression on these students by discussing examples from among the milestone empirical research projects in the discipline. We can also impress students if we discuss practical examples of research projects that use response variables that students themselves are directly interested in predicting and controlling, such as variables reflecting student grades, student health, student skills, student happiness, student expenses, and student income. Practical Research in Pure Science. How can we judge whether an empirical research project in pure science is practical? Here it seems less reasonable to view the word "practical" as meaning that the research must suggest a basis for direct physical action because pure science is not done with direct physical applications in mind (although such applications often arise). Instead, it seems more reasonable to say that empirical research in pure science is "practical" if and only if it suggests a basis for intellectual action -- if it has the potential to advance a scientific theory or to otherwise usefully advance scientific knowledge. Practical Examples in Statistics Textbooks. Surprisingly, many examples of empirical research projects in some statistics textbooks are not practical. And when one studies such examples and asks "Would an enlightened empirical researcher every actually do the research project discussed in the example?" the answer is (for various reasons) often a clear "No". On the other hand, some introductory textbook writers provide an abundance of excellent practical examples (e.g., Moore [2000], Freedman, Pisani, and Purves [1998]). (The frequent use of impractical examples in some statistics textbooks is one reason why some writers insist that teachers and textbook writers use real data in examples. Real data guarantee that at least one empirical researcher has judged the research to be in some sense practical. Unfortunately, insisting on real data does not ensure practicality according to the criterion given in the indented paragraph above. Section 7.8 contrasts real data with easier-to-obtain realistic data.) Practical Student Projects. Many student projects in some statistics courses are impractical. However, such projects need not be impractical, as illustrated by some fascinating examples of practical projects discussed by Wardrop (2000, examples 1 - 24). I further discuss the use of practical examples in the introductory statistics course in a paper (1998a, sec. 6) and in a Usenet post (forthcoming). 7.6 Generalization and Instantiation Once students have studied a concept through a sufficient number of practical examples, I recommend that the teacher cement the appropriate generalizations about the concept in students' minds. This helps students to use the concept in new situations. For example, once students understand the concept of 'relationship between variables', the teacher can make the generalization that most empirical research projects can be usefully viewed as studying relationships between variables. After stating a generalization, I recommend that the teacher assign exercises in which students identify details of the generalization in specific new instances. In particular, after discussing the idea that most empirical research projects can be usefully viewed as studying relationships between variables, I recommend that the teacher assign exercises in which students answer the nine questions given in Section 4.5 for various empirical research projects, including research projects of the students' own choosing. Answering these questions shows students that the questions almost always usefully apply. Appendix H supports the point that the questions almost always usefully apply. Appendix I.2 discusses some infrequently occurring types of empirical research projects that lack a response variable. How many explanations, examples, exercises, or activities should a teacher provide or assign to ensure that students understand a particular generalization? This depends, of course, on the generalization and on the nature of the students and is often difficult to determine at the front line of teaching -- especially if a teacher is using a new approach. To reduce this difficulty, I recommend that teachers use feedback systems to assess whether students understand each main concept and generalization. Some effective feedback systems for assessing students' understanding are
Garfield (2000) discusses approaches to assessing students as an aid to improving their learning and understanding. Gal and Garfield (1997b, pt. 2) give four interesting essays by statistics educators about assessing students' understanding of statistical ideas. 7.8 Realistic Data Versus Real Data Some statistics educators recommend that the introductory statistics course rely heavily on real data and not merely realistic data (Cobb 1987, 1992; Moore and Roberts 1989; Singer and Willett 1990; Willett and Singer 1992; Witmer 1997; Moore 1997, 2000b; American Statistical Association 2000; Ballman 2000, Hayden 2000). I give a detailed argument in a Usenet post why I believe real data are unnecessary and why realistic data should be broadly allowed in introductory courses (forthcoming). The main ideas of the argument are
In view of these points, I recommend the following criteria for data in examples (including exercises) in an introductory statistics course:
Generating Realistic Data. Perhaps the easiest way to obtain realistic data for a research project is to generate the data with statistical software using a model equation. That is, one uses random number generators or fixed values (as necessary) to generate values of predictor variables and one uses a properly parameterized model equation (with a random number generator for the error term) to generate values of the response variable from the values of the predictor variables. For realism, a teacher may wish to hand-adjust the generated data, possibly adding an outlier or two or including some missing values. Most general statistical software can be easily programmed to generate realistic research data using this method. Realistic Data in Assignments. Permitting realistic data allows teachers to assign extended exercises in which students are asked to pose a research hypothesis of their choosing and then provide a complete written proposal for an empirical research project that is capable of efficiently confirming the hypothesis (if the hypothesis is correct). To help students see value, I recommend that teachers stipulate that students' research hypotheses must be practical in the sense described in Section 7.5. I also recommend that the teacher describe numerous examples of earlier work by other students as an aid to students in choosing their own research hypotheses, as illustrated by Wardrop (2000). It is useful to have students present interim versions of their research proposals to the class, where the teacher and class may suggest improvements, as discussed by Chance (1997). After students have finished planning a research project, I recommend (following Hunter 1977) that the teacher provide them with appropriate realistic made-up data that the students might have obtained if they had actually performed their project. The students can then analyze these data and report the results. Providing students with realistic made-up data enables them |