Subject: Re: Eight Features of an Ideal Intro Stat Course (Response to comments by Gary Smith) To: EdStat-L and sci.stat.edu From: Donald B. Macnaughton <donmac@matstat.com> Date: Monday November 23, 1998

Many introductory statistics courses cover the topic of univari- ate distributions early in the course. In a recent paper (1998) I recommend that teachers *omit* this topic because (I argue) univariate distributions are boring and of little obvious use to beginning students. In response, Gary Smith of Pomona College sent me three examples of univariate distributions that may be of interest in an introductory course. Here (with Gary's permis- sion) are Gary's remarks and my reply. Gary writes > ( snip ) > As an economist, I'm very receptive to your emphasis on rela- > tionships among variables. That's pretty much what we do. As I discuss further below, I believe that most empirical re- search done by economists and by all other empirical researchers is reasonably viewed as studying relationships between variables. (Gary speaks of relationships "among" variables and I speak of relationships "between" variables. I discuss why I recommend the preposition "between" for general use in appendix A.) Gary presents his three examples of potentially interesting uni- variate distributions as follows: > On the other hand, what about, say, (a) predicting the fraction > of the vote that Candidate A will receive in an upcoming elec- > tion; (b) estimating the average body temperature of healthy > humans; or (c) estimating the speed of light? A. PREDICTING THE FRACTION OF THE VOTE CANDIDATE A WILL RECEIVE The standard way of predicting the fraction of the vote a candi- date will receive in an election is to ask a random sample of voters a question something like If the election were held today, who would you vote for: Candidate A, Candidate B, [etc.]? The results of asking this question will be reflected in a single (nominal-level) variable, which we might call "Preferred Candi- date". Each person in the sample will contribute one value of this variable. If the sample is a simple random sample, we predict that candi- date A will receive the same fraction of the vote as the fraction of people in the sample who said they will vote for candidate A. Thus the prediction is made on the basis of a univariate distri- bution, NOT on the basis of a relationship between variables. Thus this may be a situation in which a univariate distribution is important. Clearly, in one sense the predictions made from the univariate distribution of "Preferred Candidate" are important because many people are interested in election predictions. (Otherwise the news media would not publish them.) But in another more practi- cal sense the predictions are not very important. We can see this by noting that they provide relatively little *rational ba- sis for action* on the part of any person or group. What person or group can infer a rational basis for action from predictions of the univariate fraction of the vote the candidates will re- ceive in an election? (I am not saying the predictions provide NO practical basis for action. For example, one obvious practical use of the predic- tions occurs if the survey predicts that Candidate A will do very poorly in the election. In this case [and barring special cir- cumstances], Candidate A should drop out of the election [to minimize his or her expenses].) In contrast to the simple survey of voters discussed above, we might perform a more sophisticated survey -- one in which we ob- tain the age of each respondent as well as the name of the candi- date the respondent will vote for. Such a survey will give us a table like the following: Predicted Percentage of the Vote To Be Received by Different Candidates Broken Down by Voter Age Groups --------------------------------- Voter Candidate Age Group ------------- TOTAL (years) A B C --------------------------------- 18 - 29 4 8 19 31 30 - 39 6 11 7 24 40 - 49 5 8 6 19 50 and up 12 6 8 26 --- --- --- ---- TOTAL 27 33 40 100 --------------------------------- The first row in the body of the table predicts that 31 percent of the voters will lie between the ages of 18 and 29, but rela- tively few people in this group will vote for Candidate A -- most of the people in the group will vote for Candidate C. The re- sults of *this* survey provide a more useful basis for action. For example, Candidate A can see from the table that he does not have much support from younger voters. He may therefore decide to revise his campaign strategy to increase his popularity with this group. But if Candidate A uses the information in the table, he is using information about a relationship between two variables, namely the relationship between "Voter Age Group" and "Preferred Candi- date". Thus, although the univariate distribution of the variable "Pre- ferred Candidate" (as shown in the bottom row of the table) is of some interest, we can usually get much greater value if we study the same "univariate distribution", but we study it along with some *predictor* variables. That is, we can get greater value if we study the variable "Preferred Candidate" as the response vari- able in a relationship between variables. Since we can generally get more useful election predictions from studying relationships between variables than from studying uni- variate distributions, I suggest that this example does not pro- vide a strong reason for discussing univariate distributions near the beginning of the introductory statistics course. B. ESTIMATING THE AVERAGE BODY TEMPERATURE OF HEALTHY HUMANS Estimating the average body temperature of healthy humans is a problem of determining a "norm". When empirical researchers de- termine norms, they do directly study univariate distributions. However, even here, other variables and relationships between variables play a key role because researchers determining norms usually hold other important variables at specific constant val- ues. Otherwise the norms may be muddied by variation in the other variables causing (through a relationship between the vari- ables) extra variation in the values of the variable being "normed". For example, if medical researchers wish to estimate the average body temperature of healthy humans, they will usually ensure that each person whose temperature is measured to determine the norm has the same (usually sedentary) level of physical activity in the period before their temperature is measured. This is because the researchers know that a relationship exists in people between "level of physical activity" and "body temperature" -- as physi- cal activity increases, so (slightly) does body temperature. Since relationships between variables often play a key role in determining norms, it is reasonable to defer discussing norms in the introductory statistics course until students have a good sense of the concept of a relationship between variables. C. ESTIMATING THE SPEED OF LIGHT Estimating the speed of light is a problem of estimating the value of a *physical constant*. The hard sciences have defined many (perhaps one or two hundred) general physical constants such as the speed of light, Planck's constant, and the proton rest mass. In addition, the values of the various constant properties of the specific entities studied by the hard sciences are also physical constants (e.g., the thermal conductivity of titanium). An important use of physical constants is to provide the values of some parameters in some models of relationships between vari- ables. For example, the speed of light is represented by the symbol "c" and is the main parameter in Einstein's equation E = m c^2. (This equation is generally viewed as a statement of a relation- ship between two variables -- the contained atomic energy [E] and the mass [m] of a piece of matter.) Physicists and chemists estimate the values of some physical con- stants by simply solving for them in the model equations in which they appear and then by substituting appropriate empirically de- termined values for the variables into the equation. For exam- ple, the ideal gas law is a relationship between variables that is usually stated as pV = nRT. This equation is a statement of the relationship between pressure (p), volume (V), amount (n), and absolute temperature (T) of a quantity of an ideal gas. The physical constant R in this equa- tion is called the (universal) gas constant and is determined by solving the gas law equation for R and then substituting appro- priately determined sets of values for p, V, n, and T into the equation to provide estimates of R. In this situation, clearly the concept of a relationship between variables plays an impor- tant role in estimating the value of a physical constant. On the other hand, physicists estimate the speed of light and the values of many other physical constants more or less directly. In such cases, the concept of a relationship between variables plays no *direct* role. However, although relationships play no *direct* role in estimat- ing the values of some physical constants, relationships usually do play important *indirect* roles. For example, it is well known that the speed of light depends on the medium in which the light travels. That is, there is a relationship between the two variables - "the speed of a given light wave" and - "the type of medium in which the light wave is travelling" (e.g., vacuum, air, water, or glass). Since the estimated speed of light depends on the medium, a re- searcher attempting to estimate the speed must ensure that the medium in which the speed is measured is constant and well speci- fied. Similarly, with many (all?) other physical constants, cer- tain relationships between variables must be taken into account before the value of the constant can be properly estimated. Since relationships between variables play key direct or indirect roles in estimating the values of many physical constants, it is reasonable to defer discussing physical constants (or their equivalents in other fields of empirical research) in the intro- ductory statistics course until students have a good sense of the concept of a relationship between variables. (I discuss the relationship between physical constants and the concepts of entities, properties, variables, and relationships between variables in appendix B. I discuss the role of constants [physical and otherwise] in empirical research in appendix C.) > > Do you consider these [three examples] to fit into the rela- > tionship rubric, Gary's examples fit into the relationship rubric in the sense that each example has an important relationship between variables present in the background. (In each example Gary's variable ap- pears as the *response* variable in the relationship.) I suggest that many other examples of the study of univariate distributions also have at least one relationship between variables that is im- portant for the example lurking in the background. Thus to prop- erly understand these examples, students must understand the con- cept of a relationship between variables. Univariate distributions also fit into the rubric of relation- ships between variables in another important (but more theoreti- cal) way: Consider an empirical study of some *pure* univariate distribution, so we are agreed that no predictor variables are present. (And thus no relationships between variables can be lurking in the background.) We can view this situation as a *special type* of relationship between variables. As usual, there is one response variable in the situation. But the number of predictor variables, instead of being one or more, is zero. The preceding three sentences appear to be rigorously true as a limiting case in two senses - in an empirical sense and - in a strict mathematical sense. That is, every empirical or mathematical procedure we use to study univariate distributions can be easily viewed as the limit- ing case (when the number of predictor variables is reduced to zero) of a similar (but more complicated) procedure we use (or could use) to study relationships between variables. If we view the study of univariate distributions as a simple lim- iting case of the study of relationships between variables, this helps to highlight the rigorous links (empirical and mathemati- cal) between the two types of study. This helps, in turn, to simplify the field of statistics. (Although univariate distributions are a special case of rela- tionships between variables, it is also true that relationships between variables are a *generalization* of univariate distribu- tions. Thus why not teach univariate distributions first and then develop relationships as a generalization? As I discuss in the paper [1998], I recommend that univariate distributions be omitted because univariate distributions are boring and of little obvious use to beginning students. On the other hand, students find relationships between variables [when explained in terms of their ability to accurately predict and control] to be fascinat- ing.) > or do you think that such questions are less important in an > introductory statistics class? For the sake of practicality, I believe the introductory statis- tics course should emphasize the main statistical activities of empirical researchers. If we survey empirical research (e.g., if we survey articles in recent issues of the multidisciplinary journals _Science_ and _Nature_), we quickly see that almost all empirical research projects can be reasonably viewed as studying relationships between variables. And only a very few empirical research projects can be reasonably viewed as studying univariate distributions. (I believe most empirical research projects focus on relation- ships because relationships [properly discovered] give more accu- rate and more useful predictions [and control] than the predic- tions given by univariate distributions. I further discuss ways of viewing empirical research in appendix D. I discuss some ex- amples of empirical research projects that do NOT study relation- ships between variables in a Usenet post [1997, appendix A].) Since most empirical research projects can be reasonably viewed as studying relationships between variables, and not as studying univariate distributions, I believe that questions about univari- ate distributions are less important in an introductory statis- tics course. I thank Gary Smith for his thought-provoking questions. ------------------------------------------------------- Donald B. Macnaughton MatStat Research Consulting Inc donmac@matstat.com Toronto, Canada ------------------------------------------------------- APPENDIX A: ARE RELATIONSHIPS "BETWEEN" OR "AMONG" VARIABLES? Should we speak of a relationship between variables or a relationship among variables? Most (but not all) empirical research projects (or logical compo- nents of research projects) focus on a single response variable. In this majority case it makes sense to view the relationship un- der study as being *between* the single response variable and the predictor variable(s). This helps to emphasize the important distinction between the response variable and the predictor vari- able(s) in the research, which the preposition "among" would downplay. Thus for this majority case it is reasonable to use the preposition "between". In multivariate situations (which tend to be rare), such as mul- tivariate regression and multivariate analysis of variance, *several* (two or more) response variables participate simultane- ously in the analysis (generally along with one or more predictor variables). In these situations we can still view the response variable as being a single "variable", although in this case it is also a *vector*. That is, it is mathematically feasible and reasonable to view multivariate situations with multiple response variables as having a single (vector-valued) response variable. Viewing multivariate situations as having a single response vari- able is also reasonable from an empirical research point of view because the response "variable" ought to represent some unity or property, even if it is a vector consisting of several individual variables. If the response "variable" is just a random conglom- eration of properties (of, of course, the same entities), there is no obvious empirical sense in using it as the response "vari- able" in an analysis. Thus in the majority univariate case and in the minority multi- variate case (if we view the set of response variables as a [sin- gular] vector) it is appropriate to use the preposition "between" in the phrase "relationship ... variables" because there is only one response "variable", and the relationship under study is *between* that variable and the predictor variable(s). APPENDIX B: PHYSICAL CONSTANTS VERSUS ENTITIES, PROPERTIES, AND RELATIONSHIPS What is the relationship between the concept of a physical con- stant and the concepts of entities, properties, variables, and relationships between variables? Physical constants represent properties of entities like any other property except that physical constants are believed not to vary. For example, in any instance of a light transmission, the light being transmitted travels at a certain speed. This speed is a property of the (somewhat ethereal) "light" (or light wave) being transmitted. Properties that are physical constants (such as the speed of light) are viewed as being "constant". But if we go into a labo- ratory and actually empirically measure the value of a physical constant, it will seem not constant at all. For example, if we repeatedly measure the speed of light in a vacuum as accurately as possible, we will find that we get a different value for the speed almost every time we measure it. (However, if we use mod- ern instruments and are careful, the values will be *very close together*.) Since we get a different value almost every time we measure the speed of light in a vacuum, we are not (directly) studying a constant but are instead studying a (quite narrow) univariate distribution. However, physicists dismiss the small variation in the estimates of the speed of light in a vacuum as being due to inaccuracies in the measuring instruments. This dismissal is reasonable because - the small amount of variation in the measured values is commen- surate with the error rates of the instruments used to measure the speed and - so far, physicists have been unable to find any evidence of a relationship between the speed of light in a vacuum and any other variable. Thus, despite the variation in the measured values, physicists have inferred that the speed of light in a vacuum is constant. It is conceivable that someday, when sufficiently sensitive meas- uring instruments are available, and when a physicist chooses the appropriate predictor variable, say P, he or she will find that the speed of light in a vacuum does depend (likely only to a small degree) on P. However, until such empirical evidence is brought forward, the principle of parsimony (which tells us to keep things as simple as possible) dictates that we assume that the speed of light in a vacuum is (a physical) constant. (Musser [1998] discusses a research project proposed by Giovanni Amelino-Camelia and others to test the hypothesis that the speed of light in a vacuum is not constant but instead depends slightly on another variable, namely the wavelength of the light.) APPENDIX C: THE ROLE OF CONSTANTS (PHYSICAL AND OTHERWISE) IN EMPIRICAL RESEARCH Where do physical constants and other constants fit in empirical research? Only a small proportion of empirical research (probably less than one percent) is *directly* involved in estimating the values of physical constants (or in estimating what might be viewed as the equivalent of physical constants in other branches of empirical research). Instead, as I suggest above, most empirical research can be easily viewed as directly studying relationships between variables. On the other hand, much empirical research is *indirectly* in- volved in estimating the values of constants, since these con- stants are equivalent to the parameters in models of relation- ships between variables. The (constant) values of the parameters in the models of the relationships are important aids to the mod- eling. APPENDIX D: WAYS OF VIEWING EMPIRICAL RESEARCH PROJECTS I refer above to how we can *view* empirical research projects. That is, I do not say that research projects *are* a certain way -- I say they *can be viewed* in various ways. In particular, I suggest above that it is efficient to view most empirical re- search projects as studying relationships between variables. But there are other ways of viewing empirical research projects, for example - We can view some research projects NOT as studying relation- ships between variables but as studying differences between subpopulations. I discuss in a paper why this point of view is less inclusive than viewing research projects as studying rela- tionships between variables (1996, appendix B.4). - We can view any research project that ostensibly studies a re- lationship between variables as "really" studying the univari- ate distribution of the response variable. That is, we focus on the response variable and we view the predictor variables as variables that may have an "effect" on the univariate distribu- tion of the response variable, without invoking the concept of a relationship between variables. However, this approach makes it difficult to explain the use of "models" or "model equa- tions", which are ubiquitous in empirical research. What is a model if it is not a model of a relationship between variables? - We can view many empirical research projects in terms of deter- mining the values of constants, where the constants are the constant values of the parameters in models. But once again we have models, and we must answer what the models are models of. These points support the notion that the most efficient way to view most empirical research projects is in terms of the study of relationships between variables. REFERENCES Macnaughton, D. B. 1996. "The introductory statistics course: A new approach." This paper is available at http://www.matstat.com/teach/ Macnaughton, D. B. 1997. "Re: How should we *motivate* students in intro stat? (response to comments by John R. Vokey)." Posted to sci.stat.edu and EdStat-L on April 6, 1997 and re- vised on June 1, 1997. Available at http://www.matstat.com/teach/p0024.htm Macnaughton, D. B. 1998. "Eight features of an ideal introductory statistics course." This paper is available at http://www.matstat.com/teach/ Musser, G. 1998. "String instruments: String theory may soon be testable." _Scientific American,_ 279 (4) (October), 24-28.

Home page for Donald Macnaughton's papers about introductory statistics