Subject: Re: Eight Features of an Ideal Intro Stat Course
(Response to comments by Gary Smith)
To: EdStat-L and sci.stat.edu
From: Donald B. Macnaughton <donmac@matstat.com>
Date: Monday November 23, 1998
Many introductory statistics courses cover the topic of univari-
ate distributions early in the course. In a recent paper (1998)
I recommend that teachers *omit* this topic because (I argue)
univariate distributions are boring and of little obvious use to
beginning students. In response, Gary Smith of Pomona College
sent me three examples of univariate distributions that may be of
interest in an introductory course. Here (with Gary's permis-
sion) are Gary's remarks and my reply.
Gary writes
> ( snip )
> As an economist, I'm very receptive to your emphasis on rela-
> tionships among variables. That's pretty much what we do.
As I discuss further below, I believe that most empirical re-
search done by economists and by all other empirical researchers
is reasonably viewed as studying relationships between variables.
(Gary speaks of relationships "among" variables and I speak of
relationships "between" variables. I discuss why I recommend the
preposition "between" for general use in appendix A.)
Gary presents his three examples of potentially interesting uni-
variate distributions as follows:
> On the other hand, what about, say, (a) predicting the fraction
> of the vote that Candidate A will receive in an upcoming elec-
> tion; (b) estimating the average body temperature of healthy
> humans; or (c) estimating the speed of light?
A. PREDICTING THE FRACTION OF THE VOTE CANDIDATE A WILL RECEIVE
The standard way of predicting the fraction of the vote a candi-
date will receive in an election is to ask a random sample of
voters a question something like
If the election were held today, who would you vote for:
Candidate A, Candidate B, [etc.]?
The results of asking this question will be reflected in a single
(nominal-level) variable, which we might call "Preferred Candi-
date". Each person in the sample will contribute one value of
this variable.
If the sample is a simple random sample, we predict that candi-
date A will receive the same fraction of the vote as the fraction
of people in the sample who said they will vote for candidate A.
Thus the prediction is made on the basis of a univariate distri-
bution, NOT on the basis of a relationship between variables.
Thus this may be a situation in which a univariate distribution
is important.
Clearly, in one sense the predictions made from the univariate
distribution of "Preferred Candidate" are important because many
people are interested in election predictions. (Otherwise the
news media would not publish them.) But in another more practi-
cal sense the predictions are not very important. We can see
this by noting that they provide relatively little *rational ba-
sis for action* on the part of any person or group. What person
or group can infer a rational basis for action from predictions
of the univariate fraction of the vote the candidates will re-
ceive in an election?
(I am not saying the predictions provide NO practical basis for
action. For example, one obvious practical use of the predic-
tions occurs if the survey predicts that Candidate A will do very
poorly in the election. In this case [and barring special cir-
cumstances], Candidate A should drop out of the election [to
minimize his or her expenses].)
In contrast to the simple survey of voters discussed above, we
might perform a more sophisticated survey -- one in which we ob-
tain the age of each respondent as well as the name of the candi-
date the respondent will vote for. Such a survey will give us a
table like the following:
Predicted Percentage of the Vote
To Be Received by Different Candidates
Broken Down by Voter Age Groups
---------------------------------
Voter Candidate
Age Group ------------- TOTAL
(years) A B C
---------------------------------
18 - 29 4 8 19 31
30 - 39 6 11 7 24
40 - 49 5 8 6 19
50 and up 12 6 8 26
--- --- --- ----
TOTAL 27 33 40 100
---------------------------------
The first row in the body of the table predicts that 31 percent
of the voters will lie between the ages of 18 and 29, but rela-
tively few people in this group will vote for Candidate A -- most
of the people in the group will vote for Candidate C. The re-
sults of *this* survey provide a more useful basis for action.
For example, Candidate A can see from the table that he does not
have much support from younger voters. He may therefore decide
to revise his campaign strategy to increase his popularity with
this group.
But if Candidate A uses the information in the table, he is using
information about a relationship between two variables, namely
the relationship between "Voter Age Group" and "Preferred Candi-
date".
Thus, although the univariate distribution of the variable "Pre-
ferred Candidate" (as shown in the bottom row of the table) is of
some interest, we can usually get much greater value if we study
the same "univariate distribution", but we study it along with
some *predictor* variables. That is, we can get greater value if
we study the variable "Preferred Candidate" as the response vari-
able in a relationship between variables.
Since we can generally get more useful election predictions from
studying relationships between variables than from studying uni-
variate distributions, I suggest that this example does not pro-
vide a strong reason for discussing univariate distributions near
the beginning of the introductory statistics course.
B. ESTIMATING THE AVERAGE BODY TEMPERATURE OF HEALTHY HUMANS
Estimating the average body temperature of healthy humans is a
problem of determining a "norm". When empirical researchers de-
termine norms, they do directly study univariate distributions.
However, even here, other variables and relationships between
variables play a key role because researchers determining norms
usually hold other important variables at specific constant val-
ues. Otherwise the norms may be muddied by variation in the
other variables causing (through a relationship between the vari-
ables) extra variation in the values of the variable being
"normed".
For example, if medical researchers wish to estimate the average
body temperature of healthy humans, they will usually ensure that
each person whose temperature is measured to determine the norm
has the same (usually sedentary) level of physical activity in
the period before their temperature is measured. This is because
the researchers know that a relationship exists in people between
"level of physical activity" and "body temperature" -- as physi-
cal activity increases, so (slightly) does body temperature.
Since relationships between variables often play a key role in
determining norms, it is reasonable to defer discussing norms in
the introductory statistics course until students have a good
sense of the concept of a relationship between variables.
C. ESTIMATING THE SPEED OF LIGHT
Estimating the speed of light is a problem of estimating the
value of a *physical constant*. The hard sciences have defined
many (perhaps one or two hundred) general physical constants such
as the speed of light, Planck's constant, and the proton rest
mass. In addition, the values of the various constant properties
of the specific entities studied by the hard sciences are also
physical constants (e.g., the thermal conductivity of titanium).
An important use of physical constants is to provide the values
of some parameters in some models of relationships between vari-
ables. For example, the speed of light is represented by the
symbol "c" and is the main parameter in Einstein's equation
E = m c^2.
(This equation is generally viewed as a statement of a relation-
ship between two variables -- the contained atomic energy [E] and
the mass [m] of a piece of matter.)
Physicists and chemists estimate the values of some physical con-
stants by simply solving for them in the model equations in which
they appear and then by substituting appropriate empirically de-
termined values for the variables into the equation. For exam-
ple, the ideal gas law is a relationship between variables that
is usually stated as
pV = nRT.
This equation is a statement of the relationship between pressure
(p), volume (V), amount (n), and absolute temperature (T) of a
quantity of an ideal gas. The physical constant R in this equa-
tion is called the (universal) gas constant and is determined by
solving the gas law equation for R and then substituting appro-
priately determined sets of values for p, V, n, and T into the
equation to provide estimates of R. In this situation, clearly
the concept of a relationship between variables plays an impor-
tant role in estimating the value of a physical constant.
On the other hand, physicists estimate the speed of light and the
values of many other physical constants more or less directly.
In such cases, the concept of a relationship between variables
plays no *direct* role.
However, although relationships play no *direct* role in estimat-
ing the values of some physical constants, relationships usually
do play important *indirect* roles. For example, it is well
known that the speed of light depends on the medium in which the
light travels. That is, there is a relationship between the two
variables
- "the speed of a given light wave" and
- "the type of medium in which the light wave is travelling"
(e.g., vacuum, air, water, or glass).
Since the estimated speed of light depends on the medium, a re-
searcher attempting to estimate the speed must ensure that the
medium in which the speed is measured is constant and well speci-
fied. Similarly, with many (all?) other physical constants, cer-
tain relationships between variables must be taken into account
before the value of the constant can be properly estimated.
Since relationships between variables play key direct or indirect
roles in estimating the values of many physical constants, it is
reasonable to defer discussing physical constants (or their
equivalents in other fields of empirical research) in the intro-
ductory statistics course until students have a good sense of the
concept of a relationship between variables.
(I discuss the relationship between physical constants and the
concepts of entities, properties, variables, and relationships
between variables in appendix B. I discuss the role of constants
[physical and otherwise] in empirical research in appendix C.)
>
> Do you consider these [three examples] to fit into the rela-
> tionship rubric,
Gary's examples fit into the relationship rubric in the sense
that each example has an important relationship between variables
present in the background. (In each example Gary's variable ap-
pears as the *response* variable in the relationship.) I suggest
that many other examples of the study of univariate distributions
also have at least one relationship between variables that is im-
portant for the example lurking in the background. Thus to prop-
erly understand these examples, students must understand the con-
cept of a relationship between variables.
Univariate distributions also fit into the rubric of relation-
ships between variables in another important (but more theoreti-
cal) way: Consider an empirical study of some *pure* univariate
distribution, so we are agreed that no predictor variables are
present. (And thus no relationships between variables can be
lurking in the background.) We can view this situation as a
*special type* of relationship between variables. As usual,
there is one response variable in the situation. But the number
of predictor variables, instead of being one or more, is zero.
The preceding three sentences appear to be rigorously true as a
limiting case in two senses
- in an empirical sense and
- in a strict mathematical sense.
That is, every empirical or mathematical procedure we use to
study univariate distributions can be easily viewed as the limit-
ing case (when the number of predictor variables is reduced to
zero) of a similar (but more complicated) procedure we use (or
could use) to study relationships between variables.
If we view the study of univariate distributions as a simple lim-
iting case of the study of relationships between variables, this
helps to highlight the rigorous links (empirical and mathemati-
cal) between the two types of study. This helps, in turn, to
simplify the field of statistics.
(Although univariate distributions are a special case of rela-
tionships between variables, it is also true that relationships
between variables are a *generalization* of univariate distribu-
tions. Thus why not teach univariate distributions first and
then develop relationships as a generalization? As I discuss in
the paper [1998], I recommend that univariate distributions be
omitted because univariate distributions are boring and of little
obvious use to beginning students. On the other hand, students
find relationships between variables [when explained in terms of
their ability to accurately predict and control] to be fascinat-
ing.)
> or do you think that such questions are less important in an
> introductory statistics class?
For the sake of practicality, I believe the introductory statis-
tics course should emphasize the main statistical activities of
empirical researchers. If we survey empirical research (e.g., if
we survey articles in recent issues of the multidisciplinary
journals _Science_ and _Nature_), we quickly see that almost all
empirical research projects can be reasonably viewed as studying
relationships between variables. And only a very few empirical
research projects can be reasonably viewed as studying univariate
distributions.
(I believe most empirical research projects focus on relation-
ships because relationships [properly discovered] give more accu-
rate and more useful predictions [and control] than the predic-
tions given by univariate distributions. I further discuss ways
of viewing empirical research in appendix D. I discuss some ex-
amples of empirical research projects that do NOT study relation-
ships between variables in a Usenet post [1997, appendix A].)
Since most empirical research projects can be reasonably viewed
as studying relationships between variables, and not as studying
univariate distributions, I believe that questions about univari-
ate distributions are less important in an introductory statis-
tics course.
I thank Gary Smith for his thought-provoking questions.
-------------------------------------------------------
Donald B. Macnaughton MatStat Research Consulting Inc
donmac@matstat.com Toronto, Canada
-------------------------------------------------------
APPENDIX A: ARE RELATIONSHIPS "BETWEEN" OR "AMONG" VARIABLES?
Should we speak of
a relationship between variables
or
a relationship among variables?
Most (but not all) empirical research projects (or logical compo-
nents of research projects) focus on a single response variable.
In this majority case it makes sense to view the relationship un-
der study as being *between* the single response variable and the
predictor variable(s). This helps to emphasize the important
distinction between the response variable and the predictor vari-
able(s) in the research, which the preposition "among" would
downplay. Thus for this majority case it is reasonable to use
the preposition "between".
In multivariate situations (which tend to be rare), such as mul-
tivariate regression and multivariate analysis of variance,
*several* (two or more) response variables participate simultane-
ously in the analysis (generally along with one or more predictor
variables). In these situations we can still view the response
variable as being a single "variable", although in this case it
is also a *vector*. That is, it is mathematically feasible and
reasonable to view multivariate situations with multiple response
variables as having a single (vector-valued) response variable.
Viewing multivariate situations as having a single response vari-
able is also reasonable from an empirical research point of view
because the response "variable" ought to represent some unity or
property, even if it is a vector consisting of several individual
variables. If the response "variable" is just a random conglom-
eration of properties (of, of course, the same entities), there
is no obvious empirical sense in using it as the response "vari-
able" in an analysis.
Thus in the majority univariate case and in the minority multi-
variate case (if we view the set of response variables as a [sin-
gular] vector) it is appropriate to use the preposition "between"
in the phrase "relationship ... variables" because there is only
one response "variable", and the relationship under study is
*between* that variable and the predictor variable(s).
APPENDIX B: PHYSICAL CONSTANTS VERSUS ENTITIES, PROPERTIES, AND
RELATIONSHIPS
What is the relationship between the concept of a physical con-
stant and the concepts of entities, properties, variables, and
relationships between variables?
Physical constants represent properties of entities like any
other property except that physical constants are believed not to
vary. For example, in any instance of a light transmission, the
light being transmitted travels at a certain speed. This speed
is a property of the (somewhat ethereal) "light" (or light wave)
being transmitted.
Properties that are physical constants (such as the speed of
light) are viewed as being "constant". But if we go into a labo-
ratory and actually empirically measure the value of a physical
constant, it will seem not constant at all. For example, if we
repeatedly measure the speed of light in a vacuum as accurately
as possible, we will find that we get a different value for the
speed almost every time we measure it. (However, if we use mod-
ern instruments and are careful, the values will be *very close
together*.) Since we get a different value almost every time we
measure the speed of light in a vacuum, we are not (directly)
studying a constant but are instead studying a (quite narrow)
univariate distribution.
However, physicists dismiss the small variation in the estimates
of the speed of light in a vacuum as being due to inaccuracies in
the measuring instruments. This dismissal is reasonable because
- the small amount of variation in the measured values is commen-
surate with the error rates of the instruments used to measure
the speed and
- so far, physicists have been unable to find any evidence of a
relationship between the speed of light in a vacuum and any
other variable.
Thus, despite the variation in the measured values, physicists
have inferred that the speed of light in a vacuum is constant.
It is conceivable that someday, when sufficiently sensitive meas-
uring instruments are available, and when a physicist chooses the
appropriate predictor variable, say P, he or she will find that
the speed of light in a vacuum does depend (likely only to a
small degree) on P. However, until such empirical evidence is
brought forward, the principle of parsimony (which tells us to
keep things as simple as possible) dictates that we assume that
the speed of light in a vacuum is (a physical) constant.
(Musser [1998] discusses a research project proposed by Giovanni
Amelino-Camelia and others to test the hypothesis that the speed
of light in a vacuum is not constant but instead depends slightly
on another variable, namely the wavelength of the light.)
APPENDIX C: THE ROLE OF CONSTANTS (PHYSICAL AND OTHERWISE) IN
EMPIRICAL RESEARCH
Where do physical constants and other constants fit in empirical
research?
Only a small proportion of empirical research (probably less than
one percent) is *directly* involved in estimating the values of
physical constants (or in estimating what might be viewed as the
equivalent of physical constants in other branches of empirical
research). Instead, as I suggest above, most empirical research
can be easily viewed as directly studying relationships between
variables.
On the other hand, much empirical research is *indirectly* in-
volved in estimating the values of constants, since these con-
stants are equivalent to the parameters in models of relation-
ships between variables. The (constant) values of the parameters
in the models of the relationships are important aids to the mod-
eling.
APPENDIX D: WAYS OF VIEWING EMPIRICAL RESEARCH PROJECTS
I refer above to how we can *view* empirical research projects.
That is, I do not say that research projects *are* a certain way
-- I say they *can be viewed* in various ways. In particular, I
suggest above that it is efficient to view most empirical re-
search projects as studying relationships between variables.
But there are other ways of viewing empirical research projects,
for example
- We can view some research projects NOT as studying relation-
ships between variables but as studying differences between
subpopulations. I discuss in a paper why this point of view is
less inclusive than viewing research projects as studying rela-
tionships between variables (1996, appendix B.4).
- We can view any research project that ostensibly studies a re-
lationship between variables as "really" studying the univari-
ate distribution of the response variable. That is, we focus
on the response variable and we view the predictor variables as
variables that may have an "effect" on the univariate distribu-
tion of the response variable, without invoking the concept of
a relationship between variables. However, this approach makes
it difficult to explain the use of "models" or "model equa-
tions", which are ubiquitous in empirical research. What is a
model if it is not a model of a relationship between variables?
- We can view many empirical research projects in terms of deter-
mining the values of constants, where the constants are the
constant values of the parameters in models. But once again we
have models, and we must answer what the models are models of.
These points support the notion that the most efficient way to
view most empirical research projects is in terms of the study of
relationships between variables.
REFERENCES
Macnaughton, D. B. 1996. "The introductory statistics course: A
new approach." This paper is available at
http://www.matstat.com/teach/
Macnaughton, D. B. 1997. "Re: How should we *motivate* students
in intro stat? (response to comments by John R. Vokey)."
Posted to sci.stat.edu and EdStat-L on April 6, 1997 and re-
vised on June 1, 1997. Available at
http://www.matstat.com/teach/p0024.htm
Macnaughton, D. B. 1998. "Eight features of an ideal introductory
statistics course." This paper is available at
http://www.matstat.com/teach/
Musser, G. 1998. "String instruments: String theory may soon be
testable." _Scientific American,_ 279 (4) (October), 24-28.
Home page for Donald Macnaughton's papers about introductory statistics