Eight Features: Smith Response

Subject: Re: Eight Features of an Ideal Intro Stat Course
         (Response to comments by Gary Smith)

     To: EdStat-L and sci.stat.edu

   From: Donald B. Macnaughton <donmac@matstat.com>

   Date: Monday November 23, 1998

Many introductory statistics courses cover the topic of univari-
ate distributions early in the course.  In a recent paper (1998) 
I recommend that teachers *omit* this topic because (I argue) 
univariate distributions are boring and of little obvious use to 
beginning students.  In response, Gary Smith of Pomona College 
sent me three examples of univariate distributions that may be of 
interest in an introductory course.  Here (with Gary's permis-
sion) are Gary's remarks and my reply.

Gary writes

>    ( snip )
> As an economist, I'm very receptive to your emphasis on rela-
> tionships among variables.  That's pretty much what we do.  

As I discuss further below, I believe that most empirical re-
search done by economists and by all other empirical researchers 
is reasonably viewed as studying relationships between variables.

(Gary speaks of relationships "among" variables and I speak of 
relationships "between" variables.  I discuss why I recommend the 
preposition "between" for general use in appendix A.)

Gary presents his three examples of potentially interesting uni-
variate distributions as follows:

> On the other hand, what about, say, (a) predicting the fraction
> of the vote that Candidate A will receive in an upcoming elec-
> tion; (b) estimating the average body temperature of healthy
> humans; or (c) estimating the speed of light?


A. PREDICTING THE FRACTION OF THE VOTE CANDIDATE A WILL RECEIVE

The standard way of predicting the fraction of the vote a candi-
date will receive in an election is to ask a random sample of 
voters a question something like

    If the election were held today, who would you vote for:  
    Candidate A, Candidate B, [etc.]?

The results of asking this question will be reflected in a single 
(nominal-level) variable, which we might call "Preferred Candi-
date".  Each person in the sample will contribute one value of 
this variable.  

If the sample is a simple random sample, we predict that candi-
date A will receive the same fraction of the vote as the fraction 
of people in the sample who said they will vote for candidate A.  
Thus the prediction is made on the basis of a univariate distri-
bution, NOT on the basis of a relationship between variables.  
Thus this may be a situation in which a univariate distribution 
is important.

Clearly, in one sense the predictions made from the univariate 
distribution of "Preferred Candidate" are important because many 
people are interested in election predictions.  (Otherwise the 
news media would not publish them.)  But in another more practi-
cal sense the predictions are not very important.  We can see 
this by noting that they provide relatively little *rational ba-
sis for action* on the part of any person or group.  What person 
or group can infer a rational basis for action from predictions 
of the univariate fraction of the vote the candidates will re-
ceive in an election?

(I am not saying the predictions provide NO practical basis for 
action.  For example, one obvious practical use of the predic-
tions occurs if the survey predicts that Candidate A will do very 
poorly in the election.  In this case [and barring special cir-
cumstances], Candidate A should drop out of the election [to 
minimize his or her expenses].)

In contrast to the simple survey of voters discussed above, we 
might perform a more sophisticated survey -- one in which we ob-
tain the age of each respondent as well as the name of the candi-
date the respondent will vote for.  Such a survey will give us a 
table like the following:

                 Predicted Percentage of the Vote
              To Be Received by Different Candidates
                 Broken Down by Voter Age Groups
                ---------------------------------
                  Voter       Candidate
                Age Group   -------------   TOTAL
                 (years)      A    B    C
                ---------------------------------
                 18 - 29      4    8   19     31
                 30 - 39      6   11    7     24
                 40 - 49      5    8    6     19
                 50 and up   12    6    8     26
                            ---  ---  ---   ----
                   TOTAL     27   33   40    100
                ---------------------------------

The first row in the body of the table predicts that 31 percent 
of the voters will lie between the ages of 18 and 29, but rela-
tively few people in this group will vote for Candidate A -- most 
of the people in the group will vote for Candidate C.  The re-
sults of *this* survey provide a more useful basis for action.  
For example, Candidate A can see from the table that he does not 
have much support from younger voters.  He may therefore decide 
to revise his campaign strategy to increase his popularity with 
this group.

But if Candidate A uses the information in the table, he is using 
information about a relationship between two variables, namely 
the relationship between "Voter Age Group" and "Preferred Candi-
date".

Thus, although the univariate distribution of the variable "Pre-
ferred Candidate" (as shown in the bottom row of the table) is of 
some interest, we can usually get much greater value if we study 
the same "univariate distribution", but we study it along with 
some *predictor* variables.  That is, we can get greater value if 
we study the variable "Preferred Candidate" as the response vari-
able in a relationship between variables.

Since we can generally get more useful election predictions from 
studying relationships between variables than from studying uni-
variate distributions, I suggest that this example does not pro-
vide a strong reason for discussing univariate distributions near 
the beginning of the introductory statistics course.


B. ESTIMATING THE AVERAGE BODY TEMPERATURE OF HEALTHY HUMANS

Estimating the average body temperature of healthy humans is a 
problem of determining a "norm".  When empirical researchers de-
termine norms, they do directly study univariate distributions.  
However, even here, other variables and relationships between 
variables play a key role because researchers determining norms 
usually hold other important variables at specific constant val-
ues.  Otherwise the norms may be muddied by variation in the 
other variables causing (through a relationship between the vari-
ables) extra variation in the values of the variable being 
"normed".  

For example, if medical researchers wish to estimate the average 
body temperature of healthy humans, they will usually ensure that 
each person whose temperature is measured to determine the norm 
has the same (usually sedentary) level of physical activity in 
the period before their temperature is measured.  This is because 
the researchers know that a relationship exists in people between 
"level of physical activity" and "body temperature" -- as physi-
cal activity increases, so (slightly) does body temperature.

Since relationships between variables often play a key role in 
determining norms, it is reasonable to defer discussing norms in 
the introductory statistics course until students have a good 
sense of the concept of a relationship between variables.  


C. ESTIMATING THE SPEED OF LIGHT

Estimating the speed of light is a problem of estimating the 
value of a *physical constant*.  The hard sciences have defined 
many (perhaps one or two hundred) general physical constants such 
as the speed of light, Planck's constant, and the proton rest 
mass.  In addition, the values of the various constant properties 
of the specific entities studied by the hard sciences are also 
physical constants (e.g., the thermal conductivity of titanium).

An important use of physical constants is to provide the values 
of some parameters in some models of relationships between vari-
ables.  For example, the speed of light is represented by the 
symbol "c" and is the main parameter in Einstein's equation 

                           E = m c^2.

(This equation is generally viewed as a statement of a relation-
ship between two variables -- the contained atomic energy [E] and 
the mass [m] of a piece of matter.)

Physicists and chemists estimate the values of some physical con-
stants by simply solving for them in the model equations in which 
they appear and then by substituting appropriate empirically de-
termined values for the variables into the equation.  For exam-
ple, the ideal gas law is a relationship between variables that 
is usually stated as 

                            pV = nRT.

This equation is a statement of the relationship between pressure 
(p), volume (V), amount (n), and absolute temperature (T) of a 
quantity of an ideal gas.  The physical constant R in this equa-
tion is called the (universal) gas constant and is determined by 
solving the gas law equation for R and then substituting appro-
priately determined sets of values for p, V, n, and T into the 
equation to provide estimates of R.  In this situation, clearly 
the concept of a relationship between variables plays an impor-
tant role in estimating the value of a physical constant.

On the other hand, physicists estimate the speed of light and the 
values of many other physical constants more or less directly.  
In such cases, the concept of a relationship between variables 
plays no *direct* role.

However, although relationships play no *direct* role in estimat-
ing the values of some physical constants, relationships usually 
do play important *indirect* roles.  For example, it is well 
known that the speed of light depends on the medium in which the 
light travels.  That is, there is a relationship between the two 
variables 
- "the speed of a given light wave" and 
- "the type of medium in which the light wave is travelling" 
  (e.g., vacuum, air, water, or glass).  

Since the estimated speed of light depends on the medium, a re-
searcher attempting to estimate the speed must ensure that the 
medium in which the speed is measured is constant and well speci-
fied.  Similarly, with many (all?) other physical constants, cer-
tain relationships between variables must be taken into account 
before the value of the constant can be properly estimated.  

Since relationships between variables play key direct or indirect 
roles in estimating the values of many physical constants, it is 
reasonable to defer discussing physical constants (or their 
equivalents in other fields of empirical research) in the intro-
ductory statistics course until students have a good sense of the 
concept of a relationship between variables.  

(I discuss the relationship between physical constants and the 
concepts of entities, properties, variables, and relationships 
between variables in appendix B.  I discuss the role of constants 
[physical and otherwise] in empirical research in appendix C.)

>
> Do you consider these [three examples] to fit into the rela-
> tionship rubric, 

Gary's examples fit into the relationship rubric in the sense 
that each example has an important relationship between variables 
present in the background.  (In each example Gary's variable ap-
pears as the *response* variable in the relationship.)  I suggest 
that many other examples of the study of univariate distributions 
also have at least one relationship between variables that is im-
portant for the example lurking in the background.  Thus to prop-
erly understand these examples, students must understand the con-
cept of a relationship between variables.

Univariate distributions also fit into the rubric of relation-
ships between variables in another important (but more theoreti-
cal) way:  Consider an empirical study of some *pure* univariate 
distribution, so we are agreed that no predictor variables are 
present.  (And thus no relationships between variables can be 
lurking in the background.)  We can view this situation as a 
*special type* of relationship between variables.  As usual, 
there is one response variable in the situation.  But the number 
of predictor variables, instead of being one or more, is zero.  
The preceding three sentences appear to be rigorously true as a 
limiting case in two senses
- in an empirical sense and
- in a strict mathematical sense.  

That is, every empirical or mathematical procedure we use to 
study univariate distributions can be easily viewed as the limit-
ing case (when the number of predictor variables is reduced to 
zero) of a similar (but more complicated) procedure we use (or 
could use) to study relationships between variables.

If we view the study of univariate distributions as a simple lim-
iting case of the study of relationships between variables, this 
helps to highlight the rigorous links (empirical and mathemati-
cal) between the two types of study.  This helps, in turn, to 
simplify the field of statistics. 

(Although univariate distributions are a special case of rela-
tionships between variables, it is also true that relationships 
between variables are a *generalization* of univariate distribu-
tions.  Thus why not teach univariate distributions first and 
then develop relationships as a generalization?  As I discuss in 
the paper [1998], I recommend that univariate distributions be 
omitted because univariate distributions are boring and of little 
obvious use to beginning students.  On the other hand, students 
find relationships between variables [when explained in terms of 
their ability to accurately predict and control] to be fascinat-
ing.) 


> or do you think that such questions are less important in an
> introductory statistics class?

For the sake of practicality, I believe the introductory statis-
tics course should emphasize the main statistical activities of 
empirical researchers.  If we survey empirical research (e.g., if 
we survey articles in recent issues of the multidisciplinary 
journals _Science_ and _Nature_), we quickly see that almost all 
empirical research projects can be reasonably viewed as studying 
relationships between variables.  And only a very few empirical 
research projects can be reasonably viewed as studying univariate 
distributions.  

(I believe most empirical research projects focus on relation-
ships because relationships [properly discovered] give more accu-
rate and more useful predictions [and control] than the predic-
tions given by univariate distributions.  I further discuss ways 
of viewing empirical research in appendix D.  I discuss some ex-
amples of empirical research projects that do NOT study relation-
ships between variables in a Usenet post [1997, appendix A].)

Since most empirical research projects can be reasonably viewed 
as studying relationships between variables, and not as studying 
univariate distributions, I believe that questions about univari-
ate distributions are less important in an introductory statis-
tics course.


I thank Gary Smith for his thought-provoking questions.  

-------------------------------------------------------
Donald B. Macnaughton   MatStat Research Consulting Inc
donmac@matstat.com      Toronto, Canada
-------------------------------------------------------


APPENDIX A: ARE RELATIONSHIPS "BETWEEN" OR "AMONG" VARIABLES?

Should we speak of

                a relationship between variables

                               or

                 a relationship among variables?

Most (but not all) empirical research projects (or logical compo-
nents of research projects) focus on a single response variable.  
In this majority case it makes sense to view the relationship un-
der study as being *between* the single response variable and the 
predictor variable(s).  This helps to emphasize the important 
distinction between the response variable and the predictor vari-
able(s) in the research, which the preposition "among" would 
downplay.  Thus for this majority case it is reasonable to use 
the preposition "between".

In multivariate situations (which tend to be rare), such as mul-
tivariate regression and multivariate analysis of variance, 
*several* (two or more) response variables participate simultane-
ously in the analysis (generally along with one or more predictor 
variables).  In these situations we can still view the response 
variable as being a single "variable", although in this case it 
is also a *vector*.  That is, it is mathematically feasible and 
reasonable to view multivariate situations with multiple response 
variables as having a single (vector-valued) response variable.

Viewing multivariate situations as having a single response vari-
able is also reasonable from an empirical research point of view 
because the response "variable" ought to represent some unity or 
property, even if it is a vector consisting of several individual 
variables.  If the response "variable" is just a random conglom-
eration of properties (of, of course, the same entities), there 
is no obvious empirical sense in using it as the response "vari-
able" in an analysis.  

Thus in the majority univariate case and in the minority multi-
variate case (if we view the set of response variables as a [sin-
gular] vector) it is appropriate to use the preposition "between" 
in the phrase "relationship ... variables" because there is only 
one response "variable", and the relationship under study is 
*between* that variable and the predictor variable(s).


APPENDIX B:  PHYSICAL CONSTANTS VERSUS ENTITIES, PROPERTIES, AND 
RELATIONSHIPS

What is the relationship between the concept of a physical con-
stant and the concepts of entities, properties, variables, and 
relationships between variables?

Physical constants represent properties of entities like any 
other property except that physical constants are believed not to 
vary.  For example, in any instance of a light transmission, the 
light being transmitted travels at a certain speed.  This speed 
is a property of the (somewhat ethereal) "light" (or light wave) 
being transmitted.

Properties that are physical constants (such as the speed of 
light) are viewed as being "constant".  But if we go into a labo-
ratory and actually empirically measure the value of a physical 
constant, it will seem not constant at all.  For example, if we 
repeatedly measure the speed of light in a vacuum as accurately 
as possible, we will find that we get a different value for the 
speed almost every time we measure it.  (However, if we use mod-
ern instruments and are careful, the values will be *very close 
together*.)  Since we get a different value almost every time we 
measure the speed of light in a vacuum, we are not (directly) 
studying a constant but are instead studying a (quite narrow) 
univariate distribution.  

However, physicists dismiss the small variation in the estimates 
of the speed of light in a vacuum as being due to inaccuracies in 
the measuring instruments.  This dismissal is reasonable because 

- the small amount of variation in the measured values is commen-
  surate with the error rates of the instruments used to measure 
  the speed and

- so far, physicists have been unable to find any evidence of a 
  relationship between the speed of light in a vacuum and any 
  other variable.  

Thus, despite the variation in the measured values, physicists 
have inferred that the speed of light in a vacuum is constant.  

It is conceivable that someday, when sufficiently sensitive meas-
uring instruments are available, and when a physicist chooses the 
appropriate predictor variable, say P, he or she will find that 
the speed of light in a vacuum does depend (likely only to a 
small degree) on P.  However, until such empirical evidence is 
brought forward, the principle of parsimony (which tells us to 
keep things as simple as possible) dictates that we assume that 
the speed of light in a vacuum is (a physical) constant.

(Musser [1998] discusses a research project proposed by Giovanni 
Amelino-Camelia and others to test the hypothesis that the speed 
of light in a vacuum is not constant but instead depends slightly 
on another variable, namely the wavelength of the light.)


APPENDIX C:  THE ROLE OF CONSTANTS (PHYSICAL AND OTHERWISE) IN 
EMPIRICAL RESEARCH

Where do physical constants and other constants fit in empirical 
research?

Only a small proportion of empirical research (probably less than 
one percent) is *directly* involved in estimating the values of 
physical constants (or in estimating what might be viewed as the 
equivalent of physical constants in other branches of empirical 
research).  Instead, as I suggest above, most empirical research 
can be easily viewed as directly studying relationships between 
variables.

On the other hand, much empirical research is *indirectly* in-
volved in estimating the values of constants, since these con-
stants are equivalent to the parameters in models of relation-
ships between variables.  The (constant) values of the parameters 
in the models of the relationships are important aids to the mod-
eling.


APPENDIX D:  WAYS OF VIEWING EMPIRICAL RESEARCH PROJECTS

I refer above to how we can *view* empirical research projects.  
That is, I do not say that research projects *are* a certain way 
-- I say they *can be viewed* in various ways.  In particular, I 
suggest above that it is efficient to view most empirical re-
search projects as studying relationships between variables.  

But there are other ways of viewing empirical research projects, 
for example

- We can view some research projects NOT as studying relation-
  ships between variables but as studying differences between 
  subpopulations.  I discuss in a paper why this point of view is 
  less inclusive than viewing research projects as studying rela-
  tionships between variables (1996, appendix B.4).

- We can view any research project that ostensibly studies a re-
  lationship between variables as "really" studying the univari-
  ate distribution of the response variable.  That is, we focus 
  on the response variable and we view the predictor variables as 
  variables that may have an "effect" on the univariate distribu-
  tion of the response variable, without invoking the concept of 
  a relationship between variables.  However, this approach makes 
  it difficult to explain the use of "models" or "model equa-
  tions", which are ubiquitous in empirical research.  What is a 
  model if it is not a model of a relationship between variables?  

- We can view many empirical research projects in terms of deter-
  mining the values of constants, where the constants are the 
  constant values of the parameters in models.  But once again we 
  have models, and we must answer what the models are models of. 

These points support the notion that the most efficient way to 
view most empirical research projects is in terms of the study of 
relationships between variables.


REFERENCES

Macnaughton, D. B. 1996. "The introductory statistics course: A 
   new approach."  This paper is available at 
   http://www.matstat.com/teach/

Macnaughton, D. B. 1997. "Re: How should we *motivate* students 
   in intro stat? (response to comments by John R. Vokey)."  
   Posted to sci.stat.edu and EdStat-L on April 6, 1997 and re-
   vised on June 1, 1997.  Available at 
   http://www.matstat.com/teach/p0024.htm

Macnaughton, D. B. 1998. "Eight features of an ideal introductory 
   statistics course."  This paper is available at 
   http://www.matstat.com/teach/

Musser, G. 1998. "String instruments: String theory may soon be 
   testable." _Scientific American,_ 279 (4) (October), 24-28.

Home page for Donald Macnaughton's papers about introductory statistics