```Subject: Re: Eight Features of an Ideal Intro Stat Course
(Response to comments by Herman Rubin)

To: EdStat-L and sci.stat.edu

From: Donald B. Macnaughton <donmac@matstat.com>

Date: Sunday May 16, 1999

Cc: Herman Rubin <hrubin@b.stat.purdue.edu>
```

```Quoting a 98/7/23 post of mine, Herman Rubin writes (on 98/8/3)

> Donald Macnaughton ... wrote:
>
>> In a July 17 post I recommend that teachers emphasize the con-
>> cept of a relationship between variables and I recommend
>>
>>    a de-emphasis of less important topics such as univariate
>>    distributions ...
>
> As such, I agree about the point on univariate distributions.
> One does not need a catalog of the standard ones, nor [does one
> need to] be adept at calculating them.
>
> HOWEVER, on consideration of the actual problems, they are an
> essential tool.

I fully agree that univariate distributions are an essential tool
in actual statistical problems -- most statistical analyses de-
pend directly on concepts of univariate distributions.

However, as Herman may agree, the ubiquity of univariate distri-
butions in statistical analyses does NOT speak to whether a
teacher should discuss univariate distributions near the begin-
ning of an introductory statistics course when the course is
aimed at students who are NOT majoring in statistics.  I explain
why I recommend that discussion of univariate distributions be
omitted near the beginning of such courses in some Usenet posts
(1998a, 1999a).

>
> The real problem is, what is needed to discuss relations?

Herman is using the term "relations" to refer to what I call "re-
lationships between variables".  I compare the terms "relation-
ship" and "relation" in appendix A.

I agree with Herman that an important problem is to clarify the
conceptual underpinnings of relation(ship)s between variables.

>
>>   ( snip )
>> For example, students in high school physics courses learn
>> about the relationship between acceleration (a) and force (f)
>> with the model equation
>>
>>                            f = ma
>>
>> where m is the mass of the body being accelerated.
>
> I agree that there is not TOO much problem with understanding
> this, but the way algebra is taught, I would not be that sure.

Many students seem to understand the relationship between vari-
ables implied by f = ma (Isaac Newton's second law of motion).
Students also understand other similar model equations they study
in science classes.  But, unfortunately, most students do NOT
seem to understand the broad importance of the *general* concept
of a relationship between variables.

(I discuss an approach to teaching the concept of a relationship
between variables to students in three papers [1996, 1998b,
1999b].)

>
> But do RANDOM VARIABLES have relations like this?  Taking the
> classical Galton observations on heights of fathers and heights
> of sons, there is no such relation.

Herman's conclusion that there is no relation(ship) between the
heights of fathers and the heights of sons in the population of
families from which Galton (1886; 1889, chap. 6-7) drew his sam-
ple depends on how we define the concept of a relationship be-
tween variables.  Herman recognizes this and proposes two types
of definition:

>
> The user who understands statistical problems is likely to for-
> mulate a relation as either a multivariate distribution or as a
> conditional univariate distribution, most of the time with un-
> known parameters.

I agree that it is possible to define the concept of 'relation-
ship between variables' in terms of multivariate or conditional
univariate distributions.  However, another simpler way of defin-
ing the concept is also available -- in terms of conditional ex-
pected value:

DEFINITION:  There is a *relationship* between the vari-
ables x and y if for at least one value x' of x

E(y|x') ~= E(y)                        (1)
where

E(*) is the expected value operator

E(y|x') is the expected value of y given that x has
the value x' and

~= stands for "is not equal to".

Defining the concept of 'relationships between variables' in
terms of conditional expected value leads to a simpler definition
than the definitions Herman proposes above because the expected-
value approach replaces the complicated concept of 'distribution'
with the simpler concept of 'expected value'.

Herman implies above that there is no relation(ship) between the
heights of the fathers and the heights of the sons in Galton's
population.  However, under the definition I give above, it can
be easily shown (in terms of a low p-value in a statistical test)
that there IS a "relationship" in Galton's population between the
heights of the fathers (x) and the heights of the sons (y).

(Although it is not necessary to take account of the concept of a
distribution in the definition of a relationship between vari-
ables, if we wish to *perform the statistical test* I refer to
above to check whether there is convincing evidence in Galton's
data of a relationship between the heights of the fathers and the
heights of the sons, we do need to take account of the distribu-
tions of the values of the response variable [i.e., y = "height
of the son"] for given values the predictor variable [x = "height
of the father"].  Of course, much statistical machinery is avail-
able to take account of these distributions in performing the
statistical test.)

(I discuss issues pertaining to the choice of an appropriate sta-
tistical test for the Galton data in appendix B.)

>    ( snip )
> Considering the problems with interpreting multivariate data,
> not starting with a distributional type of assumption, even if
> the form of the distribution is largely unspecified, is likely
> to lead to quite inappropriate analysis.

Although the definition above of the concept of a relationship
between variables makes no reference to distributions, it leads
(as far as I can see) to fully appropriate analyses.  Further-
more, since the approach makes no use of multivariate distribu-
tions, it bypasses all "the problems with interpreting multivari-
ate data" Herman refers to.

I further discuss defining the concept of 'relationship between
variables' in terms of conditional expected value (and I propose
a definition of "expected value") in a paper (1996, sec. 7.10).

I thank Herman for his thought-provoking comments.

-------------------------------------------------------
Donald B. Macnaughton   MatStat Research Consulting Inc
-------------------------------------------------------

APPENDIX A: TERMINOLOGY: SHOULD IT BE "RELATIONSHIP" OR
"RELATION" BETWEEN VARIABLES?

In a paper I discuss whether we should use the preposition "be-
tween" or the preposition "among" in the phrase "relationship ...
variables" and I conclude that "between" is preferred in most
situations (1999b, app. C).

Similarly, following Herman's remarks above, we can ask whether
the phrase should be:

relationship between variables

or

relation between variables.

To help resolve this issue of terminology, let me first present
some dictionary definitions of the terms "relationship" and "re-
lation" since these definitions show how the terms are commonly
used by speakers of English.

The second edition of the Oxford English Dictionary (OED) defines
the relevant senses as:

relationship
The state of being related; a condition or character based
upon this; kinship.

relation
3. That feature or attribute of things which is involved in con-
sidering them in comparison or contrast with each other; the
particular way in which one thing is thought of in connexion
with another; any connexion, correspondence, or association,
which can be conceived as naturally existing between things.

Note that the OED lexicographers define the relevant sense of a
relationship first as a state, second as a condition, and last as
a character (i.e., a property).  On the other hand, they define
the relevant sense of a relation first as a property (feature or
attribute), second as a way of thinking, and last as a condition
or state (connexion, correspondence, or association).

The 1993 Random House Unabridged Electronic Dictionary defines
the relevant senses of the two terms as:

relationship
1. a connection, association, or involvement.

relation
1. an existing connection; a significant association between or
among things: "the relation between cause and effect".

These definitions suggest that the Random House lexicographers do
not see much difference between the two terms.

Merriam-Webster's Collegiate Dictionary (tenth edition, 1993) de-
fines the relevant senses of the two terms as:

relationship
1. the state of being related or interrelated <studied the
*relationship* between the variables>

relation
2. an aspect or quality (as resemblance) that connects two or
more things or parts as being or belonging or working together
or as being of the same kind <the *relation* of time and
space>; specifically : a property (as one expressed by "is
equal to", "is less than", or "is the brother of") that holds
between an ordered pair of objects

Since the Merriam-Webster lexicographers actually cite the phrase
"relationship between variables", it is clear which word they
view as being more naturally used in the phrase.  Note that the
Merriam-Webster definitions and the OED definitions are essen-
tially the same -- a relationship is mainly a state and a rela-
tion is mainly a property (feature, attribute, aspect, or qual-
ity).

It seems more reasonable to me to view a relation(ship) between
variables as a *state* or condition than to view it as a property
of the situation, although the latter point of view is possible.
Thus the dictionary definitions (as they reflect common usage)
suggest to me that the word "relationship" is more appropriate
than the word "relation" for use in the phrase "relation(ship)
between variables".

However, the word "relation" is shorter than "relationship",
which I (as a writer) view as a significant advantage.  Also, the
use of the phrase "relation between variables" does not seem to
lead to confusion or misunderstanding.  Thus although I believe
the term "relationship" is currently preferred, it seems possible
(and reasonable) that idiom will migrate to the phrase "relation
between variables".

My informal sense of the frequency of use of the two terms in
relevant statistical contexts is that the term "relationship" re-
ceives substantially more use than the term "relation", but the
latter term is used by several writers of note.  For example, the
term "relation" occurs at several places in an important book ed-
ited by John Bailar and Frederick Mosteller (1992, pp. 27, 215,
294, 296, 306, 328), although they also allow their authors to
use the term "relationship" (pp. 10-11).

APPENDIX B: TESTING FOR A RELATIONSHIP BETWEEN VARIABLES IN
GALTON'S DATA

I suggest above that we can perform a statistical test on Gal-
ton's data to determine whether there is evidence of a relation-
ship in the population between the heights of the fathers (x) and
the heights of the sons (y).  Note that in actually performing
such a test we need not perform a test that directly tests the
inequality stated above in (1) since we can easily derive from
(1) other equivalent conditions we can test.  If any of these
other conditions is satisfied, we can easily show that (1) is
also satisfied.

For example, we can test whether

E(y|x1) ~= E(y|x2)                    (2)

where we might choose x1 and x2 to be as far apart as possible
since (if the relationship is strictly monotonic, as many rela-
tionships are) this will give us (with other things being equal)
a more powerful test of the existence of the relationship than if
we use x1 and x2 closer together.  If we can show that (2) is
satisfied, it follows that so also is (1).

Alternatively, if there is no compelling evidence that the best
line for the relationship between the two variables is not a
straight line, we can fit a straight line to the data and then
test the hypothesis that the slope of the line in the population
is zero.  If the data allow us to reject this hypothesis, it is
easy to show that (1) is satisfied, and thus we can conclude that
a relationship exists between the two variables.

On the other hand, if there is good evidence that the best line
is NOT straight, this is also evidence of a relationship between
the two variables in the sense that it also implies that (1) is
satisfied.

(NOTE:  In his 1886 and 1889 works, Galton focuses on the rela-
tionship between the height of the "mid-parent" and the height of
the son, where the height of the mid-parent is a weighted average
of the heights of the mother and father.  Galton does not focus
in these works on the relationship between the height of the
*father* and the height of the son.  However, the points Herman
and I discuss above are independent of whether we view "father's
height" or "mid-parent's height" as being the predictor variable
in the example.

REFERENCES

Bailar, J. C., III, and Mosteller, F., eds. 1992. _Medical uses
of statistics._ 2d ed. Boston: NEJM (New England Journal of
Medicine) Books.

Galton, F. 1886. "Regression towards mediocrity in hereditary
stature." _Journal of the (Royal) Anthropological Institute,_
15, 246-263.

Galton, F. 1889. _Natural inheritance._ London: Macmillan.

Macnaughton, D. B. 1996.  "The entity-property-relationship ap-
proach to statistics: An introduction for students."  Avail-
able at http://www.matstat.com/teach/

Macnaughton, D. B. 1998a. "Re: Eight features of an ideal intro
stat course (response to comments by Dennis Roberts, Mark
Myatt, Rolf Dalin, Gary Smith, and Rossi Hassad)."  Posted to
sci.stat.edu and EdStat-L beginning on July 23, 1998.  Avail-
able at http://www.matstat.com/teach/

Macnaughton, D. B. 1998b.  "Eight features of an ideal introduc-
tory statistics course."  Available at
http://www.matstat.com/teach/

Macnaughton, D. B. 1999a. "Re: Eight features of an ideal intro
stat course (response to comments by Dennis Roberts and Karl
L. Wuensch)."  Posted to sci.stat.edu and EdStat-L on May 2
and May 9, 1999.  Available at http://www.matstat.com/teach/

Macnaughton, D. B. 1999b.  "The introductory statistics course:
The entity-property-relationship approach."  Available at
http://www.matstat.com/teach/

```