% LaTeX source for Bowley, Elements of Statistics, Chap X
\documentclass{article}

\usepackage{amsmath}
\usepackage{times}

\newcommand{\z}{\phantom{0}}

\begin{document}

\setcounter{page}{1}

\begin{center}
  \Large{\textit{\textbf{KARL PEARSON'S APPROACH TO $\chi^2$}}}
\end{center}
  
\begin{center}
  CHAPTER X. \\
  \ \\
  \textit{TESTS OF CORRESPONDENCE BETWEEN DATA AND FORMUL\AE.}
\end{center}

\textsc{In} the general method of the representation of observations
by a mathematical formula, the question must arise how the adequacy of
the formula is to be tested, or, as it is frequently phrased, a test of
the goodness of fit is required. 

Consider for example the table used above (p. 310) of the weekly
expenditure on food per ``unit'' in 970 families. 
{\small
\begin{center}
  \begin{tabular}{lccccc}
    & \multicolumn{1}{c}{$m'$} 
    & \multicolumn{1}{c}{$m$}  \\
      \multicolumn{1}{c}{Expenditure.} &
      \multicolumn{1}{c}{number of} &
      \multicolumn{1}{c}{calculated} &
      \multicolumn{1}{c}{$e=m\sim m'$} &
      \multicolumn{1}{c}{Standard} &
      \multicolumn{1}{c}{$\underline{e_{\phantom{1}}^2}$} \\
    & \multicolumn{1}{c}{cases.} &
      \multicolumn{1}{c}{numbers.} &
      \multicolumn{1}{c}{difference.} &
      \multicolumn{1}{c}{deviations.} &
      \multicolumn{1}{c}{$m$} \\
    Not exceeding $5.5s$\dotfill & \z18  & \z22 & \z4 & \z4.6 & \z\z.7 \\
    $\phantom{0}5.5$\dotfill     & 107   & 123  & 16  & 10.4  & \z2.1  \\
    $\phantom{0}7.5$\dotfill     & 255   & 234  & 21  & 13.3  & \z1.9  \\
    $\phantom{0}9.5$\dotfill     & 245   & 249  & \z4 & 13.6  & \z\z.1 \\
    $11.5$\dotfill               & 173   & 168  & \z5 & 11.8  & \z\z.1 \\
    $13.5$\dotfill               & 101   & \z89 & 12  & \z9.0 & \z1.6  \\
    $15.5$\dotfill               & \z38  & \z51 & 13  & \z7.0 & \z3.3  \\
    $17.5$\dotfill               & \z17  & \z22 & \z5 & \z4.6 & \z1.1  \\
    $19.5$\dotfill               & \z\z9 & \z11 & \z2 & \z3.3 & \z\z.4 \\
    Over $21.5$\dotfill& \underline{\z\z7} & \underline{\z\z1} & 
      \underline{\z\z6} & \underline{\z\z?\z} & \underline{36.0} \\
    \multicolumn{1}{c}{Totals} & 970 & 970 & 88 & --- & $47.3$
  \end{tabular} 
\end{center} 
}

The calculated numbers are from the second approximation to the Law of
Great Numbers.  A rough method formerly used was to add the differences
between the calculated numbers and the numbers observed in each
compartment, irrespective of sign, and to express this total as a
percentage of the number of cases. The ``percentage misfit'' thus
calculated is $88 \div 9.70 = 9.1$ per cent. 

The weakness of this method is that it is not related to any measurement
of probability, and one cannot tell at sight whether the fit is good or
not. Of two competing formul\ae, the presumption is that that which gives
the lower percentage misfit is the better; also when we have several
sets of similar  observations we can tell roughly by this method which
is nearest to the formula, and in some cases in which set the
observations are most regular. 

The percentage misfit is generally diminished if compartments are merged
together. 

As regards the contents of individual compartments, we already have a
simple test. If $m_t$ is the calculated number in a compartment when
there are $N$ observations in all, the chance of finding $m_t + e_t$
observations in this compartment in 
\[ \frac{1}{\sigma\sqrt{\pi}}e^{-\frac{1}{2}.\frac{e_t^2}{\sigma^2}}
   \text{\ (formula (19)) where\ }
   \sigma^2=\frac{m_t}{N}\left(1-\frac{m_t}{N}\right)N, \]
and the probability of exceeding any assigned multiple or sub-multiple
of $\sigma$ is given by the table (p.\ 271). The standard deviation for 
each grade in the above example except the last is given, and it is seen 
that four out of nine errors are less than $\sigma$, their standard 
deviation, two are between $\sigma$ and $\frac{3\sigma}{2}$ and the 
remaining three less than $2\sigma$. No separate measurement is 
improbable, and therefore the whole grouping may be presumed to be not
improbable, except the final number, 7 above 
$21.5s$. 

That numbers in extreme grades should be discontinuous in relation to
middle grades is common in many classes of observations. 

The deviations are not independent, however, since their total must be
zero; and even if the deviation in one compartment taken by itself is
improbably large, it may yet not be improbable when all the compartments
are considered. A measurement which allows for this modification has
been devised by Professor Pearson, and part of the analysis in a 
simplified form, a brief table of the results, and some applications are
given in the following paragraphs (see \textit{The Philosophical 
Magazine}, No.\ 302, July, 1900, pp.\ 157--175). 

Suppose that a formula, which is presumed to represent the distribution
of observations, leads to the expectation of $m_1$, $m_2$ \dots $m_n$
observations in $n$ grades or compartments, when $N, = m_1 + m_2 +
\dots + m_n$, is the whole number of observations. 

In an experiment or group of observations. suppose that $(m_1 + e_1) \dots
(m_t + e_t) \dots (m_n + e_n)$ are found in the compartments, so that 
$e_1 + \dots + e_t + \dots + e_n = 0$. 

Write $p_1=\frac{m_1}{N} \dots p_t=\frac{m_t}{N}$ \dots.

Then $p_t$ is the chance that an observation from a group satisfying
perfectly the formula will fall into the $t^{\text{th}}$ grade. 

The chance that $m_t + e_t$ will fall into this grade when $N$ are 
chosen at random from an indefinitely large universe is 
\[ \frac{1}{\sigma\sqrt{\pi}}e^{-\frac{1}{2}.\frac{e_t^2}{\sigma_t^2}}, \]
where \quad $\sigma_t^2 = p_t(1 - p_t)N = p_tq_tN$, where $q_t=1-p_t$. 

It can be shown that the joint chance of the errors named is 
\[ Ke^{-\frac{1}{2}\text{X}^2},
   \text{\ where\ }
   X^2=\text{S}.\frac{e_t^2}{m_t},
   \text{\ and\ }Se_t=0, \]
$K$ being a constant. 

For, if there were only two compartments, $e_1 + e_2 = 0$, and the joint
chance equals the chance of either. 

Then \qquad $p=\frac{m_1}{N}$, $q=\frac{m_2}{N}$, $m_1+m_2=N$.

The chance is 
\[ \frac{N^{\frac{1}{2}}}{\sqrt{2\pi m_1m_2}}
   e^{-\frac{1}{2}\left(\frac{e_1^2}{m_1}+\frac{e_2^2}{m_2}\right)},
   \text{\ since\ }
   \frac{e_1^2N}{m_1m_2}=\frac{e_1^2(m_1+m_2)}{m_1m_2},
   \text{\ and\ }e_1^2=e_2^2. \]

If there are \textit{three} compartments 
\[ e_1+e_2+e_3=0,\quad m_1+m_2+m_3=N,\quad
   \sigma_1^2=\frac{m_1}{N}.\frac{m_1+m_2}{N}.N, \]
and similarly for $\sigma_2^2$ and $\sigma_3^2$.
\[ 2e_1e_2=e_3^2-e_1^2-e_2^2. \]
$r\sigma_1\sigma_2=
\text{mean\ }e_1e_2=\frac{1}{2}(\sigma_3^2-\sigma_1^2-\sigma_2^2)$
\begin{align*}
  &=\frac{1}{2N}\{m_3(m_1+m_2)-m_1(m_2+m_3)-m_2(m_1+m_3)\} \\
  = -\frac{m_1m_2}{N}.\quad &\text{(Compare p. 419.)}
\end{align*}

The chance of the concurrence of $e_1$ and $e_2$, and therefore of $e_3$
also, is given by the normal correlation surface as 
\[ \frac{1}{2\pi\sigma_1\sigma_2\sqrt{1-r^2}}
   e^{-\frac{1}{2(1-r^2)}\left(\frac{e_1^2}{\sigma_1^2}+
    \frac{e_1^2}{\sigma_1^2}-\frac{2re_1e_2}{\sigma_1\sigma_2}\right)}. \]

Now 
\[ \sigma_1^2\sigma_2^2(1-r^2)=
   \frac{m_1m_2(m_2+m_3)(m_1+m_3)}{N^2}-
   \frac{m_1^2m_2^2}{n^2}=
   \frac{m_1m_2m_3}{N}, \]
since $m_1+m_2+m_3 = N$. 

Hence the index of $e$ is 
\begin{align*}
  &\qquad\qquad -\frac{N}{2m_1m_2m_3}(e_1^2\sigma_2^2+e_2^2\sigma_1^2
   -2r\sigma_1\sigma_2e_1e_2) \\
  &=-\frac{N}{m_1m_2m_3}\left\{\frac{e_1^2m_2(m_1+m_3)}{N}
   +\frac{e_2^2m_1(m_2+m_3)}{N}+\frac{2e_1e_2m_1m_2}{N}\right\} \\
  &=-\frac{1}{2m_1m_2m_3}\left\{(e_1+e_2)^2m_1m_2+e_1^2m_2m_3+e_2^2m_1m_3
    \right\} \\
  &=-\frac{1}{2}\left(\frac{e_1^2}{m_1}+\frac{e_2^2}{m_2}+\frac{e_3^2}{m_3}
    \right),
    \text{\ since\ }e_1+e_2=-e_3. 
\end{align*}

Now if the second and third compartments had been merged into one
containing $M + E$ observations, where $M=m_2+m_3$ and $E = e_2 + e_3$,
the chance would have been 
\[ K_1e^{-\frac{1}{2}
  \left(\frac{e_1^2}{m_1}+\frac{E_{\phantom{1}}^2}{M}\right)}, \]
where $K_1$ is a constant.

The effect, therefore, of dividing the second compartment without
changing the first is to alter the constant and to replace $\frac{E^2}{M}$
by $\frac{e_2^2}{m_2}+\frac{e_3^2}{m_3}$ in the index.

Similarly if two compartments are given, the effect of dividing the
third compartment without changing the first two must be to  alter the
constant and to replace $\frac{e_3^2}{m_3}$ by $\frac{e_3^2}{m_3}+
\frac{e_4^2}{m_4}$ in the index, and so on. 

Hence for $n$ compartments the chance, $P$, of errors $e_l$, $e_2$
\dots $e_n$. is
\[ Ke^{-\frac{1}{2}X^2}, 
   \text{\ where\ }
   X^2 = \frac{e_1^2}{m_1} + \frac{e_2^2}{m_2} + \dots +
   \frac{e_n^2}{m_n}, \]
and\qquad\qquad\qquad $e_1 + e_2 + \dots e_n = 0\dotfill(130)$. 

Notice that $X^2$ is the same expression as is used in obtaining the
coefficient of contingency. 

[A proof of the formula, without the above method of induction is given
by Pearson, by the use of the multiple correlation equation.] 

If the selections in the compartments had been independent and without
the condition that $e_1+e_2+\dots=0$, the chance would have been 
\[ Ke^{-\frac{1}{2}X^2}\times
   e^{-\frac{1}{2}\left(\frac{e_1^2}{S-m_1}+\frac{e_2^2}{S-m_2}+\dots\right)}
   \]
for the index would have been 
\[ -\frac{1}{2}\left(\frac{e_1^2N}{m_1(N-m_1)}+\dots\right)=
   -\frac{1}{2}\left(\frac{e_1^2}{m_1}+\frac{e_1^2}{N-m_1}+\dots\right). \]

If there are many compartments and the largest of the fractions
$\frac{m_t}{N}$ is small, the second part of the index is negligible 
$N$ compared with the first, and the two expressions tend to  equality,
and the effect of the correlation is small. 

The chance of the occurrences if there is no correlation is less than
that when there is correlation, since the last factor, if not
negligible, is less than 1. (The constant is eliminated in further
processes.) Hence the aggregation of uncorrelated chances, which is
simpler than the present method, gives, an unduly unfavourable view of
the appropriateness of a formula.

The chance of every system of errors that gives a particular value of $X^2$
is the same. Now, when the probability of.a deviation from the mean in
normal frequency is in question, it is customary to measure the
probability that so great a deviation to left or right should have
occurred, viz., 
\[ 2\int_z^{\infty} \frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}z^2}dz. \]
Similarly here we may measure the chance of the occurrence of the system
of errors or a less probable system by evaluating 
\[ 2\iint\dots Ke^{-\frac{1}{2}X^2}d_X, 
   \text{\ where\ }d_X\text{\ is written for\ }de_l.de_2\dots de_{n-1} \]
and the integral is $\overline{n-1}$ fold and extended from $X$ to 
$\infty$, with the condition $e_1+e_2+\dots+e_n=0$, $K$ being so chosen 
that
\[ \int_{-\infty}^{\infty} Ke^{-\frac{1}{2}X^2}d\ =1. \]

The existence of this condition makes the integration complicated, and
reference should be made to Pearson's original analysis for its working
out. 

The result is that 
\[ P=\sqrt{\frac{2}{\pi}}\int_X^{\infty}e^{-\frac{1}{2}X^2}.d_X
     +\sqrt{\frac{2}{\pi}}e^{-\frac{1}{2}X^2}
     \left(\frac{X}{1}+\frac{X^2}{1.3}+\dots+
     \frac{X^{n-3}}{1.3.5-\overline{n-3}}\right) \]
when $n$ is even, and 
\[ \qquad P=e^{-\frac{1}{2}X^2}\left(1+\frac{X^2}{2}+\dots+
   \frac{X^{n-2}}{2.4\dots n-3}\right)\text{\ when $n$ is odd.
    \qquad\qquad}(131) \]

A table of the values of P for various values of x2 and n is given in
\textit{Biometrika}, Vol.\ 1, pp.\ 155 \textit{seq}.   We can, in a very
brief form, obtain a working rule for determining whether a formula does
or does not adequately represent an observed group by picking out values
of $x^2$ which for a given $n$ make $P = \frac{1}{2}$ or slightly more,
or, further up the scale of improbability, make $P = .0455$ or slightly
less, which corresponds to twice the standard deviation in the normal
curve. 
\begin{center}
  \begin{tabular}{r|rl|rc}
    $n.$ & $X\,.$ & $P.$ & $X^2.$ & $P.$ \\
     3 &  1 & .61  &  6 & .050 \\
     4 &  2 & .57  &  8 & .046 \\
     5 &  3 & .56  & 10 & .040 \\
     6 &  4 & .55  & 12 & .035 \\
     7 &  5 & .54  & 13 & .043 \\
     8 &  6 & .54  & 15 & .036 \\
     9 &  7 & .54  & 16 & .042 \\
    10 &  8 & .53  & 18 & .035 \\
    11 &  9 & .53  & 19 & .040 \\
    12 & 10 & .53  & 20 & .045 \\
    13 & 11 & .53  & 22 & .038 \\
    14 & 12 & .53  & 23 & .042 \\
    15 & 13 & .53  & 24 & .046 \\
    16 & 14 & .526 & 26 & .038 \\
    17 & 15 & .525 & 27 & .041 \\
    18 & 16 & .524 & 28 & .045 \\
    19 & 17 & .523 & 30 & .037 \\
    20 & 18 & .522 &    &      \\
    25 & 23 & .520 &    &      \\
    30 & 28 & .518 &    &      
  \end{tabular}
\end{center}

{\footnotesize If $X^2 2n$, the improbability is considerable.}

\medskip

Strictly, the test should be applied using as many compartments as are
given by the observations, for the merging of compartments affects the
resulting value of $P$; but it is often difficult to get back to
ungraded observations, and in the case of continuous variables, such as
height, the original grades would be as fine as the measurements could
be made. 

A more serious difficulty is that in any compartment the observed
$m_t+e_t$ must be integral, while $m_t$ is in general not integral, and
some value of $e_t$ would be found in the most perfect representation.
In consequence, the number to be expected in the least occupied
compartment must be reasonably large, or we obtain spurious
contributions to $X^2$. This in practice rules out detailed extreme
compartments, and in their rejection or fusion an element of
arbitrariness is introduced and no fine measurement is possible. 

On the other hand, when we are testing the applicability of the normal
curve of error, or the general law of great numbers, based on
Edgeworth's hypothesis (p. 298--9), there is no expectation of
closeness of fit on absciss\ae beyond a small multiple of the standard
deviation---the smaller as the number of independent elements that
contribute to the measurement diminishes---so that the test is only
applicable to the well-occupied central compartments ; but in choosing
the extent over which the test is made, the fineness of the method is
lost. 

Hence, only a broad, but often sufficiently definite, result can be obtained. 

\begin{center}
  \textit{Illustrations.}
\end{center}

If we neglect the extreme grade in Example 7, on p.\ 310, $X^2 = 11.3$,
$n = 9$, $P = .18$, and the formula ``2nd approx.'' is adequate. 

If we take the Pearsonian formula, on the same page, $X^2 = 21.4$, $n =
9$,  $P = .006$, but if we exclude the lowest as well as the highest
grade,  $X^2 = 4.1$, $n = 8$, $P = .77$; hence this formula expresses
the central eight grades but not either extreme. 

The same conclusions are reached if we simply take the standard
deviations of the grades separately. 

In the table on p.\ 309 relating to the ages of school children, $n =
8$. The normal curve gives $X^2 = 16.7$ and $P = .02$, which is not
satisfactory. The second approximation, however, gives $X^2 = .47$ and
$P$ is indistinguishable from 1. 

In the experiment on the numbers of letters in words (pp.\ 305--6), the
sum of 10 words, graded by 5 letters, gives $n = 13$, and with the
normal curve $X^2 = 33$, $P = .001$, or omitting the lowest and two
highest extreme grades, $n = 10$, $X^2 = 6.1$, $P = .73$. The second
approximation, however, including all grades, gives $X^2 = 8.4$, $P =
.74$.                 

The sums of 100 words graded by 20 letters give $n = 10$, $X^2 = 2.96$,
$P = .965$ with the normal curve, and no further approximation can
improve on this. 

An example of a different kind is found, when a distribution found by
sample is compared with the whole group from which the sample is taken,
to verify the rules of sampling or the adequacy of the method. 

\begin{center}
  \begin{tabular}{ccccc}
    \multicolumn{5}{c}{NUMBER OF COMPANIES PAYING DIVIDENDS AT VARIOUS RATES.}
    \\
    &&Relative&& \\
    & Number in & numbers in all & Standard & $\underline{e_{\phantom{1}}^2}$ \\
    & sample & companies & deviation. & $m$ \\
    & $m$. & $m$. && \\
    Below 3 per cent.\dotfill & \z34 & \z30 & 5.3 & \z.53\\ 
    3 per cent.\dotfill & 108 & 108.8 & 8.9 & 0\phantom{.00} \\
    4 '' '' \dotfill & 117 & 124.4 & 9.3 & \z.44 \\
    5 '' '' \dotfill & \z60 & \z70.8 & 7.4 & 1.65 \\
    6 per cent.\ to 8 per cent.\dotfill & \z48 & \z43.2 & 6.2 & \z.53 \\
    8 per cent.\dotfill & \underline{\z33} & \underline{\z22.8} & 
      \underline{4.6} & \underline{4.57} \\
    & 400 & 400\phantom{.0} & & 7.72
  \end{tabular}
\end{center}

{\footnotesize Here $n = 6$, $X^2 = 7.72$, $P = .185$. The result is
fairly good, but spoilt by the highest grade.}

\medskip

This test has been applied to the distribution in two dimensions, in the
experiment tabulated on p.\ 394. 

The 24 squares, .3 to left and right of centre, and 2 above and below
it, which contain in theory 11 or more observations, were taken as
separate compartments. Outlying squares were grouped in the 9 regions
shown by the thick lines, rather arbitrarily, so as to get contiguous
squares which aggregated to at least 9 expected observations in the
second approximation. The results are as follows:---
\begin{center}
  \begin{tabular}{lcccc}
    & \multicolumn{2}{c}{Normal surface.}
    & \multicolumn{2}{c}{2nd approximation} \\
    & $X^2.$ & $P.$ & $X^2.$ & $P.$ \\
    24 central squares & 20.3 & .59\z & 17.5 & .79 \\
     9 outlying regions & 27.8 & & 10.1 \\
    33 regions & 48.6 & .035 & 27.6 & .59 
  \end{tabular}
\end{center}

The improvement in the outlying regions by the use of the second
approximation is very marked. 

\bigskip

\noindent
From: A~L~Bowley, \textit{Elements of Statistics} (4th edn), London:
P~S~King 1920.

\end{document}

%