Relative Age Effect, Goodness-of-fit, and Chi-square Tests
One of the interesting books that I read recently is a book titled "Outliers" by New Yorker magazine staff writer Malcolm Gladwell. The book opens with a story about Canadian Hockey players and their birthdays. Why would anyone care about the birthdays of hockey players? (unless you subscribe to astrology and zodiacs) It turns out something strange is going on when you look at their birthdays by month. For example, the following graph shows the birthdays of professional Hockey players from the two professional leagues in Canada,
Here is a list of players who currently play in NHL, which shows a similar pattern:
Guess what: if you are born near the beginning of the year, then you will have a better chance of playing professional hockey than someone who is born near the end of the year! In the Canadian league, the percentage of hockey players who are born in November is merely 2%. So you might as well just give up if you are born in November.
But wait a second, could this weird result be explained by more couples favoring "spring babies"? If you look at the data on birth rate, this is certainly not the case. So we are led to the dubious conclusion that somehow, kids who are born near the beginning of the year have an "advantage" over those who are born near the end of the year. Sociologists who discovered this phenomenon called it the "Relative Age Effect", who has since then been found in various other sports, such as soccer leagues in France:
Since Relative Age Effect was first noticed in sports,
people quickly started to look at other domains, such as career
success, suicide, divorce,
etc. While this is a fascinating idea, and people have different theories of
why such effect is there, all of these research rely on the statistical
analysis that we will study in this chapter: namely the Goodness-of-fit test based on the (also known as chi-square, pronounced as "kai-square")
statistic.
Why the name "Goodness-of-fit"? The name tells us how this particular test works: on one hand, we have some type of model that predicts the frequencies in categorical data. In the hockey example, if babies are equally likely to be born in each month, then we will expect the same number of babies to be born each month. So if the actual frequencies differ significantly from the all-equal expectation, then the data does NOT "fit" our intuitive model, and we can use it as good evidence to show the Relative Age Effect exists.
Let's consider a simpler scenario in which we divide the
birthdays into just 4 seasons. Suppose we are interested in showing that the
fractions of births in the four seasons are not all the same. This would be the
alternative hypothesis , since it contains the "unequal" statement. If we
use
for the proportion of
hockey players born in spring,
for summer, and so on,
we can write the null an alternative hypotheses as follows:
·
: The same fraction of players are born during each season:
·
: The fractions of birth during the four seasons are not all
the same. (Claim)
There are many ways of stating these hypotheses. Another way would be to state that
·
: Observed Frequencies = Expected Frequencies
·
: Observed Frequencies
Expected Frequencies
(at least for one season, the observed frequency is different from expected)
Don’t worry too much about stating the hypotheses; just make sure that the claim winds up in the right place.
Now suppose our data look like the following:
Season |
Number of Players |
Spring |
8 |
Summer |
6 |
Fall |
12 |
Winter |
7 |
We will need to translate this data and the into a single test
statistic, the
. To start with, we will differentiate “Observed Frequency”
(or simply "O") from “Expected Frequency”, or simply “E”, because the
former is the actual data, and the latter is the frequency we would expect to
get if the null hypothesis (that all the fractions are 1/4) were true. We find
these expected frequencies by multiplying the proportion for each category (1/4
in this case, but they may have different values, as seen in the team homework)
by the sample size
(33 in this case). Here’s
the augmented table:
It’s very true that you can’t have 8.25 people born in a season, but don’t round the expected frequencies to whole numbers, because that would throw off the calculations. The expected frequencies have to add up to n, except if rounding affects the sum slightly. This makes for a good check.
We’re working up to the statistic here, and
the next step is to calculate for each category the
contribution. It has a
formula with which you’ll become very familiar:
Take the difference between the observed and the expected
frequencies for a category, square that difference, and divide by the expected
frequency. For Spring, it would be . You will do that for each season, and add up all four
seasons. The following table shows all the calculation leading up to the
statistic:
Notice you can do this calculation in GeoGebra or your TI
calculator, but it's nice to have an intuitive sense of why the is large or small.
The sum, , is the value test statistic for the goodness-of-fit test.
We use it to find the likelihood that, if the null hypothesis were true, a
group of 33 would produce frequencies as different from the expected, or even
more different, as our group did. This likelihood is exactly the same
interpretation of P-value as we previously used in other hypothesis tests.
Goodness-of-fit tests are almost always right-tailed. This is because if, say,
the observed frequencies were exactly the same as the expected,
would be always zero,
as would
and
. The more different the observed frequencies are from the
expected, the bigger the
.
But how many degrees of freedom are there for this ? If you thought, 32, then you made a smart mistake, because
you concluded from previous work that the degrees of freedom are one less than
the sample size, which was 33. However, in goodness-of-fit tests, the degrees
of freedom are one less than the number of categories, which we label
. In this case, with four seasons,
. So there are three degrees of freedom.
To find the p-value, we use the calculator in GeoGebra,
as shown in the screen shot below. Since the
cannot take on any negative values (since
only involves adding
non-negative numbers), all the P-values under the
distribution will be right-tailed:
As you can tell from the graph, the P-value = . The P-value has the following interpretation:
·
Assuming
that people are born into the 4 seasons with the same frequency (), the probability of getting a
statistic greater than
2.516, resulting from samples of 33 people is 46.2%.
Notice we cannot be as specific as we did with the P-value
when we compared two means
or two
proportions. This is because the chi-square statistic compares two
frequency distributions with each other (the Observed v.s. Expected). The
chi-square statistic will be large when there is a significant discrepancy
between the two, but it’s hard to pinpoint which season was the culprit, unless
you break down the by using the table we
showed above.
So what is our conclusion, then? The P-value is huge, i.e. there’s more than enough probability that seasons with equal births would produce a sample as unbalanced as ours or even more unbalanced. We FAIL TO reject the null hypothesis. To summarize: There is not sufficient evidence to support the claim that the fractions of birth during the four seasons are not all the same. We failed to make our case because the frequencies are not dramatically different, considering that our sample is relatively small. Let’s ask for more research money so we can collect a larger sample!
If you would like to go with the traditional method, then
you can use the as well. Just use the
as area this time and
solve for the critical value (remember it's always right-tailed, so no more
worries about dividing by 2)
In this case, the test statistic clearly lives outside the
critical region, so we will reach the same conclusion: we do not reject .
Let's go back to the hockey example that led us to the discussion
of test. If you do the
same type of analysis that we just did on the data we presented in the
beginning, we will be able to produce a definitive answer to the question
"is there Relative Age effect in pro-hockey?" (Hint: the answer is
yes, but you should do it as an exercise) There are ways that we can improve
our test as well: for example, instead of using the "all-equal
proportions" null hypothesis, we could use the actual proportions of the
general population as
to calculate Expected
frequencies. For example, suppose in a particular area, a lot of babies were
born in the Fall, according to the following relative frequencies:
Season |
Relative Frequency |
Spring |
20% |
Summer |
20% |
Fall |
40% |
Winter |
20% |
I know this is bit ridiculous. But suppose this “Fall baby” town is used as the basis of expected frequencies, we just need to change E accordingly:
Season |
Expected Frequency (E) |
Spring |
0.2*33=6.6 |
Summer |
0.2*33=6.6 |
Fall |
0.4*33=13.2 |
Winter |
0.2*33=6.6 |
This latter approach is fairly standard in research papers on Relative Age effect. Again, the important thing to keep in mind is that the sum of O and E must be equal; otherwise the assumption of chi-square distribution breaks down.
So to recap, the large discrepancies between Observed and
Expected frequencies will lead to a large statistic, which in
turn will lead to rejection of
(no relative age
effect). So although "Goodness-of-fit" is such a lovely name, you may
want to remember that we are usually going after the claim that "this is
not a good fit".
Is there an explanation of Relative Age Effect? Technically,
this question is not something we can answer with test. If you are
interested in Malcolm Gladwell's explanation, you can certainly read the book
and find out for yourself!