Relative Age Effect, Goodness-of-fit, and Chi-square Tests

Relative Age Effect and Goodness-of-Fit Test

 

One of the interesting books that I read recently is a book titled "Outliers" by New Yorker magazine staff writer Malcolm Gladwell. The book opens with a story about Canadian Hockey players and their birthdays. Why would anyone care about the birthdays of hockey players? (unless you subscribe to astrology and zodiacs) It turns out something strange is going on when you look at their birthdays by month. For example, the following graph shows the birthdays of professional Hockey players from the two professional leagues in Canada,

canadian hockey birthdays

Here is a list of players who currently play in NHL, which shows a similar pattern:

NHL birthdays

Guess what: if you are born near the beginning of the year, then you will have a better chance of playing professional hockey than someone who is born near the end of the year! In the Canadian league, the percentage of hockey players who are born in November is merely 2%. So you might as well just give up if you are born in November.

But wait a second, could this weird result be explained by more couples favoring "spring babies"? If you look at the data on birth rate, this is certainly not the case. So we are led to the dubious conclusion that somehow, kids who are born near the beginning of the year have an "advantage" over those who are born near the end of the year. Sociologists who discovered this phenomenon called it the "Relative Age Effect", who has since then been found in various other sports, such as soccer leagues in France:

french soccer

Since Relative Age Effect was first noticed in sports, people quickly started to look at other domains, such as career success, suicide, divorce, etc. While this is a fascinating idea, and people have different theories of why such effect is there, all of these research rely on the statistical analysis that we will study in this chapter: namely the Goodness-of-fit test based on the  (also known as chi-square, pronounced as "kai-square") statistic.

Why the name "Goodness-of-fit"? The name tells us how this particular test works: on one hand, we have some type of model that predicts the frequencies in categorical data. In the hockey example, if babies are equally likely to be born in each month, then we will expect the same number of babies to be born each month. So if the actual frequencies differ significantly from the all-equal expectation, then the data does NOT "fit" our intuitive model, and we can use it as good evidence to show the Relative Age Effect exists.

Example: Birthday by Season

 

Let's consider a simpler scenario in which we divide the birthdays into just 4 seasons. Suppose we are interested in showing that the fractions of births in the four seasons are not all the same. This would be the alternative hypothesis , since it contains the "unequal" statement. If we use  for the proportion of hockey players born in spring,  for summer, and so on, we can write the null an alternative hypotheses as follows:

·         : The same fraction of players are born during each season:

·         : The fractions of birth during the four seasons are not all the same. (Claim)

There are many ways of stating these hypotheses. Another way would be to state that

·         : Observed Frequencies = Expected Frequencies

·         : Observed Frequencies  Expected Frequencies (at least for one season, the observed frequency is different from expected)

Don’t worry too much about stating the hypotheses; just make sure that the claim winds up in the right place.

Now suppose our data look like the following:

Season

Number of Players

Spring

8

Summer

6

Fall

12

Winter

7

 

We will need to translate this data and the  into a single test statistic, the . To start with, we will differentiate “Observed Frequency” (or simply "O") from “Expected Frequency”, or simply “E”, because the former is the actual data, and the latter is the frequency we would expect to get if the null hypothesis (that all the fractions are 1/4) were true. We find these expected frequencies by multiplying the proportion for each category (1/4 in this case, but they may have different values, as seen in the team homework) by the sample size  (33 in this case). Here’s the augmented table:

GOF augmented

It’s very true that you can’t have 8.25 people born in a season, but don’t round the expected frequencies to whole numbers, because that would throw off the calculations. The expected frequencies have to add up to n, except if rounding affects the sum slightly. This makes for a good check.

We’re working up to the  statistic here, and the next step is to calculate for each category the  contribution. It has a formula with which you’ll become very familiar:

Take the difference between the observed and the expected frequencies for a category, square that difference, and divide by the expected frequency. For Spring, it would be . You will do that for each season, and add up all four seasons. The following table shows all the calculation leading up to the  statistic:

GOF chi square

Notice you can do this calculation in GeoGebra or your TI calculator, but it's nice to have an intuitive sense of why the  is large or small.

The sum, , is the value test statistic for the goodness-of-fit test. We use it to find the likelihood that, if the null hypothesis were true, a group of 33 would produce frequencies as different from the expected, or even more different, as our group did. This likelihood is exactly the same interpretation of P-value as we previously used in other hypothesis tests. Goodness-of-fit tests are almost always right-tailed. This is because if, say, the observed frequencies were exactly the same as the expected,  would be always zero, as would  and . The more different the observed frequencies are from the expected, the bigger the .

 

But how many degrees of freedom are there for this ? If you thought, 32, then you made a smart mistake, because you concluded from previous work that the degrees of freedom are one less than the sample size, which was 33. However, in goodness-of-fit tests, the degrees of freedom are one less than the number of categories, which we label . In this case, with four seasons, . So there are three degrees of freedom.

To find the p-value, we use the  calculator in GeoGebra, as shown in the screen shot below. Since the  cannot take on any negative values (since  only involves adding non-negative numbers), all the P-values under the   distribution will be right-tailed:

chi square P value

As you can tell from the graph, the P-value = . The P-value has the following interpretation:

·         Assuming that people are born into the 4 seasons with the same frequency (), the probability of getting a  statistic greater than 2.516, resulting from samples of 33 people is 46.2%.

Notice we cannot be as specific as we did with the P-value when we compared two means or two proportions. This is because the chi-square statistic compares two frequency distributions with each other (the Observed v.s. Expected). The chi-square statistic will be large when there is a significant discrepancy between the two, but it’s hard to pinpoint which season was the culprit, unless you break down the  by using the table we showed above.

So what is our conclusion, then? The P-value is huge, i.e. there’s more than enough probability that seasons with equal births would produce a sample as unbalanced as ours or even more unbalanced. We FAIL TO reject the null hypothesis. To summarize: There is not sufficient evidence to support the claim that the fractions of birth during the four seasons are not all the same. We failed to make our case because the frequencies are not dramatically different, considering that our sample is relatively small. Let’s ask for more research money so we can collect a larger sample!

If you would like to go with the traditional method, then you can use the  as well. Just use the  as area this time and solve for the critical value (remember it's always right-tailed, so no more worries about dividing by 2)

chi square critical value

In this case, the test statistic clearly lives outside the critical region, so we will reach the same conclusion: we do not reject .

Other Choices for Expected Frequencies

 

Let's go back to the hockey example that led us to the discussion of  test. If you do the same type of analysis that we just did on the data we presented in the beginning, we will be able to produce a definitive answer to the question "is there Relative Age effect in pro-hockey?" (Hint: the answer is yes, but you should do it as an exercise) There are ways that we can improve our test as well: for example, instead of using the "all-equal proportions" null hypothesis, we could use the actual proportions of the general population as  to calculate Expected frequencies. For example, suppose in a particular area, a lot of babies were born in the Fall, according to the following relative frequencies:

Season

Relative Frequency

Spring

20%

Summer

20%

Fall

40%

Winter

20%

 

I know this is bit ridiculous. But suppose this “Fall baby” town is used as the basis of expected frequencies, we just need to change E accordingly:

 

Season

Expected Frequency (E)

Spring

0.2*33=6.6

Summer

0.2*33=6.6

Fall

0.4*33=13.2

Winter

0.2*33=6.6

 

This latter approach is fairly standard in research papers on Relative Age effect. Again, the important thing to keep in mind is that the sum of O and E must be equal; otherwise the assumption of chi-square distribution breaks down.

So to recap, the large discrepancies between Observed and Expected frequencies will lead to a large  statistic, which in turn will lead to rejection of  (no relative age effect). So although "Goodness-of-fit" is such a lovely name, you may want to remember that we are usually going after the claim that "this is not a good fit".

Is there an explanation of Relative Age Effect? Technically, this question is not something we can answer with  test. If you are interested in Malcolm Gladwell's explanation, you can certainly read the book and find out for yourself!