15 Chi squared test of independance
15.1 What it is?
The chi-square test of independence is a statistical hypothesis test used to determine whether there is a significant association between two categorical variables.
It assesses whether the observed frequencies of the variables in a contingency table differ significantly from the frequencies that would be expected if the variables were independent.
Let \(X\) and \(Y\) denote the variables in question, and let \(\left\{x_1,\cdots,x_k,\cdots,x_K\right\}\) and \(\left\{y_1,\cdots,y_l,\cdots,y_L\right\}\) be their respective categories.
15.2 Data: Observed frequencies
Observed frequency of \((x_k,y_l)\): \(n_{k,l}\)
Marginal frequency of category \(x_k\): \(n_{k,+}=\sum_{l=1}^Ln_{k,l}\)
Marginal frequency of category \(y_l\): \(n_{+,l}=\sum_{k=1}^Kn_{k,l}\)
Total frequency: \[ \begin{aligned} n &=\sum_{k=1}^K\sum_{l=1}^Ln_{k,l}\\ &=\sum_{k=1}^Kn_{k,+}\\ &=\sum_{l=1}^Ln_{+,l}\\ \end{aligned} \]
Contingency table | | \(y_1\) | \(\cdots\) | \(y_l\) | \(\cdots\) | \(y_L\) | Total | | :——: | :——-: | :——: | :——-: | :——: | :——-: | :——-: | | \(x_1\) | \(n_{1,1}\) | \(\cdots\) | \(n_{1,l}\) | \(\cdots\) | \(n_{1,L}\) | \(n_{1,+}\) | | \(\vdots\) | \(\vdots\) | \(\vdots\) | \(\ddots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) | | \(x_k\) | \(n_{k,1}\) | \(\cdots\) | \(n_{k,l}\) | \(\cdots\) | \(n_{k,L}\) | \(n_{k,+}\) | | \(\vdots\) | \(\vdots\) | \(\vdots\) | \(\ddots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) | | \(x_K\) | \(n_{K,1}\) | \(\cdots\) | \(n_{K,l}\) | \(\cdots\) | \(n_{K,L}\) | \(n_{K,+}\) | | Total | \(n_{+,1}\) | \(\cdots\) | \(n_{+,l}\) | \(\cdots\) | \(n_{+,L}\) | \(n\) |
15.3 Hypothesis
\(\mathcal{H}_0\): \(X\) and \(Y\) are independant
\(\mathcal{H}_1\): There is a significant association between \(X\) and \(Y\)
15.4 Test statistic
Expected Frequencies under \(\mathcal{H}_0\) \[ \widehat{n}_{k,l}=n*\dfrac{n_{k,\cdot}}{n}\dfrac{n_{\cdot,l}}{n}=\dfrac{n_{k,\cdot}n_{\cdot,l}}{n} \]
Test statistic \[ \mathbb{X}^2=\sum_k\sum_l\dfrac{\left(n_{k,l}-\widehat{n}_{k,l}\right)^2}{\widehat{n}_{k,l}}\ \overset{\mathcal{H}_0}{\rightarrow}\ \mathcal{X}_{df}^2 \]
Degrees of freedom: \(df = (K-1)(L-1)\)
15.5 Critical region and P-value
15.5.1 Critical Region
- \(W=\left]q_{1-\alpha}\left(\mathcal{X}_{df}^2\right), +\infty\right[\)
15.5.2 P-value
- \(pValue=\mathbb{P}\left(\mathcal{X}^2>\mathbb{X}_{obs}^2\right)\)
15.6 Decision
15.6.1 Decision based on Critical Region
- Reject \(\mathcal{H}_0\) if and only if \(\mathbb{X}_{obs}^2\in W\)
15.6.2 Decision based on P-value
- Reject \(\mathcal{H}_0\) if and only if \(pValue<\alpha\)