Chi squared test of independance
What it is?
The chi-square test of independence is a statistical hypothesis test used to determine whether there is a significant association between two categorical variables.
It assesses whether the observed frequencies of the variables in a contingency table differ significantly from the frequencies that would be expected if the variables were independent.
Let \(X\) and \(Y\) denote the variables in question, and let \(\left\{x_1,\cdots,x_k,\cdots,x_K\right\}\) and \(\left\{y_1,\cdots,y_l,\cdots,y_L\right\}\) be their respective categories.
Data: Observed frequencies
Observed frequency of \((x_k,y_l)\): \(n_{k,l}\)
Marginal frequency of category \(x_k\): \(n_{k,+}=\sum_{l=1}^Ln_{k,l}\)
Marginal frequency of category \(y_l\): \(n_{+,l}=\sum_{k=1}^Kn_{k,l}\)
Total frequency: \[ \begin{aligned} n &=\sum_{k=1}^K\sum_{l=1}^Ln_{k,l}\\ &=\sum_{k=1}^Kn_{k,+}\\ &=\sum_{l=1}^Ln_{+,l}\\ \end{aligned} \]
Contingency table
\(y_1\) | \(\cdots\) | \(y_l\) | \(\cdots\) | \(y_L\) | Total | |
---|---|---|---|---|---|---|
\(x_1\) | \(n_{1,1}\) | \(\cdots\) | \(n_{1,l}\) | \(\cdots\) | \(n_{1,L}\) | \(n_{1,+}\) |
\(\vdots\) | \(\vdots\) | \(\vdots\) | \(\ddots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) |
\(x_k\) | \(n_{k,1}\) | \(\cdots\) | \(n_{k,l}\) | \(\cdots\) | \(n_{k,L}\) | \(n_{k,+}\) |
\(\vdots\) | \(\vdots\) | \(\vdots\) | \(\ddots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) |
\(x_K\) | \(n_{K,1}\) | \(\cdots\) | \(n_{K,l}\) | \(\cdots\) | \(n_{K,L}\) | \(n_{K,+}\) |
Total | \(n_{+,1}\) | \(\cdots\) | \(n_{+,l}\) | \(\cdots\) | \(n_{+,L}\) | \(n\) |
Hypothesis
\(\mathcal{H}_0\): \(X\) and \(Y\) are independant
\(\mathcal{H}_1\): There is a significant association between \(X\) and \(Y\)
Test statistic
Expected Frequencies under \(\mathcal{H}_0\) \[ \widehat{n}_{k,l}=n*\dfrac{n_{k,\cdot}}{n}\dfrac{n_{\cdot,l}}{n}=\dfrac{n_{k,\cdot}n_{\cdot,l}}{n} \]
Test statistic \[ \mathbb{X}^2=\sum_k\sum_l\dfrac{\left(n_{k,l}-\widehat{n}_{k,l}\right)^2}{\widehat{n}_{k,l}}\ \overset{\mathcal{H}_0}{\rightarrow}\ \mathcal{X}_{df}^2 \]
Degrees of freedom: \(df = (K-1)(L-1)\)
Critical region and P-value
Critical Region
- \(W=\left]q_{1-\alpha}\left(\mathcal{X}_{df}^2\right), +\infty\right[\)
P-value
- \(pValue=\mathbb{P}\left(\mathcal{X}^2>\mathbb{X}_{obs}^2\right)\)
Decision
Decision based on Critical Region
- Reject \(\mathcal{H}_0\) if and only if \(\mathbb{X}_{obs}^2\in W\)
Decision based on P-value
- Reject \(\mathcal{H}_0\) if and only if \(pValue<\alpha\)