Chi squared test of independance

What it is?

  • The chi-square test of independence is a statistical hypothesis test used to determine whether there is a significant association between two categorical variables.

  • It assesses whether the observed frequencies of the variables in a contingency table differ significantly from the frequencies that would be expected if the variables were independent.

  • Let \(X\) and \(Y\) denote the variables in question, and let \(\left\{x_1,\cdots,x_k,\cdots,x_K\right\}\) and \(\left\{y_1,\cdots,y_l,\cdots,y_L\right\}\) be their respective categories.

Data: Observed frequencies

  • Observed frequency of \((x_k,y_l)\): \(n_{k,l}\)

  • Marginal frequency of category \(x_k\): \(n_{k,+}=\sum_{l=1}^Ln_{k,l}\)

  • Marginal frequency of category \(y_l\): \(n_{+,l}=\sum_{k=1}^Kn_{k,l}\)

  • Total frequency: \[ \begin{aligned} n &=\sum_{k=1}^K\sum_{l=1}^Ln_{k,l}\\ &=\sum_{k=1}^Kn_{k,+}\\ &=\sum_{l=1}^Ln_{+,l}\\ \end{aligned} \]

Contingency table

\(y_1\) \(\cdots\) \(y_l\) \(\cdots\) \(y_L\) Total
\(x_1\) \(n_{1,1}\) \(\cdots\) \(n_{1,l}\) \(\cdots\) \(n_{1,L}\) \(n_{1,+}\)
\(\vdots\) \(\vdots\) \(\vdots\) \(\ddots\) \(\vdots\) \(\vdots\) \(\vdots\)
\(x_k\) \(n_{k,1}\) \(\cdots\) \(n_{k,l}\) \(\cdots\) \(n_{k,L}\) \(n_{k,+}\)
\(\vdots\) \(\vdots\) \(\vdots\) \(\ddots\) \(\vdots\) \(\vdots\) \(\vdots\)
\(x_K\) \(n_{K,1}\) \(\cdots\) \(n_{K,l}\) \(\cdots\) \(n_{K,L}\) \(n_{K,+}\)
Total \(n_{+,1}\) \(\cdots\) \(n_{+,l}\) \(\cdots\) \(n_{+,L}\) \(n\)

Hypothesis

  • \(\mathcal{H}_0\): \(X\) and \(Y\) are independant

  • \(\mathcal{H}_1\): There is a significant association between \(X\) and \(Y\)

Test statistic

  • Expected Frequencies under \(\mathcal{H}_0\) \[ \widehat{n}_{k,l}=n*\dfrac{n_{k,\cdot}}{n}\dfrac{n_{\cdot,l}}{n}=\dfrac{n_{k,\cdot}n_{\cdot,l}}{n} \]

  • Test statistic \[ \mathbb{X}^2=\sum_k\sum_l\dfrac{\left(n_{k,l}-\widehat{n}_{k,l}\right)^2}{\widehat{n}_{k,l}}\ \overset{\mathcal{H}_0}{\rightarrow}\ \mathcal{X}_{df}^2 \]

  • Degrees of freedom: \(df = (K-1)(L-1)\)

Critical region and P-value

Critical Region

  • \(W=\left]q_{1-\alpha}\left(\mathcal{X}_{df}^2\right), +\infty\right[\)

P-value

  • \(pValue=\mathbb{P}\left(\mathcal{X}^2>\mathbb{X}_{obs}^2\right)\)

Decision

Decision based on Critical Region

  • Reject \(\mathcal{H}_0\) if and only if \(\mathbb{X}_{obs}^2\in W\)

Decision based on P-value

  • Reject \(\mathcal{H}_0\) if and only if \(pValue<\alpha\)