15  Chi squared test of independance

15.1 What it is?

  • The chi-square test of independence is a statistical hypothesis test used to determine whether there is a significant association between two categorical variables.

  • It assesses whether the observed frequencies of the variables in a contingency table differ significantly from the frequencies that would be expected if the variables were independent.

  • Let \(X\) and \(Y\) denote the variables in question, and let \(\left\{x_1,\cdots,x_k,\cdots,x_K\right\}\) and \(\left\{y_1,\cdots,y_l,\cdots,y_L\right\}\) be their respective categories.

15.2 Data: Observed frequencies

  • Observed frequency of \((x_k,y_l)\): \(n_{k,l}\)

  • Marginal frequency of category \(x_k\): \(n_{k,+}=\sum_{l=1}^Ln_{k,l}\)

  • Marginal frequency of category \(y_l\): \(n_{+,l}=\sum_{k=1}^Kn_{k,l}\)

  • Total frequency: \[ \begin{aligned} n &=\sum_{k=1}^K\sum_{l=1}^Ln_{k,l}\\ &=\sum_{k=1}^Kn_{k,+}\\ &=\sum_{l=1}^Ln_{+,l}\\ \end{aligned} \]

Contingency table | | \(y_1\) | \(\cdots\) | \(y_l\) | \(\cdots\) | \(y_L\) | Total | | :——: | :——-: | :——: | :——-: | :——: | :——-: | :——-: | | \(x_1\) | \(n_{1,1}\) | \(\cdots\) | \(n_{1,l}\) | \(\cdots\) | \(n_{1,L}\) | \(n_{1,+}\) | | \(\vdots\) | \(\vdots\) | \(\vdots\) | \(\ddots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) | | \(x_k\) | \(n_{k,1}\) | \(\cdots\) | \(n_{k,l}\) | \(\cdots\) | \(n_{k,L}\) | \(n_{k,+}\) | | \(\vdots\) | \(\vdots\) | \(\vdots\) | \(\ddots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) | | \(x_K\) | \(n_{K,1}\) | \(\cdots\) | \(n_{K,l}\) | \(\cdots\) | \(n_{K,L}\) | \(n_{K,+}\) | | Total | \(n_{+,1}\) | \(\cdots\) | \(n_{+,l}\) | \(\cdots\) | \(n_{+,L}\) | \(n\) |

15.3 Hypothesis

  • \(\mathcal{H}_0\): \(X\) and \(Y\) are independant

  • \(\mathcal{H}_1\): There is a significant association between \(X\) and \(Y\)

15.4 Test statistic

  • Expected Frequencies under \(\mathcal{H}_0\) \[ \widehat{n}_{k,l}=n*\dfrac{n_{k,\cdot}}{n}\dfrac{n_{\cdot,l}}{n}=\dfrac{n_{k,\cdot}n_{\cdot,l}}{n} \]

  • Test statistic \[ \mathbb{X}^2=\sum_k\sum_l\dfrac{\left(n_{k,l}-\widehat{n}_{k,l}\right)^2}{\widehat{n}_{k,l}}\ \overset{\mathcal{H}_0}{\rightarrow}\ \mathcal{X}_{df}^2 \]

  • Degrees of freedom: \(df = (K-1)(L-1)\)

15.5 Critical region and P-value

15.5.1 Critical Region

  • \(W=\left]q_{1-\alpha}\left(\mathcal{X}_{df}^2\right), +\infty\right[\)

15.5.2 P-value

  • \(pValue=\mathbb{P}\left(\mathcal{X}^2>\mathbb{X}_{obs}^2\right)\)

15.6 Decision

15.6.1 Decision based on Critical Region

  • Reject \(\mathcal{H}_0\) if and only if \(\mathbb{X}_{obs}^2\in W\)

15.6.2 Decision based on P-value

  • Reject \(\mathcal{H}_0\) if and only if \(pValue<\alpha\)