16 Simple linear regression
16.1 Data
- \(\left(x_i,y_i\right)\in\mathbb{R}^2,\ i=1,\cdots,n\)
16.2 Regression of \(Y\) on \(X\)
16.2.1 Model
We assume that \(y_i\) is a realization of \(Y_i\).
Model: \(Y_i=a+bx_i+\epsilon_i\) with:
\(\mathbb{E}[\epsilon_i]=0\),
\(\mathbb{V}ar(\epsilon_i)=\sigma^2\)
\(\mathbb{C}ov(\epsilon_i, \epsilon_j)=0\) if \(i\neq j\)
16.2.2 What do we need?
- \(n,\sum_ix_i,\sum_iy_i, \sum_ix_i^2, \sum_iy_i^2, \sum_ix_iy_i\)
16.3 Statistics
16.3.1 Means
\(\overline{x} = \dfrac{1}{n}\sum_{i=1}^nx_i\)
\(\overline{y} = \dfrac{1}{n}\sum_{i=1}^ny_i\)
16.3.2 Variances
\(S_x^2=\dfrac{1}{n}\sum_{i=1}^n(x_i-\overline{x})^2=\dfrac{1}{n}\sum_{i=1}^nx_i^2-\overline{x}^2\)
\(S_y^2=\dfrac{1}{n}\sum_{i=1}^n(y_i-\overline{y})^2=\dfrac{1}{n}\sum_{i=1}^ny_i^2-\overline{y}^2\)
16.3.3 Covariance-Correlation
Covariance \(S_{x,y}=\dfrac{1}{n}\sum_{i=1}^n(x_i-\overline{x})(y_i-\overline{y})\) \(=\dfrac{1}{n}\sum_{i=1}^nx_iy_i-\overline{x}\overline{y}\)
Correlation: \(R_{x,y}=\dfrac{S_{x,y}}{S_xS_y}\in[-1, 1]\)
16.4 Least Squares Estimation
16.4.1 Coefficients
Slope: \(\widehat{b}=\dfrac{S_{x,y}}{S_x^2}\)
Intercept: \(\widehat{a} = \overline{y}-\widehat{b}\overline{x}\)
16.4.2 Predictions
\(\widehat{y}_{i}=\widehat{a}+\widehat{b}x_{i}\)
\(\widehat{y}_{new}=\widehat{a}+\widehat{b}x_{new}\)
Note: \(\sum_{i=1}^n\widehat{y}_i=\sum_{i=1}^ny_i\)
16.4.3 Residuals
\(r_i=y_i-\widehat{y}_i=y_i-\widehat{a}-\widehat{b}x_i\)
Note: \(\sum_{i=1}^nr_i=0\)
16.4.4 Residual Variance
- \(S_r^2=\dfrac{1}{n}\sum_ir_i^2=\dfrac{1}{n}\sum_i\left(y_i-\widehat{y}_i\right)^2\)
16.4.5 Sum of Squares
Total: \(TSS = \sum_i\left(y_i-\overline{y}\right)^2\)
Explained: \(ESS = \sum_i\left(\widehat{y}_i-\overline{y}\right)^2\)
Residual: \(RSS = \sum_ir_i^2\)
Note: \(TSS = ESS + RSS\)
16.4.6 Coefficient of Determination
- \(R^2=R_{x, y}^2\in[0,1]\)
16.5 Gaussian linear regression
16.5.1 Assumptions
We assume that the \(y_i\) are realizations of random variables \(Y_i\).
Model: \(Y_i\overset{iid}{\sim}\mathcal{N}\left(\beta_0+\beta_1x_i,\sigma^2\right)\).
16.5.2 Maximum Likelihood Estimator
\(\widehat{\beta}_0=\widehat{a}\)
\(\widehat{\beta}_1=\widehat{b}\)
\(\widehat{\sigma^2}=\dfrac{1}{n}\sum_i\left(Y_i-\widehat{Y}_i\right)^2=\dfrac{1}{n}RSS\)
Note: \(\widehat{\beta}:=\left(\widehat{\beta}_0, \widehat{\beta}_1\right)\) is unbiased, but \(\widehat{\sigma^2}\) is not!
\(\widehat{\sigma^2}_{sb}=\dfrac{1}{n-2}RSS\) is an unbiased estimator of \(\sigma^2\).
16.5.3 Inference
UB estimate of the variance of \(\widehat{\beta}_0\): \(S^2\left(\widehat{\beta}_0\right)=\dfrac{RSS}{n-2}\left(\dfrac{1}{n}+\dfrac{\overline{x}^2}{nS_x^2}\right)\).
UB estimate of the variance of \(\widehat{\beta}_1\): \(S^2\left(\widehat{\beta}_1\right)=\dfrac{RSS}{n(n-2)S_x^2}\).
Distribution of \(\widehat{\beta}_j\) :\(\dfrac{\widehat{\beta}_j-\beta_j}{S\left(\widehat{\beta}_j\right)}\sim \mathcal{T}_{n-2}\).
Confidence Interval: \(CI_{1-\alpha}\left(\beta_j\right)=\widehat{\beta}_j\pm t_{1-\alpha/2, n-2}S\left(\widehat{\beta}_j\right)\).
Test of \(\mathcal{H}_0:\ \beta_j=0\): \(pValue=2\mathbb{P}\left(\mathcal{T}_{n-2}>\dfrac{|\widehat{\beta}_j|}{S\left(\widehat{\beta}_j\right)}\right)\).