8. Fitting regression models¶

8.1 Least squares, maximum likelihood, and Bayesian inference¶

Inference: steps of estimating the regression and assessing uncertainty of fit.

Classical regression model $y_i = a + b x_i + \epsilon_i$, a and b are estimated to minimize errors $\epsilon_i$. If n>2 generally not possible to find exact fit, i.e., with no error for all data points i=1,...,n. Instead we estimate $\hat{a}$ and $\hat{b}$ to minimize the sum of squared errors $r_i = y_i - \hat{a} - \hat{b} x_i$.

Distinguish between residuals $r_i = y_i - \hat{a} - \hat{b} x_i$ and errors $\epsilon_i = y_i - a - b x_i$. We can observe residuals but not errors, where we would need to know the true parameters a and b. We can only estimate a and b, and thus we can only estimate errors $\epsilon_i$ with residuals $r_i$.

The residual sum of squares (RSS) is $RSS = \sum_{i=1}^n (y_i - \hat{a} - \hat{b} x_i)^2$. The $(\hat{a}, \hat{b})$ that minimize RSS are called least squares or ordinary least squares (OLS) estimates and can be written as: $\hat{\beta} = (X^T X)^{-1} X^T y$. Where \beta = (a, b) is vector of coefficients, X = (1,x) is the matrix of predictors. In this notation 1 is a column of 1s for the constant term because we are fitting a model with an intercept.

Previous formula is for any number of predictors. For just one predictor: $\hat{b} = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2}$ and $\hat{a} = \bar{y} - \hat{b} \bar{x}$.

We can then write least squares line as $y_i = \bar{y} + \hat{b} (x_i - \bar{x}) + r_i$. The least squares line goes through the mean of the data $(\bar{x}, \bar{y})$ as previously shown.

Estimation of residual standard deviation \sigma¶

Errors $\epsilon_i$ come from distribution with mean 0 and standard deviation \sigma. Mean is zero by definition and standard deviation of errors is estimated from the data. We might think we can estimate \sigma from the standard deviation of residuals, but this would slighly underestimate because of overfitting as \hat{a} and \hat{b} are estimated from the data to minimise squared residuals. We can correct for this bias by dividing the sum of squared residuals by n-2 instead of n, where n is the number of data points and 2 is the number of parameters estimated (a and b). Thus we can estimate \sigma as $\hat{\sigma} = \sqrt{\frac{1}{n-2} \sum_{i=1}^n (y_i - \hat{a} - \hat{b} x_i)^2}$. When n=1 or 2, this is meaningless.

More generally when we have k predictors ($y = X\beta + \epsilon$, with an n \times k matrix X), we divide by n-k, so \sigma is estimated as $\hat{\sigma} = \sqrt{\frac{1}{n-k} \sum_{i=1}^n (y_i - X_i \hat{\beta})^2}$.

Maximum likelihood¶

Main idea: when errors are normally distributed, least squares estimates are same as maximum likelihood estimates.

If errors $\epsilon_i$ are independent and normally distributed, so $y_i ~ normal(a + b x_i, \sigma^2)$, for each i, then the least squares estimates $\hat{a}$ and $\hat{b}$ are also the maximum likelihood estimates.

Likelihood function is probability density of the data given the parameters and predictors. In this example: $p(y | a, b, \sigma, X) = \prod_{i=1}^n normal(y_i | a + b x_i, \sigma^2)$. Where normal(.|.,.) is the normal probability density function. $normal(y | m, \sigma) = \frac{1}{\sqrt{2 \pi} \sigma} exp(-\frac{(y-m)^2}{2 \sigma^2})$.

Maximising likelihood requires maximising the sum of squared residuals, so least squares estimates are also maximum likelihood estimates. Small twist that maximum likelihood estimates of \sigma is $\hat{\sigma} = \sqrt{\frac{1}{n} \sum_{i=1}^n (y_i - \hat{a} - \hat{b} x_i)^2}$, which is slightly smaller than the least squares estimate because we divide by n instead of n-2.

In Bayesian inference, uncertainty for each parameter automatically accounts for the uncertainty in other parameters. Especially relevant for models with many parameters.

Where do the standard errors come from? Using the likelihood surface to assess uncertainty in the parameter estimates¶

Maximum likelihood, the likelihood function can be viewed as a hill with a peak at the maximum likelihood estimates. If we have a and b, we can plot the likelihood as a function of a and b.

Bayesian inference¶

Least squares and maximum likelihood find parameters that best fit the data, but without otherwise guiding the fit. Bayesian inference allows us to incorporate prior beliefs about the parameters into the analysis. Bayesian inference compromises between prior beliefs and the data, by multiplying the likelihood by the prior to get the posterior distribution. The posterior distribution is a compromise between the likelihood and the prior that probabilistically encodes external information about the parameters.

Many ways to look at Bayesian inference. Here we look at it as the posterior distribution as a modification of the maximum likelihood function. Generalisation of maximum likelihood estimation is maximum penalized likelihood estimation, where the prior distribution is the penalty function that downgrades the likelhood for less favoured values of the parameter - estimate lies between the prior and the data. This anchoring can mean that the maximum penalized likelihood estimate is more stable than the raw maximum likelihood estimate or least squares estimate.

In simple terms, the prior information penalises the likelihood for values of the parameters that are less likely a priori, and thus the maximum penalized likelihood estimate is a compromise between the prior and the data.

In addition to prior information, Bayesian inference is distinctive in that it expresses uncertainty using probability. When we fit a model, we get a set of simulation draws that represent the posterior distribution of the parameters. We can use these draws to compute summaries of the posterior distribution, such as the mean, median, or credible intervals.

Point estimate, mode-based approximation, and posterior simulations¶

Least squares is a point estimate - coefficients that provides best fit to the data. For Bayesian models corresponding point estimate is the mode of the posterior distribution. Least squares or maximum liklihood estimate is the posterior mode when the prior is flat. When the prior is not flat, the posterior mode is a compromise between the likelihood and the prior.

We dont want just an estimate, we also want to know how uncertain the estimate is. This can be represented by estimate plus or minus a standard error.

It is convenient to summarise uncertainty using simulations from mode based approximation or more generally from the posterior distribution.

8.2 Influence of individual points in a fitted regression¶

As shown previously, the least squares estimated regression coefficients $\hat{a}$ and $\hat{b}$ are linear functions of the data y. We can use this linear function to see how much each data point influences the fitted regression line, by looking at how much a change in each y_i changes $\hat{b}$. We can also look at the influence on the slope.

From:

$\hat{b} = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2}$

we can see an increase of 1 in $y_{i}$ corresponds to a change in $\hat{b}$ that is proportional to $x_i - \bar{x}$.

  • If $x_{i} = \bar{x}$, influence of i on regression slope is zero, because $x_i - \bar{x} = 0$.
  • If $x_{i} > \bar{x}$, influence of i on regression slope is positive, because $x_i - \bar{x} > 0$.
  • If $x_{i} < \bar{x}$, influence of i on regression slope is negative, because $x_i - \bar{x} < 0$.

Imagine the fitted line is a rod attached to datapoints with rubber bands. The rod is attached to the mean of the data, and the rubber bands pull the rod towards the data points. The influence of each data point on the slope of the fitted line depends on how far it is from the mean of the x values. Data points that are far from the mean have more influence on the slope than data points that are close to the mean.

Same mathematical principle applies for multiple regression, but the influence of each data point on each coefficient is more complex because it depends on the values of all the predictors. In general, data points that are far from the mean of the predictors have more influence on the coefficients than data points that are close to the mean of the predictors.

8.3 Least squares slope as a weighted average of slopes of pairs¶

7.3 we saw when $y = a + b x + \epsilon$, with indicator variable (x = 0 or 1), the least squares slope (b) is the difference in means between the two groups = $\hat{b} = \bar{y}_1 - \bar{y}_0$.

Similar for continuous predictor, the least squares slope can be expressed as a weighted average of the slopes of pairs of data points.

It goes, with n data points (x, y) there are $n^2$ pairs of data points. For each pair of data points (i, j), we can compute the slope of the line connecting the two points as $s_{ij} = \frac{y_j - y_i}{x_j - x_i}$. This breaks down when $x_j = x_i$, but we can ignore those pairs because they dont contribute to the slope of the fitted line.

We want the best fit regression slope to be a weighted average of the slopes of pairs, so if two points are further apart in x, they should have more influence on the slope of the fitted line than points that are closer together in x. We weight each slope by the difference in $|x_j - x_i|$. We assume normal distribution for errors, so we also weight each slope by the inverse of the variance of the slope, which is proportional to $|x_j - x_i|^2$.

Weighted average of slopes of pairs is then:

$\hat{b} = \frac{ \sum_{i, j} (x_{j} - x_i)^2 \frac{y_j - y_i}{x_j - x_i} }{ \sum_{i, j} (x_{j} - x_i)^2 }$

= $\frac{ \sum_{i, j} (x_{j} - x_i) (y_j - y_i) }{ \sum_{i, j} (x_{j} - x_i)^2 }$

$\hat{b}$ is the weighted average of the slopes of pairs.

8.4 Comparing two fitting functions: lm and stan_glm¶

In [1]:
import sys
print(sys.executable)
/opt/anaconda3/envs/pymc_env/bin/python
In [2]:
import pymc as pm  # Import the PyMC probabilistic programming library for Bayesian modeling
import numpy as np  # Import NumPy for numerical array operations

x = np.arange(1, 11)  # Create an array of integers from 1 to 10 as the independent variable
y = np.array([1, 1, 2, 3, 5, 8, 13, 21, 34, 55])  # Define observed data (Fibonacci-like sequence) as the dependent variable

with pm.Model() as model:  # Create a new PyMC model context to define the Bayesian model
    intercept = pm.Normal("intercept", mu=0, sigma=50)  # Define a Normal prior for the intercept with mean 0 and std 50
    slope = pm.Normal("slope", mu=0, sigma=50)  # Define a Normal prior for the slope with mean 0 and std 50
    sigma = pm.HalfNormal("sigma", sigma=50)  # Define a HalfNormal prior for the noise std (constrained positive)

    mu = intercept + slope * x  # Compute the expected value of y as a linear function of x
    likelihood = pm.Normal("y", mu=mu, sigma=sigma, observed=y)  # Define the likelihood: observed y ~ Normal(mu, sigma)

    trace = pm.sample(2000, tune=1000)  # Run MCMC sampling: 2000 posterior draws after 1000 tuning steps

print(pm.summary(trace, hdi_prob=0.95))  # Print posterior summary with 95% HDI (highest density interval)
/opt/anaconda3/envs/pymc_env/lib/python3.11/site-packages/arviz/__init__.py:39: FutureWarning: 
ArviZ is undergoing a major refactor to improve flexibility and extensibility while maintaining a user-friendly interface.
Some upcoming changes may be backward incompatible.
For details and migration guidance, visit: https://python.arviz.org/en/latest/user_guide/migration_guide.html
  warn(
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [intercept, slope, sigma]
/opt/anaconda3/envs/pymc_env/lib/python3.11/site-packages/rich/live.py:256: UserWarning: install "ipywidgets" for 
Jupyter support
  warnings.warn('install "ipywidgets" for Jupyter support')

Sampling 4 chains for 1_000 tune and 2_000 draw iterations (4_000 + 8_000 draws total) took 1 seconds.
There were 5 divergences after tuning. Increase `target_accept` or reparameterize.
             mean     sd  hdi_2.5%  hdi_97.5%  mcse_mean  mcse_sd  ess_bulk  \
intercept -13.384  7.980   -29.990      2.621      0.184    0.187    1967.0   
slope       5.044  1.287     2.441      7.647      0.030    0.029    1976.0   
sigma      11.033  3.493     5.724     17.782      0.081    0.106    2140.0   

           ess_tail  r_hat  
intercept    1947.0    1.0  
slope        2103.0    1.0  
sigma        2533.0    1.0  

Confidence intervals above can be called "uncertainty intervals" or "compatibility intervals".

8.5 Bibliographic note¶

8.6 Exercises¶