6. Background on regression modeling¶
Mathematically, methods in book have two purposes: prediction and comparison.
Use regression to predict an outcome (or distribution of an outcome) from a set of predictors/inputs. We can then compare the predictions to different inputs to compare groups or estimate causal effects.
6.1 Regression models¶
Simplest regression is linear with a single predictor:
$y = a + b x + \epsilon$
Where $a$ is the intercept, $b$ is the slope (coefficients or parameters), and $\epsilon$ is the error term.
Simple model can be extended in various ways:
- Multiple predictors: $y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_k x_k + \epsilon$ in vector form: $y = X \beta + \epsilon$
- Nonlinear relationships: $y = a + b x^2 + \epsilon$ or $y = a + b \log(x) + \epsilon$
- Non-additive models: $y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_1 x_2 + \epsilon$ which has an interaction between $x_1$ and $x_2$.
- Generalized linear models, extend linear regression to allow for different types of outcomes (discrete outcomese.g., binary, count)
- Nonparametric regression, which does not assume a specific functional form for the relationship between predictors and outcome. Allow for arbitrary relationships between predictors and outcome.
- Multilevel/hierarchical models, which coefficients vary by group or cluster. Useful for data with nested structure (e.g., students within schools, patients within hospitals).
- Measurement error models, where predictors and outcomes are measured with error.
6.2 Fitting a simple regression to fake data¶
Create fake data with relationship $y = 0.2 + 0.3 x + \epsilon$ where $\epsilon \sim N(0, 0.5)$. Errors are normally distributed with mean 0 and standard deviation 0.5.
Fitting a regression and displaying the results¶
from statsmodels.robust.scale import mad
def fit_and_plot_lm(data, predictors, outcome, add_constant=True, show_plot=True, scatter_kws=None, line_kws=None):
"""
Fit a linear model using statsmodels, print summary, plot, and show formula.
Args:
data: pandas DataFrame
predictors: list of predictor column names (str)
outcome: outcome column name (str)
add_constant: whether to add intercept (default True)
show_plot: whether to plot (default True)
scatter_kws: dict, kwargs for scatterplot
line_kws: dict, kwargs for regression line
"""
X = data[predictors].copy()
if add_constant:
X = sm.add_constant(X, prepend=False)
y = data[outcome]
model = sm.OLS(y, X)
results = model.fit()
print(results.summary())
# Print formula
params = results.params
formula = f"{outcome} = " + " + ".join([f"{params[name]:.2f}*{name}" for name in predictors])
if add_constant:
formula = f"{outcome} = {params['const']:.2f} + " + " + ".join([f"{params[name]:.2f}*{name}" for name in predictors])
print("Formula:", formula)
# Print residual standard deviation and its uncertainty
sigma = np.sqrt(results.mse_resid)
sigma_se = sigma / np.sqrt(2 * results.df_resid)
print(f"Residual std dev (σ): {sigma:.2f} ± {sigma_se:.2f}")
# Print median absolute deviation of residuals
print("MAD of residuals:", round(mad(results.resid), 2))
# Plot if only one predictor
if show_plot and len(predictors) == 1:
x_name = predictors[0]
ax = sns.scatterplot(data=data, x=x_name, y=outcome, **(scatter_kws or {}))
x_vals = np.linspace(data[x_name].min(), data[x_name].max(), 100)
y_vals = params['const'] + params[x_name] * x_vals if add_constant else params[x_name] * x_vals
ax.plot(x_vals, y_vals, color='red', **(line_kws or {}))
ax.set_title('Linear Regression Fit')
plt.show()
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
#x = array between 1 and 20
x = np.arange(1, 21)
n = len(x)
a = 0.2
b = 0.3
sigma = 0.5
y = a + b * x + np.random.normal(0, sigma, n)
data = pd.DataFrame({'x': x, 'y': y})
display(data.head())
sns.scatterplot(x='x', y='y', data=data)
plt.xlabel('x')
plt.ylabel('y')
plt.title('Scatter plot of x and y')
plt.show()
fit_and_plot_lm(data, predictors=['x'], outcome='y')
| x | y | |
|---|---|---|
| 0 | 1 | -0.110005 |
| 1 | 2 | 0.047969 |
| 2 | 3 | 1.255265 |
| 3 | 4 | 1.533453 |
| 4 | 5 | 2.216840 |
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.906
Model: OLS Adj. R-squared: 0.901
Method: Least Squares F-statistic: 173.1
Date: Wed, 11 Mar 2026 Prob (F-statistic): 1.13e-10
Time: 07:55:36 Log-Likelihood: -15.161
No. Observations: 20 AIC: 34.32
Df Residuals: 18 BIC: 36.31
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
x 0.2777 0.021 13.156 0.000 0.233 0.322
const 0.3336 0.253 1.319 0.204 -0.198 0.865
==============================================================================
Omnibus: 1.195 Durbin-Watson: 1.706
Prob(Omnibus): 0.550 Jarque-Bera (JB): 0.850
Skew: -0.138 Prob(JB): 0.654
Kurtosis: 2.028 Cond. No. 25.0
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Formula: y = 0.33 + 0.28*x
Residual std dev (σ): 0.54 ± 0.09
MAD of residuals: 0.59
Comparing estimates to assumed parameter values¶
a was 0.2, standard error was 0.206, estimate was 0.06. Roughly, we can expect difference between estimate and true value to be within 1 standard error about 68% of the time, within 2 standard errors about 95% of the time, and within 3 standard errors about 99.7% of the time.
This is just a single simulation and this will vary across simulations. If we repeated this process many times, we would expect the estimates to be distributed around the true values.
6.3 Interpret coefficients as comparisons, not effects¶
Regression coefficients commonly called "effects" but they are not necessarily causal effects.
earnings = pd.read_csv('../ros_data/earnings.csv', skiprows=0)
earnings['earn_k'] = earnings['earn'] / 1000
display(earnings.head())
fit_and_plot_lm(earnings, predictors=['height', 'male'], outcome='earn_k')
| height | weight | male | earn | earnk | ethnicity | education | mother_education | father_education | walk | exercise | smokenow | tense | angry | age | earn_k | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 74 | 210.0 | 1 | 50000.0 | 50.0 | White | 16.0 | 16.0 | 16.0 | 3 | 3 | 2.0 | 0.0 | 0.0 | 45 | 50.0 |
| 1 | 66 | 125.0 | 0 | 60000.0 | 60.0 | White | 16.0 | 16.0 | 16.0 | 6 | 5 | 1.0 | 0.0 | 0.0 | 58 | 60.0 |
| 2 | 64 | 126.0 | 0 | 30000.0 | 30.0 | White | 16.0 | 16.0 | 16.0 | 8 | 1 | 2.0 | 1.0 | 1.0 | 29 | 30.0 |
| 3 | 65 | 200.0 | 0 | 25000.0 | 25.0 | White | 17.0 | 17.0 | NaN | 8 | 1 | 2.0 | 0.0 | 0.0 | 57 | 25.0 |
| 4 | 63 | 110.0 | 0 | 50000.0 | 50.0 | Other | 16.0 | 16.0 | 16.0 | 5 | 6 | 2.0 | 0.0 | 0.0 | 91 | 50.0 |
OLS Regression Results
==============================================================================
Dep. Variable: earn_k R-squared: 0.100
Model: OLS Adj. R-squared: 0.099
Method: Least Squares F-statistic: 100.3
Date: Wed, 11 Mar 2026 Prob (F-statistic): 4.88e-42
Time: 07:55:36 Log-Likelihood: -8137.7
No. Observations: 1816 AIC: 1.628e+04
Df Residuals: 1813 BIC: 1.630e+04
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
height 0.6470 0.185 3.493 0.000 0.284 1.010
male 10.6327 1.468 7.241 0.000 7.753 13.512
const -25.8722 11.962 -2.163 0.031 -49.332 -2.412
==============================================================================
Omnibus: 1902.421 Durbin-Watson: 1.895
Prob(Omnibus): 0.000 Jarque-Bera (JB): 249218.549
Skew: 4.821 Prob(JB): 0.00
Kurtosis: 59.575 Cond. No. 1.59e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.59e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Formula: earn_k = -25.87 + 0.65*height + 10.63*male
Residual std dev (σ): 21.39 ± 0.36
MAD of residuals: 14.27
Residual standard deviation indicates earnings will be within 21,400 of the predicted value for about 68% of people, within 42,800 for about 95% of people, and within 64,200 for about 99.7% of people.
R2 is 0.1 which means that 10% of the variance in earnings is explained by height and sex. Makes sense as earnings vary a lot and there are many factors that influence earnings that are not included in the model.
Inappropriate to label coefficients as "effects". Effect is change associated with some treatment or intervention. To say effect of height on earnings implies that if we could intervene and change someone's height, we would see a change in their earnings. This is not necessarily the case. The coefficient for height is a comparison of earnings for people who are different heights. This is a between person comparison, not a within person comparison. We are comparing different people who have different heights and different earnings. This does not necessarily mean that if we could change someone's height, we would see a change in their earnings.
Safest interpretation is to say that the coefficient for height is the difference in earnings associated with a one unit difference in height. "Under the fitted model, comparing two people of the same sex but one inch different in height, we would expect the taller person to earn about $600 more than the shorter person on average." This is a comparison, not an effect. Similarly, "when comparing two people of the same height but different sex, we would expect the male to earn about $10,600 more than the female on average."
Under certain conditions inferences can be causal, but this requires strong assumptions. Start with describing coefficients in predictive or desccriptive terms.
Regression mathematical tool for making predictions. Coefficients can sometimes be effects, but always as average comparisons.
6.4 Historical origins of regression¶
Daughters’ heights “regressing” to the mean¶
The intercept for the $child = 34.48 + 0.47*parent$ model is 34.48 which is the predicted height of a child whose parent is 0 inches tall. This is not meaningful in a real world context, but it is a mathematical artifact of the linear model. The slope is 0.47 which means that for every inch increase in the parent's height, the child's height increases by 0.47 inches on average. This is less than 1, which means that the child's height regresses towards the mean compared to the parent's height. If the parent is very tall, the child will be shorter than the parent on average, and if the parent is very short, the child will be taller than the parent on average.
The intercept is not useful as it is presented.
# import PearsonLee.txt first row is header
pearson_lee = pd.read_csv('../ros_data/PearsonLee.txt', delim_whitespace=True)
display(pearson_lee.head())
# new df with all columns where par = Mother and chl = Daughter
mother_daughter = pearson_lee[(pearson_lee['par'] == 'Mother') & (pearson_lee['chl'] == 'Daughter')]
display(mother_daughter.head())
# sns.scatterplot(x='parent', y='child', data=mother_daughter)
fit_and_plot_lm(mother_daughter, predictors=['parent'], outcome='child')
mother_daughter_centered = mother_daughter.copy()
mother_daughter_centered['parent_c'] = mother_daughter_centered['parent'] - mother_daughter_centered['parent'].mean()
display(mother_daughter_centered.head())
fit_and_plot_lm(mother_daughter_centered, predictors=['parent_c'], outcome='child')
heights = pd.read_csv('../ros_data/heights.txt', delim_whitespace=True)
display(heights.head())
fit_and_plot_lm(heights, predictors=['mother_height'], outcome='daughter_height')
/var/folders/6m/c2f6bq0j2fbclrsw82pn99380000gn/T/ipykernel_9001/3418795757.py:2: FutureWarning: The 'delim_whitespace' keyword in pd.read_csv is deprecated and will be removed in a future version. Use ``sep='\s+'`` instead
pearson_lee = pd.read_csv('../ros_data/PearsonLee.txt', delim_whitespace=True)
| child | parent | frequency | gp | par | chl | |
|---|---|---|---|---|---|---|
| 1 | 59.5 | 62.5 | 0.5 | fs | Father | Son |
| 2 | 59.5 | 63.5 | 0.5 | fs | Father | Son |
| 3 | 59.5 | 64.5 | 1.0 | fs | Father | Son |
| 4 | 60.5 | 62.5 | 0.5 | fs | Father | Son |
| 5 | 60.5 | 66.5 | 1.0 | fs | Father | Son |
| child | parent | frequency | gp | par | chl | |
|---|---|---|---|---|---|---|
| 180 | 52.5 | 59.5 | 0.50 | md | Mother | Daughter |
| 181 | 53.5 | 59.5 | 0.50 | md | Mother | Daughter |
| 182 | 55.5 | 59.5 | 1.00 | md | Mother | Daughter |
| 183 | 56.5 | 58.5 | 1.00 | md | Mother | Daughter |
| 184 | 56.5 | 59.5 | 0.25 | md | Mother | Daughter |
OLS Regression Results
==============================================================================
Dep. Variable: child R-squared: 0.218
Model: OLS Adj. R-squared: 0.214
Method: Least Squares F-statistic: 51.16
Date: Wed, 11 Mar 2026 Prob (F-statistic): 1.97e-11
Time: 08:12:12 Log-Likelihood: -509.89
No. Observations: 185 AIC: 1024.
Df Residuals: 183 BIC: 1030.
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
parent 0.4664 0.065 7.153 0.000 0.338 0.595
const 34.4827 4.094 8.422 0.000 26.404 42.561
==============================================================================
Omnibus: 9.513 Durbin-Watson: 0.254
Prob(Omnibus): 0.009 Jarque-Bera (JB): 4.926
Skew: -0.179 Prob(JB): 0.0852
Kurtosis: 2.285 Cond. No. 913.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Formula: child = 34.48 + 0.47*parent
Residual std dev (σ): 3.83 ± 0.20
MAD of residuals: 4.35
| child | parent | frequency | gp | par | chl | parent_c | |
|---|---|---|---|---|---|---|---|
| 180 | 52.5 | 59.5 | 0.50 | md | Mother | Daughter | -3.145946 |
| 181 | 53.5 | 59.5 | 0.50 | md | Mother | Daughter | -3.145946 |
| 182 | 55.5 | 59.5 | 1.00 | md | Mother | Daughter | -3.145946 |
| 183 | 56.5 | 58.5 | 1.00 | md | Mother | Daughter | -4.145946 |
| 184 | 56.5 | 59.5 | 0.25 | md | Mother | Daughter | -3.145946 |
OLS Regression Results
==============================================================================
Dep. Variable: child R-squared: 0.218
Model: OLS Adj. R-squared: 0.214
Method: Least Squares F-statistic: 51.16
Date: Wed, 11 Mar 2026 Prob (F-statistic): 1.97e-11
Time: 08:12:12 Log-Likelihood: -509.89
No. Observations: 185 AIC: 1024.
Df Residuals: 183 BIC: 1030.
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
parent_c 0.4664 0.065 7.153 0.000 0.338 0.595
const 63.7000 0.282 226.262 0.000 63.145 64.255
==============================================================================
Omnibus: 9.513 Durbin-Watson: 0.254
Prob(Omnibus): 0.009 Jarque-Bera (JB): 4.926
Skew: -0.179 Prob(JB): 0.0852
Kurtosis: 2.285 Cond. No. 4.32
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Formula: child = 63.70 + 0.47*parent_c
Residual std dev (σ): 3.83 ± 0.20
MAD of residuals: 4.35
/var/folders/6m/c2f6bq0j2fbclrsw82pn99380000gn/T/ipykernel_9001/3418795757.py:19: FutureWarning: The 'delim_whitespace' keyword in pd.read_csv is deprecated and will be removed in a future version. Use ``sep='\s+'`` instead
heights = pd.read_csv('../ros_data/heights.txt', delim_whitespace=True)
| daughter_height | mother_height | |
|---|---|---|
| 0 | 52.5 | 59.5 |
| 1 | 52.5 | 59.5 |
| 2 | 53.5 | 59.5 |
| 3 | 53.5 | 59.5 |
| 4 | 55.5 | 59.5 |
OLS Regression Results
==============================================================================
Dep. Variable: daughter_height R-squared: 0.252
Model: OLS Adj. R-squared: 0.252
Method: Least Squares F-statistic: 1860.
Date: Wed, 11 Mar 2026 Prob (F-statistic): 0.00
Time: 08:12:12 Log-Likelihood: -12346.
No. Observations: 5524 AIC: 2.470e+04
Df Residuals: 5522 BIC: 2.471e+04
Df Model: 1
Covariance Type: nonrobust
=================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------
mother_height 0.5449 0.013 43.125 0.000 0.520 0.570
const 29.7984 0.790 37.703 0.000 28.249 31.348
==============================================================================
Omnibus: 25.260 Durbin-Watson: 0.031
Prob(Omnibus): 0.000 Jarque-Bera (JB): 31.772
Skew: -0.074 Prob(JB): 1.26e-07
Kurtosis: 3.340 Cond. No. 1.62e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.62e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Formula: daughter_height = 29.80 + 0.54*mother_height
Residual std dev (σ): 2.26 ± 0.02
MAD of residuals: 2.16
6.5 The paradox of regression to the mean¶
Paradox of regression to the mean is that if it were true, everyone would eventually regress to the mean and there would be no variation in heights. This is not the case.
The model has error term which captures the variation in heights that is not explained by the parent's height. This error term allows for variation in heights and prevents everyone from regressing to the mean.
Prediction is that daughters of very tall parents will be shorter than their parents on average, and daughters of very short parents will be taller than their parents on average. The prediction pulls towards the mean, but there is still variation around the mean due to the error term. This is why we see regression to the mean, but not everyone regressing to the mean.
How regression to the mean can confuse people about causal inference; demonstration using fake data¶
Regression to mean can be confusing and mistake regression to the mean for a causal effect.
Fake data for 1000 students scores on mid and final exams. Each student has a true ability which is normally distributed with mean 50 and standard deviation 10. Midterm score is the true ability plus some random noise (measurement error) which is normally distributed with mean 0 and standard deviation 10. Final score is the true ability plus some random noise which is normally distributed with mean 0 and standard deviation 10.
$final score = 23.39 + 0.52 \times midterm score + \epsilon$
Slope is 0.52 which means that for every one point increase in midterm score, the final score increases by 0.52 points on average. This is less than 1, which means that the final score regresses towards the mean compared to the midterm score. Might seem natural to say that student that get high mid term score get overconfident and do worse on the final, but this is not necessarily the case. This is called the 'regression falacy'.
The data were simulated and there is no causal relationship between midterm score and final score.
Summary: when you see things drifting back toward the average, don't rush to invent a causal explanation. First ask whether the "default" you're comparing against — that things should stay the same — even makes sense. If there's natural variation, regression to the mean will happen on its own.
n = 1000
true_ability = np.random.normal(50, 10, n)
midterm_score = true_ability + np.random.normal(0, 10, n)
final_score = true_ability + np.random.normal(0, 10, n)
scores = pd.DataFrame({'true_ability': true_ability, 'midterm_score': midterm_score, 'final_score': final_score})
display(scores.head())
fit_and_plot_lm(scores, ['midterm_score'], 'final_score', add_constant=True, show_plot=True, scatter_kws=None, line_kws=None)
| true_ability | midterm_score | final_score | |
|---|---|---|---|
| 0 | 53.575017 | 56.324755 | 50.121899 |
| 1 | 59.765044 | 66.258999 | 59.124500 |
| 2 | 65.054658 | 68.848283 | 76.527669 |
| 3 | 41.159978 | 42.237073 | 33.467692 |
| 4 | 48.079297 | 43.221099 | 52.619541 |
OLS Regression Results
==============================================================================
Dep. Variable: final_score R-squared: 0.255
Model: OLS Adj. R-squared: 0.255
Method: Least Squares F-statistic: 342.1
Date: Wed, 11 Mar 2026 Prob (F-statistic): 6.57e-66
Time: 08:29:13 Log-Likelihood: -3938.9
No. Observations: 1000 AIC: 7882.
Df Residuals: 998 BIC: 7892.
Df Model: 1
Covariance Type: nonrobust
=================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------
midterm_score 0.5228 0.028 18.497 0.000 0.467 0.578
const 23.3882 1.485 15.749 0.000 20.474 26.302
==============================================================================
Omnibus: 1.672 Durbin-Watson: 2.049
Prob(Omnibus): 0.433 Jarque-Bera (JB): 1.537
Skew: 0.069 Prob(JB): 0.464
Kurtosis: 3.133 Cond. No. 198.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Formula: final_score = 23.39 + 0.52*midterm_score
Residual std dev (σ): 12.44 ± 0.28
MAD of residuals: 12.52