6. Background on regression modeling¶

Mathematically, methods in book have two purposes: prediction and comparison.

Use regression to predict an outcome (or distribution of an outcome) from a set of predictors/inputs. We can then compare the predictions to different inputs to compare groups or estimate causal effects.

6.1 Regression models¶

Simplest regression is linear with a single predictor:

$y = a + b x + \epsilon$

Where $a$ is the intercept, $b$ is the slope (coefficients or parameters), and $\epsilon$ is the error term.

Simple model can be extended in various ways:

  • Multiple predictors: $y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_k x_k + \epsilon$ in vector form: $y = X \beta + \epsilon$
  • Nonlinear relationships: $y = a + b x^2 + \epsilon$ or $y = a + b \log(x) + \epsilon$
  • Non-additive models: $y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_1 x_2 + \epsilon$ which has an interaction between $x_1$ and $x_2$.
  • Generalized linear models, extend linear regression to allow for different types of outcomes (discrete outcomese.g., binary, count)
  • Nonparametric regression, which does not assume a specific functional form for the relationship between predictors and outcome. Allow for arbitrary relationships between predictors and outcome.
  • Multilevel/hierarchical models, which coefficients vary by group or cluster. Useful for data with nested structure (e.g., students within schools, patients within hospitals).
  • Measurement error models, where predictors and outcomes are measured with error.

6.2 Fitting a simple regression to fake data¶

Create fake data with relationship $y = 0.2 + 0.3 x + \epsilon$ where $\epsilon \sim N(0, 0.5)$. Errors are normally distributed with mean 0 and standard deviation 0.5.

Fitting a regression and displaying the results¶
In [2]:
from statsmodels.robust.scale import mad

def fit_and_plot_lm(data, predictors, outcome, add_constant=True, show_plot=True, scatter_kws=None, line_kws=None):
    """
    Fit a linear model using statsmodels, print summary, plot, and show formula.
    Args:
        data: pandas DataFrame
        predictors: list of predictor column names (str)
        outcome: outcome column name (str)
        add_constant: whether to add intercept (default True)
        show_plot: whether to plot (default True)
        scatter_kws: dict, kwargs for scatterplot
        line_kws: dict, kwargs for regression line
    """
    X = data[predictors].copy()
    if add_constant:
        X = sm.add_constant(X, prepend=False)
    y = data[outcome]
    model = sm.OLS(y, X)
    results = model.fit()
    print(results.summary())
    # Print formula
    params = results.params
    formula = f"{outcome} = " + " + ".join([f"{params[name]:.2f}*{name}" for name in predictors])
    if add_constant:
        formula = f"{outcome} = {params['const']:.2f} + " + " + ".join([f"{params[name]:.2f}*{name}" for name in predictors])
    print("Formula:", formula)
    # Print residual standard deviation and its uncertainty
    sigma = np.sqrt(results.mse_resid)
    sigma_se = sigma / np.sqrt(2 * results.df_resid)
    print(f"Residual std dev (σ): {sigma:.2f} ± {sigma_se:.2f}")
    # Print median absolute deviation of residuals
    print("MAD of residuals:", round(mad(results.resid), 2))
    # Plot if only one predictor
    if show_plot and len(predictors) == 1:
        x_name = predictors[0]
        ax = sns.scatterplot(data=data, x=x_name, y=outcome, **(scatter_kws or {}))
        x_vals = np.linspace(data[x_name].min(), data[x_name].max(), 100)
        y_vals = params['const'] + params[x_name] * x_vals if add_constant else params[x_name] * x_vals
        ax.plot(x_vals, y_vals, color='red', **(line_kws or {}))
        ax.set_title('Linear Regression Fit')
        plt.show()
In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf

#x = array between 1 and 20
x = np.arange(1, 21)
n = len(x)
a = 0.2
b = 0.3
sigma = 0.5
y = a + b * x + np.random.normal(0, sigma, n)

data = pd.DataFrame({'x': x, 'y': y})
display(data.head())

sns.scatterplot(x='x', y='y', data=data)
plt.xlabel('x')
plt.ylabel('y')
plt.title('Scatter plot of x and y')
plt.show()

fit_and_plot_lm(data, predictors=['x'], outcome='y')
x y
0 1 -0.110005
1 2 0.047969
2 3 1.255265
3 4 1.533453
4 5 2.216840
No description has been provided for this image
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.906
Model:                            OLS   Adj. R-squared:                  0.901
Method:                 Least Squares   F-statistic:                     173.1
Date:                Wed, 11 Mar 2026   Prob (F-statistic):           1.13e-10
Time:                        07:55:36   Log-Likelihood:                -15.161
No. Observations:                  20   AIC:                             34.32
Df Residuals:                      18   BIC:                             36.31
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
x              0.2777      0.021     13.156      0.000       0.233       0.322
const          0.3336      0.253      1.319      0.204      -0.198       0.865
==============================================================================
Omnibus:                        1.195   Durbin-Watson:                   1.706
Prob(Omnibus):                  0.550   Jarque-Bera (JB):                0.850
Skew:                          -0.138   Prob(JB):                        0.654
Kurtosis:                       2.028   Cond. No.                         25.0
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Formula: y = 0.33 + 0.28*x
Residual std dev (σ): 0.54 ± 0.09
MAD of residuals: 0.59
No description has been provided for this image
Comparing estimates to assumed parameter values¶

a was 0.2, standard error was 0.206, estimate was 0.06. Roughly, we can expect difference between estimate and true value to be within 1 standard error about 68% of the time, within 2 standard errors about 95% of the time, and within 3 standard errors about 99.7% of the time.

This is just a single simulation and this will vary across simulations. If we repeated this process many times, we would expect the estimates to be distributed around the true values.

6.3 Interpret coefficients as comparisons, not effects¶

Regression coefficients commonly called "effects" but they are not necessarily causal effects.

In [4]:
earnings = pd.read_csv('../ros_data/earnings.csv', skiprows=0)
earnings['earn_k'] = earnings['earn'] / 1000

display(earnings.head())

fit_and_plot_lm(earnings, predictors=['height', 'male'], outcome='earn_k')
height weight male earn earnk ethnicity education mother_education father_education walk exercise smokenow tense angry age earn_k
0 74 210.0 1 50000.0 50.0 White 16.0 16.0 16.0 3 3 2.0 0.0 0.0 45 50.0
1 66 125.0 0 60000.0 60.0 White 16.0 16.0 16.0 6 5 1.0 0.0 0.0 58 60.0
2 64 126.0 0 30000.0 30.0 White 16.0 16.0 16.0 8 1 2.0 1.0 1.0 29 30.0
3 65 200.0 0 25000.0 25.0 White 17.0 17.0 NaN 8 1 2.0 0.0 0.0 57 25.0
4 63 110.0 0 50000.0 50.0 Other 16.0 16.0 16.0 5 6 2.0 0.0 0.0 91 50.0
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 earn_k   R-squared:                       0.100
Model:                            OLS   Adj. R-squared:                  0.099
Method:                 Least Squares   F-statistic:                     100.3
Date:                Wed, 11 Mar 2026   Prob (F-statistic):           4.88e-42
Time:                        07:55:36   Log-Likelihood:                -8137.7
No. Observations:                1816   AIC:                         1.628e+04
Df Residuals:                    1813   BIC:                         1.630e+04
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
height         0.6470      0.185      3.493      0.000       0.284       1.010
male          10.6327      1.468      7.241      0.000       7.753      13.512
const        -25.8722     11.962     -2.163      0.031     -49.332      -2.412
==============================================================================
Omnibus:                     1902.421   Durbin-Watson:                   1.895
Prob(Omnibus):                  0.000   Jarque-Bera (JB):           249218.549
Skew:                           4.821   Prob(JB):                         0.00
Kurtosis:                      59.575   Cond. No.                     1.59e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.59e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Formula: earn_k = -25.87 + 0.65*height + 10.63*male
Residual std dev (σ): 21.39 ± 0.36
MAD of residuals: 14.27

Residual standard deviation indicates earnings will be within 21,400 of the predicted value for about 68% of people, within 42,800 for about 95% of people, and within 64,200 for about 99.7% of people.

R2 is 0.1 which means that 10% of the variance in earnings is explained by height and sex. Makes sense as earnings vary a lot and there are many factors that influence earnings that are not included in the model.

Inappropriate to label coefficients as "effects". Effect is change associated with some treatment or intervention. To say effect of height on earnings implies that if we could intervene and change someone's height, we would see a change in their earnings. This is not necessarily the case. The coefficient for height is a comparison of earnings for people who are different heights. This is a between person comparison, not a within person comparison. We are comparing different people who have different heights and different earnings. This does not necessarily mean that if we could change someone's height, we would see a change in their earnings.

Safest interpretation is to say that the coefficient for height is the difference in earnings associated with a one unit difference in height. "Under the fitted model, comparing two people of the same sex but one inch different in height, we would expect the taller person to earn about $600 more than the shorter person on average." This is a comparison, not an effect. Similarly, "when comparing two people of the same height but different sex, we would expect the male to earn about $10,600 more than the female on average."

Under certain conditions inferences can be causal, but this requires strong assumptions. Start with describing coefficients in predictive or desccriptive terms.

Regression mathematical tool for making predictions. Coefficients can sometimes be effects, but always as average comparisons.

6.4 Historical origins of regression¶

Daughters’ heights “regressing” to the mean¶

The intercept for the $child = 34.48 + 0.47*parent$ model is 34.48 which is the predicted height of a child whose parent is 0 inches tall. This is not meaningful in a real world context, but it is a mathematical artifact of the linear model. The slope is 0.47 which means that for every inch increase in the parent's height, the child's height increases by 0.47 inches on average. This is less than 1, which means that the child's height regresses towards the mean compared to the parent's height. If the parent is very tall, the child will be shorter than the parent on average, and if the parent is very short, the child will be taller than the parent on average.

The intercept is not useful as it is presented.

In [14]:
# import PearsonLee.txt first row is header
pearson_lee = pd.read_csv('../ros_data/PearsonLee.txt', delim_whitespace=True)
display(pearson_lee.head())

# new df with all columns where par = Mother and chl = Daughter
mother_daughter = pearson_lee[(pearson_lee['par'] == 'Mother') & (pearson_lee['chl'] == 'Daughter')]
display(mother_daughter.head())

# sns.scatterplot(x='parent', y='child', data=mother_daughter)

fit_and_plot_lm(mother_daughter, predictors=['parent'], outcome='child')

mother_daughter_centered = mother_daughter.copy()
mother_daughter_centered['parent_c'] = mother_daughter_centered['parent'] - mother_daughter_centered['parent'].mean()
display(mother_daughter_centered.head())

fit_and_plot_lm(mother_daughter_centered, predictors=['parent_c'], outcome='child')

heights = pd.read_csv('../ros_data/heights.txt', delim_whitespace=True)
display(heights.head())

fit_and_plot_lm(heights, predictors=['mother_height'], outcome='daughter_height')
/var/folders/6m/c2f6bq0j2fbclrsw82pn99380000gn/T/ipykernel_9001/3418795757.py:2: FutureWarning: The 'delim_whitespace' keyword in pd.read_csv is deprecated and will be removed in a future version. Use ``sep='\s+'`` instead
  pearson_lee = pd.read_csv('../ros_data/PearsonLee.txt', delim_whitespace=True)
child parent frequency gp par chl
1 59.5 62.5 0.5 fs Father Son
2 59.5 63.5 0.5 fs Father Son
3 59.5 64.5 1.0 fs Father Son
4 60.5 62.5 0.5 fs Father Son
5 60.5 66.5 1.0 fs Father Son
child parent frequency gp par chl
180 52.5 59.5 0.50 md Mother Daughter
181 53.5 59.5 0.50 md Mother Daughter
182 55.5 59.5 1.00 md Mother Daughter
183 56.5 58.5 1.00 md Mother Daughter
184 56.5 59.5 0.25 md Mother Daughter
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  child   R-squared:                       0.218
Model:                            OLS   Adj. R-squared:                  0.214
Method:                 Least Squares   F-statistic:                     51.16
Date:                Wed, 11 Mar 2026   Prob (F-statistic):           1.97e-11
Time:                        08:12:12   Log-Likelihood:                -509.89
No. Observations:                 185   AIC:                             1024.
Df Residuals:                     183   BIC:                             1030.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
parent         0.4664      0.065      7.153      0.000       0.338       0.595
const         34.4827      4.094      8.422      0.000      26.404      42.561
==============================================================================
Omnibus:                        9.513   Durbin-Watson:                   0.254
Prob(Omnibus):                  0.009   Jarque-Bera (JB):                4.926
Skew:                          -0.179   Prob(JB):                       0.0852
Kurtosis:                       2.285   Cond. No.                         913.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Formula: child = 34.48 + 0.47*parent
Residual std dev (σ): 3.83 ± 0.20
MAD of residuals: 4.35
No description has been provided for this image
child parent frequency gp par chl parent_c
180 52.5 59.5 0.50 md Mother Daughter -3.145946
181 53.5 59.5 0.50 md Mother Daughter -3.145946
182 55.5 59.5 1.00 md Mother Daughter -3.145946
183 56.5 58.5 1.00 md Mother Daughter -4.145946
184 56.5 59.5 0.25 md Mother Daughter -3.145946
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  child   R-squared:                       0.218
Model:                            OLS   Adj. R-squared:                  0.214
Method:                 Least Squares   F-statistic:                     51.16
Date:                Wed, 11 Mar 2026   Prob (F-statistic):           1.97e-11
Time:                        08:12:12   Log-Likelihood:                -509.89
No. Observations:                 185   AIC:                             1024.
Df Residuals:                     183   BIC:                             1030.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
parent_c       0.4664      0.065      7.153      0.000       0.338       0.595
const         63.7000      0.282    226.262      0.000      63.145      64.255
==============================================================================
Omnibus:                        9.513   Durbin-Watson:                   0.254
Prob(Omnibus):                  0.009   Jarque-Bera (JB):                4.926
Skew:                          -0.179   Prob(JB):                       0.0852
Kurtosis:                       2.285   Cond. No.                         4.32
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Formula: child = 63.70 + 0.47*parent_c
Residual std dev (σ): 3.83 ± 0.20
MAD of residuals: 4.35
No description has been provided for this image
/var/folders/6m/c2f6bq0j2fbclrsw82pn99380000gn/T/ipykernel_9001/3418795757.py:19: FutureWarning: The 'delim_whitespace' keyword in pd.read_csv is deprecated and will be removed in a future version. Use ``sep='\s+'`` instead
  heights = pd.read_csv('../ros_data/heights.txt', delim_whitespace=True)
daughter_height mother_height
0 52.5 59.5
1 52.5 59.5
2 53.5 59.5
3 53.5 59.5
4 55.5 59.5
                            OLS Regression Results                            
==============================================================================
Dep. Variable:        daughter_height   R-squared:                       0.252
Model:                            OLS   Adj. R-squared:                  0.252
Method:                 Least Squares   F-statistic:                     1860.
Date:                Wed, 11 Mar 2026   Prob (F-statistic):               0.00
Time:                        08:12:12   Log-Likelihood:                -12346.
No. Observations:                5524   AIC:                         2.470e+04
Df Residuals:                    5522   BIC:                         2.471e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
=================================================================================
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
mother_height     0.5449      0.013     43.125      0.000       0.520       0.570
const            29.7984      0.790     37.703      0.000      28.249      31.348
==============================================================================
Omnibus:                       25.260   Durbin-Watson:                   0.031
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               31.772
Skew:                          -0.074   Prob(JB):                     1.26e-07
Kurtosis:                       3.340   Cond. No.                     1.62e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.62e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Formula: daughter_height = 29.80 + 0.54*mother_height
Residual std dev (σ): 2.26 ± 0.02
MAD of residuals: 2.16
No description has been provided for this image

6.5 The paradox of regression to the mean¶

Paradox of regression to the mean is that if it were true, everyone would eventually regress to the mean and there would be no variation in heights. This is not the case.

The model has error term which captures the variation in heights that is not explained by the parent's height. This error term allows for variation in heights and prevents everyone from regressing to the mean.

Prediction is that daughters of very tall parents will be shorter than their parents on average, and daughters of very short parents will be taller than their parents on average. The prediction pulls towards the mean, but there is still variation around the mean due to the error term. This is why we see regression to the mean, but not everyone regressing to the mean.

How regression to the mean can confuse people about causal inference; demonstration using fake data¶

Regression to mean can be confusing and mistake regression to the mean for a causal effect.

Fake data for 1000 students scores on mid and final exams. Each student has a true ability which is normally distributed with mean 50 and standard deviation 10. Midterm score is the true ability plus some random noise (measurement error) which is normally distributed with mean 0 and standard deviation 10. Final score is the true ability plus some random noise which is normally distributed with mean 0 and standard deviation 10.

$final score = 23.39 + 0.52 \times midterm score + \epsilon$

Slope is 0.52 which means that for every one point increase in midterm score, the final score increases by 0.52 points on average. This is less than 1, which means that the final score regresses towards the mean compared to the midterm score. Might seem natural to say that student that get high mid term score get overconfident and do worse on the final, but this is not necessarily the case. This is called the 'regression falacy'.

The data were simulated and there is no causal relationship between midterm score and final score.

Summary: when you see things drifting back toward the average, don't rush to invent a causal explanation. First ask whether the "default" you're comparing against — that things should stay the same — even makes sense. If there's natural variation, regression to the mean will happen on its own.

In [16]:
n = 1000
true_ability = np.random.normal(50, 10, n)
midterm_score = true_ability + np.random.normal(0, 10, n)
final_score = true_ability + np.random.normal(0, 10, n)
scores = pd.DataFrame({'true_ability': true_ability, 'midterm_score': midterm_score, 'final_score': final_score})
display(scores.head())

fit_and_plot_lm(scores, ['midterm_score'], 'final_score', add_constant=True, show_plot=True, scatter_kws=None, line_kws=None)
true_ability midterm_score final_score
0 53.575017 56.324755 50.121899
1 59.765044 66.258999 59.124500
2 65.054658 68.848283 76.527669
3 41.159978 42.237073 33.467692
4 48.079297 43.221099 52.619541
                            OLS Regression Results                            
==============================================================================
Dep. Variable:            final_score   R-squared:                       0.255
Model:                            OLS   Adj. R-squared:                  0.255
Method:                 Least Squares   F-statistic:                     342.1
Date:                Wed, 11 Mar 2026   Prob (F-statistic):           6.57e-66
Time:                        08:29:13   Log-Likelihood:                -3938.9
No. Observations:                1000   AIC:                             7882.
Df Residuals:                     998   BIC:                             7892.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
=================================================================================
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
midterm_score     0.5228      0.028     18.497      0.000       0.467       0.578
const            23.3882      1.485     15.749      0.000      20.474      26.302
==============================================================================
Omnibus:                        1.672   Durbin-Watson:                   2.049
Prob(Omnibus):                  0.433   Jarque-Bera (JB):                1.537
Skew:                           0.069   Prob(JB):                        0.464
Kurtosis:                       3.133   Cond. No.                         198.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Formula: final_score = 23.39 + 0.52*midterm_score
Residual std dev (σ): 12.44 ± 0.28
MAD of residuals: 12.52
No description has been provided for this image

6.6 Bibliographic note¶

6.7 Exercises¶