1.1 The three challenges of statistics¶

Three challenges of statistical inference:

  1. Generalizing from sample to population - a common issue in statistics;
  2. Generalizing from treatment to control group - causal inference problem;
  3. Generalizing from observed measurements to the underlying constructs of interest - most times we cannot measure what we want directly;

Inference: 'using mathematical models to make general claims from particular data'.

1.2 Why learn regression?¶

Regression can be used to 'summarise how predictions or average values of an outcome vary across individuals defined by a set of predictors'.

Below is an example of simple linear regression using the Hibbs dataset. To download the data, see: https://github.com/avehtari/ROS-Examples/blob/master/ElectionsEconomy/data/hibbs.dat. This dataset contains incumbent party's vote share and economic growth prior to the election for a series of U.S. presidential elections. Assuming that future elections are in some way like the past, we can use this data to predict the vote share based on economic growth.

In [1]:
import pandas as pd
import seaborn as sns
import scipy.stats
import matplotlib.pyplot as plt
import statsmodels.api as sm

# Import hibbs.dat from /ros_data
hibbs = pd.read_csv('../ros_data/hibbs.dat', 
                    sep='\s+')

# Display the first few rows of the dataset
display(hibbs.head(10))

# Add constant term for intercept
hibbs = sm.add_constant(hibbs, prepend=False)

# Fit and summarize the regression model
model = sm.OLS(hibbs['vote'], hibbs[['growth', 'const']])
results = model.fit()

# Print the summary of regression results
print(results.summary())

# Extract slope and intercept from the results
slope = results.params['growth']
intercept = results.params['const']

# Print the formula for the regression line
print(f"y = {intercept:.1f} + {slope:.1f}x")

# Print the mean absolute error
print("Mean Absolute Error:", 
      sm.tools.eval_measures.medianabs(hibbs['vote'], results.fittedvalues))

# Print the standard deviation of the error
print("Standard Deviation of Error:", 
      sm.tools.eval_measures.stde(hibbs['vote'], results.fittedvalues))

# Calculate sigma (standard deviation of residuals)
print("Sigma (std of residuals):", 
      results.scale ** 0.5)

# Plot growth vs vote
ax = sns.regplot(data=hibbs, 
                 x='growth', 
                 y='vote')

# Add formula to the plot
ax.text(0.05, 0.95,
        f"y = {intercept:.1f} + {slope:.1f}x",
        transform=ax.transAxes,
        verticalalignment='top')

plt.show()

# Simple formula:
print("Vote share = {:.1f} + {:.1f}(economic growth)".format(intercept, slope))
year growth vote inc_party_candidate other_candidate
0 1952 2.40 44.60 Stevenson Eisenhower
1 1956 2.89 57.76 Eisenhower Stevenson
2 1960 0.85 49.91 Nixon Kennedy
3 1964 4.21 61.34 Johnson Goldwater
4 1968 3.02 49.60 Humphrey Nixon
5 1972 3.62 61.79 Nixon McGovern
6 1976 1.08 48.95 Ford Carter
7 1980 -0.39 44.70 Carter Reagan
8 1984 3.86 59.17 Reagan Mondale
9 1988 2.27 53.94 Bush, Sr. Dukakis
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   vote   R-squared:                       0.580
Model:                            OLS   Adj. R-squared:                  0.550
Method:                 Least Squares   F-statistic:                     19.32
Date:                Tue, 03 Feb 2026   Prob (F-statistic):           0.000610
Time:                        09:40:19   Log-Likelihood:                -42.839
No. Observations:                  16   AIC:                             89.68
Df Residuals:                      14   BIC:                             91.22
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
growth         3.0605      0.696      4.396      0.001       1.567       4.554
const         46.2476      1.622     28.514      0.000      42.769      49.726
==============================================================================
Omnibus:                        5.392   Durbin-Watson:                   2.379
Prob(Omnibus):                  0.067   Jarque-Bera (JB):                2.828
Skew:                          -0.961   Prob(JB):                        0.243
Kurtosis:                       3.738   Cond. No.                         4.54
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
y = 46.2 + 3.1x
Mean Absolute Error: 1.658121283747942
Standard Deviation of Error: 3.520233251276052
Sigma (std of residuals): 3.7632876422297965
No description has been provided for this image
Vote share = 46.2 + 3.1(economic growth)

Some key uses of regression:

  1. Prediction: modelling existing data or predicting new data;
  2. Exploring associations: summarising how well one variable, or multiple variables, predicts the outcome;
  3. Extrapolation: adjusting for known differences between the sample and a population of interest;
  4. Causal inference: estimating treatment effects.

1.3 Some examples of regression¶

A randomized experiment on the effect of an educational television program¶

The data is from a TV program which aimed to improve reading skills of children. Children in classes with worst reading scores were assigned to watch the program (treatment group) while others did not (control group). After the program, all children took a reading test and their scores were recorded. We want to estimate the effect of watching the program on reading test scores.

In [2]:
# Import electric.csv from /ros_data
electric = pd.read_csv('../ros_data/electric.csv', index_col=0)

# Print head of electric dataset
display(electric.head(10))

# Plot summary statistics
print(electric.describe())

# Plot distributions of post_test scores for treatment and control groups
sns.displot(data=electric, x="post_test", hue="treatment", col="grade", kind="kde", fill=True)
plt.show()

# Plot violin plots of post_test scores by treatment and grade
sns.violinplot(data=electric, x='grade', y='post_test', hue='treatment', split=True)
plt.show()

# Print mean post_test scores by treatment and grade group
print(electric.groupby(['treatment', 'grade'])['post_test'].mean())
post_test pre_test grade treatment supp pair_id
1 48.9 13.8 1 1 1.0 1
2 70.5 16.5 1 1 0.0 2
3 89.7 18.5 1 1 1.0 3
4 44.2 8.8 1 1 0.0 4
5 77.5 15.3 1 1 1.0 5
6 84.7 15.0 1 1 0.0 6
7 78.9 19.4 1 1 0.0 7
8 86.8 15.0 1 1 1.0 8
9 60.8 11.8 1 1 1.0 9
10 75.7 16.4 1 1 1.0 10
        post_test    pre_test       grade   treatment       supp     pair_id
count  192.000000  192.000000  192.000000  192.000000  96.000000  192.000000
mean    97.149479   72.224479    2.427083    0.500000   0.635417   48.500000
std     17.758700   33.933253    1.060917    0.501307   0.483840   27.783757
min     44.200000    8.800000    1.000000    0.000000   0.000000    1.000000
25%     86.950000   52.500000    2.000000    0.000000   0.000000   24.750000
50%    102.300000   80.750000    2.000000    0.500000   1.000000   48.500000
75%    111.000000  100.625000    3.000000    1.000000   1.000000   72.250000
max    122.000000  119.800000    4.000000    1.000000   1.000000   96.000000
No description has been provided for this image
No description has been provided for this image
treatment  grade
0          1         68.790476
           2         93.211765
           3        106.175000
           4        110.357143
1          1         77.090476
           2        101.570588
           3        106.510000
           4        114.066667
Name: post_test, dtype: float64

The results above show a positive impact of watching the program on reading test scores. This is especially notable for the children in grades 1 and 2.

Estimating the effects of United Nations peacekeeping, using pre-treatment variables to adjust for differences between treatment and control groups¶

The data is from a study on the effect of peacekeeping missions on the duration of peace after civil wars. Some countries received peacekeeping missions (treatment group) while others did not (control group). The outcome variable is if the country returned to war and the duration of peace (in months) after the war.

In [3]:
# Simple peacekeeping analysis (following the book)
import numpy as np
import matplotlib.pyplot as plt

# Load the peacekeeping dataset
peace_df = pd.read_stata('../ros_data/pk&pkept_old.dta')

# The authors did not provide variable labels or explanations, so this is my best guess:
# 'pk_dum' indicates presence of UN peacekeepers (1 = present, 0 = absent)
# 'morewar' indicates whether war returned (0 = no return to war, 1 = war returned)
# 'hazard1' is the pre-treatment hazard rate of return to war
# 'cfdate' is the date of ceasefire
# 'faildate' is the date when peace failed (war returned)
# 'pcw' indicates whether the case is post-civil-war (1 = post-civil-war, 0 = not)

# Create a copy of the dataset with selected variables
peace_df = peace_df[['pk_dum', 'morewar', 'hazard1', 'cfdate', 'faildate', 'pcw']].copy()

# Rename columns to more descriptive names
peace_df.columns = ['peacekeepers_present', 'war_returned', 'hazard_rate', 'ceasefire_date', 'failure_date', 'post_civil_war']

# Display the first few rows of the peacekeeping dataset
# display(peace_df.head(10))

# Print value counts for key variables
# print(peace_df['peacekeepers_present'].value_counts(dropna=False))
# print(peace_df['war_returned'].value_counts(dropna=False))
# print(peace_df['hazard_rate'].value_counts(dropna=False))
# print(peace_df['ceasefire_date'].value_counts(dropna=False))
# print(peace_df['failure_date'].value_counts(dropna=False))
# print(peace_df['post_civil_war'].value_counts(dropna=False))

# Convert ceasefire and failure dates to datetime
peace_df['ceasefire_date'] = pd.to_datetime(peace_df['ceasefire_date'])
peace_df['failure_date'] = pd.to_datetime(peace_df['failure_date'])

# Where nans in failure_date, replace with a censoring date (end of 2004)
peace_df['failure_date'] = peace_df['failure_date'].where(~peace_df['failure_date'].isna(), pd.to_datetime('2004-12-31'))

# Calculate delay in years from ceasefire to end of peace (or censoring)
peace_df['delay_years'] = (peace_df['failure_date'] - peace_df['ceasefire_date']).dt.days / 365.24

# Only keep post-civil-war cases with valid delay
peace_df = peace_df[(peace_df['post_civil_war'] == 1) & (~peace_df['delay_years'].isna())].copy()

# Print the proportion of countries that stayed at peace (with and without UN peacekeepers)
print(peace_df.groupby('peacekeepers_present')['war_returned'].value_counts(normalize=True).unstack())

# Plot the distribution of delay years by presence of peacekeepers
sns.displot(data=peace_df, x='delay_years', col='peacekeepers_present')
war_returned               0.0       1.0
peacekeepers_present                    
0.0                   0.338710  0.661290
1.0                   0.558824  0.441176
Out[3]:
<seaborn.axisgrid.FacetGrid at 0x30911cbd0>
No description has been provided for this image

PROBLEM: This graph does not look like the graph in the book!

Selection bias can be a problem here, as peacekeepers might not be deployed to higher-risk situations. The “treatment”—peacekeeping—was not randomly assigned. This is an observational study rather than an experiment. There is a need to adjust for differences between the treatment and control groups using pre-treatment variables.

Censoring is another issue here, as some countries did not return to war during the observation period. This means that for these countries, we only know that the duration of peace is at least as long as the observation period.

1.4 Challenges in building, understanding, and interpreting regressions¶

Two ways regression can be used for causal inference:

  1. Estimating a relationship: estimating the effect of a treatment variable on an outcome variable;
  2. Adjusting for background variables: controlling for confounding variables that may affect the relationship between the treatment and outcome variables.

Regression to estimate a relationship of interest¶

Consider simplest example of estimating the effect of a treatment variable (X) on an outcome variable (Y), where units are randomly assigned to treatment and control groups. We can use a regression model to estimate the effect of X on Y, Y = β0 + β1X + ε.

These models can be fitted for binary (see below), continuous data (see below), and more complex data types. We can fit linear models for contiuous and non-linear models for binary data (e.g., logistic regression). We can transform the outcome variable (e.g., log transformation) to better fit the model assumptions (e.g., linear).

Interactions: treatmment effects may vary across different groups or levels of another variable (moderator). For example, radon exposure may have different effects on lung cancer risk for smokers and non-smokers. We can include interaction terms in the regression model to capture these varying effects.

Binary outcome¶
In [4]:
np.random.seed(1151) # Set seed for reproducibility
N = 50 # Number of samples
x = np.random.uniform(1, 5, N) # Generate uniform random x values
y = np.random.normal(10 + 3 * x, 3, N) # Generate y values based on linear model with noise
x_binary = np.where(x < 3, 0, 1) # Create binary variable based on x
data = pd.DataFrame({'x': x, 'y': y, 'x_binary': x_binary}) # Create DataFrame
# display(data.head(10)) # Display first 10 rows of the dataset

# Scatter plot of y vs x_binary with regression line
ax = sns.scatterplot(data=data, x='x_binary', y='y')
sns.regplot(data=data, x='x_binary', y='y', scatter=False, ax=ax, color='red')
ax.set_xlim(left=0)  # Set x-axis to start at 0
plt.show()

data = sm.add_constant(data, prepend=False) # Add constant
# print(data.head(10)) # Display first 10 rows of the dataset
model = sm.OLS(data['y'], data[['x_binary', 'const']]) # Fit model: y ~ x_binary + const
results = model.fit() # Summarise model
print(results.summary()) # Print the summary of regression results
slope = results.params['x_binary'] # Extract slope
intercept = results.params['const'] # Extract intercept
print(f"y = {intercept:.1f} + {slope:.1f}*x_binary") # Print the formula for the regression line
No description has been provided for this image
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.401
Model:                            OLS   Adj. R-squared:                  0.388
Method:                 Least Squares   F-statistic:                     32.07
Date:                Tue, 03 Feb 2026   Prob (F-statistic):           8.16e-07
Time:                        09:40:19   Log-Likelihood:                -125.46
No. Observations:                  50   AIC:                             254.9
Df Residuals:                      48   BIC:                             258.7
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
x_binary       4.8678      0.860      5.663      0.000       3.140       6.596
const         16.6639      0.620     26.886      0.000      15.418      17.910
==============================================================================
Omnibus:                        1.154   Durbin-Watson:                   1.800
Prob(Omnibus):                  0.562   Jarque-Bera (JB):                1.186
Skew:                           0.287   Prob(JB):                        0.553
Kurtosis:                       2.510   Cond. No.                         2.67
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
y = 16.7 + 4.9*x_binary
Continuous outcome¶
In [5]:
# Scatter plot of y vs x with regression line
ax = sns.scatterplot(data=data, x='x', y='y')
# Extend regression line to x=0 by specifying xlim and plotting a manual line
sns.regplot(data=data, x='x', y='y', scatter=False, ax=ax, color='red', truncate=False)
plt.show()

data = sm.add_constant(data, prepend=False) # Add constant
# print(data.head(10)) # Display first 10 rows of the dataset
model = sm.OLS(data['y'], data[['x', 'const']]) # Fit model: y ~ x_binary + const
results = model.fit() # Summarise model
print(results.summary()) # Print the summary of regression results
slope = results.params['x'] # Extract slope
intercept = results.params['const'] # Extract intercept
print(f"y = {intercept:.1f} + {slope:.1f}*x") # Print the formula for the regression line
No description has been provided for this image
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.485
Model:                            OLS   Adj. R-squared:                  0.474
Method:                 Least Squares   F-statistic:                     45.20
Date:                Tue, 03 Feb 2026   Prob (F-statistic):           1.96e-08
Time:                        09:40:19   Log-Likelihood:                -121.67
No. Observations:                  50   AIC:                             247.3
Df Residuals:                      48   BIC:                             251.2
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
x              2.7857      0.414      6.723      0.000       1.953       3.619
const         10.4407      1.362      7.668      0.000       7.703      13.179
==============================================================================
Omnibus:                        3.603   Durbin-Watson:                   1.961
Prob(Omnibus):                  0.165   Jarque-Bera (JB):                2.686
Skew:                           0.546   Prob(JB):                        0.261
Kurtosis:                       3.312   Cond. No.                         12.2
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
y = 10.4 + 2.8*x
Nonlinear regression¶
In [6]:
# Non linear model
np.random.seed(1151) # Set seed for reproducibility
N = 50 # Number of samples
x = np.random.uniform(1, 5, N) # Generate uniform random x values
y = np.random.normal(5 + 30 * np.exp(-x), 2, N) # Generate y values based on non-linear model with noise
data = pd.DataFrame({'x': x, 'y': y}) # Create DataFrame

# Fit lowess
ax = sns.scatterplot(data=data, x='x', y='y')
sns.regplot(data=data, x='x', y='y', scatter=False, ax=ax, color='red', lowess=True)
plt.show()

# Fit linear
ax = sns.scatterplot(data=data, x='x', y='y')
sns.regplot(data=data, x='x', y='y', scatter=False, ax=ax, color='red')
plt.show()



from scipy.optimize import curve_fit

# Define the exponential decay function
def exp_decay(x, a, b):
    return a + b * np.exp(-x)

# Fit the model to your data
params, covariance = curve_fit(lambda x, a, b: a + b * np.exp(-x), data['x'], data['y'], p0=[5, 30])

# Print estimated parameters
print("Estimated a:", params[0])
print("Estimated b:", params[1])

# Plot the fit
plt.scatter(data['x'], data['y'], label='Data')
plt.plot(np.sort(data['x']), params[0] + params[1] * np.exp(-np.sort(data['x'])), color='red', label='Fitted curve')
plt.legend()
plt.show()
No description has been provided for this image
No description has been provided for this image
Estimated a: 4.6802800130372
Estimated b: 32.50996224944559
No description has been provided for this image
In [7]:
import matplotlib.pyplot as plt
import statsmodels.api as sm
import seaborn as sns

def fit_and_plot_lm(data, predictors, outcome, add_constant=True, show_plot=True, scatter_kws=None, line_kws=None):
    """
    Fit a linear model using statsmodels, print summary, plot, and show formula.
    Args:
        data: pandas DataFrame
        predictors: list of predictor column names (str)
        outcome: outcome column name (str)
        add_constant: whether to add intercept (default True)
        show_plot: whether to plot (default True)
        scatter_kws: dict, kwargs for scatterplot
        line_kws: dict, kwargs for regression line
    """
    X = data[predictors].copy()
    if add_constant:
        X = sm.add_constant(X, prepend=False)
    y = data[outcome]
    model = sm.OLS(y, X)
    results = model.fit()
    print(results.summary())
    # Print formula
    params = results.params
    formula = f"{outcome} = " + " + ".join([f"{params[name]:.2f}*{name}" for name in predictors])
    if add_constant:
        formula = f"{outcome} = {params['const']:.2f} + " + " + ".join([f"{params[name]:.2f}*{name}" for name in predictors])
    print("Formula:", formula)
    # Plot if only one predictor
    if show_plot and len(predictors) == 1:
        x_name = predictors[0]
        ax = sns.scatterplot(data=data, x=x_name, y=outcome, **(scatter_kws or {}))
        x_vals = np.linspace(data[x_name].min(), data[x_name].max(), 100)
        y_vals = params['const'] + params[x_name] * x_vals if add_constant else params[x_name] * x_vals
        ax.plot(x_vals, y_vals, color='red', **(line_kws or {}))
        ax.set_title('Linear Regression Fit')
        plt.show()

Regression to adjust for differences between treatment and control groups¶

In real world research there are often systematic differences between treatment and control groups (imbalance in pre-treatment variables). For example, in an observational study on the effect of a new drug on blood pressure, patients who choose to take the drug may differ systematically from those who do not (e.g., age, gender, baseline health status).

To adjust for these differences, we can include pre-treatment variables (covariates) in the regression model. This helps to control for confounding factors that may bias the estimated treatment effect. The graphs below show an example for fabricated data.

This can be summarised as follows:

On average, treated units were 3.5 units higher than the control units on the pre-treatment variable X (31.6-28.1=3.5). But the two groups differed in their pre-treatment predictor slopes as well. The treated units had a slope of 0.5, while the control units had a slope of 1.7. This means that for every one unit increase in X, the outcome Y increased by 0.5 units for the treated group and by 1.5 units for the control group. After adjusting for these differences using regression, the estimated treatment effect is 9.6 units (the difference in intercepts: 29.5 - 19.9 = 9.6).

In [8]:
# Non linear model
np.random.seed(1151) # Set seed for reproducibility
import numpy as np

N = 100
z = np.repeat([0, 1], N // 2)
# print(z)
xx = np.random.normal(0, np.where(z == 0, 1.2, 0.8), N) ** 2
yy = np.random.normal(20 + 5 * xx + 10 * z, 3)
data = pd.DataFrame({'xx': xx, 'z': z, 'yy': yy})

# Mean
print(data.groupby('z')['yy'].mean())
# Difference in means
print("Difference in means y", data[data['z'] == 1]['yy'].mean() - data[data['z'] == 0]['yy'].mean())

# Mean
print(data.groupby('z')['xx'].mean())
# Difference in means
print("Difference in means x", data[data['z'] == 1]['xx'].mean() - data[data['z'] == 0]['xx'].mean())


# display(data.head(10))

# Fit linear
ax = sns.scatterplot(data=data, x='xx', y='yy')
sns.regplot(data=data[z==1], x='xx', y='yy', scatter=False, ax=ax, color='red', truncate=False)
sns.regplot(data=data[z==0], x='xx', y='yy', scatter=False, ax=ax, color='green')
plt.show()

fit_and_plot_lm(data[z==1], predictors=['xx'], outcome='yy', add_constant=True, show_plot=True, scatter_kws=None, line_kws=None)
fit_and_plot_lm(data[z==0], predictors=['xx'], outcome='yy', add_constant=True, show_plot=True, scatter_kws=None, line_kws=None)
z
0    28.161774
1    31.649151
Name: yy, dtype: float64
Difference in means y 3.4873768848605664
z
0    1.688619
1    0.488257
Name: xx, dtype: float64
Difference in means x -1.2003618536469403
No description has been provided for this image
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                     yy   R-squared:                       0.483
Model:                            OLS   Adj. R-squared:                  0.473
Method:                 Least Squares   F-statistic:                     44.90
Date:                Tue, 03 Feb 2026   Prob (F-statistic):           2.12e-08
Time:                        09:40:20   Log-Likelihood:                -128.65
No. Observations:                  50   AIC:                             261.3
Df Residuals:                      48   BIC:                             265.1
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
xx             4.4098      0.658      6.701      0.000       3.087       5.733
const         29.4960      0.559     52.742      0.000      28.372      30.620
==============================================================================
Omnibus:                        1.724   Durbin-Watson:                   2.400
Prob(Omnibus):                  0.422   Jarque-Bera (JB):                0.956
Skew:                           0.289   Prob(JB):                        0.620
Kurtosis:                       3.352   Cond. No.                         1.97
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Formula: yy = 29.50 + 4.41*xx
No description has been provided for this image
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                     yy   R-squared:                       0.864
Model:                            OLS   Adj. R-squared:                  0.861
Method:                 Least Squares   F-statistic:                     304.3
Date:                Tue, 03 Feb 2026   Prob (F-statistic):           2.06e-22
Time:                        09:40:20   Log-Likelihood:                -133.81
No. Observations:                  50   AIC:                             271.6
Df Residuals:                      48   BIC:                             275.4
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
xx             4.9022      0.281     17.444      0.000       4.337       5.467
const         19.8838      0.695     28.620      0.000      18.487      21.281
==============================================================================
Omnibus:                        1.119   Durbin-Watson:                   2.010
Prob(Omnibus):                  0.572   Jarque-Bera (JB):                0.803
Skew:                          -0.310   Prob(JB):                        0.669
Kurtosis:                       2.981   Cond. No.                         3.67
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Formula: yy = 19.88 + 4.90*xx
No description has been provided for this image
Interpreting coefficients in a predictive model¶

Consider, earnings= 11 000 + 1500 ∗(height−60) + error, where error is normally distributed with mean 0 and standard deviation +/- 22000. This is useless for forecasting and causal inference, but can be used for exploring associations. Ie, taller people tend to earn more on average.

Building, interpreting, and checking regression models¶

Statistical analysis four steps:

  1. Model building: selecting predictors, interactions, and transformations;
  2. Model fitting: estimating model parameters using appropriate methods (e.g., maximum likelihood, Bayesian inference);
  3. Understanding: interpreting model coefficients and assessing their significance;
  4. Criticism: checking model assumptions, evaluating model fit, and validating predictions.

Challenge is how to be critical without being paralyzed. We can generalise from sample to population even if the inference is not perfect.

Key step in criticising claims is to follow steps that link the claims to the data and models used to generate the claims.

1.5 Classical and Bayesian inference¶

Most effort in stats is in fitting models to data and using models to make predictions and inferences. Common are three concerns:

  1. What information is being used to fit the model?;
  2. What assumptions are being made?;
  3. How estimates and predictions are interpreted?

The third step can be done in a classical or Bayesian framework. The difference is in how uncertainty is quantified and interpreted.

Information¶

Regression starts with data on outcome, y, and predictors, X. In addition to data, we often know how they were collected (sampling design) and the context of the data collection (study design).

For example for a survey, random sampling or convenience sampling. Random treatment assignment in an experiment.

Finally, prior information about the parameters or the data generating process may be available from previous studies or expert knowledge. Caution as published literature may be biased towards statistically significant results.

Assumptions¶

Three sorts of assumptions for outcome variable y, predictors X:

  1. Functional form: e.g., linearity (we can also transform), additivity, interactions. Choices of transformation and variable choice correspond to assumptions about the data generating process;
  2. Where data come from: what data are seen and not - who responded to survey and who did not. E.g., random sampling, random treatment assignment;
  3. Real world relevance of data: accuracy, reliability, validity. Measurement error can bias estimates and predictions.
Classical inference¶

Traditional summarise information in data, not using prior information - get estimates that have well understood properties - low bias, low variance.

Long term expectations - correct on average (unbiasedness), confidence intervals cover true value 95% of the time (coverage). Important principle is conservatism - sometimes data are weak and we cannot make strong claims. Classical stats there should be a clear link from data to inferences, which is checkable.

Issue when studies are small and data are indirect or highly variable.

Small sample sizes can lead to unreliable estimates and wide confidence intervals, making it difficult to draw meaningful conclusions.

Bayesian inference¶

Bayesian inference combines information from data with prior information to make inferences about model parameters. This is done using Bayes' theorem, which updates prior beliefs based on observed data.

Pros: can incorporate prior information and provide more intuitive interpretations of uncertainty (credible intervals). Particularly with small sample sizes or complex models. Cons: requires specifying prior distributions, which can be subjective and may influence results.

Inferences are probabilistic and can be represented by random simulations.