3. Some basic methods in mathematics and probability¶
3.1 Weighted averages¶
Reweight data or inferences to adapt to a target population.
Example:
Stratum, Label Population, Average age 1 United States 310million 36.8 2 Mexico 112million 26.7 3 Canada 34 million 40.7
Weighted average age = (31036.8 + 11226.7 + 34*40.7) / (310 + 112 + 34) = 34.6 years Weights = 310/456, 112/456, 34/456
Weighted aveage in summation notation: $$\frac{\sum_{j} N_j y_j}{\sum_{j} N_j}$$
Where j indexes the country and the sums add over all the strata.
3.2 Vectors and matrices¶
List of numbers is a vector. A table (rectangular array) of numbers is a matrix.
Earlier example, predicted vote percentage = $$ 46.3 + 3 \times (growth rate of average personal income)$$ $$\hat{y} = 46.3 + 3x$$ $$\hat{y} = \hat{a} + \hat{b}x$$
The hats denotes estimated parameters. y hat is the predicted value of y.
Applied to a vector of data, x=-1, 0 and 3
$$ \hat{y} = \hat{a} + \hat{b}x = 46.3 + 3x = \begin{bmatrix} 46.3 + 3 \times -1 \\ 46.3 + 3 \times 0 \\ 46.3 + 3 \times 3 \end{bmatrix} = \begin{bmatrix} 43.3 \\ 46.3 \\ 55.3 \end{bmatrix}$$
In matrix form:
$$\hat{y} = \hat{a} + \hat{b}x = \begin{bmatrix} 1 & -1 \\ 1 & 0 \\ 1 & 3 \end{bmatrix} \begin{bmatrix} 46.3 \\ 3 \end{bmatrix} = \begin{bmatrix} 43.3 \\ 46.3 \\ 55.3 \end{bmatrix}$$
Abstractly, we can write this as: $$\hat{y} = X \hat{\beta}$$
3.3 Graphing a line¶
$$y = a + bx$$ Intercept = a, when x=0, y=a Slope = b, upward slope if b>0, downward slope if b<0 and flat if b=0. The larger the absolute value of b, the steeper the slope.
Below I fit a linear model to the time for the mile record to be broken vs the year. The slope is negative, indicating a downward slope with time - as time progresses, the time for the mile record to be broken decreases. The slope is also steep, indicating a rapid decline in the time for the mile record to be broken.
y = 1006.88 - 0.39*year
The tendency is to decribe the slope as "the time for the mile record to be broken decreases by 0.39 seconds per year", but this is better described as "when comparing any two years, on average, the time for the mile record to be broken decreases by 0.39 seconds for the following year".
The intercept is 1006.88, which is the predicted time for the mile record to be broken when year=0 (which is not meaningful in this context, but it is a mathematical artifact of the linear model). Even at very early years, the predicted time for the mile record to be broken is very high, which is not realistic. This is a common issue with linear models when extrapolating beyond the range of the data.
import matplotlib.pyplot as plt
import statsmodels.api as sm
import seaborn as sns
def fit_and_plot_lm(data, predictors, outcome, add_constant=True, show_plot=True, scatter_kws=None, line_kws=None):
"""
Fit a linear model using statsmodels, print summary, plot, and show formula.
Args:
data: pandas DataFrame
predictors: list of predictor column names (str)
outcome: outcome column name (str)
add_constant: whether to add intercept (default True)
show_plot: whether to plot (default True)
scatter_kws: dict, kwargs for scatterplot
line_kws: dict, kwargs for regression line
"""
X = data[predictors].copy()
if add_constant:
X = sm.add_constant(X, prepend=False)
y = data[outcome]
model = sm.OLS(y, X)
results = model.fit()
print(results.summary())
# Print formula
params = results.params
formula = f"{outcome} = " + " + ".join([f"{params[name]:.2f}*{name}" for name in predictors])
if add_constant:
formula = f"{outcome} = {params['const']:.2f} + " + " + ".join([f"{params[name]:.2f}*{name}" for name in predictors])
print("Formula:", formula)
# Plot if only one predictor
if show_plot and len(predictors) == 1:
x_name = predictors[0]
ax = sns.scatterplot(data=data, x=x_name, y=outcome, **(scatter_kws or {}))
x_vals = np.linspace(data[x_name].min(), data[x_name].max(), 100)
y_vals = params['const'] + params[x_name] * x_vals if add_constant else params[x_name] * x_vals
ax.plot(x_vals, y_vals, color='red', **(line_kws or {}))
ax.set_title('Linear Regression Fit')
plt.show()
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
mile = pd.read_csv('../ros_data/mile.csv')
display(mile)
fit_and_plot_lm(mile, ['year'], 'seconds', add_constant=True, show_plot=True, scatter_kws=None, line_kws=None)
| yr | month | min | sec | year | seconds | |
|---|---|---|---|---|---|---|
| 0 | 1913 | 5.0 | 4 | 14.40 | 1913.416667 | 254.40 |
| 1 | 1915 | 7.0 | 4 | 12.60 | 1915.583333 | 252.60 |
| 2 | 1923 | 8.0 | 4 | 10.40 | 1923.666667 | 250.40 |
| 3 | 1931 | 10.0 | 4 | 9.20 | 1931.833333 | 249.20 |
| 4 | 1933 | 7.0 | 4 | 7.60 | 1933.583333 | 247.60 |
| 5 | 1934 | 6.0 | 4 | 6.80 | 1934.500000 | 246.80 |
| 6 | 1937 | 8.0 | 4 | 6.40 | 1937.666667 | 246.40 |
| 7 | 1942 | 7.0 | 4 | 6.20 | 1942.583333 | 246.20 |
| 8 | 1942 | 7.0 | 4 | 6.20 | 1942.583333 | 246.20 |
| 9 | 1942 | 9.0 | 4 | 4.60 | 1942.750000 | 244.60 |
| 10 | 1943 | 7.0 | 4 | 2.60 | 1943.583333 | 242.60 |
| 11 | 1944 | 7.0 | 4 | 1.60 | 1944.583333 | 241.60 |
| 12 | 1945 | 7.0 | 4 | 1.40 | 1945.583333 | 241.40 |
| 13 | 1954 | 5.0 | 3 | 59.40 | 1954.416667 | 239.40 |
| 14 | 1954 | 6.0 | 3 | 58.00 | 1954.500000 | 238.00 |
| 15 | 1957 | 7.0 | 3 | 57.20 | 1957.583333 | 237.20 |
| 16 | 1958 | 8.0 | 3 | 54.50 | 1958.666667 | 234.50 |
| 17 | 1962 | 1.0 | 3 | 54.40 | 1962.083333 | 234.40 |
| 18 | 1964 | 11.0 | 3 | 54.10 | 1964.916667 | 234.10 |
| 19 | 1965 | 6.0 | 3 | 53.60 | 1965.500000 | 233.60 |
| 20 | 1966 | 7.0 | 3 | 51.30 | 1966.583333 | 231.30 |
| 21 | 1967 | 6.0 | 3 | 51.10 | 1967.500000 | 231.10 |
| 22 | 1975 | 5.0 | 3 | 51.00 | 1975.416667 | 231.00 |
| 23 | 1975 | 8.0 | 3 | 49.40 | 1975.666667 | 229.40 |
| 24 | 1979 | 7.0 | 3 | 49.00 | 1979.583333 | 229.00 |
| 25 | 1980 | 7.0 | 3 | 48.80 | 1980.583333 | 228.80 |
| 26 | 1981 | 8.0 | 3 | 48.53 | 1981.666667 | 228.53 |
| 27 | 1981 | 8.2 | 3 | 48.40 | 1981.683333 | 228.40 |
| 28 | 1981 | 8.3 | 3 | 47.33 | 1981.691667 | 227.33 |
| 29 | 1985 | 7.0 | 3 | 46.32 | 1985.583333 | 226.32 |
| 30 | 1993 | 9.0 | 3 | 44.39 | 1993.750000 | 224.39 |
| 31 | 1999 | 7.0 | 3 | 43.13 | 1999.583333 | 223.13 |
OLS Regression Results
==============================================================================
Dep. Variable: seconds R-squared: 0.977
Model: OLS Adj. R-squared: 0.976
Method: Least Squares F-statistic: 1277.
Date: Wed, 18 Feb 2026 Prob (F-statistic): 3.78e-26
Time: 09:56:41 Log-Likelihood: -54.745
No. Observations: 32 AIC: 113.5
Df Residuals: 30 BIC: 116.4
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
year -0.3930 0.011 -35.734 0.000 -0.416 -0.371
const 1006.8760 21.532 46.762 0.000 962.902 1050.850
==============================================================================
Omnibus: 0.318 Durbin-Watson: 0.840
Prob(Omnibus): 0.853 Jarque-Bera (JB): 0.115
Skew: 0.144 Prob(JB): 0.944
Kurtosis: 2.937 Cond. No. 1.72e+05
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.72e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
Formula: seconds = 1006.88 + -0.39*year
3.4 Exponential and power-law growth and decline; logarithmic and log-log relationships¶
Exponential growth and decline¶
$$y = a + bx$$
can be used to express exponential growth or decline by using the logarithm of $y$.
$$\log y = a + bx$$
This expresses exponential growth or decline. If $b > 0$, then $y$ grows exponentially with $x$. If $b < 0$, then $y$ declines exponentially with $x$. The larger the absolute value of $b$, the faster the growth or decline. The original equation can be rewritten as:
$$y = Ae^{bx}$$
Growth example: population starts 1.5 billion in 1900 and doubles every 50 years. This can be expressed as:
$$y = (1.5 \times 10^9) \times 2^{(x-1900)/50}$$
Where $x$ is the year. This can be rewritten as:
$$y = (1.5 \times 10^9) \times e^{\ln(2) \times (x-1900)/50}$$
Exponential decay example: asset worth $1000 and declines by 20% per year. This can be expressed as: $$y = 1000 \times (1 - 0.2)^x$$
Where $x$ is the number of years. This can be rewritten as: $$y = 1000 \times e^{\ln(0.8) \times x}$$
# Create a dataframe with some fabricated data
x = np.arange(1, 2000)
a = 1.5e9 # 1.5x10^9
b = np.log(2) / 50 # Doubling every 50 units
y = a * np.exp(b * (x - 1900)) # Shift x to start from 1900
data = pd.DataFrame({'x': x, 'y': y})
sns.scatterplot(data=data, x='x', y='y')
plt.title('Exponential growth y = a * exp(b * x), b > 0')
# x limit start 1900
plt.xlim(1900, 2000)
plt.ylim(1.5e9, 6e9)
plt.show()
# Exponential decline example
x = np.arange(1, 10)
a = 1000
b = np.log(0.8)
y = a * np.exp(b * x)
data = pd.DataFrame({'x': x, 'y': y})
sns.scatterplot(data=data, x='x', y='y')
plt.title('Exponential decline y = a * exp(b * x), b < 0')
plt.show()
Power-law growth and decline¶
$$ log y = a + b log x $$
This expresses power-law growth or decline. If $b > 0$, then $y$ grows as a power of $x$. If $b < 0$, then $y$ declines as a power of $x$. The larger the absolute value of $b$, the faster the growth or decline. The original equation can be rewritten as:
$$y = Ax^b$$
# Create a dataframe with some fabricated data
x = np.arange(1, 4000)
a = 3.3
b = 0.74
y = a * x ** b
data = pd.DataFrame({'x': x, 'y': y})
sns.scatterplot(data=data, x='x', y='y')
plt.title('Power-law growth y = A * x^b, b > 0')
plt.show()
# Modify x and y to be log of the original values
data['x_log'] = np.log(data['x'])
data['y_log'] = np.log(data['y'])
sns.scatterplot(data=data, x='x_log', y='y_log')
plt.title('Power-law growth y = A * x^b, b > 0')
plt.show()
# Make x and y axis log scale
sns.scatterplot(data=data, x='x', y='y')
plt.xscale('log')
plt.yscale('log')
plt.title('Power-law growth y = A * x^b, b > 0 (log-log scale)')
plt.show()
3.5 Probability distributions¶
Straight line is deterministic.
We need probability distributions and random variables, as data does not fit model exactly. Probability distributions represent unmodelled aspects of reality - the error.
$$ y = a + bx + \epsilon $$
Where $\epsilon$ is the error term, which is a random variable that represents the unmodelled aspects of reality. The probability distribution of $\epsilon$ captures the uncertainty in the model's predictions.
Distributions of data: $$ y_i, i = 1, 2, ..., n $$
Distribution of error: $$ \epsilon_i, i = y_i - (a + bx_i) $$
Regression describes the typical range of values of the outcome given the predictors. Two steps: 1. predict the mean outcome given the predictors, 2. describe the variation the remains after predicting the mean outcome given the predictors. The first step is captured by the regression line, and the second step is captured by the distribution of the error term $\epsilon$.
Mean and standard deviations of a probability distribution¶
Probability distribution of random variable z.
Mean is average of all numbers, or value that would be obtained on average from a random sample from the distribution. Mean also called the expected value E(z) or $\mu$ z.
The variance of a distribution is the average of the squared deviations from the mean ((z-$\mu$)^2), and is denoted as $\sigma^2$. The standard deviation is the square root of the variance, and is denoted as $\sigma$.
Standard deviation is square root of variance, and is in the same units as the original data. It is a measure of the spread of the distribution. A larger standard deviation indicates a wider spread, while a smaller standard deviation indicates a narrower spread.
Normal distribution, mean and standard deviation¶
The central limit theorem states that the sum of many independent random variables will be approximately normally distributed, regardless of the underlying distribution of the individual random variables. Summation of independent components:
$$z = \sum_{i=1}^n z_i$$
Mean variance of z:
$$\mu_z = \sum_{i=1}^n \mu_{z_i}$$
Standard deviation of z:
$$\sigma_z = \sqrt{\sum_{i=1}^n \sigma_{z_i}^2}$$
Normal distribution is written as: z ~ normal($\mu_z$, $\sigma_z$). Central limit theorem holds in practice if the individual $\sigma_{z_i}$ are small compared to standard deviation $\sigma_{z}$ of the sum. Normal distribution with mean $\mu$ and standard deviation $\sigma$ is denoted as N($\mu$, $\sigma$). 50% of the values of a normal distribution lie within 0.67 standard deviations of the mean, 68% lie within 1 standard deviation of the mean, 95% lie within 2 standard deviations of the mean and 99.7% lie within 3 standard deviations of the mean.
No reason to expect random variables to be normally distributed, but many are approximately normally distributed.
Example below showing the distribution of heights of women in the US. Not a great example, as the authors did not share the full data, but only the data for the histogram.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Data from previous step
height_counts_women = np.array([
80, 107, 296, 695, 1612, 2680, 4645, 8201, 9948, 11733, 10270, 9942, 6181, 3990, 2131, 1154, 245, 257, 0, 0, 0, 0
]) * 10339 / 74167
weight_counts_women = np.array([
362, 1677, 4572, 9363, 11420, 12328, 9435, 7023, 5047, 3621, 2753, 2081, 1232, 887, 2366
]) * 10339 / 74167
height_counts_men = np.array([
0, 0, 0, 0, 0, 0, 0, 542, 668, 1221, 2175, 4213, 5535, 7980, 9566, 9578, 8867, 6716, 5019, 2745, 1464, 1263
]) * 9983 / 67552
# Round counts to nearest integer for sampling
height_counts_women_int = np.round(height_counts_women).astype(int)
weight_counts_women_int = np.round(weight_counts_women).astype(int)
height_counts_men_int = np.round(height_counts_men).astype(int)
# Create synthetic data arrays
height_data_women = np.concatenate([
np.full(count, bin_idx) for bin_idx, count in enumerate(height_counts_women_int)
])
weight_data_women = np.concatenate([
np.full(count, bin_idx) for bin_idx, count in enumerate(weight_counts_women_int)
])
height_data_men = np.concatenate([
np.full(count, bin_idx) for bin_idx, count in enumerate(height_counts_men_int)
])
# Create DataFrames with thousands of entries
height_df_women = pd.DataFrame({'height_bin': height_data_women})
weight_df_women = pd.DataFrame({'weight_bin': weight_data_women})
height_df_men = pd.DataFrame({'height_bin': height_data_men})
# Seaborn histograms men and women heights on single plot
plt.figure(figsize=(8, 4))
sns.histplot(data=height_df_women, x='height_bin', color='blue', label='Women', alpha=0.5)
sns.histplot(data=height_df_men, x='height_bin', color='orange', label='Men', alpha=0.5)
plt.xlabel("Height Bin")
plt.ylabel("Count")
plt.title("Distribution of Heights of Men and Women")
plt.legend()
plt.show()
Linear transformations¶
Linearly transforming a normal distribution results in another normal distribution.
Mean and standard deviation of the sum of correlated random variables¶
If two random variables u and v have a mean $\mu_u$ and $\mu_v$, standard deviation $\sigma_u$ and $\sigma_v$, then their correlation is defined as $$\rho_{uv} = \frac{E[(u-\mu_u)(v-\mu_v)]}{\sigma_u \sigma_v}$$.
This must be between -1 and 1. If $\rho_{uv} = 0$, then u and v are uncorrelated. If $\rho_{uv} = 1$, then u and v are perfectly positively correlated. If $\rho_{uv} = -1$, then u and v are perfectly negatively correlated.
Knowing the correlation between u and v allows us to calculate the mean and standard deviation of their sum z = u + v. Their sum has mean $\mu_z = \mu_u + \mu_v$ and standard deviation $\sigma_z = \sqrt{\sigma_u^2 + \sigma_v^2 + 2\rho_{uv}\sigma_u\sigma_v}$. The weighted sum of correlated random variables is also normally distributed, with mean $\mu_z = a\mu_u + b\mu_v$ and standard deviation $\sigma_z = \sqrt{a^2\sigma_u^2 + b^2\sigma_v^2 + 2ab\rho_{uv}\sigma_u\sigma_v}$.
Log normal distribution¶
Log transformations useful as does not allow for negative values, and can capture skewed distributions.
The exponential of the mean and standard deviations of log z are the geometric mean and geometric standard deviation of z.
Logarithmic transformation are non linear.
Generally, logarithmic transformations are used to transform multiplicative relationships into additive relationships.
Binomial distribution¶
Basketball 20 hots with 0.3 probability of success and indepedent (previous shots do not affect the probability of success of future shots). Said to have a binomial distribution n=20, p=0.3.
For binary responses, binomials are more appropriate than normal.
The binomial distribution with parameters n and p is denoted as binomial(n, p). The mean of a binomial distribution is np and the standard deviation is $\sqrt{np(1-p)}$.
Poisson distribution¶
Poisson distribution is used to model count data, such as the number of events that occur in a fixed interval of time or space.
Almost always an idealisation, as often ignores systematic differences and clustering.
Generally recommend Poisson to be expanded to account for overdispersion,
Unclassified distributions¶
Much data does not fit a named distribution, that is fine. Named distributions are idealisations, and are useful for understanding the properties of data, but they are not necessary for analysis.
Probability distributions of error¶
In regression, we use a deterministic model to model as much of variation as possible and then use a probability distribution to model the remaining variation (the error) $y = deterministic model + error$.
For discrete data, error cannot be easily mathematically separated from the data, so we model the data directly with a probability distribution. For continuous data, we can separate the error from the data and model the error with a probability distribution.
Comparing distributions¶
Can look at the mean and other summary statistics. Also make sense to consider the shift in quantiles. For example, after a treatment, the 50th percentile might shift to the 54th percentile, which would indicate that the treatment has a positive effect on the outcome.
3.6 Probability modeling¶
What is the probability that my vote is decisive in the election? That is, what is the probability that my vote will change the outcome of the election?
Consider an election with two candidates, A and B. There are n voters. If tie, n=even, but my vote will break the tie. Consider two ways to estimate the probability that my vote is decisive: empirical forecasting (recommended) and binomial modeling (not recommended).
Empirical forecasting¶
Let y equal vote percentage for candidate A and summarise our uncertainty in y by a normal distribution with mean 0.49 and standard deviation 0.04 - candidate is predicted to loose, but there is uncertainty in the prediction.
The probability that n votes is exactly split is if n is even, which can be approximated as 1/n times the forecase vote share density at 0.5, which is 1/n times the normal density at 0.5 with mean 0.49 and standard deviation 0.04, which is approximately 0.08/n. For example, if n=200000, norm.pdf(0.5, 0.49, 0.04) / 2e5 = 4.8e-05 = 1 in 21,000 = very small. But if we convince a 1000 people to change their vote, 1000*norm.pdf(0.5, 0.49, 0.04) / 2e5 = 0.048 = 1 in 20 = much higher.
from scipy.stats import norm
print(norm.pdf(0.5, 0.49, 0.04) / 2e5)
print(1/norm.pdf(0.5, 0.49, 0.04) * 2e5)
4.8333514600356156e-05 20689.57757921108
Using a reasonable seeming but inappropriate model¶
n=200,000 each with probability p of voting for a particular candidate. Probability of a tie = 0.0018 = 1 in 556.
If we shift the probability to 0.49, then the probability of a tie is 07.519171007276916e-21 = effectively zero, which is not realistic.
Why?
Binomial assumption is not appropriate, as it assumes that each vote is independent and has the same probability of voting for a particular candidate. In reality, votes are not independent, as they are influenced by social networks, media, and other factors. Additionally, the probability of voting for a particular candidate is not the same for all voters, as it is influenced by demographic factors, political beliefs, and other factors. Therefore, the binomial model is not appropriate for modeling the probability of a tie in an election.
Key problem is that the binomial model does not account for the uncertainty in the vote share prediction. The empirical forecasting approach accounts for this uncertainty by using a normal distribution to model the vote share, which allows for a more realistic estimate of the probability of a tie.
Key lesson: be careful when using a model, always consider the assumptions of the model and whether they are reasonable in the context of the problem.
from scipy.stats import binom
print(binom.pmf(int(1e5), int(2e5), 0.5))
print(1/binom.pmf(int(1e5), int(2e5), 0.5))
print(binom.pmf(int(1e5), int(2e5), 0.49))
0.0017841218859990207 560.4998222641325 7.519171007276916e-21
General lessons for probability modeling¶
- Understand the assumptions of your model: Always consider what assumptions your model makes and whether they are reasonable in the context of the problem you're trying to solve.
- Use empirical data to inform your models: Whenever possible, use empirical data to inform your models and to validate their predictions. This can help ensure that your models are grounded in reality and are more likely to provide accurate predictions.