4. Statistical inference¶

Operations on data that yield estimates and uncertainty about predictions and parameters of some process or population. These probabilistic uncertainty estimates based on assumed probability model for observed data.

Key: mistake to use hypothesis tests or statistical significance to attribute certainty from noisy data.

4.1 Sampling distributions and generative models¶

Sampling, measurement error and model error¶

Three paradigms role of inference:

  1. Sampling model - learn characteristics of population from sample or subset.
  2. Measurement error model - learn about underlying pattern of law (e.g., a and b in y = ax + b), but the data are measured with error (y = ax + b + e). Measurement error can be additive (e) or multiplicative (y = ax + b * e).
  3. Model error model - model is inperfect.

In practice, consider all three when constructing and working with models.

Example: predict student grades from pre-test scores. Sampling model: students in class are sample from population of all students. Measurement error model: pre-test scores and grades are measured with error - imperfect measure of ability. Model error model: relationship between pre-test scores and grades is not perfect.

Usual approach: $y_i = ax_i + b + e_i$ where $e_i$ is also interpretable as model error, where $e_i$ is considered random sample from distribution(e.g., normal with mean 0 and standard deviation $\sigma$) that represents a hypothetical 'superpopulation' of all possible errors.

Sampling distribution¶

Sampling distribution: set of possible datasets could have been observed if data collection was re-done along with the probability of the possible values. Sampling distribution is determined by the data collection process and the model for the data. Sampling distribution is a theoretical construct that is not directly observable and term is misleading (better is 'probabilistic data model'), but can be approximated by simulation or resampling methods (e.g., bootstrap).

Example: pure random sample size n from population size N, the sampling distribution is set of all samples of size n, all with equal probability.

Example pure measurement error: if observations $y_i, i=1,...,n$ are generated by $y_i = ax_i + b + e_i$ where a and b are fixed parameters and $e_i$ are frin a specified distribution (e.g., normal with mean 0 and standard deviation $\sigma$), then the sampling distribution is the set of all possible datasets $y_i$ that could have been observed from $x_i$ and the distribution of $e_i$.

Sampling distribution is not known in practice, but can be estimated. Sampling distribution is a generative model in that it represents a random process which if known would allow us to generate new datasets that are similar to the observed data.

4.2 Estimates, standard errors, and confidence intervals¶

Parameters, estimands, and estimates¶

Parameter: unknown numbers that determine a model. E.g., in $y_i = ax_i + b + e_i$, parameters are a, b (co-efficients), and $\sigma$ (variance parameter). Parameters can be used to simulate data.

Estimand: or quantity of interest summary of parameters or data. E.g., in $y_i = ax_i + b + e_i$, estimand could be a, b, or $\sigma$ or some function of them (e.g., a/b).

Use data to estimate parameters and estimands. The sampling distribution of an estimate is a byproduct of the sampling distribution of the data and the estimation procedure.

Standard errors, inferencial uncertainty, and confidence intervals¶

Standard error: standard deviation of an estimate. Give a sense of uncertainty in estimate.

Usually summarise uncertainty using simulation and give term 'standard error' looser meaning to cover any measure of uncertainty that is comparable to the posterior standard deviation.

Standard error is measure of variation in estimate and gets smaller as sample size increases, converging to zero as sample size goes to infinity. Standard error is not a measure of the probability that an estimate is correct, but rather a measure of the variability of the estimate across hypothetical repeated samples.

Confidence interval: range of values of parameter or quantity of interest that is consistent with the observed data, given the assumed sampling distribution. If the model is correct, then the confidence interval will contain the true value of the parameter or quantity of interest with a specified probability (e.g., 95%).

The usual 95% confidence interval for large samples, based on assumption that sampling distribution of the estimate is approximately normal, is to take an estimate and add and subtract 1.96 times the standard error of the estimate. This is based on the fact that for a normal distribution, approximately 95% of the values lie within 1.96 standard deviations of the mean. 50% interval is easy to understand: true value is as likely to be within the interval as outside the interval.

Assuming the model is correct, it should happen only about 5% ofthe time that the estimate, falls more than 2 standard errors away from the true β.

Standard errors and confidence intervals for averages and proportions¶

Standard error of an infinite population, given a sample of size n, is $\sigma / \sqrt{n}$ where $\sigma$ is the standard deviation of the population.

Proportion special case of average where each observation is 0 or 1, with y yes and n-y no. Estimate of proportion is $\hat{p} = y/n$ and standard error is $\sqrt{\hat{p}(1-\hat{p})/n}$.

Confidence intervals for proportions can be calculated using the standard error formula. 700 people random sample support death penalty and 300 oppose, so $\hat{p} = 0.7$ and standard error is $\sqrt{0.7*0.3/1000} = 0.0145$. 95% confidence interval is $0.7 \pm 1.96*0.0145$ or (0.672, 0.728).

Standard error and confidence interval for a proportion when y=0 or y=n¶

Conventionally, y and n-y are both greater than 5 for the standard error formula to be valid. If y=0 or y=n, then the standard error is zero and the confidence interval is zero width. Quick correction is $\hat{p} = (y + 2)/(n + 4)$, which gives a non-zero standard error and confidence interval.

Standard error for a comparison¶

Standard error of the difference of two indepedent quantities = $\sqrt{SE_1^2 + SE_2^2}$ where $SE_1$ and $SE_2$ are the standard errors of the two quantities.

Example: 1000 people, 400 men and 600 women, 57% men and 45% of women plan to vote for candidate A. Standard error of difference in proportions is $\sqrt{0.57*0.43/400 + 0.45*0.55/600} = 0.032$.

Sampling distribution of the same mean and standard deviation: normal and x2 distributions¶

Suppose draw n data point $y_1...y_n$ from normal distribution with mean $\mu$ and standard deviation $\sigma$. The sampling distribution of the sample mean $\bar{y}=\sum y_i/n$ and standard deviation $s = \sqrt{\sum (y_i - \bar{y})^2/(n-1)}$. Sample mean is normally distributed with mean $\mu$ and standard deviation $\sigma/\sqrt{n}$. The sample standard deviation is $s^2 * (n-1)/\sigma^2$ is distributed as chi-squared with n-1 degrees of freedom.

Degrees of freedom¶

Degrees of freedom arrises with the x2 distribution and a few other places. Relate to the need to correct for overfitting when estimating the error of future predictions from a fitted model. Calculating predictive error on the same data used to fit the model will underestimate the true error, because the model is overfitting to the noise in the data. Roughly, data provide n degrees of freedom, but fitting a model with k parameters uses up k degrees of freedom, leaving n-k degrees of freedom to estimate the error of future predictions. Lower degrees of freedom means more overfitting and more underestimation of the true error.

Confidence intervals from the t distribution¶

t distribution is a family of distributions that are similar to the normal distribution but have heavier tails. t is characterised by center, scale and degrees of freedom ($1-infinity$). Distributions in the t family with low degrees of freedom have heavier tails than the normal distribution, which means that they are more likely to produce values far from their mean. As the degrees of freedom increases, the t distribution approaches the normal distribution.

When a standard error is estimated from n data points, we can account for uncertainty in the standard error by using the t distribution with n-1 degrees of freedom calculated as n-1 because the mean is estimated from the data and uses up one degree of freedom.

Inference for discrete data¶

Use continuous formula for standard error.

Example, 1000 people, 600 have 1 dog, 50 have 2 dogs 30 have 3 dogs, 20 have 4 dogs. What is the 95% confidence interval for the average number of dogs?

In [1]:
import numpy as np                                          # for numerical operations
from scipy import stats                                     # for statistical distributions

# Create the data
y = np.repeat([0, 1, 2, 3, 4], [600, 300, 50, 30, 20])    # repeat each value the specified number of times

n = len(y)                                                  # sample size (1000)
estimate = np.mean(y)                                       # sample mean (point estimate)
se = np.std(y, ddof=1) / np.sqrt(n)                         # standard error = sample SD / sqrt(n), ddof=1 for sample SD

int_50 = estimate + stats.t.ppf([0.25, 0.75], df=n-1) * se   # 50% confidence interval using t-distribution quantiles
int_95 = estimate + stats.t.ppf([0.025, 0.975], df=n-1) * se  # 95% confidence interval using t-distribution quantiles

print(f"Estimate: {estimate:.3f}")                          # print the sample mean
print(f"SE: {se:.4f}")                                      # print the standard error
print(f"50% CI: [{int_50[0]:.3f}, {int_50[1]:.3f}]")       # print the 50% confidence interval
print(f"95% CI: [{int_95[0]:.3f}, {int_95[1]:.3f}]")       # print the 95% confidence interval
Estimate: 0.570
SE: 0.0277
50% CI: [0.551, 0.589]
95% CI: [0.516, 0.624]
Linear transformations¶

In the example above, the 95%CI for number of dogs per person is [0.516, 0.624]. What is the 95%CI for number of dogs per 100 people? Just multiply the endpoints by 100, so [51.6, 62.4].

Claudes laymans explanation¶

Based on our survey of 1,000 people, the average person owns about 0.57 dogs. We're 95% confident that if we could ask every adult in the country, the true average would be somewhere between 0.52 and 0.62 dogs per person.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Read all lines from the file
with open('../ros_data/death_polls.dat', 'r') as f:
    lines = [line.strip() for line in f.readlines()]  # strip whitespace from each line

# Group every 4 lines into one record
records = []
for i in range(0, len(lines), 4):                     # step through lines in groups of 4
    year_month = lines[i].split()                      # split "2002 10" into ["2002", "10"]
    year = int(year_month[0])                          # first value is the year
    month = float(year_month[1])                       # second value is the month (can be fractional)
    val1 = int(lines[i+1])                             # percent support
    val2 = int(lines[i+2])                             # percent oppose
    val3 = int(lines[i+3])                             # percent don't know
    records.append([year, month, val1, val2, val3])

# Create a DataFrame
df = pd.DataFrame(records, columns=['year', 'month', 'support', 'oppose', 'dont_know'])

# val1, val2, val3 are percentages (sum to 100), not raw counts
# percent support among those with an opinion
df['percent_support'] = df['support'] / (df['support'] + df['oppose'])

# SE for a proportion requires actual sample size; Gallup polls typically survey ~1000 people
n_respondents = 1000
df['se'] = np.sqrt(df['percent_support'] * (1 - df['percent_support']) / n_respondents)
df['ci_68_lower'] = df['percent_support'] - stats.norm.ppf(0.84) * df['se']
df['ci_68_upper'] = df['percent_support'] + stats.norm.ppf(0.84) * df['se']

display(df.head())

fig, ax = plt.subplots(figsize=(10, 5))

# Convert year + fractional month to a decimal year for x-axis positioning
x = df['year'] + (df['month'] - 1) / 12

# Error bars from ci_68_lower to ci_68_upper
yerr_lower = (df['percent_support'] - df['ci_68_lower']) * 100
yerr_upper = (df['ci_68_upper'] - df['percent_support']) * 100

ax.errorbar(x, df['percent_support'] * 100, yerr=[yerr_lower, yerr_upper],
            fmt='o', color='#1a1a2e', ecolor='#888', elinewidth=1.2,
            capsize=3, capthick=1.2, markersize=5)

ax.set_xlabel('Year')
ax.set_ylabel('Support for death penalty (%)')
ax.set_title('Public support for the death penalty')
sns.despine()
plt.tight_layout()
plt.show()
year month support oppose dont_know percent_support se ci_68_lower ci_68_upper
0 2002 10.0 70 25 5 0.736842 0.013925 0.722994 0.750690
1 2002 5.0 72 25 3 0.742268 0.013831 0.728513 0.756023
2 2001 10.0 68 26 6 0.723404 0.014145 0.709337 0.737471
3 2001 5.0 65 27 8 0.706522 0.014400 0.692202 0.720842
4 2001 2.0 67 25 8 0.728261 0.014068 0.714271 0.742250
No description has been provided for this image
Weighted averages¶

Confidence intervals can be determined by appropriately combining separate means and variances.

Example: Survey in France, Germany, Italy... yield estimates 0.55 +/- 0.02, 0.61 +/- 0.03, 0.38 +/- 0.03. Estimated proportion for all = $\frac{N_1}{N_total}*0.55 + \frac{N_2}{N_total}*0.61 + \frac{N_3}{N_total}*0.38$ where $N_i$ is the population of country i and $N_total$ is the total population of all countries. Standard error of this weighted average = $\sqrt{(\frac{N_1}{N_total}*0.02)^2 + (\frac{N_2}{N_total}*0.03)^2 + (\frac{N_3}{N_total}*0.03)^2}$. Given N, p and standard error for each country, we can calculate the weighted average and its standard error, and then construct a confidence interval for the overall estimate.

4.3 Bias and unmodeled uncertainty¶

Inferences assume unbiased measurements, random samples and randomised experiments, however data collection is imperfect.

Bias in estimation¶

An estimate is unbiased if correct average.

Non-response or unrepresentative response can result in a biased sample and a biased estimate.

Estimates in reality are biased in some way.

Bias depends on the sampling distribution, which is unknown.

Adjusting inferences to account for bias and unmodeled uncertainty¶

Standard error might not fully capture the real/practical uncertainty associated with an inference, where there might be other important uncertainty that is not captured in the simple standard error.

For example, systematic differences in survey responses and voters, variation in opinion over time and inaccurate responses.

How to account for sources of error not in the model: improve data collection, expand the model, and increase sampling error.

Expand model: divide sample into sub groups and assume simple random sample within each. Not perfect, but allows us to reduce vias in estimation by adjusting for known differences between sample and population.

Increase uncertainty: typically assume errors are independent and so capture additional uncertainty by adding variances. Variance is the square of the standard deviation.

Total uncertainty = $\sqrt{S_{1}^2 + S_{2}^2}$ where $S_{1}$ is the standard error from the sampling distribution and $S_{2}$ is the standard error from the additional source of uncertainty. The mathematics says that it will be most effective to reduce the largest source of uncertainty, so if $S_{1}$ is much larger than $S_{2}$, then reducing $S_{1}$ will have a bigger impact on reducing total uncertainty than reducing $S_{2}$.

4.4 Statistical significance, hypothesis testing, and statistical errors¶

Concern possibility of mistakenly coming to strong conclusions that dont replicate or reflect reality. Statistical significance and hypothesis testing are common approaches to try to avoid this, but they have limitations and can be misused.

Statistical significance¶

Dont misinterpret statistical significance as stable or real.

Conventionally defined as p-value less than 0.05, relative to some null hypothesis. A p-value is the probability of observing data as extreme or more extreme than the observed data, given that the null hypothesis is true. A small p-value (e.g., less than 0.05) is often taken as evidence against the null hypothesis, suggesting that the observed data is unlikely to have occurred by chance alone under the null hypothesis.

Not statistically significant if estimate is less than 2 standard errors from the null value. Not statistically significant if observed value could responably be explained by simple chance variation.

Example: 20 coin tosses, 8 heads and 12 tails. Estimate of probability of heads is 0.4 with standard error of 0.11. Not statistically significant because 0.4 is less than 2 standard errors from null value of 0.5, so observed value could reasonably be explained by simple chance variation.

Hypothesis testing for simple comparisons¶

4.5 Problems with the concept of statistical significance¶

4.6 Example of hypothesis testing: 55,000 residents need your help!¶

4.7 Moving beyond hypothesis testing¶

4.8 Bibliographic note¶

4.9 Exercises¶