4. Statistical inference¶
Operations on data that yield estimates and uncertainty about predictions and parameters of some process or population. These probabilistic uncertainty estimates based on assumed probability model for observed data.
Key: mistake to use hypothesis tests or statistical significance to attribute certainty from noisy data.
4.1 Sampling distributions and generative models¶
Sampling, measurement error and model error¶
Three paradigms role of inference:
- Sampling model - learn characteristics of population from sample or subset.
- Measurement error model - learn about underlying pattern of law (e.g., a and b in y = ax + b), but the data are measured with error (y = ax + b + e). Measurement error can be additive (e) or multiplicative (y = ax + b * e).
- Model error model - model is inperfect.
In practice, consider all three when constructing and working with models.
Example: predict student grades from pre-test scores. Sampling model: students in class are sample from population of all students. Measurement error model: pre-test scores and grades are measured with error - imperfect measure of ability. Model error model: relationship between pre-test scores and grades is not perfect.
Usual approach: $y_i = ax_i + b + e_i$ where $e_i$ is also interpretable as model error, where $e_i$ is considered random sample from distribution(e.g., normal with mean 0 and standard deviation $\sigma$) that represents a hypothetical 'superpopulation' of all possible errors.
Sampling distribution¶
Sampling distribution: set of possible datasets could have been observed if data collection was re-done along with the probability of the possible values. Sampling distribution is determined by the data collection process and the model for the data. Sampling distribution is a theoretical construct that is not directly observable and term is misleading (better is 'probabilistic data model'), but can be approximated by simulation or resampling methods (e.g., bootstrap).
Example: pure random sample size n from population size N, the sampling distribution is set of all samples of size n, all with equal probability.
Example pure measurement error: if observations $y_i, i=1,...,n$ are generated by $y_i = ax_i + b + e_i$ where a and b are fixed parameters and $e_i$ are frin a specified distribution (e.g., normal with mean 0 and standard deviation $\sigma$), then the sampling distribution is the set of all possible datasets $y_i$ that could have been observed from $x_i$ and the distribution of $e_i$.
Sampling distribution is not known in practice, but can be estimated. Sampling distribution is a generative model in that it represents a random process which if known would allow us to generate new datasets that are similar to the observed data.
4.2 Estimates, standard errors, and confidence intervals¶
Parameters, estimands, and estimates¶
Parameter: unknown numbers that determine a model. E.g., in $y_i = ax_i + b + e_i$, parameters are a, b (co-efficients), and $\sigma$ (variance parameter). Parameters can be used to simulate data.
Estimand: or quantity of interest summary of parameters or data. E.g., in $y_i = ax_i + b + e_i$, estimand could be a, b, or $\sigma$ or some function of them (e.g., a/b).
Use data to estimate parameters and estimands. The sampling distribution of an estimate is a byproduct of the sampling distribution of the data and the estimation procedure.
Standard errors, inferencial uncertainty, and confidence intervals¶
Standard error: standard deviation of an estimate. Give a sense of uncertainty in estimate.
Usually summarise uncertainty using simulation and give term 'standard error' looser meaning to cover any measure of uncertainty that is comparable to the posterior standard deviation.
Standard error is measure of variation in estimate and gets smaller as sample size increases, converging to zero as sample size goes to infinity. Standard error is not a measure of the probability that an estimate is correct, but rather a measure of the variability of the estimate across hypothetical repeated samples.
Confidence interval: range of values of parameter or quantity of interest that is consistent with the observed data, given the assumed sampling distribution. If the model is correct, then the confidence interval will contain the true value of the parameter or quantity of interest with a specified probability (e.g., 95%).
The usual 95% confidence interval for large samples, based on assumption that sampling distribution of the estimate is approximately normal, is to take an estimate and add and subtract 1.96 times the standard error of the estimate. This is based on the fact that for a normal distribution, approximately 95% of the values lie within 1.96 standard deviations of the mean. 50% interval is easy to understand: true value is as likely to be within the interval as outside the interval.
Assuming the model is correct, it should happen only about 5% ofthe time that the estimate, falls more than 2 standard errors away from the true β.
Standard errors and confidence intervals for averages and proportions¶
Standard error of an infinite population, given a sample of size n, is $\sigma / \sqrt{n}$ where $\sigma$ is the standard deviation of the population.
Proportion special case of average where each observation is 0 or 1, with y yes and n-y no. Estimate of proportion is $\hat{p} = y/n$ and standard error is $\sqrt{\hat{p}(1-\hat{p})/n}$.
Confidence intervals for proportions can be calculated using the standard error formula. 700 people random sample support death penalty and 300 oppose, so $\hat{p} = 0.7$ and standard error is $\sqrt{0.7*0.3/1000} = 0.0145$. 95% confidence interval is $0.7 \pm 1.96*0.0145$ or (0.672, 0.728).
Standard error and confidence interval for a proportion when y=0 or y=n¶
Conventionally, y and n-y are both greater than 5 for the standard error formula to be valid. If y=0 or y=n, then the standard error is zero and the confidence interval is zero width. Quick correction is $\hat{p} = (y + 2)/(n + 4)$, which gives a non-zero standard error and confidence interval.
Standard error for a comparison¶
Standard error of the difference of two indepedent quantities = $\sqrt{SE_1^2 + SE_2^2}$ where $SE_1$ and $SE_2$ are the standard errors of the two quantities.
Example: 1000 people, 400 men and 600 women, 57% men and 45% of women plan to vote for candidate A. Standard error of difference in proportions is $\sqrt{0.57*0.43/400 + 0.45*0.55/600} = 0.032$.
Sampling distribution of the same mean and standard deviation: normal and x2 distributions¶
Suppose draw n data point $y_1...y_n$ from normal distribution with mean $\mu$ and standard deviation $\sigma$. The sampling distribution of the sample mean $\bar{y}=\sum y_i/n$ and standard deviation $s = \sqrt{\sum (y_i - \bar{y})^2/(n-1)}$. Sample mean is normally distributed with mean $\mu$ and standard deviation $\sigma/\sqrt{n}$. The sample standard deviation is $s^2 * (n-1)/\sigma^2$ is distributed as chi-squared with n-1 degrees of freedom.
Degrees of freedom¶
Degrees of freedom arrises with the x2 distribution and a few other places. Relate to the need to correct for overfitting when estimating the error of future predictions from a fitted model. Calculating predictive error on the same data used to fit the model will underestimate the true error, because the model is overfitting to the noise in the data. Roughly, data provide n degrees of freedom, but fitting a model with k parameters uses up k degrees of freedom, leaving n-k degrees of freedom to estimate the error of future predictions. Lower degrees of freedom means more overfitting and more underestimation of the true error.
Confidence intervals from the t distribution¶
t distribution is a family of distributions that are similar to the normal distribution but have heavier tails. t is characterised by center, scale and degrees of freedom ($1-infinity$). Distributions in the t family with low degrees of freedom have heavier tails than the normal distribution, which means that they are more likely to produce values far from their mean. As the degrees of freedom increases, the t distribution approaches the normal distribution.
When a standard error is estimated from n data points, we can account for uncertainty in the standard error by using the t distribution with n-1 degrees of freedom calculated as n-1 because the mean is estimated from the data and uses up one degree of freedom.
Inference for discrete data¶
Use continuous formula for standard error.
Example, 1000 people, 600 have 1 dog, 50 have 2 dogs 30 have 3 dogs, 20 have 4 dogs. What is the 95% confidence interval for the average number of dogs?
import numpy as np # for numerical operations
from scipy import stats # for statistical distributions
# Create the data
y = np.repeat([0, 1, 2, 3, 4], [600, 300, 50, 30, 20]) # repeat each value the specified number of times
n = len(y) # sample size (1000)
estimate = np.mean(y) # sample mean (point estimate)
se = np.std(y, ddof=1) / np.sqrt(n) # standard error = sample SD / sqrt(n), ddof=1 for sample SD
int_50 = estimate + stats.t.ppf([0.25, 0.75], df=n-1) * se # 50% confidence interval using t-distribution quantiles
int_95 = estimate + stats.t.ppf([0.025, 0.975], df=n-1) * se # 95% confidence interval using t-distribution quantiles
print(f"Estimate: {estimate:.3f}") # print the sample mean
print(f"SE: {se:.4f}") # print the standard error
print(f"50% CI: [{int_50[0]:.3f}, {int_50[1]:.3f}]") # print the 50% confidence interval
print(f"95% CI: [{int_95[0]:.3f}, {int_95[1]:.3f}]") # print the 95% confidence interval
Estimate: 0.570 SE: 0.0277 50% CI: [0.551, 0.589] 95% CI: [0.516, 0.624]
Linear transformations¶
In the example above, the 95%CI for number of dogs per person is [0.516, 0.624]. What is the 95%CI for number of dogs per 100 people? Just multiply the endpoints by 100, so [51.6, 62.4].
Claudes laymans explanation¶
Based on our survey of 1,000 people, the average person owns about 0.57 dogs. We're 95% confident that if we could ask every adult in the country, the true average would be somewhere between 0.52 and 0.62 dogs per person.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Read all lines from the file
with open('../ros_data/death_polls.dat', 'r') as f:
lines = [line.strip() for line in f.readlines()] # strip whitespace from each line
# Group every 4 lines into one record
records = []
for i in range(0, len(lines), 4): # step through lines in groups of 4
year_month = lines[i].split() # split "2002 10" into ["2002", "10"]
year = int(year_month[0]) # first value is the year
month = float(year_month[1]) # second value is the month (can be fractional)
val1 = int(lines[i+1]) # percent support
val2 = int(lines[i+2]) # percent oppose
val3 = int(lines[i+3]) # percent don't know
records.append([year, month, val1, val2, val3])
# Create a DataFrame
df = pd.DataFrame(records, columns=['year', 'month', 'support', 'oppose', 'dont_know'])
# val1, val2, val3 are percentages (sum to 100), not raw counts
# percent support among those with an opinion
df['percent_support'] = df['support'] / (df['support'] + df['oppose'])
# SE for a proportion requires actual sample size; Gallup polls typically survey ~1000 people
n_respondents = 1000
df['se'] = np.sqrt(df['percent_support'] * (1 - df['percent_support']) / n_respondents)
df['ci_68_lower'] = df['percent_support'] - stats.norm.ppf(0.84) * df['se']
df['ci_68_upper'] = df['percent_support'] + stats.norm.ppf(0.84) * df['se']
display(df.head())
fig, ax = plt.subplots(figsize=(10, 5))
# Convert year + fractional month to a decimal year for x-axis positioning
x = df['year'] + (df['month'] - 1) / 12
# Error bars from ci_68_lower to ci_68_upper
yerr_lower = (df['percent_support'] - df['ci_68_lower']) * 100
yerr_upper = (df['ci_68_upper'] - df['percent_support']) * 100
ax.errorbar(x, df['percent_support'] * 100, yerr=[yerr_lower, yerr_upper],
fmt='o', color='#1a1a2e', ecolor='#888', elinewidth=1.2,
capsize=3, capthick=1.2, markersize=5)
ax.set_xlabel('Year')
ax.set_ylabel('Support for death penalty (%)')
ax.set_title('Public support for the death penalty')
sns.despine()
plt.tight_layout()
plt.show()
| year | month | support | oppose | dont_know | percent_support | se | ci_68_lower | ci_68_upper | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 2002 | 10.0 | 70 | 25 | 5 | 0.736842 | 0.013925 | 0.722994 | 0.750690 |
| 1 | 2002 | 5.0 | 72 | 25 | 3 | 0.742268 | 0.013831 | 0.728513 | 0.756023 |
| 2 | 2001 | 10.0 | 68 | 26 | 6 | 0.723404 | 0.014145 | 0.709337 | 0.737471 |
| 3 | 2001 | 5.0 | 65 | 27 | 8 | 0.706522 | 0.014400 | 0.692202 | 0.720842 |
| 4 | 2001 | 2.0 | 67 | 25 | 8 | 0.728261 | 0.014068 | 0.714271 | 0.742250 |
Weighted averages¶
Confidence intervals can be determined by appropriately combining separate means and variances.
Example: Survey in France, Germany, Italy... yield estimates 0.55 +/- 0.02, 0.61 +/- 0.03, 0.38 +/- 0.03. Estimated proportion for all = $\frac{N_1}{N_total}*0.55 + \frac{N_2}{N_total}*0.61 + \frac{N_3}{N_total}*0.38$ where $N_i$ is the population of country i and $N_total$ is the total population of all countries. Standard error of this weighted average = $\sqrt{(\frac{N_1}{N_total}*0.02)^2 + (\frac{N_2}{N_total}*0.03)^2 + (\frac{N_3}{N_total}*0.03)^2}$. Given N, p and standard error for each country, we can calculate the weighted average and its standard error, and then construct a confidence interval for the overall estimate.
4.3 Bias and unmodeled uncertainty¶
Inferences assume unbiased measurements, random samples and randomised experiments, however data collection is imperfect.
Bias in estimation¶
An estimate is unbiased if correct average.
Non-response or unrepresentative response can result in a biased sample and a biased estimate.
Estimates in reality are biased in some way.
Bias depends on the sampling distribution, which is unknown.
Adjusting inferences to account for bias and unmodeled uncertainty¶
Standard error might not fully capture the real/practical uncertainty associated with an inference, where there might be other important uncertainty that is not captured in the simple standard error.
For example, systematic differences in survey responses and voters, variation in opinion over time and inaccurate responses.
How to account for sources of error not in the model: improve data collection, expand the model, and increase sampling error.
Expand model: divide sample into sub groups and assume simple random sample within each. Not perfect, but allows us to reduce vias in estimation by adjusting for known differences between sample and population.
Increase uncertainty: typically assume errors are independent and so capture additional uncertainty by adding variances. Variance is the square of the standard deviation.
Total uncertainty = $\sqrt{S_{1}^2 + S_{2}^2}$ where $S_{1}$ is the standard error from the sampling distribution and $S_{2}$ is the standard error from the additional source of uncertainty. The mathematics says that it will be most effective to reduce the largest source of uncertainty, so if $S_{1}$ is much larger than $S_{2}$, then reducing $S_{1}$ will have a bigger impact on reducing total uncertainty than reducing $S_{2}$.
4.4 Statistical significance, hypothesis testing, and statistical errors¶
Concern possibility of mistakenly coming to strong conclusions that dont replicate or reflect reality. Statistical significance and hypothesis testing are common approaches to try to avoid this, but they have limitations and can be misused.
Statistical significance¶
Dont misinterpret statistical significance as stable or real.
Conventionally defined as p-value less than 0.05, relative to some null hypothesis. A p-value is the probability of observing data as extreme or more extreme than the observed data, given that the null hypothesis is true. A small p-value (e.g., less than 0.05) is often taken as evidence against the null hypothesis, suggesting that the observed data is unlikely to have occurred by chance alone under the null hypothesis.
Not statistically significant if estimate is less than 2 standard errors from the null value. Not statistically significant if observed value could responably be explained by simple chance variation.
Example: 20 coin tosses, 8 heads and 12 tails. Estimate of probability of heads is 0.4 with standard error of 0.11. Not statistically significant because 0.4 is less than 2 standard errors from null value of 0.5, so observed value could reasonably be explained by simple chance variation.
Hypothesis testing for simple comparisons¶
Simple example: randomized experiment to compare the effectiveness of two drugs for lowering cholesterol. Mean and standard deviation of post-treatment cholesterol are $\bar{y}_T$ and $s_T$ for the $n_T$ patients in the treatment group, and $\bar{y}_C$ and $\bar{y}_C$ and $s_C$ for the $n_C$ patients in the control group.
Estimate, standard error and degrees of freedom The parameter of interest is $\phi = \phi_T - \phi_C$ difference in post-test difference in cholesterol between treatment and control groups. Estimate of $\phi$ is $\hat{\phi} = \bar{y}_T - \bar{y}_C$. Standard error of $se(\hat{\phi}) = \sqrt{\frac{s_T^2}{n_T} + \frac{s_C^2}{n_C}}$. The approximate 95% is $\hat{\phi} \pm t_{n_{c}+n_{T}-2}^{0.975}*se(\hat{\phi})$ where $t_{n_{c}+n_{T}-2}^0.975$ is the 97.5th percentile of the t distribution with df degrees of freedom, this quantile approaches 1.96, corresponding to the 95% confidence interval for a normal distribution.
Null and alternative hypotheses Null hypothesis is $\phi = 0$ (no difference between treatment and control groups), that is $\phi_{T} = \phi_{C}$. Alternative hypothesis is $\phi \neq 0$ (there is a difference between treatment and control groups), and $\phi_{T} \neq \phi_{C}$.
Hypothesis test based on a test statistic summarises the deviation of the data from what would be expected under the null hypothesis. Conventional test statistic is the absolute value of the t-score, $t=|\hat{\phi} - 0|/se(\hat{\phi})$, with the absolute value representing a two-sided test - so called because we are interested in deviations in either direction (treatment better than control or control better than treatment).
p-value deviation from null hypothesis is summarised by a p-value, which is the probability of observing a test statistic as extreme or more extreme than the observed test statistic, given that the null hypothesis is true.
If the standard deviation of $\phi$ is known, then we can use the normal distribution (z-test) instead of the t distribution.
In common practice, the null hypothesis is said to be rejected if the p-value is less than 0.05 - that is the 95% confidence interval for the parameter excludes zero. This means that if the null hypothesis is true, we would expect to see a test statistic as extreme as the observed one only about 5% of the time due to random chance alone.
Hypothesis testing: general formulation¶
Simplest form of hypothesis testing, null hypothesis $H_0$ represents a particular probability model, $p(y)$ with potential replication data $y^{rep}$. To perform a hypothesis test, we choose a test statistic $T$ that is a function of the data. For any given data y, the p-value is then $Pr(T(y^{rep}) \geq T(y) | H_0)$, which is the probability that the test statistic calculated from the replication data $y^{rep}$ is as extreme or more extreme than the test statistic calculated from the observed data y, given that the null hypothesis $H_0$ is true. If this p-value is small (e.g., less than 0.05), we reject the null hypothesis, suggesting that the observed data is unlikely to have occurred by chance alone under the null hypothesis.
In regression, testing more complicated. Model can be written as $p(y\x,\phi)$ where $\phi$ is a set of parameters including coefficients, residual standard deviation and any other parameters of the model. Null hypothesis is that a particular parameter (e.g., a coefficient) is equal to zero, which corresponds to the idea that there is no relationship between the predictor variable and the outcome variable.
Consider the model $y_i = a + b x_i + e_i$, where errors are normally distributed with mean 0 and standard deviation $\sigma$. Then $\phi$ is the vector $(a, b, \sigma)$. Null hypothesis is that $b=0$, which is a composite and corresponds to the regression model with parameters $(a, 0, \sigma)$, for any values of a and $\sigma$. The p-value $Pr(T(y^{rep}) \geq T(y) | H_0)$, then depends on a and $\sigma$ and what is typically done is to choose the maximum p-value (most conservative). To put another way, the hypothesis test is performed on the null distribution closest to the data.
Comparisons of parameters to fixed values and each other: interpreting confidence intervals as hypothesis tests¶
Hypothesis that parameter equals zero (or any other value) can be tested by fitting a model that includes the parameter and examining the 95% interval. If the interval excludes zero (or the value of interest), then the hypothesis is rejected at the 5% level.
Testing if two parameters are equal is the same as testing if the difference between the two parameters is equal to zero. This can be done by fitting a model that includes both parameters and examining the confidence interval for the difference between the two parameters. If the confidence interval for the difference excludes zero, then we reject the hypothesis that the two parameters are equal at the 5% level.
Hypothesis if one parameter is positive can be assessed by examining the confidence interval. Testing if one parameter is greater than another can be assessed by examining the confidence interval for the difference between the two parameters. If the confidence interval for the difference is entirely above zero, then we reject the hypothesis that one parameter is less than or equal to the other at the 5% level.
Hypothesis test outcomes are "reject null hypothesis" or "fail to reject null hypothesis". We never say "accept null hypothesis" because failing to reject the null hypothesis does not mean that the null hypothesis is true, it just means that we do not have enough evidence to reject it.
Type 1 and type 2 errors and why we dont like talking about them¶
Type 1 error: rejecting the null hypothesis when it is actually true. Type 2 error: failing to reject the null hypothesis when it is actually false.
Fundamental problem is that in many problems we do not think the null hypothesis can be true. For example, poitical advert will change opinions, where we might imagine that the null hypothesis is that the advert has no effect, but we are not really interested in type 1 error. We are more interested in type 2 error, which is failing to detect an effect when there is one.
Concern: When a study finds a "statistically significant" result (p < 0.05, say), people don't just say "there's an effect." They look at the estimated size of that effect and use it to make decisions. For example, "this drug lowers blood pressure by 10 mmHg" — that number matters for real-world choices. Because of this, we should not only care about whether an effect is statistically significant, but also about the accuracy of the effect size estimate and the uncertainty around that estimate. This is a problem, because statistically significant results tend to be overestimates of the true effect size ("winners curse" or "type M error" - magnitude error).
Type 1 and 2 error is based on deterministic, which might be appropriate for large effects, but less in modern sciences whith highly variable effects.
Type M (magnitude) and type S (sign) errors¶
Type M and S errors can occur when a researcher makes a claim with confidence (traditionally, p-value less than 0.05 or confidence interval excludes zero). A type S error is when the sign of the estimated effect is opposite to the true effect. A type M error is when the magnitude of the estimated effect is much different than the true effect.
A statistical procedure can be characterised by its S error rates, which are the probabilities of making type S errors and its exageration factor, which is the expected value of the magnitude of the estimated effect divided by the magnitude of the true effect, given that the estimate is statistically significant.
When procedure is noisy, the type S error rate and exaggeration factor can be large, meaning that even when we get a statistically significant result, it can be in the wrong direction (type S error) and to be an overestimate of the true effect size (type M error). This is a problem because it can lead to false confidence in results that are not actually reliable.
In quantitative research, we are particularly concerned with type M errors or exageration factors, which can be be understood in light of the "statistical significance filter" - the idea that when we only look at statistically significant results, we are more likely to see overestimates of the true effect size, because only the larger estimates will be statistically significant. Imagine the true effect of something is actually quite small. If your study is noisy (high standard error), your result won't reach that "twice the standard error" threshold unless you happen to get a lucky, inflated measurement by chance. So the only results from noisy studies that make it through to publication are the ones that got lucky and measured an effect that's bigger than the truth. Say the true effect is 3 units, but your standard error is 5. A result of 3 would never be published (not significant). Only results of 10+ might squeak through — and those are flukes. So readers only ever see the flukes. The noisier your study, the worse this problem gets. Noisy studies require bigger-looking results to get published, so they systematically overstate reality. This is why many published findings in social science, medicine, and psychology have failed to replicate — the real effects turn out to be much smaller than originally reported.
Hypothesis testing and statistical practice¶
We generally dont use null hypothesis significance testing in practice. We generally dont think the null hypothesis is true. The question is never really "is there an effect?" but rather "how big is the effect, and does it matter?". If you collect enough data, you can statistically "prove" almost anything is non-zero. So significance testing basically just tells you whether your sample was large enough — not whether the effect is meaningful or important. Instead we should focus on estimating effect sizes and their uncertainty, and on understanding the practical significance of our findings rather than just the statistical significance.
- When you fail to reject the null: This doesn't mean "there's no effect." It just means "our data isn't informative enough to say much beyond the baseline assumption." It's a signal about the limits of your data, not a claim about reality. Think of it like saying "we couldn't hear anything" rather than "there was no sound."
- When you do reject the null: The point isn't "aha, we proved the effect exists!" — they already assumed it existed before the study started. Instead, rejection means "our data contains enough of a signal that it's worth building a more detailed, complex model." It's a green light to dig deeper, not a finish line.
The key mental shift: In the standard view, rejecting the null is the goal. In their view, it's just a diagnostic tool — a way of asking whether your data has enough signal to be worth analyzing further. And failing to reject just means your data is too weak to say anything useful, not that the null is actually true.
Researchers often begin with a belief that there is an effect, their aim is to reject the null hypothesis of no effect, and then to claim that their theory is supported. But this is a flawed way of thinking about the scientific process for many reasons. E.g., the effect could be due to poor measurement, random variation, and other sources of error can lead to false positives (type 1 errors).
The issue is that a statistical hypothesis (e.g., B=0) is different from a scientific hypothesis.
4.5 Problems with the concept of statistical significance¶
The approach of summarising by statistical significance and draw sharp distinction between "significant" and "not significant" is problematic for several reasons.
Statistical significance is not the same as practical significance¶
A result can be statstically significant but not practically significant. A result could also not be statistically significant but still be practically significant.
E.g., treatment increases annual earning by 10 bucks, which is statistically significant but not practically significant. Or a treatment increases annual earning by 10k, which is not statistically significant but is practically significant.
Non significant results are not evidence of no effect¶
Failing to reject the null hypothesis does not mean that the null hypothesis is true. It just means that we do not have enough evidence to reject it. There could still be an effect, but our data is not informative enough to detect it. So a non-significant result is not evidence of no effect, it's just evidence of insufficient data.
E.g., observed average difference in treadmill time was 16.6 seconds with standard error of 10 seconds, 95% confidence interval included zero and p-value of 0.2, so not statistically significant. Fair to say results are uncertain. But lack of significance does not mean that there is no effect of the treatment on treadmill time, it just means that we do not have enough evidence to reject the null hypothesis of no effect. The true effect could still be positive or negative, but our data is not informative enough to say which.
The difference between “significant” and “not significant” is not itself statistically significant¶
A common mistake is to compare two estimates and conclude that one is significantly different from zero while the other is not, and then to claim that the two estimates are significantly different from each other. However, this is not necessarily the case.
A move from 5.1 to 4.9 could result from a very small, non-significant change in the data.
Researcher degrees of freedom, p-hacking, and forking paths¶
Researchers have many degrees of freedom in how they collect, analyze, and report data. This can lead to p-hacking, which is the practice of trying multiple analyses or data collection strategies until you get a statistically significant result.
The standard multiple comparisons problem is familiar: if you run 20 tests, one will likely appear significant by chance alone. Most researchers know they shouldn't do this obvious kind of fishing.
The sneakier version is that you can have the same problem without ever consciously running multiple tests. Here's how: At every stage of data analysis, researchers make small judgment calls — "researcher degrees of freedom":
Do I include or exclude these outliers? Do I control for this variable or not? Do I log-transform this data? Do I define the outcome this way or that way? Do I drop observations from this subgroup?
Each of these decisions creates a fork in the road. Taken together, the number of possible analyses you could have run is enormous — even if you only actually ran one.
The problem: If the path you chose was influenced at all by what the data seemed to be showing — even subconsciously — then you've effectively searched through many possible analyses and landed on one that looked good. You did the equivalent of running 50 tests without realizing it.
The key word is "contingent" — meaning your analytical choices depended on the data itself. When that happens, your single final analysis isn't truly independent of the data, which is what statistical tests assume.
A concrete example: You run your regression, notice the results look weak, and think "hmm, maybe I should drop those unusual observations" or "maybe I should add this control variable." You make the change, the result becomes significant, and you report it. You only ran one final analysis — but the road to it was shaped by peeking at results along the way.
In short: the multiple comparisons problem isn't just about how many tests you ran — it's about how many you implicitly considered while making decisions. And that number is almost always much larger than it appears.
Four ways a researcher might conduct a statistical test:
- One test, decided in advance, with no data peeking. Clean and unambiguous.
- Menu of analyses, decided in advance, with no data peeking. Still clean, but more flexible.
- Run only one analysis, but it was influenced by peeking at the data. This is the most common and most problematic scenario. Data shapes the analysis.
- Run many analyses and report only the significant one. This is the classic p-hacking scenario.
In short: the problem isn't just deliberate cheating. It's that letting data influence your analytical decisions — even innocently — corrupts your results in the same way that explicit fishing does.
Adjusting p-values for multiple comparisons can help, but it doesn't solve the underlying problem of data-contingent analysis.
Authors suggest:
- Test how robust your results are to different analytical choices. If your result only holds under one specific set of decisions, that's a red flag.
- Stop treating p-values as the finish line. "We found a modest, uncertain effect" is legitimate and honest.
The statistical significance filter¶
Statistically significant estimates tend to be overestimates of the true effect size, because only the larger estimates will be statistically significant.
E.g., study with high noise, standard error will be large, so only estimates that are much larger than the true effect will be statistically significant.
Example: A flawed study of ovulation and political attitudes¶
Example showing multiple potential comparison problems and statistical significance filter.
CLaims: "Ovulation led single women to become more liberal, less religious, and more likely to vote for Barack Obama. In contrast, ovulation led married women to become more conservative, more religious, and more likely to vote for Mitt Romney. In addition, ovulatory-induced changes in political orientation mediated women’s voting behavior."
But many other comparisons could have been made. Choices incclude: days of the month classed as peak fertility, single vs married - unmarried but partnered were counted as married, and so on. With enough comparisons, some will be statistically significant by chance alone, and those are the ones that get reported.
Effects seem unplausibly large, type M error, which is consistent with the statistical significance filter - only the larger estimates will be statistically significant, so we are more likely to see overestimates of the true effect size.
4.6 Example of hypothesis testing: 55,000 residents need your help!¶
Suspicion that votes have been rigged. Tallys are reported after 600 votes, then the next 600, the next 1244, then the next 1000, the next 1000 and the final 1109, totalling 5553.
- Null hypothesis: voting at random
- Test statistic: standard deviation of proportion of votes for candidate i across the 6 batches of votes.
- Theoretical distribution of the data if the null hypothesis is true. Under null, the 6 subsets are random samples. If $\pi_{i}$ is the total proportion of voters for candidate i, then the proportion who vote for i during time t $p_{i,t}$ follows a distribution with mean $\pi_{i}$ and a variance of $\pi_{i}(1-\pi_{i})/n_t$. Under the null hypothesis, the variance of the $p_{i,t}$ across time should on average equal the average of the six variances. Therefore, the variance of the $p_{i,t}$ whos square root is our test statistic should be approximately $avg_{t=1}^{6} \pi_{i}(1-\pi_{i})/n_t$. The probabilities are not known, so we use standard practice and insert empirical probabilities $p_i$, so the expected value of the test statistic for each candidate i is $T_{i} ^theory = \sqrt{p_{i}(1-p_{i})*avg_{t=1}^{6} 1/n_t}$.
- Comparing the test statistic to its theoetical distribution.
- Summary comparisons using $\chi^2$ test. Express test numerically. Under null, probability of candidate receiving votes is independent of time. So we can compute a summary called $\chi^2$ statistic $\sum_{j=1}^{2} \sum_{t=1}^{6} (observed_{j,t} - expected_{j,t})^2/expected_{j,t}$ and compare it to its theoretical distribution under the null hypothesis, this statistic has a $\chi^2$ distribution with 5 degrees of freedom. If the observed $\chi^2$ statistic is much larger (p closer to 1) than what we would expect under the null hypothesis, we would reject the null hypothesis and conclude that there is evidence of rigging. If the observed $\chi^2$ statistic is not much larger than what we would expect under the null hypothesis, we would fail to reject the null hypothesis and conclude that there is not enough evidence to suggest rigging.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
votes = pd.read_csv('../ros_data/Riverbay.csv', skiprows=0, names=['1_tally', '2_tally', '3_tally', '4_tally', '5_tally', '6_tally', 'candidate'])
# print the unique candidate names and count of unique candidate names
print(votes['candidate'].unique())
print(len(votes['candidate'].unique()))
# drop rows where the candidate name is missing (NaN)
votes.dropna(subset=['candidate'], inplace=True)
#reset the index to be the candidate names
votes.set_index('candidate', inplace=True)
# filter for candidates of interest
candidates_of_interest = ['Hal Spitzer', 'Margie Best', 'Greg Stevens', 'Josh Walker', 'Clotelia Smith', 'Dave Barron', 'Alphonse Preston', 'Andy Willis']
votes = votes[votes.index.isin(candidates_of_interest)]
display(votes.head())
# print unique names (index)
print(votes.index.unique())
# for each tally calculate the difference from the previous tally
votes['diff_2'] = votes['2_tally'] - votes['1_tally']
votes['diff_3'] = votes['3_tally'] - votes['2_tally']
votes['diff_4'] = votes['4_tally'] - votes['3_tally']
votes['diff_5'] = votes['5_tally'] - votes['4_tally']
votes['diff_6'] = votes['6_tally'] - votes['5_tally']
# copy the differences into a new DataFrame for plotting
diffs = votes[['1_tally', 'diff_2', 'diff_3', 'diff_4', 'diff_5', 'diff_6']].copy()
# rename columns for better plotting
diffs.columns = ['1', '2', '3', '4', '5', '6']
# calculate the sum for each tally
print(diffs[['1', '2', '3', '4', '5', '6']].sum())
# calculate the total for each candidate across all tallies
diffs['total'] = diffs[['1', '2', '3', '4', '5', '6']].sum(axis=1)
# display(diffs.head())
# for each candidate, calculate the percetage of votes they recieved for each tally of the total votes for that tally
for col in ['1', '2', '3', '4', '5', '6']:
diffs[col] = diffs[col] / diffs[col].sum() * 100
display(diffs.head())
cols = ['1', '2', '3', '4', '5', '6']
# plot the percentage of votes for each candidate across the tallies, include legend and labels
plt.figure(figsize=(10, 6))
for candidate in diffs.index:
plt.plot(cols, diffs.loc[candidate, cols], marker='o', label=candidate)
plt.xlabel('Tally')
plt.ylabel('Percentage of Votes (%)')
plt.title('Percentage of Votes for Each Candidate Across Tallies')
plt.legend()
plt.grid()
plt.tight_layout()
plt.show()
# Calculate the standard deviation of votes for each candidate across the 6 tallies
diffs['std_dev'] = diffs[['1', '2', '3', '4', '5', '6']].std(axis=1)
display(diffs[['std_dev']])
# plot the standard deviation of votes for each candidate on y axis and the total votes for each candidate on the x axis
plt.figure(figsize=(10, 6))
plt.scatter(diffs['total'], diffs['std_dev'])
plt.xlabel('Total Votes')
plt.ylabel('Standard Deviation of Votes Across Tallies')
plt.title('Standard Deviation of Votes Across Tallies vs Total Votes')
plt.grid()
plt.tight_layout()
plt.show()
['Clotelia Smith' 'Earl Coppin' 'Clarissa Montes' nan 'Hal Spitzer' 'Margie Best' 'Josh Walker' 'Greg Stevens' 'Dave Barron' 'Andy Willis' 'Alphonse Preston'] 11
| 1_tally | 2_tally | 3_tally | 4_tally | 5_tally | 6_tally | |
|---|---|---|---|---|---|---|
| candidate | ||||||
| Clotelia Smith | 208 | 416 | 867 | 1259 | 1610 | 2020 |
| Hal Spitzer | 333 | 650 | 1326 | 1870 | 2418 | 3040 |
| Margie Best | 236 | 483 | 1017 | 1422 | 1821 | 2300 |
| Josh Walker | 229 | 450 | 922 | 1318 | 1688 | 2131 |
| Greg Stevens | 235 | 462 | 970 | 1342 | 1724 | 2176 |
Index(['Clotelia Smith', 'Hal Spitzer', 'Margie Best', 'Josh Walker',
'Greg Stevens', 'Dave Barron', 'Andy Willis', 'Alphonse Preston'],
dtype='object', name='candidate')
1 1803
2 1833
3 3869
4 3050
5 3020
6 3491
dtype: int64
| 1 | 2 | 3 | 4 | 5 | 6 | total | |
|---|---|---|---|---|---|---|---|
| candidate | |||||||
| Clotelia Smith | 11.536328 | 11.347518 | 11.656759 | 12.852459 | 11.622517 | 11.744486 | 2020 |
| Hal Spitzer | 18.469218 | 17.294053 | 17.472215 | 17.836066 | 18.145695 | 17.817244 | 3040 |
| Margie Best | 13.089296 | 13.475177 | 13.802016 | 13.278689 | 13.211921 | 13.720997 | 2300 |
| Josh Walker | 12.701054 | 12.056738 | 12.199535 | 12.983607 | 12.251656 | 12.689774 | 2131 |
| Greg Stevens | 13.033833 | 12.384070 | 13.130008 | 12.196721 | 12.649007 | 12.947579 | 2176 |
| std_dev | |
|---|---|
| candidate | |
| Clotelia Smith | 0.536054 |
| Hal Spitzer | 0.429701 |
| Margie Best | 0.286932 |
| Josh Walker | 0.362337 |
| Greg Stevens | 0.376835 |
| Dave Barron | 0.203392 |
| Andy Willis | 0.884638 |
| Alphonse Preston | 0.461595 |
4.7 Moving beyond hypothesis testing¶
Hypothesis testing has issues, but allows conclusions to be drawn from noisy data and hypothesis testing provides a check on over interpretation of noise. How to avoid overconfidence and exaggerations:
- Analyse all data - dont discard data, present and analyse it all.
- Present all comparisons, not just the significant ones. This allows readers to see the full picture and avoid being misled by the statistical significance filter.
- Make data available.
Good analysis is no substitute for good data.
We must move beyond idea that effects are there or not and idea that the goal is to reject the null hypothesis.
Think about variation when generalising. Does p<0.05 represent eternal or even local truth? No, for two reasons: 1. uncertainty: with small effects, large proportion of significant results can be in wrong direction (type S error) and with small effects, large proportion of significant results can be overestimates of the true effect size (type M error). 2. variation: even if we had perfect data and perfect analysis, the true effect size could vary across contexts, populations, and time. So even if we reject the null hypothesis in one study, that doesn't mean the effect will be the same in another study.
In short: an estimated large effect size is typically too good to be true, and small effect could be lost in the noise.
We move forward accepting uncertainty and embracing variation through Bayesian methods.