5. Simulation¶
- We use probability models to mimic variation in the real world and simulation can help us understand how this variation plays out. Random swings in the short term average out in the long term.
- We can use simulation to approximate the sampling distribution of data and propagate this to the sampling distribution of statistical estimates and procedures.
- Regression models are not deterministic, they produce probabilistic predictions. Simulation is the most convenient and general way to represent uncertainty in these predictions.
5.1 Simulation of discrete probability models¶
How many girls in 400 births?¶
Probability a baby is a girl 48.8%.
import pandas as pd
import numpy as np
# simulate 400 births
births = pd.Series(np.random.binomial(1, 0.488, size=400))
print(births.value_counts())
1 202 0 198 Name: count, dtype: int64
import matplotlib.pyplot as plt
# Number of times to repeat the simulation
n_sims = 1000
# For each simulation, draw from a binomial distribution:
# 400 trials (births) with 0.488 probability of girl per trial.
# This returns an array of 1000 values, each representing
# the number of girls out of 400 births in that simulation.
n_girls = np.random.binomial(400, 0.488, size=n_sims)
# Plot a histogram of the simulated counts
plt.hist(n_girls)
plt.xlabel("Number of girls")
plt.ylabel("Frequency")
plt.show()
Accounting for twins¶
1/125 fraternal twins, of which each 49.5% chance of being a girl, and 1/300 identical twins, of which each 49.5% chance of being a pair of girls.
# -- Step 1: Define birth type probabilities --
prob_fraternal = 1 / 125
prob_identical = 1 / 300
prob_single = 1 - prob_fraternal - prob_identical
# -- Step 2: Randomly assign a birth type to each of 400 births --
birth_types = np.random.choice(
["single", "identical twin", "fraternal twin"],
size=400,
p=[prob_single, prob_identical, prob_fraternal]
)
# -- Step 3: Create boolean masks for each birth type --
is_single = birth_types == "single"
is_identical = birth_types == "identical twin"
is_fraternal = birth_types == "fraternal twin"
# -- Step 4: Simulate girls for all births at once (no loop!) --
girls = np.zeros(400, dtype=int)
# Single births: one coin flip each (48.8% chance of girl)
girls[is_single] = np.random.binomial(1, 0.488, size=is_single.sum())
# Identical twins: one coin flip * 2 (both same sex, 49.5% chance of girls)
girls[is_identical] = 2 * np.random.binomial(1, 0.495, size=is_identical.sum())
# Fraternal twins: two independent coin flips (49.5% each)
girls[is_fraternal] = np.random.binomial(2, 0.495, size=is_fraternal.sum())
# -- Step 5: Total girls --
n_girls = girls.sum()
print(f"Total girls: {n_girls}")
Total girls: 194
n_sims = 1000
n_girls = np.zeros(n_sims, dtype=int)
for s in range(n_sims):
# Assign birth types for 400 births
birth_types = np.random.choice(
["single", "identical twin", "fraternal twin"],
size=400,
p=[prob_single, prob_identical, prob_fraternal]
)
# Boolean masks
is_single = birth_types == "single"
is_identical = birth_types == "identical twin"
is_fraternal = birth_types == "fraternal twin"
# Simulate girls vectorized within each simulation
girls = np.zeros(400, dtype=int)
girls[is_single] = np.random.binomial(1, 0.488, size=is_single.sum())
girls[is_identical] = 2 * np.random.binomial(1, 0.495, size=is_identical.sum())
girls[is_fraternal] = np.random.binomial(2, 0.495, size=is_fraternal.sum())
# Store total girls for this simulation
n_girls[s] = girls.sum()
# Plot the distribution across all 1000 simulations
plt.hist(n_girls)
plt.xlabel("Number of girls")
plt.ylabel("Frequency")
plt.show()
5.2 Simulation of continuous and mixed discrete/continuous models¶
n_sims = 1000
# Normal distribution: mean=3, sd=0.5
y1 = np.random.normal(3, 0.5, size=n_sims)
# Log-normal: exponentiate the normal draws
y2 = np.exp(y1)
# Binomial: 20 trials, 60% success probability
y3 = np.random.binomial(20, 0.6, size=n_sims)
# Poisson: average rate of 5
y4 = np.random.poisson(5, size=n_sims)
# Plot all four distributions in a 2x2 grid
fig, axes = plt.subplots(2, 2, figsize=(10, 8))
axes[0, 0].hist(y1)
axes[0, 0].set_title("Normal (mean=3, sd=0.5)")
axes[0, 1].hist(y2)
axes[0, 1].set_title("Log-normal (exp of y1)")
axes[1, 0].hist(y3)
axes[1, 0].set_title("Binomial (n=20, p=0.6)")
axes[1, 1].hist(y4)
axes[1, 1].set_title("Poisson (lambda=5)")
plt.tight_layout()
plt.show()
52% adults in US are female and 48% are male. Heights of men are normally distributed with mean 69.1 inches and standard deviation 2.9 inches, women are normally distributed with mean 63.7 inches and standard deviation 2.7 inches.
n_sims = 1
man = np.random.binomial(1, 0.48, size=n_sims)
print(man)
height = np.where(man == 1, np.random.normal(69.1, 2.9, size=n_sims), np.random.normal(63.7, 2.7, size=n_sims))
print(height)
[1] [69.49095273]
def calculate_average_height(n_adults, n_sims):
# Step 1: Determine sex for each person in each simulation
# Result is a (n_sims x n_adults) grid of 1s (man) and 0s (woman)
is_man = np.random.binomial(1, 0.48, size=(n_sims, n_adults))
# Step 2: Assign height based on sex
# Men: mean=69.1, sd=2.9 | Women: mean=63.7, sd=2.7
heights = np.where(
is_man == 1,
np.random.normal(69.1, 2.9, size=(n_sims, n_adults)),
np.random.normal(63.7, 2.7, size=(n_sims, n_adults))
)
# Step 3: Calculate average height for each simulation
average_heights = np.mean(heights, axis=1)
print(average_heights)
return average_heights
# Calculate average heights for 1000 simulations of 1000 adults each
average_heights = calculate_average_height(1000, 1000)
plt.hist(average_heights)
plt.xlabel("Average height")
plt.ylabel("Frequency")
plt.show()
[66.47174962 66.42723823 66.17262348 66.29413297 66.66648476 66.21207698 66.18833124 66.21996344 66.45414682 66.20558311 66.24576169 66.31349296 65.99645576 66.32528524 66.18499265 66.11813149 66.54217101 66.27317384 66.17746124 66.4749038 66.15030127 66.38218853 66.22866711 66.25179754 66.33403733 66.42969855 66.1836283 66.41957837 66.27155653 66.01924398 66.22662282 66.06543733 66.33583222 66.28375493 66.105506 66.33866107 66.17765003 66.18458412 66.36380251 66.42081151 66.15563169 66.04890913 66.30191134 66.31688315 66.52292032 66.39010935 66.36976334 66.21982098 66.02544775 66.39259607 66.35586635 66.30661652 66.34371769 66.14522039 66.31597532 66.22385124 66.126554 65.99494971 66.2087644 66.24837386 66.28203056 66.33283453 66.46289257 66.26375939 66.09777846 66.2776857 66.13995744 66.37281341 66.15647605 66.26717603 66.28118706 66.26961883 66.23970147 66.01236466 66.01980714 66.35776578 66.1841519 66.35927417 66.13979657 66.42736508 66.51489371 66.38061099 66.22922336 66.16448485 66.14406583 66.26019415 66.26367031 66.50321395 66.26751949 66.26150857 66.09608453 66.27036345 66.2410372 66.10729747 66.39431003 66.33202988 66.2543417 66.30073593 66.30981785 66.17249348 66.25532755 66.36695433 66.05060027 66.3489086 66.43552641 66.34374467 66.33885607 66.09325504 66.16239443 66.04096195 66.17380297 66.13926538 65.88856043 66.29595814 66.25102473 66.21348734 66.34223068 66.25858041 66.27937375 66.14288289 66.19628833 66.38418834 66.23017041 66.32909036 66.27884008 66.32391842 66.15240052 66.26142227 66.33819237 66.28402563 66.28910663 66.39024895 66.31541055 66.21631881 66.36477083 66.268885 66.52947292 66.08326461 66.22551998 66.35929865 66.40331034 66.31695086 66.31651265 66.10409631 66.36200883 66.32526742 66.27879349 66.05266043 66.39066455 66.3891342 66.31815637 66.22348449 66.03030899 66.35585778 66.35118171 66.37078479 65.97938358 66.09398601 66.18363576 66.06002838 66.39076503 66.46025261 66.31546135 66.24118724 66.13023781 66.34871788 66.30677708 66.46940375 66.18184161 66.01688253 66.08251571 66.40684047 66.30712814 66.42502008 66.34024567 66.44431108 66.1839831 66.43482196 66.57854204 66.4814968 66.55954302 66.52300417 66.19190958 66.24229831 66.31753952 66.43738557 66.17132478 66.3618809 66.26034498 66.25875647 66.32586877 66.29604804 66.07205818 66.08259161 66.11877992 66.22114574 66.26752642 66.27074973 66.37898215 66.38909069 66.33056965 66.2598584 66.26627402 66.17917335 66.3220869 66.13724869 66.22892597 66.30429193 66.29891043 66.3759757 66.15783431 66.37528104 66.35792179 66.42717583 66.34155023 66.32185535 66.41874025 66.38748005 66.41998907 66.46591288 66.34413473 66.34463751 66.3847852 66.12677769 66.3745905 66.30641594 66.35391853 66.25169994 66.2515372 66.32675413 66.29206949 66.33336831 66.34146522 66.34646875 66.60956371 66.32881159 66.31702951 66.17592783 66.41456102 66.39625285 66.26987138 66.39767015 66.17225513 66.14126583 66.21556133 66.21732009 66.37475032 66.34409021 66.20425844 66.43851636 66.36193496 66.29963913 66.29475151 66.24879204 66.25542086 66.2946518 66.22105944 66.28184897 66.23931886 66.3342861 66.30226295 66.17317554 66.30197318 66.21815977 66.34813537 66.18282334 66.38406865 66.38770933 66.37109626 66.42931837 66.22902006 66.15683205 66.33944623 66.52764714 66.13880436 66.17669616 66.26201281 66.27439196 66.2977974 66.4176474 66.23158071 66.37963971 66.10196844 66.28422808 66.1751433 66.11128563 66.08927041 66.44061341 66.58844377 66.31154609 66.32160349 66.50065299 66.20884258 66.18341223 66.30122285 66.32835992 66.20814451 66.36045732 66.33083722 66.15552283 66.29580408 66.276082 66.38358148 66.21188301 66.08627018 66.39457415 66.23263688 66.03272998 66.26375346 66.30661119 66.10026313 66.07109906 66.26637304 66.41408878 66.14751721 66.42019155 66.25241094 66.3543836 66.33241476 66.41725086 66.29068223 66.30648641 66.14328261 66.47058599 66.11704912 66.53780765 66.46822494 66.49337609 66.13526384 66.29443865 66.41204819 66.28912315 66.31981659 66.22146644 66.23148044 66.23840223 66.28696469 66.11191406 66.2660468 66.1515804 66.15871949 66.44400425 66.19347988 66.29608221 66.47423085 66.33000029 66.29025505 66.46962937 66.04342792 66.28502638 66.18106089 66.22729942 66.1051768 66.22807186 66.32344512 66.41004167 66.35375617 66.34532597 66.14422978 66.30671026 66.30790934 66.3345703 66.33080135 66.28602629 66.35886925 66.47264606 66.33463395 66.32337344 66.37491144 66.35154475 66.15187826 66.47201064 66.40698475 66.17065708 66.24312228 66.35433893 66.23559127 66.46385065 66.41751999 66.19388059 66.27120676 66.48207354 66.29349396 66.14370531 66.35001854 66.31864389 66.47223744 66.52358823 66.44409074 66.35809561 66.3415957 66.32213164 66.29019543 66.2668837 66.33645612 66.46535492 66.42312029 66.30998887 65.96565912 66.29385489 66.35985981 66.31200518 66.1012427 66.37959301 66.24315943 66.07702691 66.34852256 66.07606288 66.3136521 66.23965405 66.33958429 66.15914246 66.22966805 66.09712943 66.3966861 66.20157278 66.19602237 66.37686948 66.17199452 66.1716205 66.35943686 66.40873674 66.23885314 66.1629686 66.51241577 66.25256102 66.38449942 66.42539526 66.29701298 66.40761771 66.21883137 66.33349149 66.1860148 66.43455435 66.3372089 66.25901057 66.17305458 66.21168715 66.37605145 66.25166837 66.218002 66.24924144 66.33785761 66.26198222 66.16126715 66.19797851 66.21872465 66.45505148 66.51798565 66.18793548 66.17183697 66.38017624 66.29281435 66.11687932 66.51469012 66.1983445 66.30798903 66.22689625 66.25415446 66.39545606 66.21075042 66.35244476 66.23521969 66.19187342 66.0970276 66.41697338 66.08960056 66.40162547 66.29077613 66.27794642 66.09495066 66.48190183 66.15338133 66.46215756 66.33432867 66.21496452 66.30723852 66.1844344 66.58452141 66.27040178 66.04212913 66.08282293 66.30558512 66.25205095 66.26267953 66.27915832 66.47061406 66.24784033 66.29200703 66.12340168 66.38995185 66.29248219 66.16419445 66.40287683 66.36150252 66.28152197 66.44377336 66.09774608 66.20539995 66.23831816 66.24055828 66.278495 66.38691659 66.10938972 66.07090233 66.60406811 66.2713892 66.20625974 66.20498423 66.30920249 66.34191994 66.22725457 66.11651768 66.380604 66.40875673 66.12009036 66.18200133 66.39338581 66.16307992 66.29094385 66.09302402 66.41167768 66.3802622 66.16278856 66.12030367 66.16720997 66.41140232 66.17776949 66.38701187 66.20514847 66.34198677 66.32683498 66.11139788 66.13441978 66.28111631 66.22400511 66.23737248 66.34545342 66.30158615 66.66274134 66.27993082 66.37224095 66.29056845 66.40772114 66.53344889 66.03827017 66.31764209 66.19569181 66.19597069 66.23574772 66.39258176 66.40361048 66.27410902 66.12892928 66.56906574 66.30628668 66.39156205 66.16872684 66.42093508 66.44539833 66.3452513 66.39044348 66.28144758 66.47361386 66.31193706 66.38157828 66.19999807 66.17876127 66.26239616 66.44681178 66.235528 66.36080255 66.3020355 66.07908576 66.42509857 66.46588721 66.20700334 66.30842025 66.23003298 66.03965302 66.38630932 66.23573604 66.42383279 66.25590253 66.53379214 66.28229561 66.26001695 66.26857162 66.16482634 66.29022612 66.19685641 66.38238577 66.25165649 66.24558954 66.27686491 66.43456898 66.24280351 66.46035505 66.23118873 66.17047047 66.30343439 66.33939481 66.47393115 66.44975465 66.54289516 66.19017762 66.21498858 66.10284123 66.3422649 66.21549585 66.14997157 66.296114 66.24483074 66.30133815 66.51181464 66.42311002 66.24086573 66.39928552 66.66184192 66.2235855 66.00355445 66.29716451 66.26062842 66.35271543 66.3677281 66.32043622 66.228969 66.35795565 66.47602667 66.29177224 66.16564944 66.35012199 66.2135457 66.27766563 66.15391091 66.19281781 66.38728385 66.38083662 66.34302006 66.31426204 66.36005679 66.09216621 66.31403452 66.44436422 66.20798597 66.27727636 66.33771469 66.40919772 66.44142248 66.05098259 66.30592218 66.46869404 66.34089918 66.44453544 66.2512688 66.29177027 66.36902716 66.29867247 66.48008201 66.2674679 66.22829926 66.32919597 66.2760684 66.26937478 66.52890923 66.26592968 66.52567961 66.57644519 66.23639821 66.2377904 66.14049559 66.46310688 66.07638678 66.04273132 66.42791916 66.38122698 66.50711928 66.31111894 66.27890486 66.40875823 66.08597358 66.38721523 66.28749319 66.36119658 66.34715006 66.26993671 66.19114194 66.32142339 66.28356209 66.2541645 66.16646026 66.34132237 66.32275412 66.15574748 66.21654732 66.22378278 66.47897107 66.45994858 66.43008336 66.29116818 66.31523827 66.4091788 66.31313776 66.36958649 66.26867145 66.43027407 66.26456343 66.08848088 66.2184698 66.23659346 66.17091117 66.26791295 66.27737761 66.32462636 66.32405946 66.23367032 66.14545384 66.52862686 66.52164821 66.27844766 66.54642486 66.5068889 66.2182015 66.45513695 66.27917216 66.10255413 66.30987946 66.44386813 66.25040463 66.31504448 66.25294196 66.20016378 66.38846442 66.37033704 66.24581628 66.26987964 66.08446448 66.25616489 66.40886245 66.39319 66.04855555 66.30057645 66.18200704 66.31547798 66.36725029 66.4509927 66.15326453 66.30169748 66.24236415 66.20295273 66.21116736 66.33980351 66.30916597 66.14064232 66.08295511 66.30719979 66.21486361 66.14709745 66.32737724 66.6706976 66.20268298 66.40951035 66.11200404 66.26676311 66.09190974 66.2047821 66.04642656 66.11991925 66.29761152 66.35425023 66.33326889 66.40757993 66.27773572 66.42464324 66.35409031 66.22747738 66.41589212 66.36474829 66.27276613 66.30468483 66.16652608 66.25940342 66.26234827 66.21921258 66.35457617 66.2827946 66.12585679 66.25422936 66.16686864 66.40620483 66.30516592 65.99549269 66.28093615 66.15096967 66.17858954 66.18644866 66.11569782 66.20588284 66.51026301 66.18772141 66.36077289 66.35636955 66.27917734 66.4490333 66.42911671 66.33733995 66.25371543 66.44160875 66.49417199 66.24810323 66.31136871 66.52354363 66.28422778 66.31763175 66.34175555 66.4718828 66.30070362 66.20512425 66.16157779 66.35067672 66.08998236 66.27116229 66.27859626 66.34761056 66.46746308 66.49851279 66.34332252 66.3595643 66.03464584 66.34872343 66.30952963 66.3058618 66.34282526 66.15488517 66.39053618 66.44570064 66.07656644 66.30368099 66.30418969 66.2287765 66.29131926 66.42051911 66.15764587 66.38691848 66.28083478 66.27316788 66.33900925 66.34670242 66.53840793 66.26602945 66.30465478 66.39135077 66.13856782 66.40855318 66.32121602 66.390492 66.32347182 66.16409116 66.00782811 66.30060558 66.21290335 66.17685548 66.33930838 66.15324677 66.35371179 66.20896498 66.39450873 66.37423985 66.22366183 66.19276224 66.1879846 66.43370703 66.15974042 66.24012698 66.58297216 66.29654087 66.38745239 66.30676812 66.34322388 66.33410607 66.24198576 66.24752457 66.16853775 66.22474895 66.31840952 66.30663038 66.33590472 66.20926308 66.37469927 66.22172375 66.3522675 66.16704338 66.27683609 66.15542682 66.29262299 66.12761507 66.54785786 66.39316719 66.32992995 66.21056456 66.17524023 66.28771095 66.32938868 66.21548864 66.23667377 66.18769954 66.31489605 66.33326108 66.27028737 66.15020846 66.46759835 66.36127709 66.290908 66.27897904 66.51331227 66.40108337 66.31348249 66.31619128 66.21330719 66.44295876 66.23803871 66.23631354 66.37882146 66.42379799 66.35803791 66.22196629 66.19104275 66.42578039 66.13872674 66.29355817 66.39206378 66.1190827 66.09310593 66.34127083 66.08094597 66.05256727 66.50636625 66.38751578 66.30546768 66.28477039 66.17131231 66.28740801 66.4773092 66.27646092 66.30635862 66.22092799 66.21621792 66.46944985 66.30470375 66.48566325 66.31641868 66.28882182 66.19045045 66.15836368 66.4980757 66.3616775 66.27376947 66.08935407 66.12035817 66.15440065 66.23129089 66.39461649 66.16465006 66.27581068 66.29188939 66.37637372 66.30099021 66.36355241 66.42821409 66.11272304 66.06467992 66.23393996 66.24147462 66.1604833 66.09515868 66.24728811 66.28449591 66.4457204 66.095634 66.28775637 66.30122792 66.2519923 66.34555345 66.49050804 66.35616143 66.19644038 66.22011038 66.24723221 66.32827354 66.23168844 66.12320469 66.2345638 66.1857978 66.20414214 66.38425033 66.31167385 66.14705032 66.27858225 66.24971339 66.1001593 ]
5.3 Summarizing a set of simulations using median and median absolute deviation¶
Many cases where we use a set of simulation draws to summarise a distribution. Can represent:
- A simulation from a probability model;
- A prediction from a regression model;
- Uncertainty about parameters in a fitted model.
Location of distribution can be summarised by mean or median, and spread can be summarised by standard deviation or median absolute deviation (MAD).
If the median of a set of simulations $Z_1, Z_2, ..., Z_n$ is $M$, then the MAD is the median of the absolute deviations from the median: $MAD = median(|Z_i - M|)$. Because of familiarity with SD, MAD is often multiplied by 1.4826 to make it comparable to SD for a normal distribution. This is called the mad sd.
Median summaries prefered, computationally stable.
z = np.random.normal(5, 2, size=1000)
mean_z = np.mean(z)
std_z = np.std(z)
median_z = np.median(z)
mad_z = np.median(np.abs(z - median_z))
mad_sd_z = mad_z * 1.4826 # Convert MAD to an estimate of standard deviation
print(f"Mean: {mean_z:.2f}, SD: {std_z:.2f}, Median: {median_z:.2f}, MAD: {mad_z:.2f}, MAD-based SD: {mad_sd_z:.2f}")
# central 50% interval of z
lower_25 = np.percentile(z, 25)
upper_75 = np.percentile(z, 75)
print(f"Central 50% interval: [{lower_25:.2f}, {upper_75:.2f}]")
# central 95% interval of z
lower_2_5 = np.percentile(z, 2.5)
upper_97_5 = np.percentile(z, 97.5)
print(f"Central 95% interval: [{lower_2_5:.2f}, {upper_97_5:.2f}]")
Mean: 4.96, SD: 1.92, Median: 4.97, MAD: 1.19, MAD-based SD: 1.76 Central 50% interval: [3.74, 6.12] Central 95% interval: [1.05, 8.97]
5.4 Bootstrapping to simulate a sampling distribution¶
Ideally repeat the data collection process many times to understand the sampling distribution of an estimate, but this is often not possible. Bootstrapping is a method to simulate the sampling distribution of an estimate by resampling with replacement from the observed data. We repeatedly draw samples from the same data set we have (replacement means we can draw the same data point multiple times) and look at the distribution of the estimate across these samples. This gives us an approximation of the sampling distribution of the estimate.
earnings = pd.read_csv('../ros_data/earnings.csv', skiprows=0)
median_female = earnings[earnings['male'] == 0]['earn'].median()
print(f"Median earnings for females: ${median_female:.2f}")
median_male = earnings[earnings['male'] == 1]['earn'].median()
print(f"Median earnings for males: ${median_male:.2f}")
female_male_median_ratio = median_female / median_male
print(f"Female to male median earnings ratio: {female_male_median_ratio:.2f}")
# display(earnings.head())
Median earnings for females: $15000.00 Median earnings for males: $25000.00 Female to male median earnings ratio: 0.60
import numpy as np
print(len(earnings))
def boot_ratio(data):
n = len(data)
boot = np.random.choice(n, size=n, replace=True)
earn_boot = data['earn'].iloc[boot]
male_boot = data['male'].iloc[boot]
return np.median(earn_boot[male_boot == 0]) / np.median(earn_boot[male_boot == 1])
n_sims = 10000
output = np.array([boot_ratio(data=earnings) for _ in range(n_sims)])
print(f"Bootstrap median ratio: {np.median(output):.2f}")
#print standard deviation of the bootstrap distribution
print(f"Bootstrap standard deviation: {np.std(output):.2f}")
print(f"Bootstrap 95% interval: [{np.percentile(output, 2.5):.2f}, {np.percentile(output, 97.5):.2f}]")
1816 Bootstrap median ratio: 0.60 Bootstrap standard deviation: 0.03 Bootstrap 95% interval: [0.52, 0.60]
Choices in defining the bootstrap distribution¶
Example above we resampled from the observed data.
In a simple regression we can resample data $(x, y)_i$.
An alternative is to resample residuals from the fitted model. We fit a model $y = X\beta + \epsilon$ compute the residuals $r_i = y_i - X_i\hat{\beta}$, and then resample n values of $r_i$ with replacement to get $r^*_i$ and this gives us a new bootstrap sample $y_i^boot = X_i\hat{\beta} + r^boot_i$. Doing once gives us a bootstrapped dataset and 1000 times a simulated bootstrap sampling distribution of the estimate $\hat{\beta}$.
Timeseries data: timeseries data are not independent, so simple resampling of data points is not appropriate - we might end up with lots of data from one day and none from another. Instead we can resample blocks of data, e.g. resample 7 day blocks to preserve the correlation structure within a week. Bootstrapping the residuals can also be problematic, as errors may accumulate over time. Bigger takeaway: bootstrapping only works when the data are independent.
Multilevel structure: if data have a multilevel structure, e.g. students nested within schools, the choice of how to resample can yeild different results.
Discrete data: For binomial data, we can bootstrap on the clusters of data, e.g., each city, where 100 respondents and 60 say yes, or we can first expand the data to 100 rows where 60 are yes and 40 are no, and then bundle this into a logistic regression and bootstrap the new dataset. Two different options correspond to two different sampling models.
Limitations of bootstrapping¶
Appeal is that any estimate can be bootstrapped, all we need is an estimate and a sampling distribution. But can lead to an answer with innappropriately high level of certainty.
Bootstrapping is only as good as the data we have and the sampling protocol. If the data are biased, bootstrapping will not fix this. If the data are not independent, bootstrapping may give misleading results. If the data are not representative of the population we want to make inferences about, bootstrapping will not help.