Resampling
The resampling module collects a number of computational procedure for estimation and testing, currently the module focusses on boostrap procedures for confidence intervals.
Note: bias of a resampling estimate reflects the average difference from the original estimate computed without resampling the data.
[1]:
import pandas as pd
import numpy as np
import warnings
from stats_misc.resampling import (
jackknife_replicates,
BCaCI,
PercentileCI,
StudentisedCI,
BasicCI,
)
from stats_misc.machine_learning.validation import (
cstatistic,
)
# create a function which returns a single float
def cstat_boot(*data):
'''
Function to use for bootstrapping
'''
try:
cstat = cstatistic(data[0], data[1])
point = cstat.c_statistic
if np.isnan(point):
return np.nan
return point
except Exception:
return np.nan
# example data
x = np.array([0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0])
y = np.array([0.1, 0.7, 0.3, 0.6, 0.9, 0.1, 0.2, 0.3, 0.4, 0.01, 0.9, 0.8, 0.15, 0.71, 0.8])
# set seed
np.random.seed(2)
Bootstrapped confidence intervals
Bootstrapping works by repeatedly resampling the original dataset with replacement to create B many “pseudo-datasets” (bootstrap samples). The statistic of interest (e.g., mean, median, regression coefficient) is calculated for each sample, and the variability across these replicates is then used to estimate confidence intervals or standard errors. This approach relies on the idea that the observed sample is a reasonable proxy for the population, and essentially uses this to simulate
multiple repeated studies/analyses from the same population.
Bootstrap-based confidence intervals offer a flexible, non-parametric way to estimate uncertainty around a statistic by resampling the observed data with replacement. These are particularly useful when the sampling distribution is unknown or when parametric assumptions (e.g., normality) may not hold.
Basic confidence interval
This method reflects the percentile interval around the original estimate. It assumes the bootstrap distribution is symmetric. The interval is constructed by doubling the difference between the point estimate and each percentile. It is straightforward to compute and interpret. However, it performs poorly with skewed data or biased estimators.
[2]:
boot_basic = BasicCI((x,y))
res_basic = boot_basic(n_estimates=1, statsfunction=cstat_boot)
low, up = res_basic.interval_values[0]
print(f"The Basic bootstrap confidence interval: {res_basic.coverage:.2f} confidence interval is {low:.3f}; {up:.3f}")
The Basic bootstrap confidence interval: 0.95 confidence interval is 0.043; 0.643
Percentile confidence interval
This method uses the empirical percentiles from the bootstrap distribution directly. It does not assume symmetry or normality. It is simple and commonly used for non-parametric inference. However, it can be biased if the estimator is not centred in the bootstrap distribution. It may underperform in small or skewed samples.
[3]:
boot_perc = PercentileCI((x,y))
res_perc = boot_perc(n_estimates=1, statsfunction=cstat_boot)
low, up = res_perc.interval_values[0]
print(f"The Percentile bootstrap confidence interval: {res_perc.coverage:.2f} confidence interval is {low:.3f}; {up:.3f}")
The Percentile bootstrap confidence interval: 0.95 confidence interval is 0.080; 0.685
Studentised confidence interval
This method calculates a t-statistic using the bootstrap estimate and a nested estimate of its standard error. It adjusts for variability in standard errors across samples. It is more robust than basic or percentile intervals when variances are unequal. The method is computationally expensive due to the nested resampling. It may be unstable in small samples.
[4]:
boot_student = StudentisedCI((x,y), verbose=False)
res_student = boot_student(n_estimates=1, statsfunction=cstat_boot)
low, up = res_student.interval_values[0]
print(f"The Percentile bootstrap confidence interval: {res_student.coverage:.2f} confidence interval is {low:.3f}; {up:.3f}")
The Percentile bootstrap confidence interval: 0.95 confidence interval is 0.033; 0.813
/home/amand/google_drive/Research/stats-misc/stats_misc/resampling.py:1043: RuntimeWarning: divide by zero encountered in divide
t_stats = (boot_point - orig_mean) / boot_se
Bias corrected and accelerated confidence interval
The BCa method adjusts for both bias and skewness in the bootstrap distribution. It uses both bootstrap and jackknife samples to estimate correction terms. It is generally more accurate, especially for small samples or skewed data. It is computationally more intensive than other methods. This method is often recommended when accurate coverage is essential.
[5]:
boot_bca = BCaCI((x,y))
res_bca = boot_bca(n_estimates=1, statsfunction=cstat_boot)
low, up = res_bca.interval_values[0]
print(f"The BCa {res_bca.coverage:.2f} confidence interval is {low:.3f}; {up:.3f}")
The BCa 0.95 confidence interval is 0.037; 0.607
The Jackknife
The jackknife is a resampling technique used to estimate the variability and influence of a statistic. It systematically removes one observation at a time from the dataset, recomputing the statistic of interest on each subsample. This produces a collection of “leave-one-out” estimates from which standard errors and influence measures can be derived.
The method is particularly useful when analytical standard errors are difficult to obtain, and is widely used to estimate the bias or variance of statistics such as the mean, median, or regression coefficients. Additionally, jackknife influence values allow identification of highly influential observations.
Because it only requires n model fits for n observations, the jackknife is computationally cheaper than bootstrap methods, though it may be less accurate for non-smooth statistics.
[6]:
jacked = jackknife_replicates((x,y), statsfunction=cstat_boot, n_estimates=1)
# the jacknife standard error sqrt [ (n-1)/n * sum(jk_i - jk_bar)**2 ]
se_jk = np.sqrt((len(x) - 1)/len(x) * np.sum((jacked.flatten() - np.mean(jacked)) ** 2))
# the most influential observation (full_estimate - jk_estimates)
influence = cstat_boot(*(x,y)) - jacked.flatten()
most_infl_idx = np.argmax(np.abs(influence))
print(f'The Jackknife standard error: {se_jk:.3f}, and the most influential observation: {most_infl_idx}, with a difference of: {influence[most_infl_idx]:.3f}')
The Jackknife standard error: 0.159, and the most influential observation: 0, with a difference of: 0.061