Resampling methods

The resampling module collects a number of computational procedure for estimation and testing, currently the module focusses on boostrap procedures for confidence intervals.

Note: bias of a resampling estimate reflects the average difference from the original estimate computed without resampling the data.

[1]:

import pandas as pd
import numpy as np
import warnings
from stats_misc.resampling import (
    bootstrap_replicates,
    jackknife_replicates,
    BCaCI,
    PercentileCI,
    StudentisedCI,
    BasicCI,
)
from stats_misc.machine_learning.validation import (
    cstatistic,
)
# create a function which returns a single float
def cstat_boot(*data):
    '''
    Function to use for bootstrapping
    '''
    try:
        cstat = cstatistic(data[0], data[1])
        point = cstat.c_statistic
        if np.isnan(point):
            return np.nan
        return point
    except Exception:
        return np.nan
# example data
x = np.array([0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0])
y = np.array([0.1, 0.7, 0.3, 0.6, 0.9, 0.1, 0.2, 0.3, 0.4, 0.01, 0.9, 0.8, 0.15, 0.71, 0.8])

# set seed
np.random.seed(2)

Bootstrapped confidence intervals

Bootstrapping works by repeatedly resampling the original dataset with replacement to create B many “pseudo-datasets” (bootstrap samples). The statistic of interest (e.g., mean, median, regression coefficient) is calculated for each sample, and the variability across these replicates is then used to estimate confidence intervals or standard errors. This approach relies on the idea that the observed sample is a reasonable proxy for the population, and essentially uses this to simulate multiple repeated studies/analyses from the same population.

Bootstrap-based confidence intervals offer a flexible, non-parametric way to estimate uncertainty around a statistic by resampling the observed data with replacement. These are particularly useful when the sampling distribution is unknown or when parametric assumptions (e.g., normality) may not hold.

`bootstrap_replicates`: the low-level interface

All CI classes use bootstrap_replicates internally. Call it directly when you only need the raw replicate distribution — for example to plot or inspect it before choosing an interval method.

The function takes a tuple of arrays (with the same number of rows), a callable statsfunction, and n_estimates (the number of values statsfunction returns). It returns a 2-D array of shape (n_reps, n_estimates).

[2]:

np.random.seed(2)
boot_dist = bootstrap_replicates((x, y), statsfunction=cstat_boot, n_estimates=1, n_reps=499)
print(f'Replicate array shape: {boot_dist.shape}')
print(f'Mean: {np.nanmean(boot_dist):.3f},  SD: {np.nanstd(boot_dist):.3f}')
print(f'2.5th–97.5th percentile: {np.nanpercentile(boot_dist, 2.5):.3f}; {np.nanpercentile(boot_dist, 97.5):.3f}')

Replicate array shape: (499, 1)
Mean: 0.366,  SD: 0.146
2.5th–97.5th percentile: 0.122; 0.664

Basic confidence interval

This method reflects the percentile interval around the original estimate. It assumes the bootstrap distribution is symmetric. The interval is constructed by doubling the difference between the point estimate and each percentile. It is straightforward to compute and interpret. However, it performs poorly with skewed data or biased estimators.

[3]:

np.random.seed(2)
boot_basic = BasicCI((x,y))
res_basic = boot_basic(n_estimates=1, statsfunction=cstat_boot)
low, up = res_basic.interval_values[0]
print(f"The Basic bootstrap confidence interval: {res_basic.coverage:.2f} confidence interval is {low:.3f}; {up:.3f}")

The Basic bootstrap confidence interval: 0.95 confidence interval is 0.053; 0.623

Percentile confidence interval

This method uses the empirical percentiles from the bootstrap distribution directly. It does not assume symmetry or normality. It is simple and commonly used for non-parametric inference. However, it can be biased if the estimator is not centred in the bootstrap distribution. It may underperform in small or skewed samples.

[4]:

np.random.seed(2)
boot_perc = PercentileCI((x,y))
res_perc = boot_perc(n_estimates=1, statsfunction=cstat_boot)
low, up = res_perc.interval_values[0]
print(f"The Percentile bootstrap confidence interval: {res_perc.coverage:.2f} confidence interval is {low:.3f}; {up:.3f}")

The Percentile bootstrap confidence interval: 0.95 confidence interval is 0.102; 0.685

Studentised confidence interval

This method calculates a t-statistic using the bootstrap estimate and a nested estimate of its standard error. It adjusts for variability in standard errors across samples. It is more robust than basic or percentile intervals when variances are unequal. The method is computationally expensive due to the nested resampling. It may be unstable in small samples.

[5]:

np.random.seed(2)
boot_student = StudentisedCI((x,y), verbose=False)
res_student = boot_student(n_estimates=1, statsfunction=cstat_boot)
low, up = res_student.interval_values[0]
print(f"The Percentile bootstrap confidence interval: {res_student.coverage:.2f} confidence interval is {low:.3f}; {up:.3f}")

The Percentile bootstrap confidence interval: 0.95 confidence interval is -0.001; 0.871

/home/amand/google_drive/Research/stats-misc/stats_misc/resampling.py:1067: RuntimeWarning: divide by zero encountered in divide
  t_stats = (boot_point - orig_mean) / boot_se

Bias corrected and accelerated confidence interval

The BCa method adjusts for both bias and skewness in the bootstrap distribution. It uses both bootstrap and jackknife samples to estimate correction terms. It is generally more accurate, especially for small samples or skewed data. It is computationally more intensive than other methods. This method is often recommended when accurate coverage is essential.

[6]:

np.random.seed(2)
boot_bca = BCaCI((x,y))
res_bca = boot_bca(n_estimates=1, statsfunction=cstat_boot)
low, up = res_bca.interval_values[0]
print(f"The BCa {res_bca.coverage:.2f} confidence interval is {low:.3f}; {up:.3f}")

The BCa 0.95 confidence interval is 0.034; 0.640

/home/amand/google_drive/Research/stats-misc/stats_misc/resampling.py:557: UserWarning: There are 1 `nan` values in `boot_samp`. These will be omitted.
  warnings.warn(MSG_NAN.format(sum(missings_bs), 'boot_samp'))

The Jackknife

The jackknife is a resampling technique used to estimate the variability and influence of a statistic. It systematically removes one observation at a time from the dataset, recomputing the statistic of interest on each subsample. This produces a collection of “leave-one-out” estimates from which standard errors and influence measures can be derived.

The method is particularly useful when analytical standard errors are difficult to obtain, and is widely used to estimate the bias or variance of statistics such as the mean, median, or regression coefficients. Additionally, jackknife influence values allow identification of highly influential observations.

Because it only requires n model fits for n observations, the jackknife is computationally cheaper than bootstrap methods, though it may be less accurate for non-smooth statistics.

[7]:

jacked = jackknife_replicates((x,y), statsfunction=cstat_boot, n_estimates=1)
# the jacknife standard error sqrt [ (n-1)/n * sum(jk_i - jk_bar)**2 ]
se_jk = np.sqrt((len(x) - 1)/len(x) * np.sum((jacked.flatten() - np.mean(jacked)) ** 2))
# the most influential observation (full_estimate - jk_estimates)
influence = cstat_boot(*(x,y)) - jacked.flatten()
most_infl_idx = np.argmax(np.abs(influence))
print(f'The Jackknife standard error: {se_jk:.3f}, '
      f'and the most influential observation: {most_infl_idx}, '
      f'with a difference of: {influence[most_infl_idx]:.3f}')

The Jackknife standard error: 0.159, and the most influential observation: 0, with a difference of: 0.061

Multiple simultaneous estimates

All CI classes support n_estimates > 1 for statistics that return a vector — for example, bootstrapping two parameters jointly. The statsfunction must return an array with exactly n_estimates elements. interval_values is then a 2-D array with one row per estimate.

[8]:

def mean_and_sd(*data):
    """Return mean and standard deviation of the first array."""
    arr = data[0]
    return np.array([np.mean(arr), np.std(arr)])

z = y   # use the continuous predictor array from above

np.random.seed(2)
boot_multi = PercentileCI((z,))
res_multi = boot_multi(n_estimates=2, statsfunction=mean_and_sd)
print(f'interval_values shape: {np.array(res_multi.interval_values).shape}')
for i, label in enumerate(['mean', 'std']):
    low, up = res_multi.interval_values[i]
    print(f'{label}: [{low:.3f}, {up:.3f}]')

interval_values shape: (2, 2)
mean: [0.312, 0.622]
std: [0.241, 0.351]