Interval estimation Methods

The interval module can be used to estimate a mixture of intervals. Currently the focus is on confidence intervals, but this can be expanded to include prediction intervals or highest density intervals for example.

[1]:
import numpy as np
from stats_misc.intervals import (
    univariable_quantiles_exact,
    univariable_poisson_standard_normal,
    wald_confidence_interval,
    beta_confidence_interval,
    linear_contrast,
)

Confidence intervals for point estimates

The following confidence intervals are available:

  • Wald based, assuming the point estimate follows a normal distribution,

  • Count data, assuming the point estimate follows a normal distribution,

  • Beta distribution, for proportions.

[2]:
# wald confidence interval
res = wald_confidence_interval(0.2, 0.01)
print(f'point estimate: {res.point_estimate} with {res.coverage}% confidence interval: '
      f'{res.interval_values[0]:.2f}; {res.interval_values[1]:.2f}.')
point estimate: 0.2 with 0.95% confidence interval: 0.18; 0.22.
[3]:
# the class instance
res
[3]:
wald_confidence_interval
  point_estimate=0.200
  coverage=0.950
  interval_values=(0.180, 0.220)
  standard_error=0.010
[4]:
# confidence interval for count data
res = univariable_poisson_standard_normal([1, 2, 3, 4, 5, 10,11, 1.5, 2.5], alpha=0.001)
print(f'point estimate: {res.point_estimate:.2f} with {res.coverage}% confidence interval: '
      f'{res.interval_values[0]:.2f}; {res.interval_values[1]:.2f}.')
point estimate: 4.44 with 0.999% confidence interval: 2.13; 6.76.
[5]:
# confidence interval for proportions
res = beta_confidence_interval(0.2, 20, alpha=0.25)
print(f'point estimate: {res.point_estimate:.2f} with {res.coverage}% confidence interval: '
      f'{res.interval_values[0]:.2f}; {res.interval_values[1]:.2f}.')
point estimate: 0.20 with 0.75% confidence interval: 0.10; 0.35.

Confidence intervals for a quantile/percentage

The following example calculate a confidence interval for a quantile betweeen 0 and 100% of the data. The calculations are based on assuming the propbability of obtaining a value below or above the requested quantile follows a binomial distribution. The function will return confiden interval limits based on the observed data, selecting limits which are closest to the requested coverage (e.g., 95%). The obtained coverage is guaranteed to be equal or larger than the requested level. The procedure is essentially non-parametric and can be applied without making strong distributional assumptions about the sample or population.

[6]:
data = [1, 2, 3, 4, 5, 10,11, 1.5, 2.5, 2, 2.2, 3.5, 13.1, 0.5, 0.2, 0.15, 0.01, 7, 5, 8, 9.2, 4.8]
res1 = univariable_quantiles_exact(data, quantile=0.5, alpha=0.10)
res2 = univariable_quantiles_exact(data, quantile=0.15, alpha=0.25)
print(f'The {res1.coverage:.2f}% confidence interval for qunantile 0.50 is : {res1.interval_values[0]:.2f}; {res1.interval_values[1]:.2f}.\n'
      f'The {res2.coverage:.2f}% confidence interval for qunantile 0.15 is : {res2.interval_values[0]:.2f}; {res2.interval_values[1]:.2f} ')

res1
The 0.91% confidence interval for qunantile 0.50 is : 2.00; 5.00.
The 0.76% confidence interval for qunantile 0.15 is : 0.20; 2.00
[6]:
univariable_quantiles_exact
  point_estimate=3.250
  coverage=0.907
  interval_values=[2.000, 5.000]
  interval_indices=[7.000, 15.000]
  matrix_coverage=array([[0.848 0.73  0.925 0.965 0.983]
     [0.855 0.736 0.931 0.972 0.989]
     [0.831 0.712 0.907 0.948 0.965]
     [0.79  0.671 0.866 0.907 0.925]
     [0.714 0.595 0.79  0.831 0.848]], shape=(5, 5), dtype=float64)
  matrix_columns=list([14, 13, 15, 16, 17])
  matrix_rows=list([6, 5, 7, 8, 9])

The window parameter

window controls the size of the candidate search grid: the function evaluates all (lower, upper) order-statistic index pairs within ±window steps of the binom.ppf starting bounds, producing a (2w+1) × (2w+1) coverage matrix stored in matrix_coverage.

Because coverage is monotone with interval width, the selected pair is almost always identical regardless of window — binom.ppf already places the starting bounds near-optimally. What changes is only the size of the evaluated grid.

The main reason to increase window is defensive: if the function raises an error (no valid pair found), a larger window expands the search. This can occur with very small samples and extreme quantiles where the binomial distribution is highly discrete.

[7]:
for w in [1, 2, 3]:
    res = univariable_quantiles_exact(data, quantile=0.5, alpha=0.10, window=w)
    print(f'window={w}: matrix_coverage shape={res.matrix_coverage.shape}, '
          f'indices={res.interval_indices}, coverage={res.coverage:.3f}')

# Inspect the full 3x3 coverage matrix at window=1
res_w1 = univariable_quantiles_exact(data, quantile=0.5, alpha=0.10, window=1)
res_w1
window=1: matrix_coverage shape=(3, 3), indices=[7, 15], coverage=0.907
window=2: matrix_coverage shape=(5, 5), indices=[7, 15], coverage=0.907
window=3: matrix_coverage shape=(7, 7), indices=[7, 15], coverage=0.907
[7]:
univariable_quantiles_exact
  point_estimate=3.250
  coverage=0.907
  interval_values=[2.000, 5.000]
  interval_indices=[7.000, 15.000]
  matrix_coverage=array([[0.848 0.925 0.965]
     [0.831 0.907 0.948]
     [0.79  0.866 0.907]], shape=(3, 3), dtype=float64)
  matrix_columns=list([14, 15, 16])
  matrix_rows=list([6, 7, 8])

Linear contrasts of regression coefficients

linear_contrast propagates uncertainty through arbitrary linear combinations of regression coefficients:

Var(L) = D' * V * D

where D is the contrast matrix (one column per contrast, shape n_coef × n_contrasts), V is the variance–covariance (VCV) matrix of the coefficients, and c is the coefficient vector. The function returns a DataFrame with one row per contrast containing the estimate, standard error, and confidence bounds.

This works with any model that exposes a coefficient vector and VCV matrix: OLS, GLM, Cox, mixed-effects, and penalised regression. Estimates and intervals are on the link scale - back-transform externally if needed (e.g. np.exp() for a log link).

The example below uses a three-group regression (intercept + two indicator coefficients, with group A as the reference) to compute pairwise contrasts.

[8]:
c_vec = np.array([1.0, 0.3, 0.5])   # [intercept, coef_B, coef_C]; group A is the reference

v_mat = np.array([
    [0.04, 0.01, 0.01],
    [0.01, 0.02, 0.00],
    [0.01, 0.00, 0.03],
])

# d_mat shape: (n_coef=3, n_contrasts=3)
# Column 0: B vs A → [0,  1,  0]  => estimate = coef_B
# Column 1: C vs A → [0,  0,  1]  => estimate = coef_C
# Column 2: C vs B → [0, -1,  1]  => estimate = coef_C - coef_B
d_mat = np.array([
    [0,  0,  0],
    [1,  0, -1],
    [0,  1,  1],
])

res = linear_contrast(d_mat, v_mat, c_vec, contrast_names=['B_vs_A', 'C_vs_A', 'C_vs_B'])
res
[8]:
contrast estimate se lower upper
0 B_vs_A 0.3 0.141421 0.022819 0.577181
1 C_vs_A 0.5 0.173205 0.160524 0.839476
2 C_vs_B 0.2 0.223607 -0.238261 0.638261