Firth’s Penalised Logistic Regression
Standard maximum likelihood logistic regression fails when the outcome is perfectly (or nearly perfectly) separated by a predictor. A simple example is the situation in which every exposed subject develops the event of interest while all unexposed subjects remain event-free. Under (near) perfect separation, the coefficient estimates diverge to ±∞, and no valid confidence interval or p-value can be obtained.
Firth’s method addresses this by adding a penalty term based on the observed information matrix, producing finite and nearly unbiased estimates even in these settings. FirthLogisticRegression is a drop-in scikit-learn estimator that implements Firth’s penalised logistic regression, with profile-likelihood confidence intervals and penalised likelihood ratio p-values.
[1]:
import warnings
import numpy as np
import pandas as pd
import statsmodels.api as sm
from sklearn.model_selection import cross_val_score
from stats_misc.machine_learning.firthlogist import FirthLogisticRegression
from stats_misc.example_data.examples import load_endometrial, load_sex2
The separation problem
The endometrial cancer dataset (Heinze & Schemper, 2002) is a canonical example of complete separation: the predictor NV strongly discriminates between outcome classes. Standard logistic regression cannot produce finite estimates in this setting.
[2]:
X_endo, y_endo, feature_names_endo = load_endometrial()
nv = X_endo[:, 0].astype(int)
# 2x2 contingency table of NV against the outcome
tab = pd.crosstab(
pd.Series(nv, name='Exposure: NV'),
pd.Series(y_endo, name='Endometrial cancer'),
margins=True,
)
print(tab)
Endometrial cancer 0.0 1.0 All
Exposure: NV
0 49 17 66
1 0 13 13
All 49 30 79
The standard logisitic regression results
As a default comparison we first show the result from a standard logistici regression analysis. Note the fairly extreme and clearly unstable estimates obtained for the NV variable.
[3]:
X_endo_sm = sm.add_constant(X_endo)
with warnings.catch_warnings(record=True) as caught:
warnings.simplefilter('always')
logit_result = sm.Logit(y_endo, X_endo_sm).fit(disp=False)
for w in caught:
print(f'Warning — {w.category.__name__}: {w.message}')
print(logit_result.summary(xname=['const'] + feature_names_endo))
Warning — ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
Logit Regression Results
==============================================================================
Dep. Variable: y No. Observations: 79
Model: Logit Df Residuals: 75
Method: MLE Df Model: 3
Date: Tue, 26 May 2026 Pseudo R-squ.: 0.4720
Time: 12:09:01 Log-Likelihood: -27.697
converged: False LL-Null: -52.451
Covariance Type: nonrobust LLR p-value: 1.016e-10
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const 4.3045 1.637 2.629 0.009 1.095 7.514
NV 19.5002 5458.415 0.004 0.997 -1.07e+04 1.07e+04
PI -0.0422 0.044 -0.952 0.341 -0.129 0.045
EH -2.9026 0.846 -3.433 0.001 -4.560 -1.245
==============================================================================
Possibly complete quasi-separation: A fraction 0.16 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.
Comparison against the Firth logigistic regression.
Next we will run a Firth Logistic regression model. Note the marked difference in estimate log odds ratio, and less extreme confidence intervals. The fact that the p-values are smaller also follows logically from the observed near separation, which clearly arguess against an odds ratio of 1.0.
[4]:
firth_endo = FirthLogisticRegression()
firth_endo.fit(X_endo, y_endo)
firth_endo.summary(xname=feature_names_endo)
coef std err [0.025 0.975] p-value
--------- ---------- --------- --------- ---------- -----------
NV 2.92927 1.55076 0.609724 7.85463 0.00905243
PI -0.0347517 0.0395781 -0.124459 0.0404555 0.376022
EH -2.60416 0.776017 -4.36518 -1.23272 2.50418e-05
Intercept 3.77456 1.48869 1.08254 7.20928 0.00416242
Log-Likelihood: -24.0373
Newton-Raphson iterations: 8
Inference on a larger dataset
The sex2 dataset (239 subjects, 6 predictors) demonstrates the full inference workflow: fitting, coefficient summaries, and generating predictions.
[5]:
X_sex2, y_sex2, feature_names_sex2 = load_sex2()
print(f'Shape: X={X_sex2.shape}, y={y_sex2.shape}')
print(f'Class balance:\n{pd.Series(y_sex2).value_counts().to_string()}')
Shape: X=(239, 6), y=(239,)
Class balance:
1.0 130
0.0 109
[6]:
firth_sex2 = FirthLogisticRegression()
firth_sex2.fit(X_sex2, y_sex2)
firth_sex2.summary(xname=feature_names_sex2)
coef std err [0.025 0.975] p-value
--------- ---------- --------- --------- ---------- -----------
age -1.10598 0.42366 -1.97379 -0.307425 0.00611139
oc -0.0688167 0.443793 -0.941436 0.789202 0.826365
vic 2.26887 0.548416 1.27302 3.43543 1.67219e-06
vicl -2.11141 0.543082 -3.26086 -1.11773 1.23618e-05
vis -0.788317 0.417368 -1.60809 0.0151847 0.0534899
dia 3.09601 1.67501 0.774568 8.03029 0.00484687
Intercept 0.120254 0.485542 -0.818559 1.07315 0.766584
Log-Likelihood: -132.5394
Newton-Raphson iterations: 8
[7]:
predictions = firth_sex2.predict(X_sex2)
probabilities = firth_sex2.predict_proba(X_sex2)
print('Predicted classes (first 5): ', predictions[:5])
print('Predicted probabilities (first 5):')
print(np.round(probabilities[:5], 4))
Predicted classes (first 5): [0. 1. 1. 1. 1.]
Predicted probabilities (first 5):
[[0.6611 0.3389]
[0.084 0.916 ]
[0.084 0.916 ]
[0.084 0.916 ]
[0.084 0.916 ]]
Scikit-learn compatibility
FirthLogisticRegression implements the scikit-learn BaseEstimator and ClassifierMixin interfaces, so it works directly with scikit-learn’s model selection utilities such as cross_val_score.
Computing profile-likelihood CIs and penalised LRT p-values adds overhead per fold. Pass skip_ci=True and skip_pvals=True when only the coefficient estimates are needed (e.g., during cross-validation).
[8]:
scores = cross_val_score(
FirthLogisticRegression(skip_ci=True, skip_pvals=True),
X_sex2,
y_sex2,
cv=5,
scoring='roc_auc',
)
print(f'Cross-validated AUC per fold: {np.round(scores, 3)}')
print(f'Mean AUC: {scores.mean():.3f} (\u00b1{scores.std():.3f})')
Cross-validated AUC per fold: [0.827 0.439 0.5 0.798 0.904]
Mean AUC: 0.694 (±0.187)
[9]:
fast_model = FirthLogisticRegression(skip_ci=True, skip_pvals=True)
fast_model.fit(X_sex2, y_sex2)
fast_model.summary(xname=feature_names_sex2)
coef std err [0.025 0.975] p-value
--------- ---------- --------- -------- -------- ---------
age -1.10598 0.42366 nan nan nan
oc -0.0688167 0.443793 nan nan nan
vic 2.26887 0.548416 nan nan nan
vicl -2.11141 0.543082 nan nan nan
vis -0.788317 0.417368 nan nan nan
dia 3.09601 1.67501 nan nan nan
Intercept 0.120254 0.485542 nan nan nan
Log-Likelihood: -132.5394
Newton-Raphson iterations: 8