Firth’s Penalised Logistic Regression

Standard maximum likelihood logistic regression fails when the outcome is perfectly (or nearly perfectly) separated by a predictor. A simple example is the situation in which every exposed subject develops the event of interest while all unexposed subjects remain event-free. Under (near) perfect separation, the coefficient estimates diverge to ±∞, and no valid confidence interval or p-value can be obtained.

Firth’s method addresses this by adding a penalty term based on the observed information matrix, producing finite and nearly unbiased estimates even in these settings. FirthLogisticRegression is a drop-in scikit-learn estimator that implements Firth’s penalised logistic regression, with profile-likelihood confidence intervals and penalised likelihood ratio p-values.

[1]:
import warnings

import numpy as np
import pandas as pd
import statsmodels.api as sm
from sklearn.model_selection import cross_val_score

from stats_misc.machine_learning.firthlogist import FirthLogisticRegression
from stats_misc.example_data.examples import load_endometrial, load_sex2

The separation problem

The endometrial cancer dataset (Heinze & Schemper, 2002) is a canonical example of complete separation: the predictor NV strongly discriminates between outcome classes. Standard logistic regression cannot produce finite estimates in this setting.

[2]:
X_endo, y_endo, feature_names_endo = load_endometrial()
nv = X_endo[:, 0].astype(int)

# 2x2 contingency table of NV against the outcome
tab = pd.crosstab(
    pd.Series(nv, name='Exposure: NV'),
    pd.Series(y_endo, name='Endometrial cancer'),
    margins=True,
)
print(tab)

Endometrial cancer  0.0  1.0  All
Exposure: NV
0                    49   17   66
1                     0   13   13
All                  49   30   79

The standard logisitic regression results

As a default comparison we first show the result from a standard logistici regression analysis. Note the fairly extreme and clearly unstable estimates obtained for the NV variable.

[3]:
X_endo_sm = sm.add_constant(X_endo)

with warnings.catch_warnings(record=True) as caught:
    warnings.simplefilter('always')
    logit_result = sm.Logit(y_endo, X_endo_sm).fit(disp=False)

for w in caught:
    print(f'Warning — {w.category.__name__}: {w.message}')

print(logit_result.summary(xname=['const'] + feature_names_endo))
Warning — ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
                           Logit Regression Results
==============================================================================
Dep. Variable:                      y   No. Observations:                   79
Model:                          Logit   Df Residuals:                       75
Method:                           MLE   Df Model:                            3
Date:                Tue, 26 May 2026   Pseudo R-squ.:                  0.4720
Time:                        12:09:01   Log-Likelihood:                -27.697
converged:                      False   LL-Null:                       -52.451
Covariance Type:            nonrobust   LLR p-value:                 1.016e-10
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          4.3045      1.637      2.629      0.009       1.095       7.514
NV            19.5002   5458.415      0.004      0.997   -1.07e+04    1.07e+04
PI            -0.0422      0.044     -0.952      0.341      -0.129       0.045
EH            -2.9026      0.846     -3.433      0.001      -4.560      -1.245
==============================================================================

Possibly complete quasi-separation: A fraction 0.16 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.

Comparison against the Firth logigistic regression.

Next we will run a Firth Logistic regression model. Note the marked difference in estimate log odds ratio, and less extreme confidence intervals. The fact that the p-values are smaller also follows logically from the observed near separation, which clearly arguess against an odds ratio of 1.0.

[4]:
firth_endo = FirthLogisticRegression()
firth_endo.fit(X_endo, y_endo)
firth_endo.summary(xname=feature_names_endo)
                 coef    std err     [0.025      0.975]      p-value
---------  ----------  ---------  ---------  ----------  -----------
NV          2.92927    1.55076     0.609724   7.85463    0.00905243
PI         -0.0347517  0.0395781  -0.124459   0.0404555  0.376022
EH         -2.60416    0.776017   -4.36518   -1.23272    2.50418e-05
Intercept   3.77456    1.48869     1.08254    7.20928    0.00416242

Log-Likelihood: -24.0373
Newton-Raphson iterations: 8

Inference on a larger dataset

The sex2 dataset (239 subjects, 6 predictors) demonstrates the full inference workflow: fitting, coefficient summaries, and generating predictions.

[5]:
X_sex2, y_sex2, feature_names_sex2 = load_sex2()

print(f'Shape: X={X_sex2.shape}, y={y_sex2.shape}')
print(f'Class balance:\n{pd.Series(y_sex2).value_counts().to_string()}')
Shape: X=(239, 6), y=(239,)
Class balance:
1.0    130
0.0    109
[6]:
firth_sex2 = FirthLogisticRegression()
firth_sex2.fit(X_sex2, y_sex2)
firth_sex2.summary(xname=feature_names_sex2)
                 coef    std err     [0.025      0.975]      p-value
---------  ----------  ---------  ---------  ----------  -----------
age        -1.10598     0.42366   -1.97379   -0.307425   0.00611139
oc         -0.0688167   0.443793  -0.941436   0.789202   0.826365
vic         2.26887     0.548416   1.27302    3.43543    1.67219e-06
vicl       -2.11141     0.543082  -3.26086   -1.11773    1.23618e-05
vis        -0.788317    0.417368  -1.60809    0.0151847  0.0534899
dia         3.09601     1.67501    0.774568   8.03029    0.00484687
Intercept   0.120254    0.485542  -0.818559   1.07315    0.766584

Log-Likelihood: -132.5394
Newton-Raphson iterations: 8

[7]:
predictions = firth_sex2.predict(X_sex2)
probabilities = firth_sex2.predict_proba(X_sex2)

print('Predicted classes (first 5):    ', predictions[:5])
print('Predicted probabilities (first 5):')
print(np.round(probabilities[:5], 4))
Predicted classes (first 5):     [0. 1. 1. 1. 1.]
Predicted probabilities (first 5):
[[0.6611 0.3389]
 [0.084  0.916 ]
 [0.084  0.916 ]
 [0.084  0.916 ]
 [0.084  0.916 ]]

Scikit-learn compatibility

FirthLogisticRegression implements the scikit-learn BaseEstimator and ClassifierMixin interfaces, so it works directly with scikit-learn’s model selection utilities such as cross_val_score.

Computing profile-likelihood CIs and penalised LRT p-values adds overhead per fold. Pass skip_ci=True and skip_pvals=True when only the coefficient estimates are needed (e.g., during cross-validation).

[8]:
scores = cross_val_score(
    FirthLogisticRegression(skip_ci=True, skip_pvals=True),
    X_sex2,
    y_sex2,
    cv=5,
    scoring='roc_auc',
)
print(f'Cross-validated AUC per fold: {np.round(scores, 3)}')
print(f'Mean AUC: {scores.mean():.3f} (\u00b1{scores.std():.3f})')
Cross-validated AUC per fold: [0.827 0.439 0.5   0.798 0.904]
Mean AUC: 0.694 (±0.187)
[9]:
fast_model = FirthLogisticRegression(skip_ci=True, skip_pvals=True)
fast_model.fit(X_sex2, y_sex2)

fast_model.summary(xname=feature_names_sex2)
                 coef    std err    [0.025    0.975]    p-value
---------  ----------  ---------  --------  --------  ---------
age        -1.10598     0.42366        nan       nan        nan
oc         -0.0688167   0.443793       nan       nan        nan
vic         2.26887     0.548416       nan       nan        nan
vicl       -2.11141     0.543082       nan       nan        nan
vis        -0.788317    0.417368       nan       nan        nan
dia         3.09601     1.67501        nan       nan        nan
Intercept   0.120254    0.485542       nan       nan        nan

Log-Likelihood: -132.5394
Newton-Raphson iterations: 8