Prediction model validation

The validation module packages a number of frequently used methods for prediction model validation. The current module focusses on binary clasificaiton problems as well as calibration of models for quantative traits (i.e. regression problems). For survival model validation have a look at the sksurv_utils module.

[1]:

# Imports
import numpy as np
import pandas as pd
from stats_misc.machine_learning.validation import (
    ConcordanceCorrelationCoefficient,
    cstatistic,
    format_roc,
    get_calibration,
    recalibrate,
)
from stats_misc.utils.general import logit
from stats_misc.constants import (
    NamesValidation as NamesVal,
)
# Some toy data
data = pd.DataFrame({
    'observed': [0, 1, 0, 1, 0, 0, 0, 1],
    'predicted': [0.1, 0.7, 0.3, 0.6, 0.9, 0.2, 0.15, 0.8],
    })
data['logit_predicted'] = logit(data['predicted'].to_numpy())
# list data
data.head()

[1]:

	observed	predicted	logit_predicted
0	0	0.1	-2.197225
1	1	0.7	0.847298
2	0	0.3	-0.847298
3	1	0.6	0.405465
4	0	0.9	2.197225

Estimating calibration

Many prediction models provide a predicted risk for a certain outcome. Ideally these predictions agree well with the observed risk. To determine this we will estimate the calibration-in-the-large (i.e. the intercept) and calibration slope. These estimates are directly related to a Cartesian graph with the observed risk on the y-axis and predicted risk on the x-axis, where perfect calibration would imply an intercept of zero and a slope of one.

[2]:

results = get_calibration(data, 'observed', 'logit_predicted', bins=3)
print('The calibration intercept: {:.2f} and its standard error: {:.3f},\n\n'
      'The calibration slope: {:.2f} and its standard error: {:.3f},\n\n'
      'As well as the calibration table:\n'
      '---------------------------------\n{}'.format(
          results.calibration_in_the_large,
          results.calibration_in_the_large_se,
          results.calibration_slope,
          results.calibration_slope_se,
          results.observed_predict_table,
      ))

The calibration intercept: -0.59 and its standard error: 0.896,

The calibration slope: 0.91 and its standard error: 0.665,

As well as the calibration table:
---------------------------------
      Average predict risk  Average observed risk  Lower bound observed risk  \
bins
0                     0.15               0.000000                        NaN
1                     0.45               0.500000                   0.012579
2                     0.80               0.666667                   0.094299

      Upper bound observed risk  No. subjects
bins
0                      0.707598             3
1                      0.987421             2
2                      0.991596             3

Model recalibration

Clearly the above model is poorly calibrated to the observed data. We can try to recalibrate the model by estimating a new intercept and slope. Note that the estimated intercept and slope should be used to update the original score as follows: new_intercept \(+\) new_slope \(\times\) score.

The included table automatically performs this operation; nevertheless, in many cases it might be preferred to report the new intercept and slope estimates (e.g., when updating an existing prediction model that you might want to apply to additional (testing) data).

[3]:

results = recalibrate(data, observed='observed', score='logit_predicted', model='binomial')
print('The re-calibration intercept: {:.2f} and slope: {:.2f},\n\n'
      'As well as recalibrated predictions:\n'
      '-----------------------------------\n{}'.format(
          results.intercept,
          results.slope,
          results.table_recalibrated.rename(columns={'logit_predicted': 'Predicted risk'}),
      ))

The re-calibration intercept: -0.57 and slope: 0.91,

As well as recalibrated predictions:
-----------------------------------
   Predicted risk
0        0.071634
1        0.551010
2        0.208308
3        0.450971
4        0.807127
5        0.138841
6        0.105132
7        0.666977

Confirm the recalibration has worked

Next, we will confirm the recalibration has worked. Here we are using the training data again, which of course ensures perfect calibration. In practice, one would therefore want to use held-out data to get a fair understanding of the (hopefully) improved calibration.

[4]:

data['new_logit_predicted'] = logit(results.table_recalibrated.to_numpy())
recal_results = get_calibration(data, 'observed', 'new_logit_predicted', bins=3)

print('The calibration intercept: {:.2f} and its standard error: {:.3f},\n\n'
      'The calibration slope: {:.2f} and its standard error: {:.3f},\n\n'
      'As well as the calibration table:\n'
      '---------------------------------\n{}'.format(
          recal_results.calibration_in_the_large,
          recal_results.calibration_in_the_large_se,
          recal_results.calibration_slope,
          recal_results.calibration_slope_se,
          recal_results.observed_predict_table,
      ))

The calibration intercept: 0.00 and its standard error: 0.871,

The calibration slope: 1.00 and its standard error: 0.732,

As well as the calibration table:
---------------------------------
      Average predict risk  Average observed risk  Lower bound observed risk  \
bins
0                 0.105202               0.000000                        NaN
1                 0.329639               0.500000                   0.012579
2                 0.675038               0.666667                   0.094299

      Upper bound observed risk  No. subjects
bins
0                      0.707598             3
1                      0.987421             2
2                      0.991596             3

Discrimination

Discrimination can be assessed using the c-statistic (which for most binary classification problems is equivalent to the Area Under the Receiver Operating Characteristic Curve). Packages such as sklearn already include functionality to calculate the c-statistic, but typically do not provide matching confidence intervals. Here we implement a large sample size solution for the confidence interval (noting that large sample size in statistics is often relatively small, something like 30–40 cases and controls). Alternatively, in small sample size settings, a bootstrap-based confidence interval might perform better (using the current implementation or the sklearn solution) - please refer to the stats-misc resampling module.

[5]:

res=cstatistic(observed=data.observed, predicted=data.predicted, alpha=0.05)
print('c-statistic: `{:.2f}` and its {}% confidence interval: `{:.2f}; {:.2f}`.\n'.format(
    getattr(res, NamesVal.CSTAT),
    int(getattr(res, NamesVal.CSTAT_COVERAGE) * 100),
    getattr(res, NamesVal.CSTAT_LB),
    getattr(res, NamesVal.CSTAT_UB)
))

c-statistic: `0.80` and its 95% confidence interval: `0.44; 1.16`.

Exploring the influence of logit and recalibration

Both the logit transformation and the specific type of employed recalibration will not affect the c-statistic. This is because these mappings are rank-preserving and the c-statistic is a rank-based statistic.

[6]:

res1=cstatistic(observed=data.observed, predicted=data.logit_predicted)
res2=cstatistic(observed=data.observed, predicted=data.new_logit_predicted)

print('The c-statistic for the logit transformed predicted risk: `{:.2f}`,\n'
      'The c-statistic for the recalibrated predicted risk: `{:.2f}`.'.format(
          getattr(res1, NamesVal.CSTAT),
          getattr(res2, NamesVal.CSTAT)
      ))

The c-statistic for the logit transformed predicted risk: `0.80`,
The c-statistic for the recalibrated predicted risk: `0.80`.

Creating the ROC input data

Aside from calculating the c-statistic, people often want to plot the ROC curve showing the various possible cut-points of the prediction rule and its impact on sensitivity and false-positive rates. The following function neatly organises the input data for such a plot.

[7]:

format_roc(observed=data.observed, predicted=data.predicted)

[7]:

	false_positive	sensitivity	threshold
0	0.0	0.0	inf
1	0.2	0.0	0.9
2	0.2	1.0	0.6
3	1.0	1.0	0.1

Concordance Correlation Coefficient

While get_calibration can also be used to determine calibration for regression/quantitative traits problems, Lin’s concordance correlation coefficient (CCC) provides an often-used alternative method.

The CCC measures agreement between two continuous variables. It differs from Pearson’s correlation (which does not reflect concordance/calibration) in that it penalises deviations from the line of perfect agreement (the 45-degree line through the origin), not just departures from any linear relationship. Pearson’s \(r\) captures precision (how tightly points cluster around a line), while the CCC captures both precision and accuracy (whether that line is the line of perfect agreement). The CCC is therefore the appropriate measure when you need to confirm that two methods or raters produce equivalent measurements, not merely correlated ones.

The CCC decomposes as: \(\rho_c = \rho \cdot C_b\), where \(\rho\) is Pearson’s correlation (precision) and \(C_b \in [0, 1]\) is a bias correction factor (accuracy). Two confidence interval methods are available: the delta method (default) and Fisher’s z-transformation.

[8]:

# Toy data: two continuous paired measurements (e.g. two raters or two instruments)
x = np.array([1.2, 2.4, 3.1, 4.5, 5.0, 6.3, 7.1, 8.8])
y = np.array([1.0, 2.1, 3.4, 4.2, 5.3, 6.0, 7.5, 8.5])

ccc = ConcordanceCorrelationCoefficient(x, y)

# Delta method CI (default)
ccc(alpha=0.05, ci_method='delta')
print('CCC (delta method): {:.3f}, 95% CI: {:.3f}; {:.3f}'.format(
    ccc.ccc, ccc.ccc_ci[0], ccc.ccc_ci[1]
))


# Fisher z-transformation CI
ccc(alpha=0.05, ci_method='fisher')
print('\nCCC (Fisher z): {:.3f}, 95% CI: {:.3f}; {:.3f}'.format(
    ccc.ccc, ccc.ccc_ci[0], ccc.ccc_ci[1]
))

# The Pearson estimate and the related bias correction
print('\nPearson r: {:.3f}, bias correction (Cb): {:.3f}'.format(
    ccc.pearson_correlation, ccc.bias_correction
))

CCC (delta method): 0.992, 95% CI: 0.979; 1.005

CCC (Fisher z): 0.992, 95% CI: 0.961; 0.998

Pearson r: 0.992, bias correction (Cb): 1.000