Validating (machine learning) prediction models
Here we will showcase a number of functions to validate machine learning (or otherwise) prediction models.
[1]:
# Imports
import pandas as pd
from stats_misc.machine_learning.validation import (
cstatistic,
format_roc,
get_calibration,
recalibrate,
)
from stats_misc.utils.general import logit
from stats_misc.constants import (
NamesValidation as NamesVal,
)
# Some toy data
data = pd.DataFrame({
'observed': [0, 1, 0, 1, 0, 0, 0, 1],
'predicted': [0.1, 0.7, 0.3, 0.6, 0.9, 0.2, 0.15, 0.8],
})
data['logit_predicted'] = logit(data['predicted'].to_numpy())
# list data
data.head()
/tmp/ipykernel_18082/4137796496.py:2: DeprecationWarning:
Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
import pandas as pd
[1]:
| observed | predicted | logit_predicted | |
|---|---|---|---|
| 0 | 0 | 0.1 | -2.197225 |
| 1 | 1 | 0.7 | 0.847298 |
| 2 | 0 | 0.3 | -0.847298 |
| 3 | 1 | 0.6 | 0.405465 |
| 4 | 0 | 0.9 | 2.197225 |
Estimating calibration
Many prediction models provide a predicted risk for a certain outcome. Ideally these predictions agree well with the observed risk. To determine this we will estimate the calibration-in-the-large (i.e. the intercept) and calibration slope. These estimates are directly related to a Cartesian graph with the observed risk on the y-axis and predict risk on the x-axis, where perfect calibration would imply an intercept of zero and a slope of one.
[2]:
results = get_calibration(data, 'observed', 'logit_predicted', bins=3)
print('The calibration intercept: {:.2f} and its standard error: {:.3f},\n\n'
'The calibration slope: {:.2f} and its standard error: {:.3f},\n\n'
'As well as the calibration table:\n'
'---------------------------------\n{}'.format(
results.calibration_in_the_large,
results.calibration_in_the_large_se,
results.calibration_slope,
results.calibration_slope_se,
results.observed_predict_table,
))
The calibration intercept: -0.59 and its standard error: 0.896,
The calibration slope: 0.91 and its standard error: 0.665,
As well as the calibration table:
---------------------------------
Average predict risk Average observed risk Lower bound observed risk \
bins
0 0.15 0.000000 NaN
1 0.45 0.500000 0.012579
2 0.80 0.666667 0.094299
Upper bound observed risk No. subjects
bins
0 0.707598 3
1 0.987421 2
2 0.991596 3
Model recalibration
Clearly the above model is poorly calibrated to the observed data. We can try to recalibrate the model by estimating a new intercept and slope. Note that the estimated intercept and slope should be used to updated the original score as follows: new_intercept \(+\) new_slope \(\times\) score.
The included table automatically performs this opperation, nevertheless in many cases it might be perferred to report the new intercept and slope estimates (e.g., when updating an existing prediction model for example, which you might want to apply to additional (testing) data).
[3]:
results = recalibrate(data, observed='observed', score='logit_predicted', model='binomial')
print('The re-calibration intercept: {:.2f} and slope: {:.2f},\n\n'
'As well as recalibrated predictions:\n'
'-----------------------------------\n{}'.format(
results.intercept,
results.slope,
results.table_recalibrated.rename(columns={'logit_predicted': 'Predicted risk'}),
))
The re-calibration intercept: -0.57 and slope: 0.91,
As well as recalibrated predictions:
-----------------------------------
Predicted risk
0 0.071634
1 0.551010
2 0.208308
3 0.450971
4 0.807127
5 0.138841
6 0.105132
7 0.666977
Confirm the recalibration has worked
Next we will confirm the recalibration has worked. Here we are using the training data again which of course ensures perfect calibration. In practice one would therefore want to use data not used in the training data to get a fair understanding of the (hopefully) improved calibration.
[4]:
data['new_logit_predicted'] = logit(results.table_recalibrated.to_numpy())
recal_results = get_calibration(data, 'observed', 'new_logit_predicted', bins=3)
print('The calibration intercept: {:.2f} and its standard error: {:.3f},\n\n'
'The calibration slope: {:.2f} and its standard error: {:.3f},\n\n'
'As well as the calibration table:\n'
'---------------------------------\n{}'.format(
recal_results.calibration_in_the_large,
recal_results.calibration_in_the_large_se,
recal_results.calibration_slope,
recal_results.calibration_slope_se,
recal_results.observed_predict_table,
))
The calibration intercept: 0.00 and its standard error: 0.871,
The calibration slope: 1.00 and its standard error: 0.732,
As well as the calibration table:
---------------------------------
Average predict risk Average observed risk Lower bound observed risk \
bins
0 0.105202 0.000000 NaN
1 0.329639 0.500000 0.012579
2 0.675038 0.666667 0.094299
Upper bound observed risk No. subjects
bins
0 0.707598 3
1 0.987421 2
2 0.991596 3
Discrimination
Discrimination can be assessed using the c-statistic (which for most analyses is equivalent to the Area Under the Receiver Opperator Curve). Package such as sklearn already include functionality to calculate the c-statistic, but typically do not provide matchingconfidence intervals. Here we implement a large sample size solution for the confidence interval (noting that large sample size in statistics is often relatively small, something about 30-40 cases and control). Alternative, in small sample size settings, a boostrapped based confidence interval might perform better (using the current implementation or the sklearn solution).
[5]:
res=cstatistic(observed=data.observed, predicted=data.predicted, alpha=0.05)
print('c-statistic: `{:.2f}` and its {}% confidence interval: `{:.2f}; {:.2f}`.\n'.format(
getattr(res, NamesVal.CSTAT),
getattr(res, NamesVal.CSTAT_COVERAGE),
getattr(res, NamesVal.CSTAT_LB),
getattr(res, NamesVal.CSTAT_UB)
))
c-statistic: `0.80` and its 0.95% confidence interval: `0.44; 1.16`.
Exploring the influence of logit and recalibration
Both the logit transformation and the specific type of employed recalibration will not affect the c-statistic. This is because these mappings are rank perserving and the c-statistic is a rank-based statistic.
[6]:
res1=cstatistic(observed=data.observed, predicted=data.logit_predicted)
res2=cstatistic(observed=data.observed, predicted=data.new_logit_predicted)
print('The c-statistic for the logit transformed predicted risk: `{:.2f}`,\n'
'The c-statistic for the recalibrated predicted risk: `{:.2f}`.'.format(
getattr(res1, NamesVal.CSTAT),
getattr(res2, NamesVal.CSTAT)
))
The c-statistic for the logit transformed predicted risk: `0.80`,
The c-statistic for the recalibrated predicted risk: `0.80`.
Creating the ROC input data
Aside from calculating the c-statistic people often want to plot the ROC curve showing the various possible cut-points of the prediction rule and its impact on sensitivity and false-postive rates. The following function neatly organises the input data for such a plot.
[7]:
format_roc(observed=data.observed, predicted=data.predicted)
[7]:
| false_positive | sensitivity | threshold | |
|---|---|---|---|
| 0 | 0.0 | 0.0 | inf |
| 1 | 0.2 | 0.0 | 0.9 |
| 2 | 0.2 | 1.0 | 0.6 |
| 3 | 1.0 | 1.0 | 0.1 |