Deep ROC Analysis

for binary classifiers & diagnostic tests

pip install deeproc     |     PyPI webpage    |     Github webpage

The area under the ROC curve (AUC) measures performance over the whole ROC curve, considering every possible decision threshold, which is too general and includes thresholds that would never be used. Accuracy, F1 score, sensitivity and specificity measure performance at a single decision threshold (point) on an ROC curve, which is too specific and ignores information. Deep ROC analysis (paper) (presentation) (code) [1] permits in-depth analysis of classifier performance in groups of predicted risk or probability that span the ROC curve. Previous attempts to represent AUC by partial AUC, the standardized partial AUC or the two-way AUC were flawed.

With deep ROC analysis we can validate that a classifier performs well in the group(s) that is/are most relevant--e.g., patients at greatest risk, or patients at medium risk that are challenging to classify, or the range of plausible decision thresholds. We may select a classifier differently based on group measures compared to whole area or single point measures.

More information

Measuring classifier or test performance in a group, i.e., a range of thresholds, can account for the fact that each patient has different costs and risks--and that there are different priorities in different clinical settings (family practice, emergency, disease clinics). In contrast, measures at a single threshold are only optimal for a prototype or average patient. We have the opportunity to select and use classifiers better.

Also, in general, a classifier or test performs differently for patients at different levels of predicted risk.

Our group measures use familiar concepts such as AUC, sensitivity, specificity, positive predictive value (PPV) that includes population prevalence [2] and negative predictive value (NPV). The first three are the normalized versions of our concordant partial AUC (AUCni or cpAUCni) [3], the partial AUC (avg Sensi) [4], and horizontal partial AUC (avg Speci) [3].

We also provide a new interpretation of AUC and AUCni as balanced average accuracy [1] for individuals instead of pairs of individuals. It is a weighted average that balances average sensitivity and average specificity [3] according to their proportional contribution in the range of interest. That is, the vertical range and horizontal range for part of an ROC curve may differ, thus contributing different amounts to AUCni. Our interpretation explains how AUC and AUCni exactly measure errors in decision-making, i.e., false positives and false negatives.


[1] Carrington AM, Manuel DG, Fieguth PW, Ramsay T, Osmani V, Wernly B, Bennett C, Hawken S, McInnes M, Magwood O, Sheikh Y, Holzinger A. Deep ROC Analysis and AUC as Balanced Average Accuracy Deep ROC analysis and AUC as balanced average accuracy for improved classifier selection, audit and explanation. IEEE Transactions on Pattern Analysis and Machine Intelligence, Early Access, January 25, 2022. doi:10.1109/TPAMI.2022.3145392

[2] Altman DG, Bland JM. Diagnostic tests 2: predictive values. BMJ, 309 (July 1994), 16104.

[3] Carrington AM, Fieguth PW, Qazi H, Holzinger A, Chen HH, Mayr F and Manuel DG. A new concordant partial AUC and partial c statistic for imbalanced data in the evaluation of machine learning algorithms, BMC Medical Informatics and Decision Making 20, 4 (2020) doi:10.1186/s12911-019-1014-6

[4] McClish DK. Analyzing a Portion of the ROC Curve. Medical decision making, pp. 190–195, 1989.