Deep ROC Analysis

for binary classifiers & diagnostic tests

The area under the ROC curve (AUC) measures performance over the whole ROC curve, considering every possible decision threshold, which is too general and includes thresholds that would never be used. Accuracy, F1 score, sensitivity and specificity measure performance at a single decision threshold (point) on an ROC curve, which is too specific. Deep ROC analysis (a preprint) [with code] [1] permits in-depth analysis of performance in a range of decision thresholds, organized in groups, which correspond to different ranges of predicted risk. We also newly interpret the AUC, in whole or part (when normalized) as balanced average accuracy.

With deep ROC analysis we can validate that a classifier performs well in the group(s) that is/are most relevant--e.g., patients at greatest risk, or patients at medium risk that are challenging to classify, or the range of plausible decision thresholds. We may select a classifier differently based on group measures compared to whole or single point measures.

More information

Measuring classifier or test performance in a group, i.e., a range of thresholds, can account for the fact that each patient has different costs and risks--and that there are different priorities in different clinical settings (family practice, emergency, disease clinics). In contrast, measures at a single threshold are only optimal for a prototype or average patient. We have the opportunity to select and use classifiers better.

Also, in general, a classifier or test performs differently for patients at different levels of predicted risk.

Our group measures use familiar concepts such as AUC, sensitivity, specificity, positive predictive value (PPV) that includes population prevalence [2], sometimes called post-test probability, and negative predictive value (NPV). The first three are the normalized versions of our concordant partial AUC (denoted AUCi or cpAUCi) [3], the partial AUC (avg Sensi) [4], and the partial area index (avg Speci) [5] or horizontal partial AUC [3].

We also provide a new interpretation of AUC and AUCi as balanced average accuracy [1], which is useful--i.e., the average of average sensitivity and average specificity [3] over a range of decision thresholds. This explains exactly how AUC and AUCi relate to errors in decision-making, i.e., false positives and false negatives.


[1] Carrington AM, Manuel DG, Fieguth PW, Ramsay T, Osmani V, Wernly B, Bennett C, Hawken S, McInnes M, Magwood O, Sheikh Y, Holzinger A. Deep ROC Analysis and AUC as Balanced Average Accuracy to Improve Model Selection, Understanding and Interpretation. Manuscript submitted February 14, 2021.

[2] Altman DG, Bland JM. Diagnostic tests 2: predictive values. BMJ, 309 (July 1994), 16104.

[3] Carrington AM, Fieguth PW, Qazi H, Holzinger A, Chen HH, Mayr F and Manuel DG. A new concordant partial AUC and partial c statistic for imbalanced data in the evaluation of machine learning algorithms, BMC Medical Informatics and Decision Making 20, 4 (2020) doi:10.1186/s12911-019-1014-6.

[4] McClish DK. Analyzing a Portion of the ROC Curve. Medical decision making, pp. 190–195, 1989.

[5] Jiang Y, Metz CE, Nishikawa RM. A receiver operating characteristic partial area index for highly sensitive diagnostic tests. Radiology, vol. 201, no. 3, pp. 745–750, 2014.