📚

How to Interpret a ROC Curve

A default setting within logistic regression models for binary classification is to classify all outcomes with a prediction probability of 0.5 and higher as 1. Probability values lower than 0.5 are classified as 0.

What if we could play around with this threshold of 0.5 and increase or decrease it? How would it affect the classification models ability to correctly predict positive outcomes? These are the very questions that the ROC curve aims to answer visually in a highly interpretable manner.

Logistic Regression is a statistical method that we use to fit a regression model when the response variable is binary. To assess how well a logistic regression model fits a dataset, we can look at the following two metrics:

Sensitivity: The probability that the model predicts a positive outcome for an observation when the outcome is indeed positive.
Specificity: The probability that the model predicts a negative outcome for an observation when the outcome is indeed negative.

An easy way to visualize these two metrics is by creating a ROC curve, which is a plot that displays the sensitivity and specificity of a logistic regression model.

The ROC curve is plotted on a graph with the True Positive Rate(Sensitivity) on the Y-axis and False Positive Rate(1 - Specificity) on the X-axis. The values of the TPR and the FPR are found for many thresholds from 0 to 1. These values are then plotted on the graph. A basic ROC graph that I created based on a logistic regression problem can be seen below:

For example, suppose we fit three different logistic regression models and plot the following ROC curves for each model:

Suppose we calculate the AUC for each model as follows:

Model A: AUC = 0.923
Model B: AUC = 0.794
Model C: AUC = 0.588

Model A has the highest AUC, which indicates that it has the highest area under the curve and is the best model at correctly classifying observations into categories.

Some important takeaways for the ROC AUC are listed below:

An ROC AUC of 0.5 indicates that the positive and negative data classes perfectly overlap and the model is basically pointless and is only as good as simple guessing
An ROC AUC of 1 indicates that the positive and negative data classes are perfectly separated and the model is as efficient as it can get.
The closer your ROC AUC is to 1, the better.

Finally to summarize the ROC curve is generated per model by varying our threshold values from 0 to 1. It ultimately helps us to visualize the tradeoff between sensitivity and specificity and understand how well-separated our data classes are.