Why Accuracy is Bad Metric for Imbalanced Datasets

Bingi Nagesh
2 min readSep 4, 2022

--

Source: Analytics Vidhya

Let us consider two class classification problem for this entire blog. Considering this, let us try to define what is an imbalanced dataset? An imbalanced dataset is that type of dataset where one class has high number of data points and other class has low number of data points. The adjectives high and low are subjective and varies from problem to problem. So, it is better to consult your peers, colleagues, seniors, team leads on what high or low number of data points makes it imbalanced dataset.

Example: Consider a loan default prediction problem which has a total of 1000 data points out of which 100 are ‘default’ and remaining 900 are ‘Not default’. The ratio of ‘default’ to ‘Not default’ is 1:9. This is an imbalanced dataset.

Confusion Matrix for two class classification problem

Let us consider a dumb model that predicts ‘Not default’ for all data points. Below shows the confusion matrix

Below shows the accuracy

So, the accuracy of our dumb model is 90%. On the face of it, 90% accuracy seems very good (which is still subjective) but no one deploys this model in production.

Classification metrics to use for imbalanced datasets can be found here.

--

--