Is 99.83% of accuracy good? Maybe not

In some cases, accuracy is not enough

7 min readAug 8, 2021

In many cases, taking accuracy as the main metric is a good approach for your analysis. However, sometimes this method can lead us to an unprecise result.

Yes, I’m talking about imbalanced datasets! They can be tricky and ruin our models and analysis if we don’t treat them properly.

What’s an imbalanced dataset?

Imbalanced datasets are those in which there is an unequal distribution of classes. It’s common in many situations, as frauds in credit card transactions, disease, or spam detection.

Let’s take this dataset from Kaggle as an example. It was taken from the Credit Card Fraud Detection challenge.

So, first of all, let’s read the dataset:

df = pd.read_csv(‘../input/creditcardfraud/creditcard.csv’)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   Time    284807 non-null  float64
 1   V1      284807 non-null  float64
 2   V2      284807 non-null  float64
 3   V3      284807 non-null  float64
 4   V4      284807 non-null  float64
 5   V5      284807 non-null  float64
 6   V6      284807 non-null  float64
 7   V7      284807 non-null  float64
 8   V8      284807 non-null  float64
 9   V9      284807 non-null  float64
 10  V10     284807 non-null  float64
 11  V11     284807 non-null  float64
 12  V12     284807 non-null  float64
 13  V13     284807 non-null  float64
 14  V14     284807 non-null  float64
 15  V15     284807 non-null  float64
 16  V16     284807 non-null  float64
 17  V17     284807 non-null  float64
 18  V18     284807 non-null  float64
 19  V19     284807 non-null  float64
 20  V20     284807 non-null  float64
 21  V21     284807 non-null  float64
 22  V22     284807 non-null  float64
 23  V23     284807 non-null  float64
 24  V24     284807 non-null  float64
 25  V25     284807 non-null  float64
 26  V26     284807 non-null  float64
 27  V27     284807 non-null  float64
 28  V28     284807 non-null  float64
 29  Amount  284807 non-null  float64
 30  Class   284807 non-null  int64  
dtypes: float64(30), int64(1)
memory usage: 67.4 M

As you can see, the features of the dataset are not described. That’s because these kinds of data are personal and we can not violate them. So, we have features from V1 to V28, Time, and Amount. Class is our target, being 0 if it’s not fraud and 1 if it’s fraud.

Now, let’s take a look at the distribution of our Class.

total = df.Class.shape[0]
regular = df.query('Class == 0').shape[0]
frauds = df.query('Class == 1').shape[0]
print('Total transactions:', total)
print('Nº of regular transactions: {} - {:.2f}%'.format(regular, regular/total*100))
print('Nº of frauds: {} - {:.2f}%'.format(frauds, frauds/total*100))Total transactions: 284807
Nº of regular transactions: 284315 - 99.83%
Nº of frauds: 492 - 0.17%

What you can see, and what was expected by us, is that regular credit card transactions are extremely more frequent than frauds, 99.83% against 0.17%.

It means that if we create a model that predicts only regular transactions, it’ll have an accuracy of 99.83% for this dataset! But will this model be good or useful for us? Of course, it won’t. Once we want to identify fraud, it doesn’t make any sense to have high accuracy, but without registering any suspicious transaction.

How to deal with imbalanced datasets?

As you may have already noticed, accuracy is not the ideal metric for cases where we have imbalanced datasets. In these cases, it’s preferable to calculate:

— Precision: rate of relevant instances over total selected as positive
— Recall: rate of relevant instances over a total of true positive
— f1: the harmonic mean of precision and recall

We can easily calculate all these metrics using the sklearn library. But before doing that, we need to preprocess our dataset, since most machine learning methods work well only with balanced data.

Creating balanced datasets

To create a balanced dataset from an imbalanced one, we can use undersample and/or oversample methods.

There are several methods to undersample a dataset. The idea is simple, we just need to resize our majority class to the size of our minority class. It’s a good approach, but we have to keep in mind that our final dataset must be big enough so that we don’t underfit the model.

Oversampling means resizing the minority class to the size of the majority class. The idea is also simple and there are a bunch of ways of doing it, like SMOTE which creates synthetic samples on the minority class by interpolation between positive instances.

In the case we’re treating here, I’m going to use the SMOTEENN method, which combines over and under-sampling using SMOTE and Edited Nearest Neighbours.

Before using SMOTEENN in our dataset, we have to remember to split it into train and test data, and fit SMOTEENN only in the train datasets, so that we can check our models’ performance later. So, let’s see how to code it.

from sklearn.model_selection import train_test_split
from imblearn.combine import SMOTEENNX = df.drop('Class', axis = 1)
y = df.ClassX_train,X_test,y_train,y_test = train_test_split(X, y,
                                                 test_size=0.2,
                                                 random_state=1128)sme = SMOTEENN(random_state=42)
X_sme, y_sme = sme.fit_resample(X_train, y_train)

Now that we have already used the SMOTEENN method, let’s see how balanced is our Class:

round(y_sme.value_counts()/len(y_sme)*100,2)
1    50.96
0    49.04
Name: Class, dtype: float64

As you can see, now we have a way more balanced dataset than we had before!

Running Machine Learning Models

Once we already have a balanced training dataset, we can run our machine learning models and check the results.

First, let’s see accuracy, precision, recall, and f1 for the models: Logistic Regression, Random Forest Classifier, and Decision Tree Classifier.

from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
import seaborn as sns
from sklearn.metrics import roc_curvelr_clf = LogisticRegression()
lr_clf.fit(X_sme, y_sme)
y_predicted_lr = lr_clf.predict(X_test)
y_proba_lr = lr_clf.predict_proba(X_test)
y_proba_lr = y_proba_lr[:, 1]rfc_clf = RandomForestClassifier()
rfc_clf.fit(X_sme, y_sme)
y_predicted_rfc = rfc_clf.predict(X_test)
y_proba_rfc = rfc_clf.predict_proba(X_test)
y_proba_rfc = y_proba_rfc[:, 1]dtc_clf = DecisionTreeClassifier()
dtc_clf.fit(X_sme, y_sme)
y_predicted_dtc = dtc_clf.predict(X_test)
y_proba_dtc = dtc_clf.predict_proba(X_test)
y_proba_dtc = y_proba_dtc[:, 1]print('Logistic Regression score of:',
      round(lr_clf.score(X_test, y_test)*100,2))
print('Random Forest Classifier score of:',
      round(rfc_clf.score(X_test, y_test)*100,2))
print('Decision Tree Classifier score of:',
      round(dtc_clf.score(X_test, y_test)*100,2))print('-'*90)print('Logistic Regression Report\n',classification_report(y_test, y_predicted_lr))
print('Random Forest Classifier Report\n',classification_report(y_test, y_predicted_rfc))
print('Decision Tree Classifier Report\n',classification_report(y_test, y_predicted_dtc))print('-'*90)roc_auc_lr = roc_auc_score(y_test, y_proba_lr)
roc_auc_rfc = roc_auc_score(y_test, y_proba_rfc)
roc_auc_dtc = roc_auc_score(y_test, y_proba_dtc)
        
fpr_lr, tpr_lr, thresold_lr = roc_curve(y_test, y_proba_lr)
fpr_rfc, tpr_rfc, thresold_rfc = roc_curve(y_test, y_proba_rfc)
fpr_dtc, tpr_dtc, thresold_dtc = roc_curve(y_test, y_proba_dtc)plt.figure(figsize = (10,10))
plt.plot(fpr_lr, tpr_lr, color='darkorange',
        lw=2, label='Logistic Regression\nROC curve (area = %0.2f)'     % round(roc_auc_score(y_test, y_proba_lr)*100,2))
plt.plot(fpr_rfc, tpr_rfc, color='Blue',
        lw=2, label='Random Forest Classifier\nROC curve (area = %0.2f)' % round(roc_auc_score(y_test, y_proba_rfc)*100,2))
plt.plot(fpr_dtc, tpr_dtc, color='Red',
        lw=2, label='Decision Tree Classifier\nROC curve (area = %0.2f)' % round(roc_auc_score(y_test, y_proba_dtc)*100,2))
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic\nFraud Detection')
plt.legend(loc="lower right")
plt.show()

And the results are:

Logistic Regression score of: 96.65
Random Forest Classifier score of: 99.95
Decision Tree Classifier score of: 99.77
--------------------------------------------------------------------
Logistic Regression Report
               precision    recall  f1-score   support

           0       1.00      0.97      0.98     56875
           1       0.04      0.86      0.07        87

    accuracy                           0.97     56962
   macro avg       0.52      0.91      0.53     56962
weighted avg       1.00      0.97      0.98     56962

Random Forest Classifier Report
               precision    recall  f1-score   support

           0       1.00      1.00      1.00     56875
           1       0.85      0.84      0.84        87

    accuracy                           1.00     56962
   macro avg       0.92      0.92      0.92     56962
weighted avg       1.00      1.00      1.00     56962

Decision Tree Classifier Report
               precision    recall  f1-score   support

           0       1.00      1.00      1.00     56875
           1       0.38      0.76      0.50        87

    accuracy                           1.00     56962
   macro avg       0.69      0.88      0.75     56962
weighted avg       1.00      1.00      1.00     56962

It’s now clearer how important analyze other parameters besides accuracy. From the results above, we can see that all models presented high accuracy, but only Random Forest Classifier showed good results for precision and f1 parameters.

Let’s see the confusion matrix for each model.

confusion_lr = confusion_matrix(y_test, y_predicted_lr)
confusion_rfc = confusion_matrix(y_test, y_predicted_rfc)
confusion_dtc = confusion_matrix(y_test, y_predicted_dtc)fig,ax = plt.subplots(1, 3,sharex=True, figsize=(16,4))ax[0] = sns.heatmap(ax = ax[0],
                    data=confusion_lr,
                    annot=True,
                    annot_kws={"fontsize":8},
                    cbar=False)
ax[0].set_yticklabels(['Non fraud', 'Fraud'], fontsize=10)
ax[0].set_xticklabels(['Non fraud PRED', 'Fraud PRED'], fontsize=10)
ax[0].set_title('Logistic Regression\nConfusion Matrix', fontsize=10)ax[1] = sns.heatmap(ax = ax[1],
                    data=confusion_rfc,
                    annot=True,
                    annot_kws={"fontsize":8},
                    cbar=False)
ax[1].set_yticklabels(['Non fraud', 'Fraud'], fontsize=10)
ax[1].set_xticklabels(['Non fraud PRED', 'Fraud PRED'], fontsize=10)
ax[1].set_title('Random Forest Classifier\nConfusion Matrix', fontsize=10)ax[2] = sns.heatmap(ax=ax[2],
                    data=confusion_dtc,
                    annot=True,
                    annot_kws={"fontsize":8},
                    cbar=True)
ax[2].set_yticklabels(['Non fraud', 'Fraud'], fontsize=10)
ax[2].set_xticklabels(['Non fraud PRED', 'Fraud PRED'], fontsize=10)
ax[2].set_title('Decision Tree Classifier\nConfusion Matrix', fontsize=10)

Conclusion

Dealing with imbalanced datasets is very common in our daily work, and it’s important to be aware not to make mistakes analyzing them. Wrong analysis can lead us to bad decisions, which can be too harmful in some cases.

This analysis was a simple one, just to demonstrate a way to handle an imbalanced dataset. It wasn’t a full tutorial, but I think it can be taken as a starting point.

Now that you’ve got the power of data analysis, please remember, “With great power comes great responsibility.”