Is 99.83% of accuracy good? Maybe not

In some cases, accuracy is not enough

Andre Kuniyoshi
7 min readAug 8, 2021

In many cases, taking accuracy as the main metric is a good approach for your analysis. However, sometimes this method can lead us to an unprecise result.

Yes, I’m talking about imbalanced datasets! They can be tricky and ruin our models and analysis if we don’t treat them properly.

What’s an imbalanced dataset?

Imbalanced datasets are those in which there is an unequal distribution of classes. It’s common in many situations, as frauds in credit card transactions, disease, or spam detection.

Let’s take this dataset from Kaggle as an example. It was taken from the Credit Card Fraud Detection challenge.

So, first of all, let’s read the dataset:

df = pd.read_csv(‘../input/creditcardfraud/creditcard.csv’)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Time 284807 non-null float64
1 V1 284807 non-null float64
2 V2 284807 non-null float64
3 V3 284807 non-null float64
4 V4 284807 non-null float64
5 V5 284807 non-null float64
6 V6 284807 non-null float64
7 V7 284807 non-null float64
8 V8 284807 non-null float64
9 V9 284807 non-null float64
10 V10 284807 non-null float64
11 V11 284807 non-null float64
12 V12 284807 non-null float64
13 V13 284807 non-null float64
14 V14 284807 non-null float64
15 V15 284807 non-null float64
16 V16 284807 non-null float64
17 V17 284807 non-null float64
18 V18 284807 non-null float64
19 V19 284807 non-null float64
20 V20 284807 non-null float64
21 V21 284807 non-null float64
22 V22 284807 non-null float64
23 V23 284807 non-null float64
24 V24 284807 non-null float64
25 V25 284807 non-null float64
26 V26 284807 non-null float64
27 V27 284807 non-null float64
28 V28 284807 non-null float64
29 Amount 284807 non-null float64
30 Class 284807 non-null int64
dtypes: float64(30), int64(1)
memory usage: 67.4 M

As you can see, the features of the dataset are not described. That’s because these kinds of data are personal and we can not violate them. So, we have features from V1 to V28, Time, and Amount. Class is our target, being 0 if it’s not fraud and 1 if it’s fraud.

Now, let’s take a look at the distribution of our Class.

total = df.Class.shape[0]
regular = df.query('Class == 0').shape[0]
frauds = df.query('Class == 1').shape[0]
print('Total transactions:', total)
print('Nº of regular transactions: {} - {:.2f}%'.format(regular, regular/total*100))
print('Nº of frauds: {} - {:.2f}%'.format(frauds, frauds/total*100))
Total transactions: 284807
Nº of regular transactions: 284315 - 99.83%
Nº of frauds: 492 - 0.17%

What you can see, and what was expected by us, is that regular credit card transactions are extremely more frequent than frauds, 99.83% against 0.17%.

It means that if we create a model that predicts only regular transactions, it’ll have an accuracy of 99.83% for this dataset! But will this model be good or useful for us? Of course, it won’t. Once we want to identify fraud, it doesn’t make any sense to have high accuracy, but without registering any suspicious transaction.

How to deal with imbalanced datasets?

As you may have already noticed, accuracy is not the ideal metric for cases where we have imbalanced datasets. In these cases, it’s preferable to calculate:

Precision: rate of relevant instances over total selected as positive
Recall: rate of relevant instances over a total of true positive
f1: the harmonic mean of precision and recall

We can easily calculate all these metrics using the sklearn library. But before doing that, we need to preprocess our dataset, since most machine learning methods work well only with balanced data.

Creating balanced datasets

To create a balanced dataset from an imbalanced one, we can use undersample and/or oversample methods.

There are several methods to undersample a dataset. The idea is simple, we just need to resize our majority class to the size of our minority class. It’s a good approach, but we have to keep in mind that our final dataset must be big enough so that we don’t underfit the model.

Oversampling means resizing the minority class to the size of the majority class. The idea is also simple and there are a bunch of ways of doing it, like SMOTE which creates synthetic samples on the minority class by interpolation between positive instances.

In the case we’re treating here, I’m going to use the SMOTEENN method, which combines over and under-sampling using SMOTE and Edited Nearest Neighbours.

Before using SMOTEENN in our dataset, we have to remember to split it into train and test data, and fit SMOTEENN only in the train datasets, so that we can check our models’ performance later. So, let’s see how to code it.

from sklearn.model_selection import train_test_split
from imblearn.combine import SMOTEENN
X = df.drop('Class', axis = 1)
y = df.Class
X_train,X_test,y_train,y_test = train_test_split(X, y,
test_size=0.2,
random_state=1128)
sme = SMOTEENN(random_state=42)
X_sme, y_sme = sme.fit_resample(X_train, y_train)

Now that we have already used the SMOTEENN method, let’s see how balanced is our Class:

round(y_sme.value_counts()/len(y_sme)*100,2)
1 50.96
0 49.04
Name: Class, dtype: float64

As you can see, now we have a way more balanced dataset than we had before!

Running Machine Learning Models

Once we already have a balanced training dataset, we can run our machine learning models and check the results.

First, let’s see accuracy, precision, recall, and f1 for the models: Logistic Regression, Random Forest Classifier, and Decision Tree Classifier.

from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
import seaborn as sns
from sklearn.metrics import roc_curve
lr_clf = LogisticRegression()
lr_clf.fit(X_sme, y_sme)
y_predicted_lr = lr_clf.predict(X_test)
y_proba_lr = lr_clf.predict_proba(X_test)
y_proba_lr = y_proba_lr[:, 1]
rfc_clf = RandomForestClassifier()
rfc_clf.fit(X_sme, y_sme)
y_predicted_rfc = rfc_clf.predict(X_test)
y_proba_rfc = rfc_clf.predict_proba(X_test)
y_proba_rfc = y_proba_rfc[:, 1]
dtc_clf = DecisionTreeClassifier()
dtc_clf.fit(X_sme, y_sme)
y_predicted_dtc = dtc_clf.predict(X_test)
y_proba_dtc = dtc_clf.predict_proba(X_test)
y_proba_dtc = y_proba_dtc[:, 1]
print('Logistic Regression score of:',
round(lr_clf.score(X_test, y_test)*100,2))
print('Random Forest Classifier score of:',
round(rfc_clf.score(X_test, y_test)*100,2))
print('Decision Tree Classifier score of:',
round(dtc_clf.score(X_test, y_test)*100,2))
print('-'*90)print('Logistic Regression Report\n',classification_report(y_test, y_predicted_lr))
print('Random Forest Classifier Report\n',classification_report(y_test, y_predicted_rfc))
print('Decision Tree Classifier Report\n',classification_report(y_test, y_predicted_dtc))
print('-'*90)roc_auc_lr = roc_auc_score(y_test, y_proba_lr)
roc_auc_rfc = roc_auc_score(y_test, y_proba_rfc)
roc_auc_dtc = roc_auc_score(y_test, y_proba_dtc)

fpr_lr, tpr_lr, thresold_lr = roc_curve(y_test, y_proba_lr)
fpr_rfc, tpr_rfc, thresold_rfc = roc_curve(y_test, y_proba_rfc)
fpr_dtc, tpr_dtc, thresold_dtc = roc_curve(y_test, y_proba_dtc)
plt.figure(figsize = (10,10))
plt.plot(fpr_lr, tpr_lr, color='darkorange',
lw=2, label='Logistic Regression\nROC curve (area = %0.2f)' % round(roc_auc_score(y_test, y_proba_lr)*100,2))
plt.plot(fpr_rfc, tpr_rfc, color='Blue',
lw=2, label='Random Forest Classifier\nROC curve (area = %0.2f)' % round(roc_auc_score(y_test, y_proba_rfc)*100,2))
plt.plot(fpr_dtc, tpr_dtc, color='Red',
lw=2, label='Decision Tree Classifier\nROC curve (area = %0.2f)' % round(roc_auc_score(y_test, y_proba_dtc)*100,2))
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic\nFraud Detection')
plt.legend(loc="lower right")
plt.show()

And the results are:

Logistic Regression score of: 96.65
Random Forest Classifier score of: 99.95
Decision Tree Classifier score of: 99.77
--------------------------------------------------------------------
Logistic Regression Report
precision recall f1-score support

0 1.00 0.97 0.98 56875
1 0.04 0.86 0.07 87

accuracy 0.97 56962
macro avg 0.52 0.91 0.53 56962
weighted avg 1.00 0.97 0.98 56962

Random Forest Classifier Report
precision recall f1-score support

0 1.00 1.00 1.00 56875
1 0.85 0.84 0.84 87

accuracy 1.00 56962
macro avg 0.92 0.92 0.92 56962
weighted avg 1.00 1.00 1.00 56962

Decision Tree Classifier Report
precision recall f1-score support

0 1.00 1.00 1.00 56875
1 0.38 0.76 0.50 87

accuracy 1.00 56962
macro avg 0.69 0.88 0.75 56962
weighted avg 1.00 1.00 1.00 56962

It’s now clearer how important analyze other parameters besides accuracy. From the results above, we can see that all models presented high accuracy, but only Random Forest Classifier showed good results for precision and f1 parameters.

Let’s see the confusion matrix for each model.

confusion_lr = confusion_matrix(y_test, y_predicted_lr)
confusion_rfc = confusion_matrix(y_test, y_predicted_rfc)
confusion_dtc = confusion_matrix(y_test, y_predicted_dtc)
fig,ax = plt.subplots(1, 3,sharex=True, figsize=(16,4))ax[0] = sns.heatmap(ax = ax[0],
data=confusion_lr,
annot=True,
annot_kws={"fontsize":8},
cbar=False)
ax[0].set_yticklabels(['Non fraud', 'Fraud'], fontsize=10)
ax[0].set_xticklabels(['Non fraud PRED', 'Fraud PRED'], fontsize=10)
ax[0].set_title('Logistic Regression\nConfusion Matrix', fontsize=10)
ax[1] = sns.heatmap(ax = ax[1],
data=confusion_rfc,
annot=True,
annot_kws={"fontsize":8},
cbar=False)
ax[1].set_yticklabels(['Non fraud', 'Fraud'], fontsize=10)
ax[1].set_xticklabels(['Non fraud PRED', 'Fraud PRED'], fontsize=10)
ax[1].set_title('Random Forest Classifier\nConfusion Matrix', fontsize=10)
ax[2] = sns.heatmap(ax=ax[2],
data=confusion_dtc,
annot=True,
annot_kws={"fontsize":8},
cbar=True)
ax[2].set_yticklabels(['Non fraud', 'Fraud'], fontsize=10)
ax[2].set_xticklabels(['Non fraud PRED', 'Fraud PRED'], fontsize=10)
ax[2].set_title('Decision Tree Classifier\nConfusion Matrix', fontsize=10)

Conclusion

Dealing with imbalanced datasets is very common in our daily work, and it’s important to be aware not to make mistakes analyzing them. Wrong analysis can lead us to bad decisions, which can be too harmful in some cases.

This analysis was a simple one, just to demonstrate a way to handle an imbalanced dataset. It wasn’t a full tutorial, but I think it can be taken as a starting point.

Now that you’ve got the power of data analysis, please remember, “With great power comes great responsibility.”

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Andre Kuniyoshi
Andre Kuniyoshi

No responses yet

Write a response