Credit Card Fraud Detection - Data Exploration
This is the first post for Credit Card Fraud Detection. The goal of this project is to explore different classification models and evaluate their performance for an imbalanced dataset. Along with implementing classification models, I also wanted to explore some the methods used to handle class imbalance. In this post, I mainly explore the dataset using visualization tools.
We hear/read about credit card frauds and identity theft every other day. Recently, I received a call from a fraudster. Unaware, I was almost duped. Thankfully, I realized something is awfully wrong with the voice and the tone of the caller. Plus, he asked me to pay my 'fines' using Walmart gift cards. Really?! I managed to escape this but not without divulging some information about me. What if he uses that information to hack into my back accounts? He would've committed credit card fraud maybe by using my card information on online shopping websites.
Credit card frauds can be unnoticeable to the human eye. It is easy to pretend some one while using the card. In my experience, only at shopping centres has my ID been checked with my credit card. Everywhere else I could be anyone but the cardholder. All of the online websites I have used requires me to just enter my card information and zip code (how easy is that once you have the card information) instead of two-step verification (like a verification code through a text message). I might be missing something here in terms of online card transaction security and any information on this would be great. My point being, it's not that difficult to get someone else's card information and use it for different purposes. So, how do banks and credit card companies keep us safe in terms of credit card frauds? It's by using historical data of all the transactions! Fraudulent transactions may have a pattern - card is used in different locations, huge withdrawals and transactions in small amounts to avoid suspicion are just some of the indications.
I came across Kaggle's dataset on Credit Card Fraud Detection and decided to dive into this problem. This dataset includes transactions by European cardholders completed in September 2013. I want to explore some of the classification methods that could be used to solve this problem. The biggest challenge of this problem is the class imbalance - only 0.172% of all transactions in this dataset are fraudulent. This post explores the dataset using data visualization.
import os
os.chdir('D:/ML_Projects/CreditCardFraud/') #Set working directory
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import seaborn as sns
import matplotlib.ticker as ticker
import itertools
import datetime
from collections import Counter
from imblearn.over_sampling import RandomOverSampler, SMOTE
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.patches as mpatches
sns.set_style('whitegrid')
%matplotlib inline
Data Description¶
- The datasets contains transactions made by credit cards in September 2013 by european cardholders. The transactions occur in two days.
- Features V1, V2, ... V28 are the principal components obtained with PCA.
- Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset.
- The feature 'Amount' is the transaction Amount.
- Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.
Read and Explore the data set¶
# Read credit card fraud data
cc_fraud = pd.read_csv('./Data/creditcard.csv')
print("Number of instances: %d" % cc_fraud.shape[0])
print("Number of features: %d" % cc_fraud.shape[1])
cc_fraud.head()
cc_fraud.dtypes
Any missing values?¶
# Missing values
na_perc = {}
for col in cc_fraud.columns:
na_perc[col] = cc_fraud[col].isnull().sum()*100/len(cc_fraud)
na_perc
What is the linear correlation between the features?¶
# Feature correlations - ideally should be uncorrelated since they are different components from PCA
corrmat = cc_fraud.corr()
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corrmat, square = True, cmap = 'YlGnBu')
plt.title('Correlation heatmap for the features', fontsize=20)
Linear correlation between PCA features in close to 0.
What is the distribution of the class variable?¶
You would expect that the number of credit cards would significantly lower than legitimate transactions. Let's find out whether the world is filled good people or fraudsters.
# Class distribution
map_dict = {0:"Valid", 1:"Fraud"}
cc_fraud['Class'] = cc_fraud['Class'].map(map_dict)
ncount = len(cc_fraud)
plt.figure()
sns.set_context({"figure.figsize": (10, 8)})
g = sns.countplot(x = 'Class', data = cc_fraud)
#g.set_xticklabels(labels,rotation = 90)
plt.title("Class distribution", fontsize=22)
g.set_xlabel('Fraud', fontsize=15)
g.set_ylabel('Frequency [%]', fontsize=15, labelpad=25)
g.axes.grid('off')
g2 = g.twinx()
g2.yaxis.tick_left()
g.yaxis.tick_right()
for p in g.patches:
x=p.get_bbox().get_points()[:,0]
y=p.get_bbox().get_points()[1,1]
g.annotate('{:.2f}%'.format(100.*y/ncount), (x.mean(), y),
ha='center', va='bottom') # set the alignment of the text
g.yaxis.set_major_locator(ticker.LinearLocator(11))
# Fix the frequency range to 0-100
g2.set_ylim(0,100)
g.set_ylim(0,ncount)
# And use a MultipleLocator to ensure a tick spacing of 10
g2.yaxis.set_major_locator(ticker.MultipleLocator(10))
g2.set_ylabel('Count', fontsize=15, labelpad=25)
g2.axes.grid('off')
plt.tight_layout()
Phew! The world is still good.
Distribution of transaction amount by transaction class¶
Do fraudsters do multiple transactions of small amounts or are they foolish enough to spend a high amount in one transaction? Let's find out!
# Analysis of fraud by the total amount spent
plt.figure()
plt.yscale('log')
sns.set_context({"figure.figsize": (10, 8)})
g = sns.boxplot(data = cc_fraud, x = 'Class', y = 'Amount')
plt.title("Distribution of Transaction Amount", fontsize=25)
plt.xlabel('Class', fontsize=20); plt.xticks(fontsize=15)
plt.ylabel('Amount', fontsize=20); plt.yticks(fontsize=15)
# Transactions less than 2K
plt.figure()
sns.set_context({"figure.figsize": (10, 8)})
g = sns.boxplot(data=cc_fraud.loc[cc_fraud.Amount <= 2000,:], x = 'Class', y = 'Amount')
plt.title("Transactions amounting to <= 2K", fontsize=25)
plt.xlabel('Class', fontsize=20); plt.xticks(fontsize=15)
plt.ylabel('Amount', fontsize=20); plt.yticks(fontsize=15)
f, (ax1, ax2) = plt.subplots(2, 1, sharex=True, figsize=(16,6))
ax1.hist(cc_fraud.Amount[cc_fraud.Class == 'Fraud'], bins = 30, color='red')
ax1.set_title('Fraudalent Transactions', fontsize=20)
ax2.hist(cc_fraud.Amount[cc_fraud.Class == 'Valid'], bins = 30, color='green')
ax2.set_title('Valid Transactions', fontsize=20)
plt.xlabel('Amount ($)')
plt.ylabel('Number of Transactions')
plt.yscale('log')
plt.show()
cc_fraud.loc[cc_fraud.Amount==0,:].Class.value_counts()*100/cc_fraud.shape[0]
Of the all the transactions, 0.63% of the transactions with no transaction amount are valid whereas 0.009% of the transactions with no transaction amount are fraudelent. Is it possible to have fraudelent transactions with no transaction amount? You may argue that a thief would actually do his work! So in that sense, the row entries could be data entry errors. However, this is possible in the real world that a transaction may amount to 0.
def convert_time(sec):
return datetime.datetime.fromtimestamp(sec)
cc_fraud_time = cc_fraud[['Time','Amount','Class']].copy()
cc_fraud_time['time'] = cc_fraud_time.Time.apply(convert_time)
timeDelta = datetime.datetime.utcnow() - datetime.datetime.now()
cc_fraud_time['hour'] = cc_fraud_time.time + timeDelta
cc_fraud_time['hour'] = cc_fraud_time.hour.dt.hour
cc_fraud_time_grouped = cc_fraud_time.groupby(['hour','Class'])
plt.figure()
sns.set_context({"figure.figsize": (10, 8)})
g = sns.countplot(data = cc_fraud_time[cc_fraud_time.Class == 'Valid'], x = 'hour', color = 'green',
saturation = 0.6)
plt.title("Valid transactions by hour of the day", fontsize=25)
plt.xlabel('Hour of the day', fontsize=20);plt.xticks(fontsize=15)
plt.ylabel('Count', fontsize=20);plt.yticks(fontsize=15)
g.axes.grid('off')
plt.figure()
sns.set_context({"figure.figsize": (10, 8)})
g = sns.countplot(data = cc_fraud_time[cc_fraud_time.Class == 'Fraud'], x = 'hour', color = 'red',
saturation = 0.6)
plt.title("Fraudalent transactions by hour of the day", fontsize=25)
plt.xlabel('Hour of the day',fontsize=20);plt.xticks(fontsize=15)
plt.ylabel('Count', fontsize=20);plt.yticks(fontsize=15)
plt.ylabel('Count')
g.axes.grid('off')
# Reference - https://www.kaggle.com/currie32/d/dalpozz/creditcardfraud/predicting-fraud-with-tensorflow
plt.clf()
pca_features = cc_fraud.columns[1:29]
plt.figure(figsize=(16,28*4))
gs = gridspec.GridSpec(28, 1)
for i, col in enumerate(cc_fraud[pca_features]):
ax = plt.subplot(gs[i])
sns.distplot(cc_fraud[col][cc_fraud.Class == 'Valid'], bins=50, label='Valid Transaction', color='green')
sns.distplot(cc_fraud[col][cc_fraud.Class == 'Fraud'], bins=50, label='Fraudelent Transaction', color='red')
ax.set_xlabel('')
ax.set_title('Histogram of feature: ' + str(col),fontsize=15)
plt.legend(loc='best',fontsize=12)
plt.show()