{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "This is a continuation of the credit card fraud detection - data visualization post. I now build a machine learning model using Adaptive Synthetic Sampling to detect credit card frauds.\n", "\n", "I came across Kaggle's dataset on [Credit Card Fraud Detection](https://www.kaggle.com/dalpozz/creditcardfraud) and decided to dive into this problem. This dataset includes transactions by European cardholders completed in September 2013. I want to explore some of the classification methods that could be used to solve this problem. The biggest challenge of this problem is the class imbalance - only 0.172% of all transactions in this dataset are fraudulent. The goal of this project is to start with a simple yet powerful model like [Logistic Regression](https://onlinecourses.science.psu.edu/stat504/node/149). Along with implementing logistic regression, I also wanted to explore some the methods used to handle class imbalance. In this post, I use [Adaptive Synthetic Sampling (ADASYN)](http://ieeexplore.ieee.org/document/4633969/?part=1), which is further discussed in this post.\n", "" ] }, { "cell_type": "markdown", "metadata": { "toc": "true" }, "source": [ "# Table of Contents\n", "
" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import os\n", "os.chdir('D:/ML_Projects/CreditCardFraud/') #Set working directory\n", "import pandas as pd\n", "import numpy as np\n", "\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "import itertools\n", "import datetime\n", "from collections import Counter\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.tree import DecisionTreeClassifier\n", "from sklearn.model_selection import train_test_split, StratifiedShuffleSplit, validation_curve, learning_curve, GridSearchCV\n", "from sklearn.preprocessing import StandardScaler\n", "import sklearn.metrics as mt\n", "from itertools import cycle\n", "\n", "from scipy import interp\n", "\n", "from imblearn.over_sampling import ADASYN\n", "\n", "from mpl_toolkits.mplot3d import Axes3D\n", "import matplotlib.patches as mpatches\n", "\n", "sns.set_style('whitegrid')\n", "\n", "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of instances: 284807\n", "Number of features: 31\n" ] }, { "data": { "text/html": [ "\n", " | Time | \n", "V1 | \n", "V2 | \n", "V3 | \n", "V4 | \n", "V5 | \n", "V6 | \n", "V7 | \n", "V8 | \n", "V9 | \n", "... | \n", "V21 | \n", "V22 | \n", "V23 | \n", "V24 | \n", "V25 | \n", "V26 | \n", "V27 | \n", "V28 | \n", "Amount | \n", "Class | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "0.0 | \n", "-1.359807 | \n", "-0.072781 | \n", "2.536347 | \n", "1.378155 | \n", "-0.338321 | \n", "0.462388 | \n", "0.239599 | \n", "0.098698 | \n", "0.363787 | \n", "... | \n", "-0.018307 | \n", "0.277838 | \n", "-0.110474 | \n", "0.066928 | \n", "0.128539 | \n", "-0.189115 | \n", "0.133558 | \n", "-0.021053 | \n", "149.62 | \n", "0 | \n", "
1 | \n", "0.0 | \n", "1.191857 | \n", "0.266151 | \n", "0.166480 | \n", "0.448154 | \n", "0.060018 | \n", "-0.082361 | \n", "-0.078803 | \n", "0.085102 | \n", "-0.255425 | \n", "... | \n", "-0.225775 | \n", "-0.638672 | \n", "0.101288 | \n", "-0.339846 | \n", "0.167170 | \n", "0.125895 | \n", "-0.008983 | \n", "0.014724 | \n", "2.69 | \n", "0 | \n", "
2 | \n", "1.0 | \n", "-1.358354 | \n", "-1.340163 | \n", "1.773209 | \n", "0.379780 | \n", "-0.503198 | \n", "1.800499 | \n", "0.791461 | \n", "0.247676 | \n", "-1.514654 | \n", "... | \n", "0.247998 | \n", "0.771679 | \n", "0.909412 | \n", "-0.689281 | \n", "-0.327642 | \n", "-0.139097 | \n", "-0.055353 | \n", "-0.059752 | \n", "378.66 | \n", "0 | \n", "
3 | \n", "1.0 | \n", "-0.966272 | \n", "-0.185226 | \n", "1.792993 | \n", "-0.863291 | \n", "-0.010309 | \n", "1.247203 | \n", "0.237609 | \n", "0.377436 | \n", "-1.387024 | \n", "... | \n", "-0.108300 | \n", "0.005274 | \n", "-0.190321 | \n", "-1.175575 | \n", "0.647376 | \n", "-0.221929 | \n", "0.062723 | \n", "0.061458 | \n", "123.50 | \n", "0 | \n", "
4 | \n", "2.0 | \n", "-1.158233 | \n", "0.877737 | \n", "1.548718 | \n", "0.403034 | \n", "-0.407193 | \n", "0.095921 | \n", "0.592941 | \n", "-0.270533 | \n", "0.817739 | \n", "... | \n", "-0.009431 | \n", "0.798278 | \n", "-0.137458 | \n", "0.141267 | \n", "-0.206010 | \n", "0.502292 | \n", "0.219422 | \n", "0.215153 | \n", "69.99 | \n", "0 | \n", "
5 rows × 31 columns
\n", "\n", " | Accuracy | \n", "F1_Score | \n", "Recall | \n", "LogLoss | \n", "AUPRC | \n", "
---|---|---|---|---|---|
0 | \n", "0.775137 | \n", "0.756745 | \n", "0.699508 | \n", "7.766552 | \n", "0.878713 | \n", "
1 | \n", "0.778038 | \n", "0.760043 | \n", "0.703024 | \n", "7.666345 | \n", "0.880371 | \n", "
2 | \n", "0.777042 | \n", "0.758798 | \n", "0.701383 | \n", "7.700760 | \n", "0.880482 | \n", "
3 | \n", "0.776104 | \n", "0.758762 | \n", "0.704196 | \n", "7.733151 | \n", "0.878395 | \n", "
4 | \n", "0.777188 | \n", "0.759361 | \n", "0.703083 | \n", "7.695699 | \n", "0.880638 | \n", "
5 | \n", "0.779826 | \n", "0.762974 | \n", "0.708709 | \n", "7.604603 | \n", "0.882200 | \n", "
6 | \n", "0.776954 | \n", "0.758174 | \n", "0.699273 | \n", "7.703795 | \n", "0.879221 | \n", "
7 | \n", "0.781438 | \n", "0.763673 | \n", "0.706248 | \n", "7.548930 | \n", "0.882277 | \n", "
8 | \n", "0.781086 | \n", "0.764710 | \n", "0.711464 | \n", "7.561079 | \n", "0.884576 | \n", "
9 | \n", "0.779445 | \n", "0.761700 | \n", "0.704958 | \n", "7.617760 | \n", "0.882295 | \n", "