k-means clustering
Supervised learning corresponds to the type of machine learning that draws inferences from datasets that have labeled training data. For example, when training a model to classify healthy and unhealthy patients, the dataset will include information about each patient and also a label as healthy or unhealthy.
Unsupervised learning corresponds to the type of machine learning that draws inferences from datasets that have no labeld information. Unsupervised learning algorithms try to find inherent patterns in the data. For example, you want to classify photos into different types like cars, animals, buildings etc. without any labeled information available. The algorithm 'learns' about different images during the training period. Based on what it has learned, it will try to classify new images it looks at into different categories. Unsupervised learning is achieved using one of the following approaches, depending on the type of problem you are trying to solve -
- Clustering
- Neural Networks
- Anamoly detection techniques
- Expectation-maximization algorithms
- Blind signal seperation algorithms
- Method of Moments
In this notebook, I talk and implement k-means clustering algorithm to perform clustering on Iris dataset (http://archive.ics.uci.edu/ml/datasets/Iris?ref=datanews.io).
Yelp Visualization
One’s ability of making decisions is largely dependent on others opinions with similar experiences. In today’s era of internet and information, it has become easier to find people with similar experiences you are looking for and website services like Yelp are playing an important role in making such information readily available. These reviews shared by service users are valuable for both business owners and prospective customers. The reviews consists of text description, star ratings, reviewer’s information, business descriptions for various categories (as defined by Yelp) etc. People can also vote on user reviews if they find it useful, funny or cool. The goal is to classify sentiments using the enormous review data text and predict the success or failure of a business. Here, we plan to conduct a sentiment analysis of the text description of the reviews received for food businesses in Charlotte. The idea is to find attributes that result in high ratings and thus suggest improvements in certain services in order to attract more customers. Some of the questions we will be interested in includes; how well we can guess a review's rating from its text alone? What are the most common positive and negative words used in reviews? Can we extract tips from reviews? Is it possible to predict the closure of a business based on the reviews received?
Here in the first phase, we focus on visualizations (graphs, plots and maps) to explore the data in a way that can be useful for further analysis. We look at how the location, keywords and attributes affect the success of the business. Focussing on keywords from the reviews, it's sentiment and other significant effects caused by the attributes of a business we can model the success or failure of a business. This idea can be further extended to analyse the reviews and the attributes of a business and predict its success or failure. It can further provide suggestions to help businesses to improve and succeed.
We look at the Charlotte food businesses from the following perspective -
- How does location affect a business?
- What keywords define the location?
- Which attributes define a location?
We intend to measure the effectiveness of the model based on classification accuracy of Yelp's historical data. Based on the model, we can then define the important features of a successful food business in Charlotte.
Credit Card Fraud Detection - Logistic Regression and ADASYN
This is a continuation of the credit card fraud detection - data visualization post. I now build a machine learning model using Adaptive Synthetic Sampling to detect credit card frauds.
I came across Kaggle's dataset on Credit Card Fraud Detection and decided to dive into this problem. This dataset includes transactions by European cardholders completed in September 2013. I want to explore some of the classification methods that could be used to solve this problem. The biggest challenge of this problem is the class imbalance - only 0.172% of all transactions in this dataset are fraudulent. The goal of this project is to start with a simple yet powerful model like Logistic Regression. Along with implementing logistic regression, I also wanted to explore some the methods used to handle class imbalance. In this post, I use Adaptive Synthetic Sampling (ADASYN), which is further discussed in this post.
Credit Card Fraud Detection - Data Exploration
This is the first post for Credit Card Fraud Detection. The goal of this project is to explore different classification models and evaluate their performance for an imbalanced dataset. Along with implementing classification models, I also wanted to explore some the methods used to handle class imbalance. In this post, I mainly explore the dataset using visualization tools.
We hear/read about credit card frauds and identity theft every other day. Recently, I received a call from a fraudster. Unaware, I was almost duped. Thankfully, I realized something is awfully wrong with the voice and the tone of the caller. Plus, he asked me to pay my 'fines' using Walmart gift cards. Really?! I managed to escape this but not without divulging some information about me. What if he uses that information to hack into my back accounts? He would've committed credit card fraud maybe by using my card information on online shopping websites.
Credit card frauds can be unnoticeable to the human eye. It is easy to pretend some one while using the card. In my experience, only at shopping centres has my ID been checked with my credit card. Everywhere else I could be anyone but the cardholder. All of the online websites I have used requires me to just enter my card information and zip code (how easy is that once you have the card information) instead of two-step verification (like a verification code through a text message). I might be missing something here in terms of online card transaction security and any information on this would be great. My point being, it's not that difficult to get someone else's card information and use it for different purposes. So, how do banks and credit card companies keep us safe in terms of credit card frauds? It's by using historical data of all the transactions! Fraudulent transactions may have a pattern - card is used in different locations, huge withdrawals and transactions in small amounts to avoid suspicion are just some of the indications.
I came across Kaggle's dataset on Credit Card Fraud Detection and decided to dive into this problem. This dataset includes transactions by European cardholders completed in September 2013. I want to explore some of the classification methods that could be used to solve this problem. The biggest challenge of this problem is the class imbalance - only 0.172% of all transactions in this dataset are fraudulent. This post explores the dataset using data visualization.
plotly example
Plotly is one of my favourite tools for data visualization. When I started writing this blog, plotly graphs were not embedded in the correct format. This is a simple post on how to embed your plotly graphs on a blog/webpage using a Jupyter notebook.