Yelp Visualization
One’s ability of making decisions is largely dependent on others opinions with similar experiences. In today’s era of internet and information, it has become easier to find people with similar experiences you are looking for and website services like Yelp are playing an important role in making such information readily available. These reviews shared by service users are valuable for both business owners and prospective customers. The reviews consists of text description, star ratings, reviewer’s information, business descriptions for various categories (as defined by Yelp) etc. People can also vote on user reviews if they find it useful, funny or cool. The goal is to classify sentiments using the enormous review data text and predict the success or failure of a business. Here, we plan to conduct a sentiment analysis of the text description of the reviews received for food businesses in Charlotte. The idea is to find attributes that result in high ratings and thus suggest improvements in certain services in order to attract more customers. Some of the questions we will be interested in includes; how well we can guess a review's rating from its text alone? What are the most common positive and negative words used in reviews? Can we extract tips from reviews? Is it possible to predict the closure of a business based on the reviews received?
Here in the first phase, we focus on visualizations (graphs, plots and maps) to explore the data in a way that can be useful for further analysis. We look at how the location, keywords and attributes affect the success of the business. Focussing on keywords from the reviews, it's sentiment and other significant effects caused by the attributes of a business we can model the success or failure of a business. This idea can be further extended to analyse the reviews and the attributes of a business and predict its success or failure. It can further provide suggestions to help businesses to improve and succeed.
We look at the Charlotte food businesses from the following perspective -
- How does location affect a business?
- What keywords define the location?
- Which attributes define a location?
We intend to measure the effectiveness of the model based on classification accuracy of Yelp's historical data. Based on the model, we can then define the important features of a successful food business in Charlotte.
Data Description¶
The user review data has been publicly released by Yelp for dataset challenge (https://www.yelp.com/dataset_challenge). The released data consists of 2.7M reviews and 649K tips by 687K users for 86K businesses. However, to simplify our approach, we have only considered review data of Charlotte City and for food businesses. We used 3 major categories of data namely; Business, Review and User. These broad categories consists of attributes of similar class, the details of which are listed below.
To run this notebook, please set your data directory path in the following cell. To download the data, follow the dropbox link. Dropbox Link - https://www.dropbox.com/sh/pmvxzxq430h5b12/AAAQqN6ykeqqJ00LyNvxzlCza?dl=0
The data as provided by Yelp is stored in json format. The following json_to_csv converter is used to convert the files to csv.
Json-to-CSV converter: https://github.com/Yelp/dataset-examples/blob/master/json_to_csv_converter.py
The description of the attributes used for the analysis is given in the following table -
import sys
import pandas as pd
sys.path.insert(0,'.\Data\\')
pd.read_excel('Data description.xlsx')
Data Quality - Comments¶
The review data is a complex dataset across multiple dimensions. The attributes we have used are mostly those which have at least 70% data present and thus are statistically significant. The only feature that is not allowed for missing values is the business ID. The decision to remove such rows were taken for the following reasons:
- They constitute of less than 1% of total restaurants and are statistically insignificant.
- There is no other way that these business ID’s can be recovered thus making the merging process practically impossible.
Other treatments on the data quality are commented as and when they are implemented below.
# Import required modules
from __future__ import division
import os
os.chdir("D:/ML_Projects/Yelp/")
import pandas as pd
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import numpy as np
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")
import plotly
import plotly.plotly as py
plotly.tools.set_credentials_file(username='', api_key='')
import plotly.tools as tls
from plotly.graph_objs import *
%matplotlib inline
Analysis of Business Dataset¶
Read the yelp business dataset. Sub-sample the dataset for Charlotte food businesses. We have grouped the food business on the following categories defined by Yelp:
- Food
- Restaurants
- Nightlife
business_df = pd.read_csv("./Data/yelp_academic_dataset_business.csv")
charlotte_business = business_df.loc[business_df['city'] == 'Charlotte']
food_charlotte_business = pd.DataFrame()
food_categories = ['Food','Restaurants','Nightlife']
for index,row in charlotte_business.iterrows():
if any(category in row.categories for category in food_categories):
food_charlotte_business = food_charlotte_business.append(row)
del business_df,charlotte_business
food_charlotte_business.head()
print (food_charlotte_business.info())
print (food_charlotte_business.dtypes)
Map 1 and 0 to True and False for required attribute columns
# Convert selected columns to boolean type
col_dict_bool = {'attributes.Accepts Credit Cards':np.bool,'attributes.Accepts Insurance':np.bool,
'attributes.Ambience.casual':np.bool,'attributes.Ambience.classy':np.bool,'attributes.Ambience.divey':np.bool,
'attributes.Ambience.hipster':np.bool,'attributes.Ambience.intimate':np.bool,
'attributes.Ambience.romantic':np.bool,'attributes.Ambience.touristy':np.bool,
'attributes.Ambience.trendy':np.bool, 'attributes.Ambience.upscale':np.bool,'attributes.BYOB':np.bool,
'attributes.By Appointment Only':np.bool,
'attributes.Caters':np.bool,'attributes.Coat Check':np.bool,'attributes.Corkage':np.bool,
'attributes.Delivery':np.bool,'attributes.Dietary Restrictions.dairy-free':np.bool,
'attributes.Dietary Restrictions.gluten-free':np.bool,
'attributes.Dietary Restrictions.halal':np.bool,'attributes.Dietary Restrictions.kosher':np.bool,
'attributes.Dietary Restrictions.soy-free':np.bool,'attributes.Dietary Restrictions.vegan':np.bool,
'attributes.Dietary Restrictions.vegetarian':np.bool,'attributes.Dogs Allowed':np.bool,
'attributes.Drive-Thru':np.bool,'attributes.Good For Dancing':np.bool,'attributes.Good For Groups':np.bool,
'attributes.Good For.breakfast':np.bool,'attributes.Good For.brunch':np.bool,
'attributes.Good For.dessert':np.bool,'attributes.Good For.dinner':np.bool,
'attributes.Good For.latenight':np.bool,'attributes.Good For.lunch':np.bool,'attributes.Good for Kids':np.bool,
'attributes.Happy Hour':np.bool,'attributes.Has TV':np.bool,'attributes.Music.background_music':np.bool,
'attributes.Music.dj':np.bool,'attributes.Music.jukebox':np.bool,
'attributes.Music.karaoke':np.bool,'attributes.Music.live':np.bool,
'attributes.Music.video':np.bool,'attributes.Open 24 Hours':np.bool,'attributes.Order at Counter':np.bool,
'attributes.Outdoor Seating':np.bool,'attributes.Parking.garage':np.bool,'attributes.Parking.lot':np.bool,
'attributes.Parking.street':np.bool,'attributes.Parking.valet':np.bool,'attributes.Parking.validated':np.bool,
'attributes.Take-out':np.bool,'attributes.Takes Reservations':np.bool,'attributes.Waiter Service':np.bool,
'attributes.Wheelchair Accessible' :np.bool, 'open':np.bool
}
map_dict = {0:False, 1:True}
# Mapping
for col in food_charlotte_business.columns:
if col in col_dict_bool.keys():
food_charlotte_business[col] = food_charlotte_business[col].map(map_dict)
#Drop columns that have only NaN
food_charlotte_business = food_charlotte_business.dropna(axis = 1,how = 'all')
print (food_charlotte_business.shape)
food_charlotte_business.describe()
Let's see the number of missing values in the business dataset
na_perc = {}
for col in food_charlotte_business.columns:
na_perc[col] = food_charlotte_business[col].isnull().sum()*100/len(food_charlotte_business)
food_charlotte_business[food_charlotte_business['attributes.Accepts Insurance'].notnull()]
na_perc
As can be seen, there are quite a few columns that have more than 70% missing values. We will drop these columns for now. (70% is an arbitary choice of number. It will be difficult to interpret any results from such data and imputation may not be the ideal way to go about it). Going further, missing values (NaN's) are ignored for the current analysis.
for key,value in na_perc.items():
if value > 70:
food_charlotte_business.drop(key,axis = 1,inplace=True)
food_charlotte_business.shape
We will check for duplicates based on the business id.
print (len(food_charlotte_business.business_id.unique()))
food_charlotte_business.business_id.loc[food_charlotte_business.business_id.duplicated()]
The above business id's are undefined in the original dataset. We will drop these rows.
food_charlotte_business.drop_duplicates(subset = ['business_id'],keep = False,inplace = True)
print (len(food_charlotte_business))
print (len(food_charlotte_business.business_id.unique()))
Frequency of open and closed businesses¶
sns.set(font_scale = 2)
sns.set_context({"figure.figsize": (10, 8)})
sns.set_style("whitegrid")
#Author = Author.sort_values('Author_Cited_By')
#labels = Author.Author_Name
g = sns.countplot(x = 'open', data = food_charlotte_business)
#g.set_xticklabels(labels,rotation = 90)
plt.title("Count plot for Open Restaurants")
plt.xlabel('Open Restaurants')
plt.ylabel('Count')
g.axes.grid('off')
#sns.set(font_scale=15)
In the above plot, we see that almost 20% of the reviewed restaurants are actually listed as no longer operational. This may be because they shut due to lack of enough business or changes in operating conditions (eg.- due to a new establishment that restricted restaurant’s operation). We will further analyze the reasons of such failures considering that the major reason for shutting down is low revenue generation.
Star rating distribution by status of business¶
#violin plot for open and closed stars
sns.set(font_scale = 1)
sns.set_context({"figure.figsize": (10, 8)})
sns.set_style("whitegrid")
g = sns.violinplot(x='type',y = 'stars',hue='open', data = food_charlotte_business,
split=True,inner="quart")
plt.title("Distribution of Star Ratings vs the status of the business")
plt.xlabel('Restaurants')
plt.ylabel('Star Ratings')
g.axes.grid('off')
plt.ylim(0, 6)
Contrary to intuition, this plot shows that the median star rating is identical for closed and open restaurants, but it appears the open restaurant distribution has more highly rated restaurants, evidenced by the bimodal humps. Clearly, star ratings are not the only factor that leads to the close of a restaurant. Or one can even consider the ratings distribution during the time which may lead to interesting facts.
neighborhoods_1 = list()
neighborhoods_2 = list()
for item in food_charlotte_business.neighborhoods:
ngh = item.strip('[]')
ngh = ngh.strip().split(',')
if (len(ngh) == 1) and ngh[0] != "":
#print ("Only one neighborhood")
neighborhoods_1.append(ngh[0].strip("'"))
neighborhoods_2.append(np.nan)
elif (len(ngh) == 2):
#print ("Two neighboorhoods")
neighborhoods_1.append(ngh[0].strip("'"))
neighborhoods_2.append(ngh[1].strip("'"))
else:
#print ("No Neighborhood")
neighborhoods_1.append(np.nan)
neighborhoods_2.append(np.nan)
food_charlotte_business['neighborhoods_1'] = pd.Series(neighborhoods_1,index = food_charlotte_business.index)
food_charlotte_business['neighborhoods_2'] = pd.Series(neighborhoods_2,index = food_charlotte_business.index)
food_charlotte_business[['neighborhoods','neighborhoods_1','neighborhoods_2']].head()
We split the neighbourhoods into primary and secondary neighbourhoods.
# Neighborhood not defined
print (food_charlotte_business.neighborhoods_1.isnull().sum()/len(food_charlotte_business)*100)
print (food_charlotte_business.neighborhoods_1.unique())
print (food_charlotte_business.neighborhoods_2.unique())
As shown above, there are 27.80% business where the neighbourhood is not defined. We ignore these values in the following analysis to find out the importance of a location since we cannot infer their neighbourhood yet. We assume that the available 73% of the data is statistically significant.
What differentiates the locations in Charlotte?¶
Review count by neighbourhood¶
from IPython.display import HTML
group_neighborhood = food_charlotte_business.groupby('neighborhoods_1')
open_food=food_charlotte_business[food_charlotte_business['open']]
closed_food=food_charlotte_business[food_charlotte_business['open']==False]
group_open= open_food.groupby(by='neighborhoods_1')
group_close=closed_food.groupby(by='neighborhoods_1')
review_count_open = group_open.review_count.sum().sort_values(ascending = False)
review_count_close=group_close.review_count.sum().sort_values(ascending = False)
trace1 = Bar(
x=review_count_open.index,
y=review_count_open.values,
#marker = dict(color='rgb(158,202,225)'),
name="open"
)
trace2=Bar(
x=review_count_close.index,
y=review_count_close.values,
#marker = dict(color='rgb(158,202,225)'),
name="closed"
)
trace=[trace1,trace2]
layout = Layout(
title='Review Count by Neighborhood',
titlefont=dict(size=18),
showlegend=False,
width=1000,
height=750,
hovermode='closest',
xaxis=XAxis(showgrid=False, zeroline=False, showticklabels=True, title='Neighborhood'),
yaxis=YAxis(showgrid=False, zeroline=False, showticklabels=True,title='Review Count'),
barmode='stack'
)
fig = Figure(data = trace, layout = layout)
plot_url = py.plot(fig, filename = 'ReviewCount-Neighborhood', auto_open=False)
HTML(tls.get_embed(plot_url, height=1000))
This stacked bar chart shows the cardinality of the review counts by Charlotte neighborhoods. The reviews for closed restaurants are in minority for all neighborhoods, but make up a substantial fraction of total reviews for smaller neighborhoods such as the Third and Fourth Ward.The review volume is a good indicator of visits to the restaurant which thus provides us useful information about the distribution of customer count over time. It could be possible that the customer count suddenly dropped indicating an immediate change in the working conditions. We will further ponder upon this by considering the star rating distribution for each neighborhood.
Star rating distribution by neigbourhood¶
#source = https://plot.ly/python/box-plots/
trace = []
for key,group in group_neighborhood:
trace.append(Box(y = group.stars,
name = key,
jitter=0.5,
whiskerwidth=0.2,
#fillcolor=colors,
fillcolor='Viridis',
marker=dict(
size=2,
),
line=dict(width=1),))
layout = Layout(
title='Star rating distribution by neighbourhood',
titlefont=dict(size=18),
width = 1000,
height = 750,
hovermode='closest',
yaxis=dict(
autorange=True,
showgrid=False,
zeroline=True
),
#paper_bgcolor='rgb(243, 243, 243)',
#plot_bgcolor='rgb(243, 243, 243)',
showlegend=False
)
fig = Figure(data=trace, layout=layout)
plot_url = py.plot(fig, filename = 'StarRatingDistribution_Neighborhood', auto_open=False)
HTML(tls.get_embed(plot_url, height=1000))
These boxplots show the distribution of scores by neighborhood and how they compare to each other. From it we can gather the popularity and success of businesses within the neighborhoods. NoDa and Elizabeth have a higher median and an higher overall rating distribution marking them as popular places whereas the food businesses in Derita and Paw Creek have received significantly low ratings.
Quail Hollow has only one restuarant with 7 reviews.
food_charlotte_business.loc[food_charlotte_business['neighborhoods_1'] == 'Quail Hollow']
Attributes defining different neighbourhoods¶
import warnings
warnings.filterwarnings("ignore")
col_list = []
for name in food_charlotte_business.columns:
if 'attributes' in name and name in col_dict_bool.keys():
col_list.append(name)
col_list.append('stars')
col_list.append('neighborhoods_1')
print ("Total features defining the attributes of a business")
print (len(col_list))
food_charlotte_business_att=food_charlotte_business[col_list]
food_charlotte_business_att.fillna(0,inplace = True)
map_dict = {False:0, True:1}
# Mapping
for col in food_charlotte_business_att.columns:
if col in col_dict_bool.keys():
food_charlotte_business_att[col] = food_charlotte_business_att[col].map(map_dict)
group_att = food_charlotte_business_att.groupby(by='neighborhoods_1')
avg_by_neighborhood = (group_att.sum()/group_att.count())/((group_att.sum()/group_att.count()).max())
avg_by_neighborhood = (group_att.sum()/group_att.count())/((group_att.sum()/group_att.count()).max())
avg_by_neighborhood['overall']=avg_by_neighborhood.sum(axis=1)
avg_by_neighborhood['overall']=avg_by_neighborhood['overall']/avg_by_neighborhood['overall'].max()
data = [Heatmap(
x = avg_by_neighborhood.columns,
y= avg_by_neighborhood.index,
z = avg_by_neighborhood.values.tolist(),
colorscale = 'Viridis'
)
]
layout = Layout(
title='Neighborhood Popularity defined by the attributes',
titlefont=dict(size=18),
width = 950,
height = 800,
hovermode='closest',
yaxis=dict(
autorange=True,
showgrid=False,
zeroline=True
),
#paper_bgcolor='rgb(243, 243, 243)',
#plot_bgcolor='rgb(243, 243, 243)',
showlegend=False
)
fig = Figure(data = data,layout = layout)
plot_url = py.plot(fig, filename = 'NeighborhoodPopularity_Attributes', auto_open=False)
HTML(tls.get_embed(plot_url, height=800))
This heat map shows the density of a certain attribute within each neighborhood relative to the neighborhood with the maximum density for the attribute. The brighter hues indicate higher density of the attribute, and each column represents a distinctive attribute. Rows with lots of bright hues indicate that the neighborhood has a relatively high density for multiple attributes. We also created an overall score that takes into account densities across all attributes and the star rating to come up with a composite density. This heat map could be used to make a decision for somebody looking to find a neighborhood with lots of different attributes, or they could key in on the relevant attributes to their own situation.
# Function to get attributes boxplot
def get_attr_boxplot(df,plot_title,xaxis_title, yaxis_title,plot_fig = True):
trace = []
for key,group in df:
trace.append(Box(y = group.stars,
name = key,
boxpoints = 'all',
whiskerwidth=0.2,
#fillcolor=colors,
fillcolor='Viridis',
marker=dict(
size=4,),
boxmean = 'sd',
line=dict(width=2),))
layout = Layout(
title=plot_title,
titlefont=dict(size=18),
width = 750,
height = 500,
hovermode='closest',
yaxis=dict(
autorange = False,
showgrid=False,
zeroline=True,
showline = True,
range = [0,6],
title = yaxis_title),
xaxis = dict(
title = xaxis_title,
zeroline = True,
showline = True),
#paper_bgcolor='rgb(243, 243, 243)',
#plot_bgcolor='rgb(243, 243, 243)',
showlegend=True)
fig = Figure(data=trace, layout=layout)
if plot_fig == True:
plot_url = py.plot(fig, filename = plot_title, auto_open=False)
return (tls.get_embed(plot_url, height=1000))
Credit Card Violin Plot¶
f, ax = plt.subplots(figsize=(9, 9))
sns.violinplot(x='attributes.Accepts Credit Cards',y='stars',hue='open',
data=food_charlotte_business,split=True,inner="quart")
plt.ylim(0,6)
food_charlotte_business.groupby('attributes.Accepts Credit Cards').describe()
This violin plot compares the distribution of the ratings for restaurants based on whether or not they accept credit cards. The hue represents whether or not they are still in operation. The distribution for those that accept credit cards are very similar for both open and closed restaurants. This could be due to a much larger sample size. Interestingly, the median star rating for closed restaurants that don’t accept credit cards is higher than that of open restaurants that don’t accept credit cards. In addition, the median rating is higher for restaurants that do not accept cards. This could be the result of small owned restaurants that are well run operations.
Star Rating By Ambience¶
trace# How does the ambience of food business affect star rating?
trace = []
for col_name in food_charlotte_business.columns:
if 'attributes.Ambience' in col_name:
#food_charlotte_business_attr[col_name] = food_charlotte_business_attr[col_name].astype(bool)
df = food_charlotte_business.groupby(col_name)
for key,group in df:
if key == True:
trace.append(Box(
y = group.stars,
name = col_name,
marker=dict(size=4,),
boxmean = 'sd',
line=dict(width=2),
))
layout = Layout(
title='Star Rating by Ambience',
titlefont=dict(size=18),
width = 1000,
height = 750,
hovermode='closest',
yaxis=dict(
autorange=False,
range = [0,6],
showgrid=False,
zeroline=True,
title = 'Star Rating',
showline = True),
boxmode = 'group',
boxgroupgap=0.5,
xaxis = dict(
title = 'Attributes',
zeroline = True,
showline = True),
#paper_bgcolor='rgb(243, 243, 243)',
#plot_bgcolor='rgb(243, 243, 243)',
showlegend=True,
)
fig = Figure(data=trace, layout=layout)
plot_url = py.plot(fig, filename = 'RatingAmbience', auto_open=False)
HTML(tls.get_embed(plot_url,height=800))
Star Rating by Delivery¶
df = food_charlotte_business.groupby('attributes.Delivery')
get_attr_boxplot(df,plot_title = "How does delivery service affect star rating?",
xaxis_title = 'Delivery',yaxis_title = 'Star Rating')
df.stars.describe()
The effect of delivery on a rating seems to be minimized. This is probably due to the fact that service applications (like UBER EATS) can make nearly any restaurant into a delivery restaurant thus making this attribute less useful than others for comparison.
Star Rating by Price Range¶
f, ax = plt.subplots(figsize=(9, 9))
sns.violinplot(x='attributes.Price Range',y='stars',hue='open',
data=food_charlotte_business,split=True,inner="quart")
plt.ylim(0,6)
food_charlotte_business.groupby('attributes.Price Range').describe()
The violin plot suggests that the price doesn’t seem to affect the star rating until the highest price bracket is reached, where the median star rating is higher for businesses that are still operational. The median for out of business restaurants is lower, meaning that overpriced restaurants have a real problem staying in business.
Analysis of Reviews Dataset¶
Read the yelp reviews dataset. Sub-sample based on the business id's in the dataframe 'food_charlotte_business'.
reviews = pd.read_csv('.\Data\yelp_academic_dataset_review.csv')
food_charlotte_reviews = reviews.loc[reviews['business_id'].isin(food_charlotte_business['business_id'])]
del reviews
food_charlotte_reviews.head()
print (food_charlotte_reviews.info())
print (food_charlotte_reviews.dtypes)
print (food_charlotte_reviews.shape)
The reviews dataset has 114343 entries and 10 columns.
# Change date to date type
food_charlotte_reviews.date = pd.to_datetime(food_charlotte_reviews['date'])
food_charlotte_reviews.date.dtype
Merge the reviews dataset with the business dataset on business id's.
food_charlotte_reviews_merge = food_charlotte_reviews.merge(food_charlotte_business,how = 'inner',on= 'business_id')
food_charlotte_reviews_merge.date = pd.to_datetime(food_charlotte_reviews_merge['date'])
food_charlotte_reviews_merge.rename(index = str,columns = {'stars_y':'avg_rating','stars_x':'review_rating'},inplace=True)
food_charlotte_reviews_merge.head()
Unique Words Analysis¶
Get all the reviews for each of the neighbourhood
# Combine all the reviews for a given neighbourhood.
from collections import defaultdict
group_neighborhood = food_charlotte_reviews_merge.groupby('neighborhoods_1')
review_text_dict = defaultdict(list)
for key,group in group_neighborhood:
temp_text = [x for x in group.text.tolist()]
review_text_dict[key].append(temp_text)
review_text_dict[key][0][0:5]
#Reference
#http://www.nltk.org/book/
#http://stanford.edu/~rjweiss/public_html/IRiSS2013/text2/notebooks/cleaningtext.html
import re
import string
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
snowball = SnowballStemmer('english')
def clean_text(text_list):
clean_review = []
for text in text_list:
#remove punctuation non alpha-numeric characters
#regex = re.compile('[%s]' % re.escape(string.punctuation))
#alpha_numeric = regex.sub(u'', text).lower()
alpha_numeric = re.sub("[^a-zA-Z]", " ", text)
#alpha_numeric = re.sub('\n','',alpha_numeric)
clean_review.append(snowball.stem(alpha_numeric))
return clean_review
The above code/function cleans the text data by removing all punctuations and keeps only characters. Also, it stems the text using Snowball stemmer from NLTK module.
review_text = {}
for key,value in review_text_dict.items():
clean_review = clean_text(value[0])
review_text[key] = clean_review
#print (key)
review_text[key][0:5]
# Create bag of words model where the feature is made of a word, stop words are removed
from sklearn.feature_extraction.text import CountVectorizer
bag_of_words = defaultdict()
count_vect_dict = defaultdict()
for key,value in review_text.items():
count_vect = CountVectorizer(stop_words = 'english',
analyzer = 'word') # an object capable of counting words in a document!
# count_vect.fit(summary_text)
# count_vect.transform(summary_text)
#print (len(value))
bag_words = count_vect.fit_transform(value)
bag_of_words[key] = bag_words
count_vect_dict[key] = count_vect
The above piece of code creates bag of words model for each of the neighbourhoods. Stop words are removed. Bag of words model helps us calculate the frequency of all the words used in the vocabulary.
dist = {}
for key,value in bag_of_words.items():
feature_array = value.toarray()
print ("Neighbourhood")
print (key)
print ("Vocabulary Count")
print (len(count_vect_dict[key].vocabulary_))
# Sum up the counts of each vocabulary word
#dist = np.sum(feature_array, axis=0)
df = pd.DataFrame(data=feature_array,columns=count_vect_dict[key].get_feature_names())
dist[key] = df.sum().sort_values()[-300:] # Select the top 300 words by count
#print (max(dist))
print ('-----------------------------------')
Create a dataframe from the dictionary with index as words (from the corpus of top 300 words for each neighbourhood) and columns as the neighbourhood.
dist_df = pd.DataFrame.from_dict(dist)
print (dist_df.shape)
dist_df.head()
Find out unique words for each neighbourhood.
unique_words = defaultdict(list)
unique_words_df = pd.DataFrame()
for index,row in dist_df.iterrows():
count_not_null = row.notnull().sum()
if count_not_null == 1:
unique_words[index].append(row[row.notnull()].to_dict())
#d = row[row.notnull()].to_frame(name = idx)
#unique_words_df = unique_words_df.append(d,ignore_index=True)
len(unique_words)
There are a total of 523 unique words across all the neighbourhoods.
unique_words_df = []
word = []
for key,value in unique_words.items():
word.append(key)
unique_words_df.append(pd.DataFrame.from_dict(value[0],orient='index'))
unique_words_df = pd.concat(unique_words_df)
unique_words_df['unique_word'] = word
unique_words_df.head()
# Rearrange the dataframe
unique_words_df.reset_index(level=0, inplace=True)
unique_words_df.rename(index = str,columns={"index": "neighborhoods", 0: "word_count"},inplace=True)
unique_words_df.head()
unique_words_group = unique_words_df.groupby('neighborhoods')
data = []
buttons = list([
dict(
args=['visible', list(np.repeat(True,len(unique_words_group.keys)))],
label='All',
method='restyle'
)])
keys_sorted = sorted(unique_words_group.groups.keys())
for key,group in unique_words_group:
key_bool = []
for item in keys_sorted:
if key == item:
key_bool.append(True)
else:
key_bool.append(False)
trace = Bar(y = group.word_count,
x = group.unique_word,
name = key)
data.append(trace)
b = dict(args=['visible', key_bool],
label=key,
method='restyle')
buttons.append(b)
layout = Layout(
title='Unique words (from top 300 words) by Neighbourhood',
width = 750,
height = 1000,
updatemenus=list([
dict(
x=-0.05,
y=2,
yanchor='top',
buttons=buttons)]))
fig = Figure(data = data,layout = layout)
plot_url = py.plot(fig, filename = 'UniqueWords', auto_open=False)
HTML(tls.get_embed(plot_url, height=1000))
Please use the drop down menu to change/select the neighbourhood
The above plots show the unique words by their neighbourhood and their counts. These unique words help define the neighbourhood. For example,
- 'NoDa', we can speculate that it is 'smelly'. We can also speculate that there are a lot places for crepes and that people/reviewers are talking about ales and 'mahi' fish.
- Dilworth looks like a good neighbourhood for donuts. Reviewers are also specifically talking about the chipotle and starbucks in this area.
- South Park is famous for cowfish!
- Villa Heights has one of the famous french bakeries, 'Amelie' and reviews of that neighbourhood prove this point. It looks like a neighbourhood where you get excellent bakery food along with coffee and it also seems to have places to study. This neighbourhood seems to be famous for its cafe's.
- Arboretum is famous for movie places and gelato!
- Eastland is famous for korean, thai and eastern foods!
- Highland Creek is famous for biryani's!
This way, based on people reviews, we can find out what makes a neighbourhood famous. Businesses in the future can select their location based on their type of services. They can also find out what is missing in a neighbourhood and create a business opportunity.
Votes for reviews¶
f,(ax1, ax2,ax3) = plt.subplots(3,figsize = (15,15))
sns.set(font_scale = 2)
f.suptitle("Frequency plots for review votes")
sns.set_style("whitegrid")
sns.countplot(x = 'votes.useful', data = food_charlotte_reviews,ax = ax1)
sns.countplot(x = 'votes.cool', data = food_charlotte_reviews,ax = ax2)
sns.countplot(x = 'votes.funny', data = food_charlotte_reviews,ax = ax3)
ax1.grid(False)
ax2.grid(False)
ax3.grid(False)
Reviews on Yelp can be voted as being 'useful', 'cool' or 'funny'. From the above plots, we can see that the use of this feature is limited. However, we can use this feature to aid in determining the sentiment of the review.
Monthly average star rating for 'Good', 'Medium' and 'Bad' rating restuarants¶
Compute the range of the days available in the dataset
food_charlotte_reviews_merge.date.describe()
ts_sub_reviews = food_charlotte_reviews_merge[['date','business_id','avg_rating','review_rating','open']]
Create bins to describe 'Good', 'Medium' and 'Bad' rating as follows:
- 1-2.33 : "Bad" rating
- 2.33-3.67 : "Medium" rating
- 3.67-5 : "Good" rating
bins = np.linspace(ts_sub_reviews.avg_rating.min(), ts_sub_reviews.avg_rating.max(), 4)
ts_sub_reviews['rating_bin'] = pd.cut(ts_sub_reviews.avg_rating,bins,right = True,
labels = ['Bad_Rating','Medium_Rating','Good_Rating'],include_lowest=True)
bins
ts_sub_reviews = ts_sub_reviews.sort_values('date')
ts_sub_reviews.head()
Create a series of dates from the start date and end date as defined in the dataset
date_range = pd.date_range('2004-12-19','2016-07-19', freq='D')
date_range
# Group data by their rating bin and whether they are open or closed
ts_group = ts_sub_reviews.groupby(['rating_bin','open'])
ts_group.groups.keys()
Compute the daily average of review ratings for each of rating bins. If a given day is not present (i.e. no activity of Yelp users on a particular day), we assume previous days numbers i.e. the ratings are constant if there is no update.
# Compute the average review rating for each bin created above
ts_temp = pd.DataFrame()
frames = []
data = pd.Series()
for key,value in ts_group:
for d in date_range:
x = value[value['date'] == d]
if x.empty == False:
data = pd.Series({'date':d,'rating_bin':key[0],
'open' : key[1],'avg_review_rating' : x.review_rating.mean()})
elif data.empty:
data = pd.Series({'date':d,'rating_bin':key[0],'open':key[1],'avg_review_rating':0})
else:
data = pd.Series({'date':d,'rating_bin':key[0],
'open' : key[1],'avg_review_rating' : data['avg_review_rating']})
ts_temp = ts_temp.append(data,ignore_index = True)
df = ts_temp.set_index('rating_bin','open')
frames.append(df)
ts_temp = pd.DataFrame()
data = pd.Series()
# Get the required format of the data for plotting
business_closed = pd.DataFrame()
business_open = pd.DataFrame()
for df in frames:
if df.open.iloc[0] == 0:
business_closed = business_closed.append(df)
else:
business_open = business_open.append(df)
business_closed['rating_bin'] = business_closed.index
business_closed = business_closed.set_index('date')
business_open['rating_bin'] = business_open.index
business_open = business_open.set_index('date')
Compute the monthly average of the daily average review ratings.
df_group = business_closed.groupby('rating_bin').resample('M').mean()
df_group = df_group.reset_index()
df_group = df_group.groupby('rating_bin')
traces = []
for key,group in df_group:
trace = Scatter(
x = group.date,
y = group.avg_review_rating,
mode = 'lines+markers',
name = key,
marker = dict(size = 5))
traces.append(trace)
layout = Layout(
title='Star Rating Trend',
titlefont=dict(size=18),
showlegend=True,
width=1000,
height=750,
hovermode='closest',
xaxis=XAxis(showgrid=False, zeroline=False, showticklabels=True, title='Date'),
yaxis=YAxis(showgrid=False, zeroline=False, showticklabels=True,title='Monthly Average Rating',range = [0,6])
)
fig = Figure(data = traces, layout = layout)
plot_url = py.plot(fig, filename = 'RatingTrend', auto_open=False)
HTML(tls.get_embed(plot_url, height=800))
The above graph shows the monthly average star rating trend for "Good", "Medium" and "Bad" rating businesses that have closed down. It helps us understand whether the ratings drop towards the closure for the business. As can be seen, for "Good" rating businesses, the monthly average varies less as time progresses (there will be more number of reviews). Although we can see a drop in the ratings towards the end, there is a sudden jump, maybe implying the reason for closure is something else. There can be several reasons that can be attributed for this difference, one being that the high rated restaurants have a different business operating style than the mid rated and shut down suddenly after realizing major losses. In contrast, the mid rated restaurants tend to suffer losses for a longer time and leave the business after a cut-off poor performance. The low rated restaurants follow an expected trend and hardly achieve high ratings throughout their operations.
The assumption while making these observations is that all businesses (i.e. coffee house, restaurants, bars etc.) with similar overall average rating behave in a similar manner over time. Even though this is not true, it gives an overall picture of how on average, the rating trend may affect or force the closure of a business. The businesses can then focus on their Yelp ratings and maybe also look into their social marketing strategy.
User Data Analysis¶
user = pd.read_csv('.\Data\yelp_academic_dataset_user.csv')
food_charlotte_user = user.loc[user['user_id'].isin(food_charlotte_reviews['user_id'])]
del user
food_charlotte_user.head()
print (food_charlotte_user.info())
print (food_charlotte_user.dtypes)
print (food_charlotte_user.shape)
Correlation Matrix for users and it's attributes¶
yelp_user = food_charlotte_user[['review_count','average_stars','votes.useful','fans']]
# plot the correlation matrix using seaborn
cmap = sns.diverging_palette(220, 10, as_cmap=True) # one of the many color mappings
sns.set(style="darkgrid") # one of the many styles to plot using
f, ax = plt.subplots(figsize=(9, 9))
sns.heatmap(yelp_user.corr(), cmap=cmap, annot=True)
f.tight_layout()
The correlation matrix shows correlations between review count by each user, average star rating given by that user, Useful votes count the reviews received and number of fans following that user. As seen, there exists a strong correlations between review counts, useful review votes (showing the motivation reviewer receives for writing reviews) and number of fans. This is expected as more number of people will follow the reviewer if the reviews are useful. However, there is almost no correlation in his fan following and the average star rating the reviewer gives to businesses.
elite_bool = []
for item in food_charlotte_user.elite:
if item == '[]':
elite_bool.append(False)
else:
elite_bool.append(True)
food_charlotte_user['elite_bool'] = elite_bool
food_charlotte_user.elite_bool.value_counts()
As expected there are about 91% users are non-elite and the rest are elite users
Distribution of the average star ratings for user v/s elite and non-elite users¶
#violin plot for star rating for elite vs non-elite users
sns.set(font_scale = 1)
sns.set_context({"figure.figsize": (12, 5)})
sns.set_style("whitegrid")
g = sns.violinplot(x='elite_bool',y = 'average_stars', data = food_charlotte_user,inner="quart")
plt.title("Distribution of Star Ratings vs the status of the business")
plt.xlabel('Elite User')
plt.ylabel('Star Ratings')
g.axes.grid('off')
plt.ylim(0, 6)
Total review count by the eliteness of the user¶
sns.set_context({"figure.figsize": (12, 5)})
df = food_charlotte_user.groupby('elite_bool')
df.review_count.sum().plot(kind = 'bar')
plt.title('Review Count by eliteness of the user')
As expected, elite users have written more reviews than the non-elite users
Check-in Data Analysis¶
checkin = pd.read_csv('.\Data\yelp_academic_dataset_checkin.csv')
food_charlotte_checkin = checkin.loc[checkin['business_id'].isin(food_charlotte_business['business_id'])]
del checkin
food_charlotte_checkin.head()
Column name format - checkin_info.9-4: Number of check-ins from 9am-10am on Thursday
- 0 - Sunday
- 1 - Monday
- 2 - Tuesday
- 3 - Wednesday
- 4 - Thursday
- 5 - Friday
- 6 - Saturday
print (food_charlotte_checkin.info())
food_charlotte_checkin_merge = food_charlotte_checkin.merge(food_charlotte_business,how = 'inner',on= 'business_id')
# food_charlotte_reviews_merge.date = pd.to_datetime(food_charlotte_reviews_merge['date'])
# food_charlotte_reviews_merge.rename(index = str,columns = {'stars_y':'avg_rating','stars_x':'review_rating'},inplace=True)
food_charlotte_checkin_merge.head()
map_days = {'0':'Sunday',
'1':'Monday',
'2':'Tuesday',
'3':'Wednesday',
'4':'Thursday',
'5':'Friday',
'6':'Saturday'}
# checkin_neighbourhood = defaultdict(list)
checkin_neighbourhood = []
for col_name in food_charlotte_checkin_merge.columns:
if 'checkin' in col_name:
temp = col_name.split('info.')[1].split('-')
hr = temp[0]
day = temp[1]
for row in food_charlotte_checkin_merge.iterrows():
row = row[1]
if 5<=int(hr)<11:
section_day = 'Morning'
elif 11<=int(hr)<16:
section_day = 'Afternoon'
else:
section_day = 'Evening'
checkin_neighbourhood.append([row.neighborhoods_1,map_days[day],section_day,row[col_name]])
checkin_neighbourhood = pd.DataFrame(checkin_neighbourhood,
columns=['Neighbourhood','Weekday','Section_Day','Checkin_count'])
checkin_neighbourhood.head()
checkin_grouped = checkin_neighbourhood.groupby(['Neighbourhood','Weekday','Section_Day'])
checkin_neighbourhood_agg = checkin_grouped.aggregate(np.mean).groupby(level = 0)
trace_data = []
buttons = list([
dict(
args=['visible', list(np.repeat(True,len(checkin_neighbourhood_agg.groups.keys())))],
label='All',
method='restyle'
)])
keys_sorted = sorted(checkin_neighbourhood_agg.groups.keys())
for ng, new_df in checkin_neighbourhood_agg:
key_bool = []
for item in keys_sorted:
if ng == item:
key_bool.append(True)
else:
key_bool.append(False)
new_df.reset_index(inplace = True)
for key,group in new_df.groupby('Section_Day'):
trace = Bar(x = group.Weekday,
y = group.Checkin_count)
trace_data.append(trace)
b = dict(args=['visible', key_bool],
label=ng,
method='restyle')
buttons.append(b)
layout = Layout(
title='Average Check-in for a neighborhood',
width = 750,
height = 1000,
barmode = 'group',
bargap=0.15,
showlegend = False,
updatemenus=list([
dict(
x=-0.05,
y=2,
yanchor='top',
buttons=buttons)]))
fig = Figure(data = trace_data,layout = layout)
plot_url = py.plot(fig, filename = 'CheckIn', auto_open=False)
HTML(tls.get_embed(plot_url,height=1000))
Further Analysis¶
To enhance the results of our prediction and classification models, it’s desirable to have the data of generated revenues by the restaurant (or a related proxy). The closing date of a restaurant is also significant and thus a similar data can be useful. Other features that could have been added to the dataset includes tables turned per night, revenue generation by hour, capacity etc. Other features, that can be extracted from the existing attributes for our analysis are the day of week of checkin (using the checkin code), the time of the day (eg. from 6 pm to 10 pm can be termed as dinner time and 10 pm to 2 am can be termed as late night) and the core sentiment of the text review.