Project Python Foundations: FoodHub Data Analysis¶

Context¶

The number of restaurants in New York is increasing day by day. Lots of students and busy professionals rely on those restaurants due to their hectic lifestyles. Online food delivery service is a great option for them. It provides them with good food from their favorite restaurants. A food aggregator company FoodHub offers access to multiple restaurants through a single smartphone app.

The app allows the restaurants to receive a direct online order from a customer. The app assigns a delivery person from the company to pick up the order after it is confirmed by the restaurant. The delivery person then uses the map to reach the restaurant and waits for the food package. Once the food package is handed over to the delivery person, he/she confirms the pick-up in the app and travels to the customer's location to deliver the food. The delivery person confirms the drop-off in the app after delivering the food package to the customer. The customer can rate the order in the app. The food aggregator earns money by collecting a fixed margin of the delivery order from the restaurants.

Objective¶

The food aggregator company has stored the data of the different orders made by the registered customers in their online portal. They want to analyze the data to get a fair idea about the demand of different restaurants which will help them in enhancing their customer experience. Suppose you are hired as a Data Scientist in this company and the Data Science team has shared some of the key questions that need to be answered. Perform the data analysis to find answers to these questions that will help the company to improve the business.

Data Description¶

The data contains the different data related to a food order. The detailed data dictionary is given below.

Data Dictionary¶

  • order_id: Unique ID of the order
  • customer_id: ID of the customer who ordered the food
  • restaurant_name: Name of the restaurant
  • cuisine_type: Cuisine ordered by the customer
  • cost_of_the_order: Cost of the order
  • day_of_the_week: Indicates whether the order is placed on a weekday or weekend (The weekday is from Monday to Friday and the weekend is Saturday and Sunday)
  • rating: Rating given by the customer out of 5
  • food_preparation_time: Time (in minutes) taken by the restaurant to prepare the food. This is calculated by taking the difference between the timestamps of the restaurant's order confirmation and the delivery person's pick-up confirmation.
  • delivery_time: Time (in minutes) taken by the delivery person to deliver the food package. This is calculated by taking the difference between the timestamps of the delivery person's pick-up confirmation and drop-off information

Let us start by importing the required libraries¶

In [2]:
# Installing the libraries with the specified version.
#!pip install numpy==1.25.2 pandas==1.5.3 matplotlib==3.7.1 seaborn==0.13.1 -q --user

!pip install numpy pandas matplotlib seaborn -q --user

"""
Ran into an error as shown below; hence, removed the versions and executed the command per instruction provided by the evaluator in the FoodHub Forum.

  AttributeError: module 'pkgutil' has no attribute 'ImpImporter'. Did you mean: 'zipimporter'?
  [end of output]
"""
Out[2]:
"\nRan into an error as shown below; hence, removed the versions and executed the command per instruction provided by the evaluator in the FoodHub Forum.\n\n  AttributeError: module 'pkgutil' has no attribute 'ImpImporter'. Did you mean: 'zipimporter'?\n  [end of output]\n"

Note: After running the above cell, kindly restart the notebook kernel and run all cells sequentially from the start again.

In [3]:
# import libraries for data manipulation
import numpy as np
import pandas as pd

# import libraries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns

Understanding the structure of the data¶

In [76]:
# I used Anaconda/Jupyter Notebook and didn't use Google Colab for this project completion. So, the below statements are commented.
# from google.colab import drive
# drive.mount('/content/drive')
In [4]:
# Read the dataset from the csv file
foodHub_df = pd.read_csv('foodhub_order.csv')
In [5]:
# View the first 5 rows from the DataFrame to ensure the csv file was loaded properly into the DataFrame.
foodHub_df.head()
Out[5]:
order_id customer_id restaurant_name cuisine_type cost_of_the_order day_of_the_week rating food_preparation_time delivery_time
0 1477147 337525 Hangawi Korean 30.75 Weekend Not given 25 20
1 1477685 358141 Blue Ribbon Sushi Izakaya Japanese 12.08 Weekend Not given 25 23
2 1477070 66393 Cafe Habana Mexican 12.23 Weekday 5 23 28
3 1477334 106968 Blue Ribbon Fried Chicken American 29.20 Weekend 3 25 15
4 1478249 76942 Dirty Bird to Go American 11.59 Weekday 4 25 24

Observation¶

  • The dataset (foodhub_order.csv) has been read and loaded into foodHub_df dataframe successfully.
In [6]:
# Check for whether duplicate records exist in the dataset.
foodHub_df.duplicated().sum()
Out[6]:
0

Observation¶

  • No duplicate records exist in the dataset.

Question 1: How many rows and columns are present in the data? [0.5 mark]¶

In [7]:
# Write your code here
foodHub_df.shape
Out[7]:
(1898, 9)

Observations:¶

  • There are 1,898 rows and 9 columns present in the FoodHub Order dataset.

Question 2: What are the datatypes of the different columns in the dataset? (The info() function can be used) [0.5 mark]¶

In [8]:
# Write your code here
foodHub_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1898 entries, 0 to 1897
Data columns (total 9 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   order_id               1898 non-null   int64  
 1   customer_id            1898 non-null   int64  
 2   restaurant_name        1898 non-null   object 
 3   cuisine_type           1898 non-null   object 
 4   cost_of_the_order      1898 non-null   float64
 5   day_of_the_week        1898 non-null   object 
 6   rating                 1898 non-null   object 
 7   food_preparation_time  1898 non-null   int64  
 8   delivery_time          1898 non-null   int64  
dtypes: float64(1), int64(4), object(4)
memory usage: 133.6+ KB

Observations:¶

There are 9 columns in the dataset with three different data types. The data types are:

  • object - 4 Columns (restaurant_name, cuisine_type, day_of_the_week, and rating)
  • float (64 bit) - 1 Column (cost_of_the_order)
  • integer (64 bit) - 4 Columns (order_id, customer_id, food_preparation_time, delivery_time)
  • It is also observed that all columns have valid values and there are NO missing values present in the dataset.

Question 3: Are there any missing values in the data? If yes, treat them using an appropriate method. [1 mark]¶

In [9]:
# Write your code here
foodHub_df.isnull().sum()
Out[9]:
order_id                 0
customer_id              0
restaurant_name          0
cuisine_type             0
cost_of_the_order        0
day_of_the_week          0
rating                   0
food_preparation_time    0
delivery_time            0
dtype: int64

Observation:¶

  • There are NO columns with null and/or missing values.
  • As we could see from the observation of "Question 2", there are no columns with null or missing values.

Question 4: Check the statistical summary of the data. What is the minimum, average, and maximum time it takes for food to be prepared once an order is placed? [2 marks]¶

In [10]:
# Write your code here
foodHub_df.describe().T
Out[10]:
count mean std min 25% 50% 75% max
order_id 1898.0 1.477496e+06 548.049724 1476547.00 1477021.25 1477495.50 1.477970e+06 1478444.00
customer_id 1898.0 1.711685e+05 113698.139743 1311.00 77787.75 128600.00 2.705250e+05 405334.00
cost_of_the_order 1898.0 1.649885e+01 7.483812 4.47 12.08 14.14 2.229750e+01 35.41
food_preparation_time 1898.0 2.737197e+01 4.632481 20.00 23.00 27.00 3.100000e+01 35.00
delivery_time 1898.0 2.416175e+01 4.972637 15.00 20.00 25.00 2.800000e+01 33.00

Observations:¶

  • MINIMUM time it takes for food to be prepared once an order is placed: 20 Minutes
  • AVERAGE time it takes for food to be prepared once an order is placed: 27.37 Minutes
  • MAXIMUM time it takes for food to be prepared once an order is placed: 35 Minutes

Question 5: How many orders are not rated? [1 mark]¶

In [11]:
# To get the count of orders not rated.
foodHub_df['rating'].value_counts()['Not given']
Out[11]:
736
In [12]:
# To get a list of orders that weren't rated.
not_rated = foodHub_df[foodHub_df['rating'] == 'Not given'].reset_index(drop=True)
not_rated
Out[12]:
order_id customer_id restaurant_name cuisine_type cost_of_the_order day_of_the_week rating food_preparation_time delivery_time
0 1477147 337525 Hangawi Korean 30.75 Weekend Not given 25 20
1 1477685 358141 Blue Ribbon Sushi Izakaya Japanese 12.08 Weekend Not given 25 23
2 1477894 157711 The Meatball Shop Italian 6.07 Weekend Not given 28 21
3 1477895 143926 Big Wong Restaurant Œ_¤¾Ñ¼ Chinese 5.92 Weekday Not given 34 28
4 1478198 62667 Lucky's Famous Burgers American 12.13 Weekday Not given 23 30
... ... ... ... ... ... ... ... ... ...
731 1476873 237616 Shake Shack American 5.82 Weekend Not given 26 30
732 1476981 138586 Shake Shack American 5.82 Weekend Not given 22 28
733 1477473 97838 Han Dynasty Chinese 29.15 Weekend Not given 29 21
734 1477819 35309 Blue Ribbon Sushi Japanese 25.22 Weekday Not given 31 24
735 1478056 120353 Blue Ribbon Sushi Japanese 19.45 Weekend Not given 28 24

736 rows × 9 columns

Observation:¶

  • There are 736 orders that are NOT rated by the customers.

Exploratory Data Analysis (EDA)¶

Univariate Analysis¶

Question 6: Explore all the variables and provide observations on their distributions. (Generally, histograms, boxplots, countplots, etc. are used for univariate exploration.) [9 marks]¶

In [14]:
# To visualize distribution of orders among the different cuisine types.
plt.figure(figsize=(20,7))
sns.countplot(data=foodHub_df, x='cuisine_type')
plt.title('Cuisine Type Vs Number of Orders')
plt.xlabel('Cuisine Type')
plt.ylabel('No. of orders');
#plt.xticks(rotation=90)
No description has been provided for this image

Observations¶

  • The above chart shows the distrbution orders by cuisine type.
  • As observed from the countplot, the top 3 cuisine types with MOST orders placed by the customers are 1) American, 2) Japanese, and 3) Italian.
  • The bottom 3 cuisine types with LEAST orders placed by the customers are 1) Vietnamese, 2) Spanish, and 3) Korean.
In [15]:
plt.figure(figsize=(20,7))
sns.countplot(data=foodHub_df, x='cuisine_type', hue='day_of_the_week')
plt.title('Cuisine Type Vs Day of The Week')
plt.xlabel('Cuisine Type')
plt.ylabel('Day of The Week');
#plt.xticks(rotation=90)
No description has been provided for this image

Observations¶

  • For the cuisine types, the majority of the orders were placed over the weekends than weekdays.
  • Although there is no data to support my view, it's likely that the number of businessses operated in this area are minimal and these restaurants may be located close to residential areas. Need more data to support my view though.
In [16]:
#plt.figure(figsize=(20,7))
sns.boxplot(data=foodHub_df, x='food_preparation_time')
plt.xlabel('Time taken to prepare food in minutes');
#plt.ylabel('No. of orders');
#plt.xticks(rotation=90)
No description has been provided for this image

Observations¶

  • There are no outliers in terms of time taken to prepare food.
  • The minimum time taken to prepare food is about 20 minutes.
  • The median time taken to prepare food is about 27 minutes (50th percentile).
  • 75% of the orders were prepared under/about 31 minutes from receiving the order.
  • The maximum time taken to prepare food is about 35 minutes.

Additional information

Given that different types of cuisines take different time to prepare the food, the times noted under observation seem reasonable. Moreover, the dataset contains ONLY delivery orders and there is NO mention about whether the restaurant has dine-in facility in addition to door delivery and NO mention about the number of chefs in the kitchen, average time taken to prepare each dish/food item etc. If such data points are available, it would be beneficial to perform further analysis to generage more insights, which will help to make data-driven business decisions.

In [17]:
# Generate subplots to determine correlation between cuisine type and food preparation time.
g = sns.FacetGrid(foodHub_df, col="cuisine_type")
g.map(sns.histplot, "food_preparation_time")
plt.xlabel('Time taken to prepare food');
#plt.show()
No description has been provided for this image

Observations¶

  • As shown in the histogram subplots, the food prepration time for almost all restaurants take anywhere from 20 minutes to 35 minutes.
  • So, the mimimum, median and maximum times noted in the previous observation appear to be correct across all restaurants amd cuisine types.
In [19]:
# Visualize delivery time using box plot to determine minimum, 25th, 50th, 75th percentile and maximum time.
sns.boxplot(data=foodHub_df, x='delivery_time')
plt.title('Time to deliver food after preparation');
plt.xlabel('Delivery Time in minutes');
No description has been provided for this image

Observations¶

  • The minimum delivery time is about 15 minutes.
  • The median delivery time is about 25 minutes.
  • 75% of the orders are delivered under/about 28 minutes.
  • There are no outliers exist i.e. no delivery takes more than 33 minutes.

Note:

  • The problem statement/context does NOT state if there was any SLAs established for delivery times in the first place nor such SLAs are included in the dataset. If no benchmark/SLAs established, then it is hard to determine whether the delivery times fall within the defined timeframes or not.
  • The dataset does NOT contain distance between the restaurants and customer homes. So, a conclusion CANNOT be made with the given dataset whether the above mentioned delivery times are acceptable or not.
  • If the restaurants and customers are located in major cities like New York City, then the higher delivery times may be acceptable due to heavy traffic, population, visitor crowd etc.
  • Basically, there is no sufficient data to come to a conclusion from time taken for delivery.
In [20]:
# Visualize correlation between cuisine type and customer rating to detemine the highly rated cuisine types.
plt.figure(figsize=(20,7))
sns.countplot(data=foodHub_df, x='cuisine_type', hue='rating')
plt.title('Cuisine Type vs Number of Orders')
plt.xlabel('Cuisine Type')
plt.ylabel('No. of orders');
#plt.xticks(rotation=90)
No description has been provided for this image

Observations¶

  • For the top 3 cuisine types with most orders placed by customers, the majority of the customers did NOT write reviews as shown above. As shown, this is the scenario for most of the restaurants given in the dataset.
  • For the same 3 cuisine types, the majority of the customers wrote reviews have given a rating of 5 and followed by rating 4.

Additional Information

  • It is vital to understand why majority of the customers did NOT give rating for their delivery orders.
  • The restaurants can offer some incentives (like discount for future order, free stuff etc.) if the customers write a review, which would help to encourage customers to give rating.

Question 7: Which are the top 5 restaurants in terms of the number of orders received? [1 mark]¶

In [21]:
# To get a number of orders per restaurant. The default sort value is descending; so, no need to specify sort order explicitly.
orders_count = foodHub_df['restaurant_name'].value_counts()

#To get the top 5 restaurants in terms of number of orders received
orders_count.head()
Out[21]:
restaurant_name
Shake Shack                  219
The Meatball Shop            132
Blue Ribbon Sushi            119
Blue Ribbon Fried Chicken     96
Parm                          68
Name: count, dtype: int64

Observations:¶

  • Shake Shack restaurant received the most orders among the entire dataset population, and the total number of orders received was 219.
  • Parm restaurant received the least number of orders AMONG the top 5 restaurants in terms of the number of orders received, and the total number of orders received was 68.

Below is the list of top 5 restaurants and corresponding number of orders:

  1. Shake Shack - 219 orders
  2. The Meatball Shop - 132 orders
  3. Blue Ribbon Sushi - 119 orders
  4. Blue Ribbon Fried Chicken - 96 orders
  5. Parm - 68 orders

Question 8: Which is the most popular cuisine on weekends? [1 mark]¶

In [48]:
# First extract all rows of orders placed over the weekends
foodHub_df_weekend = foodHub_df[foodHub_df['day_of_the_week'].isin(['Weekend'])]
foodHub_df_weekend
Out[48]:
order_id customer_id restaurant_name cuisine_type cost_of_the_order day_of_the_week rating food_preparation_time delivery_time total_time_required_to_deliver
0 1477147 337525 Hangawi Korean 30.75 Weekend Not given 25 20 45
1 1477685 358141 Blue Ribbon Sushi Izakaya Japanese 12.08 Weekend Not given 25 23 48
3 1477334 106968 Blue Ribbon Fried Chicken American 29.20 Weekend 3 25 15 40
6 1477894 157711 The Meatball Shop Italian 6.07 Weekend Not given 28 21 49
11 1478437 221206 Empanada Mama (closed) Mexican 8.10 Weekend 5 23 22 45
... ... ... ... ... ... ... ... ... ... ...
1891 1476981 138586 Shake Shack American 5.82 Weekend Not given 22 28 50
1892 1477473 97838 Han Dynasty Chinese 29.15 Weekend Not given 29 21 50
1893 1476701 292602 Chipotle Mexican Grill $1.99 Delivery Mexican 22.31 Weekend 5 31 17 48
1894 1477421 397537 The Smile American 12.18 Weekend 5 31 19 50
1897 1478056 120353 Blue Ribbon Sushi Japanese 19.45 Weekend Not given 28 24 52

1351 rows × 10 columns

In [24]:
# Second group the rows extracted in step 1 by cuisine type and sort the results in descending order of counts and fetch the first row.
foodHub_df_weekend.groupby(['cuisine_type'])['order_id'].count().sort_values(ascending=False).head(1)
Out[24]:
cuisine_type
American    415
Name: order_id, dtype: int64
In [25]:
""" 
Alternatively, the below code would fetch the output using one line statement. However, the one line statement could 
potentially add complexity in terms of reading, comprehending and maintaning the code as it depends on level of proficiency in Python.
So, a step-by-step process is recommeded to promote readability and maintanability.
"""
foodHub_df[foodHub_df['day_of_the_week'].isin(['Weekend'])].groupby(['cuisine_type'])['order_id'].count().sort_values(ascending=False).head(1)
Out[25]:
cuisine_type
American    415
Name: order_id, dtype: int64

Observation¶

  • "American" cuisine is the most popular cuisine on the weekends because the total number of orders placed with "American" restaurants on the weekends is 415 orders.

NOTE: The question states "most popular"; hence, made an assumption that the most popular means food "most consumed/bought" by the customer/order data provided in the given dataset.

Question 9: What percentage of the orders cost more than 20 dollars? [2 marks]¶

In [26]:
# get total number of order with order value of more than $20.00
foodHub_df_Over_20 = foodHub_df[foodHub_df['cost_of_the_order'] > 20.00].shape[0]

# Calculate % of orders cost more than 20 dollars by dividing count of orders with over 20 dollars by total number of orders.
foodHub_df_Over_pct = (foodHub_df_Over_20/foodHub_df.shape[0])* 100

# Limit the decimal places to 2 decimals.
foodHub_df_Over_pct = f"{foodHub_df_Over_pct:.2f}"

# Print the result
print('The percentage of orders cost more than 20 dollars is ',foodHub_df_Over_pct,'%.',sep='')
The percentage of orders cost more than 20 dollars is 29.24%.

Observation¶

  • The percentage of orders that costs more than 20 dollars is 29.24%.

Question 10: What is the mean order delivery time? [1 mark]¶

In [27]:
# Generate the statistical summary
foodHub_df.describe().T
Out[27]:
count mean std min 25% 50% 75% max
order_id 1898.0 1.477496e+06 548.049724 1476547.00 1477021.25 1477495.50 1.477970e+06 1478444.00
customer_id 1898.0 1.711685e+05 113698.139743 1311.00 77787.75 128600.00 2.705250e+05 405334.00
cost_of_the_order 1898.0 1.649885e+01 7.483812 4.47 12.08 14.14 2.229750e+01 35.41
food_preparation_time 1898.0 2.737197e+01 4.632481 20.00 23.00 27.00 3.100000e+01 35.00
delivery_time 1898.0 2.416175e+01 4.972637 15.00 20.00 25.00 2.800000e+01 33.00

Observation¶

  • The mean order delivery time is 24.16 minutes.

NOTE: Made an assumption here that the question is about "delivery_time"column statistics only and the question does NOT refer inclusion of "food_preparation_time".

Question 11: The company has decided to give 20% discount vouchers to the top 3 most frequent customers. Find the IDs of these customers and the number of orders they placed. [1 mark]¶

In [28]:
# Get top 3 most frequent customers
foodHub_df['customer_id'].value_counts().head(3)
Out[28]:
customer_id
52832    13
47440    10
83287     9
Name: count, dtype: int64

Observation¶

  • The top 3 most frequent customers are 1) 52832 (13 orders), 2) 47440 (10 orders), and 3) 83287 (9 orders).

Multivariate Analysis¶

Question 12: Perform a multivariate analysis to explore relationships between the important variables in the dataset. (It is a good idea to explore relations between numerical variables as well as relations between numerical and categorical variables) [10 marks]¶

In [67]:
# Heatmap to visualize (postive or negative) correlation among cost of the order, food prepration time and delivery time.
sns.heatmap(foodHub_df[['cost_of_the_order','food_preparation_time','delivery_time']].corr(),annot=True,vmin=-1,vmax=1);
No description has been provided for this image

Observation¶

  • There is no statistically significant relationship/correlation (either positive or negative) among cost of the order, food prepration time and delivery time.
In [29]:
# Determine relationship between "Food Preparation Time and "Delivery Time" using line plot
plt.figure(figsize=(15,7))
sns.lineplot(data=foodHub_df, x='food_preparation_time', y ='delivery_time',errorbar=None)
plt.title('Food prepration Time Vs Delivery Time')
plt.xlabel('Food Preparation Time in minutes')
plt.ylabel('Delivery Time in minutes')
plt.show()
No description has been provided for this image

Observation¶

  • There is no statistically significant correlation between Food Prepration Time and Delivery Time, although the delivery time is relatively a bit higher when the Food Preparation Time is over 33 minutes; however, it could be due to the distance between the restaurant and customer home as well. Need more data to arrive at a conclusion.
In [31]:
# Determine relationship between "Delivery Time" and "Rating" using scatter plot
plt.figure(figsize=(15,7))
plt.title('Delivery Time Vs Customer Rating')
plt.xlabel('Delivery Time in minutes')
plt.ylabel('Customer Rating (0 - 5)')
sns.scatterplot(data=foodHub_df, x='delivery_time', y ='rating')
plt.show()
No description has been provided for this image

Observation¶

  • Based on the scatterplot between "Delivery Time" and "Rating", there is no siginificant relationship between these two variables i.e. statistically NOT significant.
In [32]:
# Determine relationship between "Delivery Time" and "Rating"
plt.figure(figsize=(15,7))
sns.scatterplot(data=foodHub_df, x='food_preparation_time', y ='cost_of_the_order')
plt.title('Food Preparation Time Vs Cost of The Order')
plt.xlabel('Food Preparation Time in minutes')
plt.ylabel('Cost of The Order')
plt.show()
No description has been provided for this image

Observations¶

  • There is NO statistically significant correlation between Cost of The Order and Food Prepration Time.
  • Need more data to analyze and arrive at a conclusion.
In [37]:
# Determine relationship/correlation between "Food Preparation Time" and "Cost of The Order". Using hex to visualize better.
plt.figure(figsize=(15,7))
sns.jointplot(data=foodHub_df, x='food_preparation_time', y ='cost_of_the_order', kind='hex')
plt.title('Food Preparation Time Vs Cost of The Order')
plt.xlabel('Food Preparation Time in minutes')
plt.ylabel('Cost of The Order in USD')
plt.show()
<Figure size 1500x700 with 0 Axes>
No description has been provided for this image

Observations¶

  • The cost of order for most of the orders are between USD 11 and USD 17 and the food is prepared under 27 minutes (area where concentration of hex with dark colors).
  • Other than the above observation, there is no statistically significant correlation between Cost of The Order and Food Prepration Time.
  • Need more data to analyze and arrive at a conclusion.
In [38]:
# Boxplot to visalize correlation between Delivery Time and Day of The Week
plt.title('Delivery Time Vs Day of The Week')
plt.xlabel('Delivery Time in minutes')
plt.ylabel('Day of The Week')
sns.boxplot(data=foodHub_df, x='delivery_time', y ='day_of_the_week');
No description has been provided for this image

Observations¶

  • As shown above, the time it takes to deliver food during the weekday is much higher than the weekend.
  • During weekdays, 75% of the orders are delivered under/about 31 minutes while 50% of the orders are delivered under/about 28 minutes.
  • During weekends, 75% of the orders are delivered under/about 27 minutes while 50% of the orders are delivered under/about 22 minutes.
  • There is no sufficient data to prove my view, but my view is that the road traffic might be higher during weekdays (i.e. business days) and could be less during weekends (non-business days). Moreover, the dataset contains data for orders placed for (early) dinner or breakfast, where the road traffic is more during peak hours.
In [39]:
# Determine relationship between "Delivery Time" and "Rating"
plt.figure(figsize=(15,7))
sns.scatterplot(data=foodHub_df, x='delivery_time', y ='rating', hue='cuisine_type')
plt.title('Delivery Time Vs Customer Rating')
plt.xlabel('Delivery Time in minutes')
plt.ylabel('Customer Rating (0 - 5)')
plt.show()
No description has been provided for this image

Observation¶

  • The scatter plot doesn't show any statistically significant correlation between Delivery Time and Customer Rating.

Question 13: The company wants to provide a promotional offer in the advertisement of the restaurants. The condition to get the offer is that the restaurants must have a rating count of more than 50 and the average rating should be greater than 4. Find the restaurants fulfilling the criteria to get the promotional offer. [3 marks]¶

In [43]:
"""
The below code contains multiple steps to increase readability and also to ensure 
that anyone looking at the code can easily understand from maintenance perspective.
"""
# Applied rating > 4 condition to filter out rows with rating = 5 only. This would improve performance as well.
foodHub_df_rating_over4 = foodHub_df[~foodHub_df['rating'].isin(['Not given', '4', '3', '2', '1', '0'])]
#foodHub_df_rating_over4

# Then grouped the records by "restaurant name and rating" to determine count for each combination.
foodHub_df_rating_count = foodHub_df_rating_over4.groupby(['restaurant_name','rating'])['rating'].count().sort_values(ascending=False)
#foodHub_df_rating_count

# Then applied the filter to extract only the restaurants that have customer rating count over 50.
foodHub_df_rating_over4_over50 = foodHub_df_rating_count[foodHub_df_rating_count > 50]
foodHub_df_rating_over4_over50
Out[43]:
restaurant_name    rating
Shake Shack        5         60
The Meatball Shop  5         53
Name: rating, dtype: int64

Observation¶

  • The restaurants that meet the given criteria are 1) Shake Shack, and 2) The Meatball Shop. So, these two restaurants fulfill the criteria for the promotional offer.

Question 14: The company charges the restaurant 25% on the orders having cost greater than 20 dollars and 15% on the orders having cost greater than 5 dollars. Find the net revenue generated by the company across all orders. [3 marks]¶

In [44]:
# Create a DataFrame containing just the "cost_of_the_order"column from foodHub_df.
cost_of_order = foodHub_df['cost_of_the_order']

# Declare and initialize variables to store sum values.
revenue_over_20 = 0
revenue_over_5 = 0

# Loop through to calculate cumulative values of the sums
for i in cost_of_order:
    if i > 20.00:
        revenue_over_20 = revenue_over_20 + (i * 0.25)
    elif i > 5:
        revenue_over_5 = revenue_over_5 + (i * 0.15)

# Calculate the net_revenue
net_revenue = revenue_over_20 + revenue_over_5

# Print the calculated values.
print('Revenue generated from orders over dollar 20 is', revenue_over_20) 
print('Revenue generated from orders over dollar 5 but less than or equal to 20 is', revenue_over_5)
print('NET revenue generated in dollars',net_revenue)
Revenue generated from orders over dollar 20 is 3688.7275000000027
Revenue generated from orders over dollar 5 but less than or equal to 20 is 2477.5755000000036
NET revenue generated in dollars 6166.303000000006

Observations¶

  • Revenue generated from order value over dollar 20 is dollar 3,688.73
  • Revenue generated from order value over dollar 5 but less than or equal to 20 is dollar 2,477.58
  • NET revenue generated in dollars 6,166.31

Note: The values were rounded off manually while writing the observations.

Question 15: The company wants to analyze the total time required to deliver the food. What percentage of orders take more than 60 minutes to get delivered from the time the order is placed? (The food has to be prepared and then delivered.) [2 marks]¶

In [45]:
# under the existing foodHub_df, create a new column named "total_time_required_to_deliver" by adding both food_preparation_time and delivery_time
foodHub_df['total_time_required_to_deliver'] = foodHub_df['food_preparation_time'] + foodHub_df['delivery_time']

# Create a new dataframe containing just the orders that took over 60 minutes to deliver the food.
order_with_delivery_time_over60 = foodHub_df[foodHub_df['total_time_required_to_deliver'] > 60]

# Calculate the percent of overall orders that took more than 60 minutes to deliver the food.
pct_of_order_over60_minutes_delivery_time = (order_with_delivery_time_over60.shape[0]/foodHub_df.shape[0])*100

# print the percentage.
print('The percent of total order that took over 60 minutes for delivery is', pct_of_order_over60_minutes_delivery_time)
The percent of total order that took over 60 minutes for delivery is 10.537407797681771

Observations¶

  • The percent of orders that takes more than 60 minutes of delivery time is 10.54% of overall/total orders.
  • ~90% of orders are delivered at or under 60 minutes from the time the order was placed by the customer.

Note: The percent value was rounded off manually while writing the observations.

Question 16: The company wants to analyze the delivery time of the orders on weekdays and weekends. How does the mean delivery time vary during weekdays and weekends? [2 marks]¶

In [46]:
# Generate statistics for orders received/processed during "Weekend"
foodHub_df[foodHub_df['day_of_the_week'].isin(['Weekend'])].describe().T
Out[46]:
count mean std min 25% 50% 75% max
order_id 1351.0 1.477500e+06 554.168089 1476547.00 1477015.50 1477505.00 1477980.500 1478444.00
customer_id 1351.0 1.704226e+05 113403.749068 1311.00 77665.00 127934.00 268958.000 405334.00
cost_of_the_order 1351.0 1.657483e+01 7.486979 4.47 12.08 14.41 22.285 33.37
food_preparation_time 1351.0 2.743671e+01 4.601476 20.00 23.00 27.00 31.000 35.00
delivery_time 1351.0 2.247002e+01 4.628938 15.00 18.50 22.00 27.000 30.00
total_time_required_to_deliver 1351.0 4.990674e+01 6.624883 35.00 45.00 50.00 55.000 65.00
In [47]:
# Generate statistics for orders received/processed during "Weekend"
foodHub_df[foodHub_df['day_of_the_week'].isin(['Weekday'])].describe().T
Out[47]:
count mean std min 25% 50% 75% max
order_id 547.0 1.477483e+06 532.945358 1476559.00 1477043.00 1477457.00 1477929.000 1478435.00
customer_id 547.0 1.730108e+05 114505.333032 5693.00 78913.00 130521.00 276192.000 404649.00
cost_of_the_order 547.0 1.631119e+01 7.479521 4.75 12.08 14.07 22.285 35.41
food_preparation_time 547.0 2.721207e+01 4.708616 20.00 23.00 27.00 31.000 35.00
delivery_time 547.0 2.834004e+01 2.891428 24.00 26.00 28.00 31.000 33.00
total_time_required_to_deliver 547.0 5.555210e+01 5.567972 44.00 51.00 55.00 59.500 68.00
In [169]:
# Determine number of orders received during "Weekday"
foodHub_df[foodHub_df['day_of_the_week'].isin(['Weekday'])].shape[0] # 547 orders
Out[169]:
547
In [170]:
# Determine number of orders received during "Weekend"
foodHub_df[foodHub_df['day_of_the_week'].isin(['Weekend'])].shape[0] # 1351 orders
Out[170]:
1351

Observations¶

Below are the mean values of "delivery time" (NOT the total time required to deliver, which includes the time take for food preparation as well.):

  • Mean delivery time during weekdays is 28.34 minutes (additional information: Total orders received during weekdays is 547).
  • Mean delivery time during weekends is 22.47 minutes (additional information: Total orders received during weekends is 1351).

Additional findings

  • The total number of orders received during weekdays is 547, but the mean time taken for delivery is higher than weekends.
  • The total number of orders received during weekends is 1351, but the mean time taken for delivery is lower than weekdays.
  • Although there is no data available to support my theory, I assume that the number of delivery people work on the weekends could be more than the number of delivery people work on the weekdays.

Conclusion and Recommendations¶

Question 17: What are your conclusions from the analysis? What recommendations would you like to share to help improve the business? (You can use cuisine type and feedback ratings to drive your business recommendations.) [6 marks]¶

Conclusions:¶

Below are the key conclusions arrived at based on the analysis performed on the dataset provided:

  • Food Preparation Time: The minimum, median and maximum time taken to prepare food across all the cuisine types are 20, 27.37 and 35 minutes, respectively. 75% of the orders were prepared in under 31.5 minutes. This seems to be normal given that food from some cuisine types will be prepared fresh after the order is placed unlike the fast-food chains where the already prepared/frozen food will simply be heated.

  • Customer Ratings: About 39% of orders were not rated by the customers. Based on the observation, most of the orders from each cuisine type were not rated.

  • Popular Cuisine Types: The top 4 restaurants where majority of the orders are placed were American, Japanese, Italian and Chinese.

  • Order Timing Trends: Majority of the orders are placed by the customers during weekends than the weekdays. Perhaps the restaurants are possibly in residential areas and people go to work in a different area, which leads to reduced orders during weekdays. Need additional data to validate this factor.

  • Delivery Time: The minimum, median and maximum time taken to deliver food across all the cuisine types are 15, 25 and 33 minutes, respectively. 75% of the orders were delivered in under 28 minutes. The delivery time can be reduced but additional data points like distance from restaurants to customer homes and time of orders were placed are needed to further analyze and conclude.

  • Order Value: The cost of the order for about 70% of the orders was less than USD 20, which might not help to generate more revenue.

Recommendations:¶

Below are the top 4 recommendations based on the dataset provided:

  • Optimize Delivery Network: Remove low-demand cuisine types/restaurants from the delivery network to free up delivery drivers for high-demand ones, which will potentially reduce delivery times for restaurants where most orders are placed by customers.

  • Weekday Order Analysis: Collect more data to understand why fewer orders are placed on weekdays compared to weekends. Until such data is available, no conclusions can be made.

  • Enhance Delivery Insights: Capture additional data such as order timestamps, peak vs. off-peak periods, and historical drive times (e.g., from Google Maps) to analyze traffic impact and establish realistic delivery SLAs and performance metrics.

  • Revenue Growth Opportunity: Identify and onboard niche or specialty cuisine restaurants, which typically have higher price points and could boost overall revenue.

In [ ]: