To enhance readability and focus on the results of the analysis, the source code has been hidden from this notebook. This approach helps in presenting a clean, report-like view of the work while also upholding academic integrity standards. All outputs are included to showcase the full scope of the project.¶

Problem Statement¶

Context¶

AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).

A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.

You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.

Objective¶

To predict whether a liability customer will buy personal loans, to understand which customer attributes are most significant in driving purchases, and identify which segment of customers to target more.

Data Dictionary¶

  • ID: Customer ID
  • Age: Customer’s age in completed years
  • Experience: #years of professional experience
  • Income: Annual income of the customer (in thousand dollars)
  • ZIP Code: Home Address ZIP code.
  • Family: the Family size of the customer
  • CCAvg: Average spending on credit cards per month (in thousand dollars)
  • Education: Education Level. 1: Undergrad; 2: Graduate;3: Advanced/Professional
  • Mortgage: Value of house mortgage if any. (in thousand dollars)
  • Personal_Loan: Did this customer accept the personal loan offered in the last campaign? (0: No, 1: Yes)
  • Securities_Account: Does the customer have securities account with the bank? (0: No, 1: Yes)
  • CD_Account: Does the customer have a certificate of deposit (CD) account with the bank? (0: No, 1: Yes)
  • Online: Do customers use internet banking facilities? (0: No, 1: Yes)
  • CreditCard: Does the customer use a credit card issued by any other Bank (excluding All life Bank)? (0: No, 1: Yes)

Importing necessary libraries¶

Note:

  1. After running the above cell, kindly restart the notebook kernel (for Jupyter Notebook) or runtime (for Google Colab), write the relevant code for the project from the next cell, and run all cells sequentially from the next cell.

  2. On executing the above line of code, you might see a warning regarding package dependencies. This error message can be ignored as the above code ensures that all necessary libraries and their dependencies are maintained to successfully execute the code in this notebook.

Loading the dataset¶

ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
0 1 25 1 49 91107 4 1.6 1 0 0 1 0 0 0
1 2 45 19 34 90089 3 1.5 1 0 0 1 0 0 0
2 3 39 15 11 94720 1 1.0 1 0 0 0 0 0 0
3 4 35 9 100 94112 1 2.7 2 0 0 0 0 0 0
4 5 35 8 45 91330 4 1.0 2 0 0 0 0 0 1
ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
4995 4996 29 3 40 92697 1 1.9 3 0 0 0 0 1 0
4996 4997 30 4 15 92037 4 0.4 1 85 0 0 0 1 0
4997 4998 63 39 24 93023 2 0.3 3 0 0 0 0 0 0
4998 4999 65 40 49 90034 3 0.5 2 0 0 0 0 1 0
4999 5000 28 4 83 92612 3 0.8 1 0 0 0 0 1 1

Observation(s):

  • The dataset has been loaded successfully, and all columns are populated properly.

Data Overview¶

Understand the dataset attributes

(5000, 14)

Observation(s):

  • The dataset contains 5,000 rows and 14 columns in it.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   ID                  5000 non-null   int64  
 1   Age                 5000 non-null   int64  
 2   Experience          5000 non-null   int64  
 3   Income              5000 non-null   int64  
 4   ZIPCode             5000 non-null   int64  
 5   Family              5000 non-null   int64  
 6   CCAvg               5000 non-null   float64
 7   Education           5000 non-null   int64  
 8   Mortgage            5000 non-null   int64  
 9   Personal_Loan       5000 non-null   int64  
 10  Securities_Account  5000 non-null   int64  
 11  CD_Account          5000 non-null   int64  
 12  Online              5000 non-null   int64  
 13  CreditCard          5000 non-null   int64  
dtypes: float64(1), int64(13)
memory usage: 547.0 KB

Observation(s):

  • The dataset contains 14 columns.
  • It includes one column of the 'float' data type and 13 columns of the 'integer' data type. All columns are numerical.
  • All columns have been populated with valid values, and there are no 'null' or missing values.
count mean std min 25% 50% 75% max
ID 5000.0 2500.500000 1443.520003 1.0 1250.75 2500.5 3750.25 5000.0
Age 5000.0 45.338400 11.463166 23.0 35.00 45.0 55.00 67.0
Experience 5000.0 20.104600 11.467954 -3.0 10.00 20.0 30.00 43.0
Income 5000.0 73.774200 46.033729 8.0 39.00 64.0 98.00 224.0
ZIPCode 5000.0 93169.257000 1759.455086 90005.0 91911.00 93437.0 94608.00 96651.0
Family 5000.0 2.396400 1.147663 1.0 1.00 2.0 3.00 4.0
CCAvg 5000.0 1.937938 1.747659 0.0 0.70 1.5 2.50 10.0
Education 5000.0 1.881000 0.839869 1.0 1.00 2.0 3.00 3.0
Mortgage 5000.0 56.498800 101.713802 0.0 0.00 0.0 101.00 635.0
Personal_Loan 5000.0 0.096000 0.294621 0.0 0.00 0.0 0.00 1.0
Securities_Account 5000.0 0.104400 0.305809 0.0 0.00 0.0 0.00 1.0
CD_Account 5000.0 0.060400 0.238250 0.0 0.00 0.0 0.00 1.0
Online 5000.0 0.596800 0.490589 0.0 0.00 1.0 1.00 1.0
CreditCard 5000.0 0.294000 0.455637 0.0 0.00 0.0 1.00 1.0

Observation(s):

  • The age of customers ranges from 23 to 67 years. The average age is about 45 years, which is almost the same as the median age and appears to be normally distributed.
  • Work experience ranges from -3 to 43 years. Since work experience cannot be a negative value, customer records with negative values must be deleted or replaced with zero. The average work experience is about 20 years, which is almost the same as the median, and the distribution appears to be normal.
  • The annual income of customers ranges from USD 8,000 to USD 224,000. The average income is about USD 74,700, and the median is USD 64,000, which indicates that the income data is right-skewed.
  • The family size of customers ranges from 1 (single) to 4. The average family size is 2.39, which is almost the same as the median value, and the distribution appears to be normally distributed.
  • The average monthly credit card spend ranges from USD 0 to USD 10,000. A value of USD 0 could be interpreted in two ways: 1) The customers do not have a credit card issued by AllLife Bank, or 2) They did not use their card during the period for which data was collected. The credit card average spend (CCAvg) data is right-skewed. Assumption made: Due to the lack of additional information, it was assumed, for the purposes of the project, that customers with a USD 0 spend do not have a credit card issued by AllLife Bank.
  • The average mortgage value is about USD 56,000, and the median value is USD 101,000. The data is significantly right-skewed, with the highest mortgage value being USD 635,000.
0

Observation(s):

  • No duplicate records exist, and all 5,000 rows correspond to unique customers.
5000

Observation(s):

  • No duplicate records exist, and all 5,000 rows correspond to unique customers.
  • Therefore, the 'ID' column will not be helpful for building the model.
ID                    0
Age                   0
Experience            0
Income                0
ZIPCode               0
Family                0
CCAvg                 0
Education             0
Mortgage              0
Personal_Loan         0
Securities_Account    0
CD_Account            0
Online                0
CreditCard            0
dtype: int64

Observation(s):

  • No columns have null or missing values that need to be imputed.

Excluding 'ID' and 'ZIPCode' columns from the DataFrame:

  • Since the dataset contains a unique customer ID, it is not necessary for solving the problem and can be deleted from the dataframe.
  • Secondly, the ZIPCode column can also be dropped from the dataset because the approval decision for a personal loan does not depend on the customer's home value, and the home is not considered collateral for the loan either. Therefore, it is safe to assume that the ZIPCode column can be excluded from the dataset.
Age Experience Income Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
0 25 1 49 4 1.6 1 0 0 1 0 0 0
1 45 19 34 3 1.5 1 0 0 1 0 0 0
2 39 15 11 1 1.0 1 0 0 0 0 0 0
3 35 9 100 1 2.7 2 0 0 0 0 0 0
4 35 8 45 4 1.0 2 0 0 0 0 0 1

Observation(s):

  • It has been verified that the 'ID' and 'ZIPCode' columns have been successfully deleted from the dataset.

Exploratory Data Analysis.¶

  • EDA is an important part of any project involving data.
  • It is important to investigate and understand the data better before building a model with it.
  • A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
  • A thorough analysis of the data, in addition to the questions mentioned below, should be done.

Questions:

  1. What is the distribution of mortgage attribute? Are there any noticeable patterns or outliers in the distribution?
  2. How many customers have credit cards?
  3. What are the attributes that have a strong correlation with the target attribute (personal loan)?
  4. How does a customer's interest in purchasing a loan vary with their age?
  5. How does a customer's interest in purchasing a loan vary with their education?

NOTE: All the above questions are answered at the end of the Exploratory Data Analysis section.

Univariate Analysis

No description has been provided for this image

Observation(s):

  • The average age of customers is around 45 years.
  • The mean and median ages are almost equal, although the data appear to be mildly right-skewed.
  • About 75% of customers are below 56 years of age (approximately).
  • The maximum age appears to be over 65 years.
No description has been provided for this image

Observation(s):

  • The average professional experience of customers is about 20 years, while the maximum experience is over 40 years.
  • The mean and median appear to be almost the same, and the data is normally distributed.
  • About 50 customers have a negative value for professional experience.

Note: The negative values for experience can be imputed, as "Professional Experience" can never be negative, and the valid value should be ≥ 0. More details will follow in the respective section below.

No description has been provided for this image

Observation(s):

  • The median annual income is around USD 65,000, while the mean is around USD 75,000. The annual income data is right-skewed.
  • About 75% of customers earn an annual salary of about USD 97,000.
  • The maximum annual salary, as indicated by the right whisker of the boxplot, is around USD 185,000.
  • Some customers earn more than USD 185,000 annually, with some even making over USD 200,000.
No description has been provided for this image

Observation(s):

  • 29.4% of customers have a family size of 1 (i.e. single), followed by 25.9% with a family size of 2.
  • For customers with a family size of 2, no data is available to determine whether they are couples, single parents, etc.
  • 20.2% of customers have a family size of 3, and 24.4% have a family size of 4.
  • As shown, about 55% of customers have a family size of 2 or fewer.
No description has been provided for this image

Observation(s):

  • The median monthly credit card spend is around USD 1,500, while the mean spend is around USD 2,000. The data is positively skewed.
  • Seventy-five (75) percent of customers spend at most USD 2,250 per month on their credit cards.
  • As shown in the boxplot, the maximum spend is around USD 5,250, and several outliers have monthly spends up to USD 10,000.
No description has been provided for this image

Observation(s):

  • About 42% of customers have an undergraduate degree, while 30% have an advanced or professional degree.
  • About 28% of customers have a graduate degree.
  • All customers have completed one of the three education levels, and no customer is without a degree.
No description has been provided for this image

Observation(s):

  • The volume of outliers is very high, and the mortgage data is heavily right-skewed. Outlier treatment is necessary.
  • As seen in the plots above, the median mortgage value is near zero (USD 0), indicating that a large number of customers have no mortgage.
  • The mean mortgage value is between USD 55,000 and USD 60,000.
No description has been provided for this image

Observation(s):

  • Only about 10% of customers accepted the personal loan offers from the prior campaign.
  • This means that about 90% of customers did not accept the personal loan offer. Further data is needed to understand why they declined. Perhaps, they did not need a personal loan at the time the campaign was run.
  • First and foremost, the effectiveness of the campaign should be assessed to determine its reach, the mode of delivery (e.g., postal mail, email, flyers at bank locations), etc. More data is required to validate this.
No description has been provided for this image

Observation(s):

  • Only about 10% of customers have a securities account with the bank.
  • This means that about 90% of customers do not have a securities account with the bank at this time.
  • Perhaps, the campaign should include offers for opening a securities or brokerage/trading account, in addition to the personal loan.
  • The bank could also offer bonuses for opening and/or maintaining multiple accounts.
No description has been provided for this image

Observation(s):

  • Only about 6% of customers have a Certificate of Deposit (CD) account with the bank.
  • This means that about 94% of customers do not have a Certificate of Deposit (CD) account with the bank.
  • Perhaps, the campaign should include offers for opening a Certificate of Deposit (CD), securities, or brokerage/trading accounts, in addition to the personal loan.
  • The bank could also offer bonuses for opening and/or maintaining multiple accounts.
No description has been provided for this image

Observation(s):

  • About to 60% of the customers use internet/online banking features.
  • The bank could take advantage of the online banking portal to push more campaigns electronically.
No description has been provided for this image

Observation(s):

  • Only about 30% of customers have credit cards issued by banks other than AllLife Bank.
  • About 70% of customers do not have credit cards issued by other banks.

Bivariate Analysis

No description has been provided for this image
No description has been provided for this image

Observation(s):
Key observations

  • CCAvg and Personal_Loan have a moderate positive correlation, indicating that customers with higher monthly credit card spending are more likely to have taken a personal loan.
  • Income and Personal_Loan show a moderate positive correlation (0.5), suggesting that customers are more likely to take a personal loan when their income is higher.
  • There is a moderate positive correlation between CD_Account and Personal_Loan, indicating that customers with a CD account are more likely to take a personal loan.
  • A positive correlation exists between Family size and Personal_Loan, suggesting that larger families may be more likely to take a personal loan.

Other observations

  • There is a moderate positive correlation between CD_Account and Securities_Account, suggesting that customers with a CD account are more likely to have a securities account with the bank.
  • Age and Experience have a strong positive correlation (0.99), which is expected as working individuals typically accumulate more experience as they age.
  • CCAvg and Income show a higher end of moderate positive correlation, indicating that customers with higher income tend to have higher average monthly credit card spending.
  • Other data points do not exhibit statistically significant correlations and may not contribute to achieving the objective.
No description has been provided for this image

Observation(s):

  • As observed, customers with higher income are more likely to accept the personal loan offer.
  • However, there is a significant volume of outliers where customers have high income but did not accept the personal loan offer.
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Observation(s):

  • Customers with an annual income below USD 55,000 do not have a personal loan account.
  • The majority of personal loan accounts are held by customers with an annual income between USD 100,000 and USD 200,000.
  • The average monthly credit card spend has a strong positive correlation with annual income.
  • Average monthly credit card spend shows a moderate positive correlation with the personal loan account.
  • There is a significant concentration of customers at or below the USD 50,000 income level, followed by a relatively higher concentration up to the USD 100,000 income level, where the corresponding average monthly credit card spend ranges from a few hundred dollars to about USD 2,500.
  • Customers with higher income levels tend to have higher average monthly credit card spending.
  • The highest average monthly credit card spend is about USD 10,000, and the corresponding income level is around USD 200,000.
  • Some customers have an average monthly credit card spend of USD 0. This likely indicates that these customers either do not have a credit card issued by the bank or did not use their credit card during the period for which the data was collected. It is assumed that these customers do not have a credit card issued by AllLife Bank.
No description has been provided for this image
No description has been provided for this image

Observation(s):

  • Most personal loan account holders also have a Certificate of Deposit (CD) account with the bank.
  • Personal loans were also approved for customers who do not have a Certificate of Deposit (CD) account with AllLife Bank.
No description has been provided for this image
No description has been provided for this image

Observation(s):

  • Most personal loan account holders also have a Securities Account with the bank.
  • Personal loans were also approved for customers who do not have a Securities Account with AllLife Bank.
No description has been provided for this image
No description has been provided for this image

Observation(s):

  • Most personal loan account holders also have a Mortgage Account with the bank.
  • Customers with a mortgage amount of less than USD 80,000 do not have personal loan accounts.
  • Personal loans were also approved for customers who do not have a Securities Account with AllLife Bank.
No description has been provided for this image
No description has been provided for this image

Observation(s):

  • There is no significant correlation between customers having credit cards issued by other banks and holding a personal loan account at AllLife Bank.
  • Some customers with an annual income over USD 100,000 have credit cards issued by banks other than AllLife Bank.
No description has been provided for this image
No description has been provided for this image

Observation(s):

  • There is a significant correlation between family size and personal loan accounts.
  • Families with a size of 3 or 4 are more likely to have personal loan accounts compared to families with a size of 1 or 2.
No description has been provided for this image
No description has been provided for this image

Observation(s):

  • There is no statistically significant correlation between age and personal loan accounts.
No description has been provided for this image
No description has been provided for this image

Observation(s):

  • Customers with an Advanced/Professional degree (3) and a Graduate degree (2) have personal loan accounts compared to customers with an Undergraduate education (1).

Answers to the questions posed at the beginning of the Exploratory Data Analysis section.

Question #1: What is the distribution of mortgage attribute? Are there any noticeable patterns or outliers in the distribution?
The distribution of the mortgage attribute is heavily right-skewed, and there is a high number of outliers, as shown in the box plot of the mortgage attribute under the Univariate Analysis section. Given the volume of outliers, the outliers must be addressed before data preprocessing to prepare the data for the model.

Question #2: How many customers have credit cards?

Number of customers having average credit card monthly spend of USD 0 is 106.
Number of customers having average credit card monthly spend of greater than USD 0 is 4894.

The total number of customers having an AllLife Bank-issued credit card is 4,894.

Assumption Made:

  • As noted above, there are 106 customers with an average monthly credit card spend of USD 0. Therefore, it is assumed that these customers do not have a credit card issued by AllLife Bank.

Question #3: What are the attributes that have a strong correlation with the target attribute (personal loan)??

The following attributes have a positive correlation with the target variable, Personal_Loan, listed in descending order of their correlation coefficients:

  • Income (0.50)
  • CCAvg (0.37)
  • CD_Account (0.32)
  • Education (0.14)
  • Mortgage (0.14)
  • Family (0.06)

Question #4: How does a customer's interest in purchasing a loan vary with their age?
As shown in the bivariate analysis of Age vs. Annual Income, with Personal_Loan as the hue parameter, there is no statistically significant correlation between Age and obtaining a Personal Loan (i.e., the target variable).

Question #5: How does a customer's interest in purchasing a loan vary with their education?

  • There is a positive correlation between Education and Personal_Loan.
  • Customers with an Advanced/Professional degree (3) and a Graduate degree (2) have personal loan accounts compared to customers with an Undergraduate education (1).

Data Preprocessing¶

  • Missing value treatment
  • Feature engineering (if needed)
  • Outlier detection and treatment (if needed)
  • Preparing data for modeling
  • Any other preprocessing steps (if needed)

Data treatment:

  • The attribute "Mortgage" has a higher volume of outliers, and these will need to be addressed to reduce their influence on the outcome.
  • The attribute "Experience" has negative values for about 52 records, and all 52 customer records will be excluded from the final dataset before preparing it for processing through the model (i.e., before splitting data into training and test sets).
  • No other missing value treatment is required, as the dataset does not contain any missing or null values.

Outlier treatment for the Mortgage feature

No description has been provided for this image

Observation(s):

  • As shown above, the mortgage outliers have been addressed using the 25th and 75th quantile values.
  • The mortgage data is now suitable for use as a feature.

Treating customer records with negative value for "Experience"

52
Age Experience Income Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
89 25 -1 113 4 2.30 3 0.0 0 0 0 0 1
226 24 -1 39 2 1.70 2 0.0 0 0 0 0 0
315 24 -2 51 3 0.30 3 0.0 0 0 0 1 0
451 28 -2 48 2 1.75 3 89.0 0 0 0 1 0
524 24 -1 75 4 0.20 1 0.0 0 0 0 1 0
536 25 -1 43 3 2.40 2 176.0 0 0 0 1 0
540 25 -1 109 4 2.30 3 252.5 0 0 0 1 0
576 25 -1 48 3 0.30 3 0.0 0 0 0 0 1
583 24 -1 38 2 1.70 2 0.0 0 0 0 1 0
597 24 -2 125 2 7.20 1 0.0 0 1 0 0 1
649 25 -1 82 4 2.10 3 0.0 0 0 0 1 0
670 23 -1 61 4 2.60 1 239.0 0 0 0 1 0
686 24 -1 38 4 0.60 2 0.0 0 0 0 1 0
793 24 -2 150 2 2.00 1 0.0 0 0 0 1 0
889 24 -2 82 2 1.60 3 0.0 0 0 0 1 1
909 23 -1 149 1 6.33 1 252.5 0 0 0 0 1
1173 24 -1 35 2 1.70 2 0.0 0 0 0 0 0
1428 25 -1 21 4 0.40 1 90.0 0 0 0 1 0
1522 25 -1 101 4 2.30 3 252.5 0 0 0 0 1
1905 25 -1 112 2 2.00 1 241.0 0 0 0 1 0
2102 25 -1 81 2 1.60 3 0.0 0 0 0 1 1
2430 23 -1 73 4 2.60 1 0.0 0 0 0 1 0
2466 24 -2 80 2 1.60 3 0.0 0 0 0 1 0
2545 25 -1 39 3 2.40 2 0.0 0 0 0 1 0
2618 23 -3 55 3 2.40 2 145.0 0 0 0 1 0
2717 23 -2 45 4 0.60 2 0.0 0 0 0 1 1
2848 24 -1 78 2 1.80 2 0.0 0 0 0 0 0
2876 24 -2 80 2 1.60 3 238.0 0 0 0 0 0
2962 23 -2 81 2 1.80 2 0.0 0 0 0 0 0
2980 25 -1 53 3 2.40 2 0.0 0 0 0 0 0
3076 29 -1 62 2 1.75 3 0.0 0 0 0 0 1
3130 23 -2 82 2 1.80 2 0.0 0 1 0 0 1
3157 23 -1 13 4 1.00 1 84.0 0 0 0 1 0
3279 26 -1 44 1 2.00 2 0.0 0 0 0 0 0
3284 25 -1 101 4 2.10 3 0.0 0 0 0 0 1
3292 25 -1 13 4 0.40 1 0.0 0 1 0 0 0
3394 25 -1 113 4 2.10 3 0.0 0 0 0 1 0
3425 23 -1 12 4 1.00 1 90.0 0 0 0 1 0
3626 24 -3 28 4 1.00 3 0.0 0 0 0 0 0
3796 24 -2 50 3 2.40 2 0.0 0 1 0 0 0
3824 23 -1 12 4 1.00 1 0.0 0 1 0 0 1
3887 24 -2 118 2 7.20 1 0.0 0 1 0 1 0
3946 25 -1 40 3 2.40 2 0.0 0 0 0 1 0
4015 25 -1 139 2 2.00 1 0.0 0 0 0 0 1
4088 29 -1 71 2 1.75 3 0.0 0 0 0 0 0
4116 24 -2 135 2 7.20 1 0.0 0 0 0 1 0
4285 23 -3 149 2 7.20 1 0.0 0 0 0 1 0
4411 23 -2 75 2 1.80 2 0.0 0 0 0 1 1
4481 25 -2 35 4 1.00 3 0.0 0 0 0 1 0
4514 24 -3 41 4 1.00 3 0.0 0 0 0 1 0
4582 25 -1 69 3 0.30 3 0.0 0 0 0 1 0
4957 29 -1 50 2 1.75 3 0.0 0 0 0 0 1

Observation(s):

  • There are 52 records with negative values for "Experience"; however, work experience cannot be negative and should be either 0 or a positive value.
  • A closer inspection reveals that all 52 customers do not have a personal loan account with AllLife Bank. Since the dependent variable is "Personal_Loan," updating the negative values to 0 will not affect the outcome.

Creating a new dataframe by excluding all 52 records with negative value for "Experience"

Age Experience Income Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
0 25 1 49 4 1.6 1 0.0 0 1 0 0 0
1 45 19 34 3 1.5 1 0.0 0 1 0 0 0
2 39 15 11 1 1.0 1 0.0 0 0 0 0 0
3 35 9 100 1 2.7 2 0.0 0 0 0 0 0
4 35 8 45 4 1.0 2 0.0 0 0 0 0 1
Age Experience Income Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard

Observations:

  • As shown above, the negative values in the "Experience" feature for all 52 customer records (who are not personal loan account holders) have been updated to 0.
  • The DataFrame is now ready for further data processing.

Data Preparation for Model¶

Age Experience Income Family CCAvg Education Mortgage Securities_Account CD_Account Online CreditCard
0 25 1 49 4 1.6 1 0.0 1 0 0 0
1 45 19 34 3 1.5 1 0.0 1 0 0 0
2 39 15 11 1 1.0 1 0.0 0 0 0 0
3 35 9 100 1 2.7 2 0.0 0 0 0 0
4 35 8 45 4 1.0 2 0.0 0 0 0 1
0    0
1    0
2    0
3    0
4    0
Name: Personal_Loan, dtype: int64
Shape of training set: (3500, 11)
Shape of test set: (1500, 11) 

Percentage of classes in training set:
Personal_Loan
0    90.4
1     9.6
Name: proportion, dtype: float64 

Percentage of classes in test set:
Personal_Loan
0    90.4
1     9.6
Name: proportion, dtype: float64

Observation(s):

  • As shown above, the independent and dependent variables have been split equally between the training and testing datasets.
  • A 70-30 split was used for training and testing data to ensure enough data is available for model validation and performance evaluation using unseen data.

Model Building¶

DecisionTreeClassifier(random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(random_state=42)

Model Evaluation Criterion¶

No description has been provided for this image
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0
No description has been provided for this image
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0

Observation(s):

  • As shown above, the model performs perfectly on the training dataset, while its performance on the test dataset is signifincatly lower.
  • This indicates that the model is overfitting.

Visualize the decision tree and text report

['Age',
 'Experience',
 'Income',
 'Family',
 'CCAvg',
 'Education',
 'Mortgage',
 'Securities_Account',
 'CD_Account',
 'Online',
 'CreditCard']
No description has been provided for this image
|--- Income <= 98.50
|   |--- CCAvg <= 2.95
|   |   |--- weights: [2483.00, 0.00] class: 0
|   |--- CCAvg >  2.95
|   |   |--- CD_Account <= 0.50
|   |   |   |--- Age <= 27.00
|   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |--- Age >  27.00
|   |   |   |   |--- Income <= 92.50
|   |   |   |   |   |--- CCAvg <= 3.65
|   |   |   |   |   |   |--- Mortgage <= 216.50
|   |   |   |   |   |   |   |--- Income <= 82.50
|   |   |   |   |   |   |   |   |--- Experience <= 18.50
|   |   |   |   |   |   |   |   |   |--- Age <= 43.00
|   |   |   |   |   |   |   |   |   |   |--- Education <= 1.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [13.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |   |--- Education >  1.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 2
|   |   |   |   |   |   |   |   |   |--- Age >  43.00
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |   |--- Experience >  18.50
|   |   |   |   |   |   |   |   |   |--- weights: [24.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- Income >  82.50
|   |   |   |   |   |   |   |   |--- CCAvg <= 3.05
|   |   |   |   |   |   |   |   |   |--- weights: [6.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- CCAvg >  3.05
|   |   |   |   |   |   |   |   |   |--- Education <= 2.50
|   |   |   |   |   |   |   |   |   |   |--- Family <= 1.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 2
|   |   |   |   |   |   |   |   |   |   |--- Family >  1.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [7.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- Education >  2.50
|   |   |   |   |   |   |   |   |   |   |--- Mortgage <= 94.00
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 4.00] class: 1
|   |   |   |   |   |   |   |   |   |   |--- Mortgage >  94.00
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |--- Mortgage >  216.50
|   |   |   |   |   |   |   |--- Income <= 68.00
|   |   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |   |   |--- Income >  68.00
|   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |--- CCAvg >  3.65
|   |   |   |   |   |   |--- Mortgage <= 89.00
|   |   |   |   |   |   |   |--- weights: [43.00, 0.00] class: 0
|   |   |   |   |   |   |--- Mortgage >  89.00
|   |   |   |   |   |   |   |--- Mortgage <= 99.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |--- Mortgage >  99.50
|   |   |   |   |   |   |   |   |--- weights: [13.00, 0.00] class: 0
|   |   |   |   |--- Income >  92.50
|   |   |   |   |   |--- Education <= 1.50
|   |   |   |   |   |   |--- weights: [6.00, 0.00] class: 0
|   |   |   |   |   |--- Education >  1.50
|   |   |   |   |   |   |--- Income <= 96.50
|   |   |   |   |   |   |   |--- weights: [0.00, 5.00] class: 1
|   |   |   |   |   |   |--- Income >  96.50
|   |   |   |   |   |   |   |--- Age <= 48.50
|   |   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- Age >  48.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |--- CD_Account >  0.50
|   |   |   |--- CCAvg <= 4.25
|   |   |   |   |--- weights: [0.00, 6.00] class: 1
|   |   |   |--- CCAvg >  4.25
|   |   |   |   |--- Mortgage <= 38.00
|   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |--- Mortgage >  38.00
|   |   |   |   |   |--- weights: [3.00, 0.00] class: 0
|--- Income >  98.50
|   |--- Education <= 1.50
|   |   |--- Family <= 2.50
|   |   |   |--- Income <= 99.50
|   |   |   |   |--- Family <= 1.50
|   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |--- Family >  1.50
|   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |--- Income >  99.50
|   |   |   |   |--- Income <= 104.50
|   |   |   |   |   |--- CCAvg <= 3.31
|   |   |   |   |   |   |--- weights: [17.00, 0.00] class: 0
|   |   |   |   |   |--- CCAvg >  3.31
|   |   |   |   |   |   |--- CCAvg <= 4.25
|   |   |   |   |   |   |   |--- Mortgage <= 124.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 3.00] class: 1
|   |   |   |   |   |   |   |--- Mortgage >  124.50
|   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |--- CCAvg >  4.25
|   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |--- Income >  104.50
|   |   |   |   |   |--- weights: [449.00, 0.00] class: 0
|   |   |--- Family >  2.50
|   |   |   |--- Income <= 113.50
|   |   |   |   |--- Online <= 0.50
|   |   |   |   |   |--- CCAvg <= 2.05
|   |   |   |   |   |   |--- Experience <= 15.50
|   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |--- Experience >  15.50
|   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |   |--- CCAvg >  2.05
|   |   |   |   |   |   |--- weights: [0.00, 3.00] class: 1
|   |   |   |   |--- Online >  0.50
|   |   |   |   |   |--- CD_Account <= 0.50
|   |   |   |   |   |   |--- weights: [9.00, 0.00] class: 0
|   |   |   |   |   |--- CD_Account >  0.50
|   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |--- Income >  113.50
|   |   |   |   |--- weights: [0.00, 49.00] class: 1
|   |--- Education >  1.50
|   |   |--- Income <= 114.50
|   |   |   |--- CCAvg <= 2.45
|   |   |   |   |--- Income <= 106.50
|   |   |   |   |   |--- weights: [28.00, 0.00] class: 0
|   |   |   |   |--- Income >  106.50
|   |   |   |   |   |--- Experience <= 8.00
|   |   |   |   |   |   |--- weights: [11.00, 0.00] class: 0
|   |   |   |   |   |--- Experience >  8.00
|   |   |   |   |   |   |--- Experience <= 31.50
|   |   |   |   |   |   |   |--- Family <= 3.50
|   |   |   |   |   |   |   |   |--- Age <= 36.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |   |   |   |--- Age >  36.50
|   |   |   |   |   |   |   |   |   |--- Mortgage <= 231.00
|   |   |   |   |   |   |   |   |   |   |--- CCAvg <= 1.05
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [6.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |   |--- CCAvg >  1.05
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 2
|   |   |   |   |   |   |   |   |   |--- Mortgage >  231.00
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |--- Family >  3.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 3.00] class: 1
|   |   |   |   |   |   |--- Experience >  31.50
|   |   |   |   |   |   |   |--- weights: [11.00, 0.00] class: 0
|   |   |   |--- CCAvg >  2.45
|   |   |   |   |--- CCAvg <= 4.65
|   |   |   |   |   |--- CCAvg <= 4.45
|   |   |   |   |   |   |--- Age <= 63.50
|   |   |   |   |   |   |   |--- Family <= 1.50
|   |   |   |   |   |   |   |   |--- Age <= 45.00
|   |   |   |   |   |   |   |   |   |--- Experience <= 2.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |   |   |--- Experience >  2.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [5.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- Age >  45.00
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |   |   |--- Family >  1.50
|   |   |   |   |   |   |   |   |--- Experience <= 20.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 9.00] class: 1
|   |   |   |   |   |   |   |   |--- Experience >  20.50
|   |   |   |   |   |   |   |   |   |--- Age <= 52.00
|   |   |   |   |   |   |   |   |   |   |--- weights: [3.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- Age >  52.00
|   |   |   |   |   |   |   |   |   |   |--- Family <= 3.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 3.00] class: 1
|   |   |   |   |   |   |   |   |   |   |--- Family >  3.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |--- Age >  63.50
|   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |   |--- CCAvg >  4.45
|   |   |   |   |   |   |--- weights: [3.00, 0.00] class: 0
|   |   |   |   |--- CCAvg >  4.65
|   |   |   |   |   |--- weights: [0.00, 6.00] class: 1
|   |   |--- Income >  114.50
|   |   |   |--- Income <= 116.50
|   |   |   |   |--- CCAvg <= 1.10
|   |   |   |   |   |--- CCAvg <= 0.65
|   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |--- CCAvg >  0.65
|   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |--- CCAvg >  1.10
|   |   |   |   |   |--- weights: [0.00, 6.00] class: 1
|   |   |   |--- Income >  116.50
|   |   |   |   |--- weights: [0.00, 215.00] class: 1

Observation(s):

  • Given that the training performance metrics are perfect and the testing scores are lower, the model is overfit and can't be used for prediction, as it has learned noise along with the data.
  • The model has a depth of 12 levels, with leaf nodes at the bottom containing only single-digit samples.
  • Therefore, this model is too complex and fails to predict the target variable as expected.
  • Pruning techniques must be employed to refine the tree so the target variable is predicted more accurately.

Pre-pruning Technique¶

DecisionTreeClassifier(max_depth=4, max_leaf_nodes=20, min_samples_split=10,
                       random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(max_depth=4, max_leaf_nodes=20, min_samples_split=10,
                       random_state=42)
No description has been provided for this image
Accuracy Recall Precision F1
0 0.985429 0.89881 0.946708 0.922137
No description has been provided for this image
Accuracy Recall Precision F1
0 0.984667 0.944444 0.900662 0.922034
['Age',
 'Experience',
 'Income',
 'Family',
 'CCAvg',
 'Education',
 'Mortgage',
 'Securities_Account',
 'CD_Account',
 'Online',
 'CreditCard']
No description has been provided for this image
|--- Income <= 98.50
|   |--- CCAvg <= 2.95
|   |   |--- weights: [2483.00, 0.00] class: 0
|   |--- CCAvg >  2.95
|   |   |--- CD_Account <= 0.50
|   |   |   |--- Age <= 27.00
|   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |--- Age >  27.00
|   |   |   |   |--- weights: [121.00, 17.00] class: 0
|   |   |--- CD_Account >  0.50
|   |   |   |--- CCAvg <= 4.25
|   |   |   |   |--- weights: [0.00, 6.00] class: 1
|   |   |   |--- CCAvg >  4.25
|   |   |   |   |--- weights: [3.00, 1.00] class: 0
|--- Income >  98.50
|   |--- Education <= 1.50
|   |   |--- Family <= 2.50
|   |   |   |--- Income <= 99.50
|   |   |   |   |--- weights: [1.00, 2.00] class: 1
|   |   |   |--- Income >  99.50
|   |   |   |   |--- weights: [469.00, 3.00] class: 0
|   |   |--- Family >  2.50
|   |   |   |--- Income <= 113.50
|   |   |   |   |--- weights: [11.00, 5.00] class: 0
|   |   |   |--- Income >  113.50
|   |   |   |   |--- weights: [0.00, 49.00] class: 1
|   |--- Education >  1.50
|   |   |--- Income <= 114.50
|   |   |   |--- CCAvg <= 2.45
|   |   |   |   |--- weights: [60.00, 8.00] class: 0
|   |   |   |--- CCAvg >  2.45
|   |   |   |   |--- weights: [14.00, 21.00] class: 1
|   |   |--- Income >  114.50
|   |   |   |--- Income <= 116.50
|   |   |   |   |--- weights: [2.00, 7.00] class: 1
|   |   |   |--- Income >  116.50
|   |   |   |   |--- weights: [0.00, 215.00] class: 1

No description has been provided for this image

Observation(s):

  • The model trained on the training data is no longer overfitting.
  • The F1-scores of the training and test datasets are within the allowed difference.
  • The order of importance of the features used to build the pre-pruned model is as follows: Education, Income, CCAvg, and CD_Account.

Post Pruning Technique¶

ccp_alphas impurities
0 0.000000 0.000000
1 0.000270 0.000540
2 0.000275 0.001090
3 0.000281 0.001651
4 0.000378 0.002784
5 0.000381 0.003165
6 0.000381 0.003546
7 0.000381 0.003927
8 0.000381 0.004308
9 0.000381 0.005069
10 0.000426 0.006773
11 0.000429 0.007201
12 0.000429 0.007630
13 0.000440 0.008949
14 0.000457 0.009406
15 0.000476 0.009882
16 0.000476 0.010358
17 0.000476 0.013216
18 0.000514 0.013730
19 0.000539 0.016964
20 0.000543 0.019679
21 0.000662 0.020341
22 0.000688 0.021029
23 0.000743 0.021771
24 0.000771 0.022543
25 0.000933 0.025341
26 0.001698 0.027040
27 0.002429 0.029468
28 0.003072 0.032540
29 0.003258 0.035798
30 0.020297 0.056095
31 0.021982 0.078076
32 0.047746 0.173568
No description has been provided for this image
Number of nodes in the last tree is 1 with ccp_alpha 0.04774589891961516

Observation(s):

  • The last node corresponds to a trivial node (just one node) and can be excluded.
No description has been provided for this image
No description has been provided for this image
Index best model 25
DecisionTreeClassifier(ccp_alpha=0.0009328066710555192, random_state=42)

Model Evaluation

No description has been provided for this image
Accuracy Recall Precision F1
0 0.984 0.889881 0.940252 0.914373
No description has been provided for this image
Accuracy Recall Precision F1
0 0.988667 0.944444 0.937931 0.941176

Visualizing the tree

No description has been provided for this image
|--- Income <= 98.50
|   |--- CCAvg <= 2.95
|   |   |--- weights: [2483.00, 0.00] class: 0
|   |--- CCAvg >  2.95
|   |   |--- CD_Account <= 0.50
|   |   |   |--- weights: [121.00, 19.00] class: 0
|   |   |--- CD_Account >  0.50
|   |   |   |--- weights: [3.00, 7.00] class: 1
|--- Income >  98.50
|   |--- Education <= 1.50
|   |   |--- Family <= 2.50
|   |   |   |--- weights: [470.00, 5.00] class: 0
|   |   |--- Family >  2.50
|   |   |   |--- Income <= 113.50
|   |   |   |   |--- weights: [11.00, 5.00] class: 0
|   |   |   |--- Income >  113.50
|   |   |   |   |--- weights: [0.00, 49.00] class: 1
|   |--- Education >  1.50
|   |   |--- Income <= 114.50
|   |   |   |--- CCAvg <= 2.45
|   |   |   |   |--- weights: [60.00, 8.00] class: 0
|   |   |   |--- CCAvg >  2.45
|   |   |   |   |--- weights: [14.00, 21.00] class: 1
|   |   |--- Income >  114.50
|   |   |   |--- weights: [2.00, 222.00] class: 1

No description has been provided for this image

Observation(s):

  • The above chart shows the feature importance used by the post-pruning decision tree model.

Model Performance Comparison and Final Model Selection¶

Training performance comparison:
Default Decision Tree Pre-pruned Decision Tree Post-pruned Decision Tree
Accuracy 1.0 0.985429 0.984000
Recall 1.0 0.898810 0.889881
Precision 1.0 0.946708 0.940252
F1 1.0 0.922137 0.914373
Test set performance comparison:
Default Decision Tree Pre-pruned Decision Tree Post-pruned Decision Tree
Accuracy 0.980667 0.984667 0.988667
Recall 0.909722 0.944444 0.944444
Precision 0.891156 0.900662 0.937931
F1 0.900344 0.922034 0.941176
  • Both the pre-pruned and post-pruned decision trees exhibit generalized performance.

  • The pre-pruned decision tree shows almost identical performance on the training and test sets.

    • This model uses an additional feature (Age) for decision-making compared to the post-pruned tree.
    • While this may result in slightly longer prediction times, it is likely to yield better results on unseen data, although the difference in performance is negligible.
  • The post-pruned decision tree shows approximately a 2.85% higher F1 score on the test set compared to the training set.

    • This model uses five features for decision-making.
    • It benefits from lower prediction time and potentially performs as good as the pre-pruned model, handling edge cases in unseen data effectively.

Final selection

  • Based on the observations stated above, the post-pruned tree model is expected to yield better predictions than the pre-pruned model.

  • The feature importance of the post-pruned tree is almost identical to that of the pre-pruned tree, except that the feature ‘Age’ is used additionally by the pre-pruned model. As visualized in the EDA section, ‘Age’ has no statistically significant correlation with ‘Personal_Loan’. Therefore, its influence on prediction is likely negligible.

  • As a result, the post-pruned model has been selected to help AllLife Bank predict which liability customers are likely to accept the personal loan offer.

  • As shown in the "Feature Importance" chart for the post-pruned model, the features 1) Annual Income and 2) Education have the highest influence, followed by 3) Family Size, 4) Monthly Credit Card Spend, and 5) CD Account. This order of importance is consistent with the pre-pruned model.

Prediction on a single datapoint

[1]
CPU times: total: 0 ns
Wall time: 2.01 ms
  • The model was able to predict in well under half a second.
0.9910714285714286

Observation(s):

  • This indicates that the model is approximately 99% confident that the liability customer would accept the personal loan offer, and that the offer should be sent during the upcoming campaign.

Actionable Insights and Business Recommendations¶

What recommedations would you suggest to the bank?

  • The post-pruned model can be deployed to determine the likelihood of liability customers accepting the personal loan offer sent as part of the campaign.

  • A business decision should be made to define a threshold, so that customers with predictions below this threshold are reviewed manually to decide whether the offer should be sent.

  • By deploying this model, current manual processes can be significantly reduced, and an automated process can be implemented to send offers via email, postal mail, or both to customers who meet the defined threshold.