HMDA. Mortgage Data Analysis and Modeling Predictions.¶
The Dataset: https://www.consumerfinance.gov/data-research/hmda/¶
The Home Mortgage Disclosure Act (HMDA) requires many financial institutions to maintain, report, and publicly disclose loan-level information about mortgages. These data help show whether lenders are serving the housing needs of their communities; they give public officials information that helps them make decisions and policies; and they shed light on lending patterns that could be discriminatory. The public data are modified to protect applicant and borrower privacy.
HMDA was originally enacted by Congress in 1975 and is implemented by Regulation C.
The Goal:¶
We will use the models Logistic Regression and LightGBM. to assess the outcomes, we will use ROC-AUC and Log Loss. The purpouse will be:
- Find out if the are any intersectional biases in model's predictions.
- Which features influenced the most on model's outcomes. We will use SHAP Values.
Steps:¶
Step 1: Setup of Working Environment and Libraries.
Step 2: Overview
A) Understanding features: Know what each column represents and whether it is useful.
B) Understanding the data, shape and structure:
- df.shape
- df.columns
- df.dtypes
- df.head()
- Step 3: Data Cleaning & Feature Engineering
A) Excluding irrelevant records from columns.
- Quality issues: in "edit_status_name" there are 190,210 records with quality issues. We will ommit them.
- Only single idividuals: we will focus on single individuals hence from 'co_applicant_ethnicity_name' we only keep 'No co-applicant'
B) Excluding irrelevant records from:
- applicant_ethnicity_name
- applicant_race_name_1
- applicant_sex_name
- action_taken_name
C) Addressing missing values:
- Filling relevant empty cells with "Unknown"
- Deleting irrelevant records from other columns.
D) Feature engineering;
- Feature creation:
- 'ethnicity_race_sex'
- "loan_to_income_ratio"
- Feature transformation. Mapping action_taken_name into "approved" & "denied"
- Feature creation:
- Step 4: Exploratory Data Analysis (EDA) & Feature Engineering
A) Initial questions
- A.1 What is the proportion of denials to approvals for each group in the column "ethnicity_race_sex"?
- A.2 What is the approval rate for each group?
- A.3 What is the denial rate for each group?
- A.4 What percentage of total applications does each ethnicity_race_sex group represent?
- A.5 What is the ratio of applications submitted by men compared to women?
- A.6 What is the average loan amount requested by men compared to women?
- A.7 What is the average loan amount requested by each group?
- A.8 What is the average income for each ethnicity_race_sex group?
- A.9 What is the average loan-to-income ratio for each group?
- A.10 What is the average rate_spread for each group?
- A.11 Which group has the fewest zero values in the column "rate_spread", suggesting they are more frequently overcharged?
B) Statistics & feature engineering; feature selection
C) Visualization of numerical and categorical data
- Categorical data; histograms
- Numercial Data:
- Looking for outliers with boxplot
- Filtering out outliers
- Checking Statistics
- Checking and addressing Skewness
E) Kernel Density Estimation (KDE)
Final Considerations: Please note that at the bottom of some coding blocks we acknowledge external original sources of the code used, if any source is mentioned credits go to myself or from official documentation.
Step 1: Setup of working environment and libraries.¶
Note:
!git init is for Google Colab environment, we should use "subprocess.run(["git", "init"])" which is more in line with PEP 8 doc. Style.
We only use once "git init", when starting a new project. The purpose of this code is to create things related to the project in a .git directory in the project folder.
subprocess.run(["git", "init"])
# Mount google drive
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
# Setting up GitHub
import os
import subprocess
from getpass import getpass #Secure token storage
# Importing working space
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
Username and email:¶
subprocess.run(["git", "config", "--global", "user.name", "AJLR888"], check=True) subprocess.run(["git", "config", "--global", "user.email", "roldan.analytics@gmail.com"], check=True)
Storing GitHub token and repository details¶
GITHUB_TOKEN = getpass("Enter GitHub Token:") REPO_OWNER = "AJLR888" REPO_NAME = "hmda-ny-2007-loan-default" BRANCH_NAME = "main"
Setting GitHub remot URL with authentcation¶
GIT_REMOTE_URL = f"https://%7BGITHUB_TOKEN%7D@github.com/%7BREPO_OWNER%7D/%7BREPO_NAME%7D.git" os.system(f"git remote set-url origin {GIT_REMOTE_URL}")
# Loading the data
df = pd.read_csv('/content/drive/My Drive/Colab Notebooks/hmda_2007_ny_all-records_labels.csv')
# Checking data size
df = pd.DataFrame(df)
<ipython-input-3-100f7bdf4485>:2: DtypeWarning: Columns (34,36,38,42,44,46,48) have mixed types. Specify dtype option on import or set low_memory=False. df = pd.read_csv('/content/drive/My Drive/Colab Notebooks/hmda_2007_ny_all-records_labels.csv')
Step 2: General overview¶
A) Understanding of each feature. What does each feature describe?¶
as_of_year: The year the mortgage was reported.
respondent_id: The lender.
agency_name: The regulatory agency responsible for overseeing the financial institution that reported the mortgage.
agency_abbr: Agency abrebiation name.
agency_code: The regulator that supervises the regulated lender that report the data.
loan_type_name:The type of covered loan or application (if covered). By covered mean, there is a mechanism to protect the lender.
property_type_name:
- One to four-family (other than manufactured housing)
- Manufactured housing
- Multifamily
loan_purpose_name: What was the loan used for.
owner_occupancy_name: Whether the owner intend to live in the property or not.
loan_amount_000s
preapproval_name: Preapproval only applies to home purchase loans, not refinancing or home improvement loans.
action_taken_name: The final outcome of a loan application.
msamd_name: MSA stands for Metropolitan Statistical Area, is a region with a high population density at its core and close economic ties throught the area; MD stands for Metropolitan division, is a sub-region within a large MSA (used when the area is particularly populous).
- Considerations:
- If the property is outside of the MSA/MD area it may be blank/empty.
- Helps analyze lending patterns, approval rates, and borrower demographics in specific regions.
- msamd is broader, city or regional-level in comparison with census_tract_number. msamd is better use for housing trends or market segmentation for instance.
census_tract_number: is a small area. Designed to be socially and economically homogeneous. In other words, they are kind of neighbourhoods.
purchaser_type_name: Describes who purchased the loan in the secondary market after it was originated by the reporting institution.
denial_reason_name_1: Important to consider that if NA the loan was not denied.
rate_spread: The exra interest that a borrower pay compared to the "best-qualified" borrowers. Generally speaking it means higher risk although there are nuances to consider. e.g. someone can have a higher rate spread due to a bad credit score which doesn't reflect their ability and will to pay back.
hoepa_status_name: hoepa stands for Home Ownership and Equity Protection Act. is a law disgned to protect borrowers from predatory lending.
- Considerations:
- It can be used to track if high-cost loans are disproportionally targeted at vulnerable groups.
lien_status_name: Basically, it indicates whether the mortgage is secured or not.
edit_status_name: Are rules to assist filers in checking the accuracy of HMDA data prior to submission. Idicate if there are any accuracy(quality)issues.
- Note: In our dataset we have almost 200,000 records as "Quality edit failure only" which is indicative of some potentail data inconsistency.
minority_population: % of minority population to total tract population.
hud_median_family_income: Median family income for the MSA (metropolitan statistical area) or MD (metropolitan division)
- Note: hud(Housing & Urban development).
- List item
population: total population in tract (tract ~ neighbourhood).
tract_to_msamd_income: Indicates how wealthy or poor a tract is compared to its surroundings areas. compares the median family income of the census tract to the median family income of the corresponding metropolitan statistical area MSA OR metropolitan division MD. Let's remember that a tract is inside of a MSA/MD.
number_of_owner_occupied_units: Total number of housing units in the census tract that are occupied by their owners rather than rented.
number_of_1_to_4_family_units: Indicates the number of housing units that are classified as single-family homes. The higher the number the less population density.
Check the following sources for further undertanding:
Check filing instructions guide for further information: https://ffiec.cfpb.gov/documentation/fig/2024/overview
Original source: https://www.consumerfinance.gov/data-research/hmda/historic-data/?geo=nationwide&records=all-records&field_descriptions=labels
Check "lar_record_codes.pdf" from my GitHub for further Inf.
B) Understanding the data¶
# Checking content in the columns
pd.set_option('display.max_columns', None)
print(df.head(5))
as_of_year respondent_id agency_name \ 0 2007 20-2096530 Office of the Comptroller of the Currency 1 2007 2085300005 Department of Housing and Urban Development 2 2007 20-2096530 Office of the Comptroller of the Currency 3 2007 0000003970 Office of Thrift Supervision 4 2007 2085300005 Department of Housing and Urban Development agency_abbr agency_code loan_type_name loan_type \ 0 OCC 1 Conventional 1 1 HUD 7 Conventional 1 2 OCC 1 Conventional 1 3 OTS 4 Conventional 1 4 HUD 7 Conventional 1 property_type_name property_type \ 0 One-to-four family dwelling (other than manufa... 1 1 One-to-four family dwelling (other than manufa... 1 2 One-to-four family dwelling (other than manufa... 1 3 One-to-four family dwelling (other than manufa... 1 4 One-to-four family dwelling (other than manufa... 1 loan_purpose_name loan_purpose owner_occupancy_name \ 0 Home improvement 2 Owner-occupied as a principal dwelling 1 Refinancing 3 Owner-occupied as a principal dwelling 2 Home improvement 2 Owner-occupied as a principal dwelling 3 Refinancing 3 Owner-occupied as a principal dwelling 4 Refinancing 3 Owner-occupied as a principal dwelling owner_occupancy loan_amount_000s preapproval_name preapproval \ 0 1 54 Not applicable 3 1 1 60 Not applicable 3 2 1 56 Not applicable 3 3 1 374 Not applicable 3 4 1 158 Not applicable 3 action_taken_name action_taken \ 0 Application denied by financial institution 3 1 Application denied by financial institution 3 2 Application denied by financial institution 3 3 Loan purchased by the institution 6 4 Application denied by financial institution 3 msamd_name msamd state_name state_abbr state_code \ 0 Binghamton - NY 13780.0 New York NY 36 1 Binghamton - NY 13780.0 New York NY 36 2 Binghamton - NY 13780.0 New York NY 36 3 Nassau, Suffolk - NY 35004.0 New York NY 36 4 Binghamton - NY 13780.0 New York NY 36 county_name county_code census_tract_number applicant_ethnicity_name \ 0 Tioga County 107.0 206.0 Not Hispanic or Latino 1 Broome County 7.0 17.0 Hispanic or Latino 2 Tioga County 107.0 206.0 Not Hispanic or Latino 3 Nassau County 59.0 4086.0 Not applicable 4 Broome County 7.0 125.0 Not Hispanic or Latino applicant_ethnicity co_applicant_ethnicity_name co_applicant_ethnicity \ 0 2 No co-applicant 5 1 1 No co-applicant 5 2 2 No co-applicant 5 3 4 Not applicable 4 4 2 Not Hispanic or Latino 2 applicant_race_name_1 applicant_race_1 applicant_race_name_2 \ 0 White 5 NaN 1 White 5 NaN 2 White 5 NaN 3 Not applicable 7 NaN 4 White 5 NaN applicant_race_2 applicant_race_name_3 applicant_race_3 \ 0 NaN NaN NaN 1 NaN NaN NaN 2 NaN NaN NaN 3 NaN NaN NaN 4 NaN NaN NaN applicant_race_name_4 applicant_race_4 applicant_race_name_5 \ 0 NaN NaN NaN 1 NaN NaN NaN 2 NaN NaN NaN 3 NaN NaN NaN 4 NaN NaN NaN applicant_race_5 co_applicant_race_name_1 co_applicant_race_1 \ 0 NaN No co-applicant 8 1 NaN No co-applicant 8 2 NaN No co-applicant 8 3 NaN Not applicable 7 4 NaN White 5 co_applicant_race_name_2 co_applicant_race_2 co_applicant_race_name_3 \ 0 NaN NaN NaN 1 NaN NaN NaN 2 NaN NaN NaN 3 NaN NaN NaN 4 NaN NaN NaN co_applicant_race_3 co_applicant_race_name_4 co_applicant_race_4 \ 0 NaN NaN NaN 1 NaN NaN NaN 2 NaN NaN NaN 3 NaN NaN NaN 4 NaN NaN NaN co_applicant_race_name_5 co_applicant_race_5 \ 0 NaN NaN 1 NaN NaN 2 NaN NaN 3 NaN NaN 4 NaN NaN applicant_sex_name applicant_sex \ 0 Male 1 1 Information not provided by applicant in mail,... 3 2 Male 1 3 Not applicable 4 4 Information not provided by applicant in mail,... 3 co_applicant_sex_name co_applicant_sex \ 0 No co-applicant 5 1 No co-applicant 5 2 No co-applicant 5 3 Not applicable 4 4 Information not provided by applicant in mail,... 3 applicant_income_000s purchaser_type_name \ 0 16.0 Loan was not originated or was not sold in cal... 1 36.0 Loan was not originated or was not sold in cal... 2 66.0 Loan was not originated or was not sold in cal... 3 NaN Commercial bank, savings bank or savings assoc... 4 62.0 Loan was not originated or was not sold in cal... purchaser_type denial_reason_name_1 denial_reason_1 \ 0 0 Collateral 4.0 1 0 NaN NaN 2 0 Credit application incomplete 7.0 3 6 NaN NaN 4 0 NaN NaN denial_reason_name_2 denial_reason_2 denial_reason_name_3 denial_reason_3 \ 0 NaN NaN NaN NaN 1 NaN NaN NaN NaN 2 NaN NaN NaN NaN 3 NaN NaN NaN NaN 4 NaN NaN NaN NaN rate_spread hoepa_status_name hoepa_status lien_status_name \ 0 NaN Not a HOEPA loan 2 Secured by a first lien 1 NaN Not a HOEPA loan 2 Secured by a first lien 2 NaN Not a HOEPA loan 2 Secured by a subordinate lien 3 NaN Not a HOEPA loan 2 Not applicable 4 NaN Not a HOEPA loan 2 Secured by a first lien lien_status edit_status_name edit_status sequence_number population \ 0 1 NaN NaN 13711 7424.0 1 1 NaN NaN 10815 4302.0 2 2 NaN NaN 13712 7424.0 3 4 NaN NaN 174127 4497.0 4 1 NaN NaN 10823 6421.0 minority_population hud_median_family_income tract_to_msamd_income \ 0 2.25 52800.0 91.790001 1 14.16 52800.0 94.230003 2 2.25 52800.0 91.790001 3 11.12 93800.0 97.110001 4 2.59 52800.0 93.949997 number_of_owner_occupied_units number_of_1_to_4_family_units \ 0 2259.0 3030.0 1 1033.0 1823.0 2 2259.0 3030.0 3 1341.0 1459.0 4 1926.0 2669.0 application_date_indicator 0 0 1 0 2 0 3 2 4 0
# Force to dislplay all rows
pd.set_option('display.max_rows', None)
# Data types
print(df.dtypes)
as_of_year int64 respondent_id object agency_name object agency_abbr object agency_code int64 loan_type_name object loan_type int64 property_type_name object property_type int64 loan_purpose_name object loan_purpose int64 owner_occupancy_name object owner_occupancy int64 loan_amount_000s int64 preapproval_name object preapproval int64 action_taken_name object action_taken int64 msamd_name object msamd float64 state_name object state_abbr object state_code int64 county_name object county_code float64 census_tract_number float64 applicant_ethnicity_name object applicant_ethnicity int64 co_applicant_ethnicity_name object co_applicant_ethnicity int64 applicant_race_name_1 object applicant_race_1 int64 applicant_race_name_2 object applicant_race_2 float64 applicant_race_name_3 object applicant_race_3 float64 applicant_race_name_4 object applicant_race_4 float64 applicant_race_name_5 object applicant_race_5 float64 co_applicant_race_name_1 object co_applicant_race_1 int64 co_applicant_race_name_2 object co_applicant_race_2 float64 co_applicant_race_name_3 object co_applicant_race_3 float64 co_applicant_race_name_4 object co_applicant_race_4 float64 co_applicant_race_name_5 object co_applicant_race_5 float64 applicant_sex_name object applicant_sex int64 co_applicant_sex_name object co_applicant_sex int64 applicant_income_000s float64 purchaser_type_name object purchaser_type int64 denial_reason_name_1 object denial_reason_1 float64 denial_reason_name_2 object denial_reason_2 float64 denial_reason_name_3 object denial_reason_3 float64 rate_spread float64 hoepa_status_name object hoepa_status int64 lien_status_name object lien_status int64 edit_status_name object edit_status float64 sequence_number int64 population float64 minority_population float64 hud_median_family_income float64 tract_to_msamd_income float64 number_of_owner_occupied_units float64 number_of_1_to_4_family_units float64 application_date_indicator int64 dtype: object
# Null values
print(df.isnull().sum())
as_of_year 0 respondent_id 0 agency_name 0 agency_abbr 0 agency_code 0 loan_type_name 0 loan_type 0 property_type_name 0 property_type 0 loan_purpose_name 0 loan_purpose 0 owner_occupancy_name 0 owner_occupancy 0 loan_amount_000s 0 preapproval_name 0 preapproval 0 action_taken_name 0 action_taken 0 msamd_name 92920 msamd 92920 state_name 0 state_abbr 0 state_code 0 county_name 1007 county_code 1007 census_tract_number 1370 applicant_ethnicity_name 0 applicant_ethnicity 0 co_applicant_ethnicity_name 0 co_applicant_ethnicity 0 applicant_race_name_1 0 applicant_race_1 0 applicant_race_name_2 1006430 applicant_race_2 1006430 applicant_race_name_3 1009259 applicant_race_3 1009259 applicant_race_name_4 1009369 applicant_race_4 1009369 applicant_race_name_5 1009386 applicant_race_5 1009386 co_applicant_race_name_1 0 co_applicant_race_1 0 co_applicant_race_name_2 1008514 co_applicant_race_2 1008514 co_applicant_race_name_3 1009393 co_applicant_race_3 1009393 co_applicant_race_name_4 1009417 co_applicant_race_4 1009417 co_applicant_race_name_5 1009427 co_applicant_race_5 1009427 applicant_sex_name 0 applicant_sex 0 co_applicant_sex_name 0 co_applicant_sex 0 applicant_income_000s 88875 purchaser_type_name 0 purchaser_type 0 denial_reason_name_1 828980 denial_reason_1 828980 denial_reason_name_2 971378 denial_reason_2 971378 denial_reason_name_3 1002497 denial_reason_3 1002497 rate_spread 939632 hoepa_status_name 0 hoepa_status 0 lien_status_name 0 lien_status 0 edit_status_name 819241 edit_status 819241 sequence_number 0 population 1466 minority_population 1477 hud_median_family_income 1418 tract_to_msamd_income 1745 number_of_owner_occupied_units 3093 number_of_1_to_4_family_units 2965 application_date_indicator 0 dtype: int64
# Unique values for categorical variables
categorical_columns = df.select_dtypes(include=['object']).columns.drop("respondent_id")
for col in categorical_columns:
print(f"Value counts for: {col}")
print(df[col].value_counts(), "\n")
print("\n" + "=+"*50 + "\n")
# Explanation:
# - df.select_dtypes(include=['object']).columns: This part selects all columns from the dataframe that is 'object' (basically not numerical data)
# - .drop("respondent_id") is used to remove that columns as is not relevant
# - for col in categorical_columns: goes through each column name, and...
# - print(f"Value counts for: {col}"): prints the column name, and...
# - print(df[col].value_counts(), "\n"); prints the number of times each unique value appears in a column
Value counts for: agency_name agency_name Office of the Comptroller of the Currency 363756 Department of Housing and Urban Development 214451 Office of Thrift Supervision 199948 Federal Reserve System 173886 National Credit Union Administration 34779 Federal Deposit Insurance Corporation 22631 Name: count, dtype: int64 =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ Value counts for: agency_abbr agency_abbr OCC 363756 HUD 214451 OTS 199948 FRS 173886 NCUA 34779 FDIC 22631 Name: count, dtype: int64 =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ Value counts for: loan_type_name loan_type_name Conventional 966120 FHA-insured 39128 VA-guaranteed 3573 FSA/RHS-guaranteed 630 Name: count, dtype: int64 =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ Value counts for: property_type_name property_type_name One-to-four family dwelling (other than manufactured housing) 990336 Manufactured housing 13427 Multifamily dwelling 5688 Name: count, dtype: int64 =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ Value counts for: loan_purpose_name loan_purpose_name Refinancing 490921 Home purchase 393021 Home improvement 125509 Name: count, dtype: int64 =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ Value counts for: owner_occupancy_name owner_occupancy_name Owner-occupied as a principal dwelling 930351 Not owner-occupied as a principal dwelling 73039 Not applicable 6061 Name: count, dtype: int64 =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ Value counts for: preapproval_name preapproval_name Not applicable 841011 Preapproval was not requested 148534 Preapproval was requested 19906 Name: count, dtype: int64 =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ Value counts for: action_taken_name action_taken_name Loan originated 398639 Application denied by financial institution 265860 Loan purchased by the institution 125325 Application withdrawn by applicant 98667 Application approved but not accepted 88070 File closed for incompleteness 32846 Preapproval request denied by financial institution 39 Preapproval request approved but not accepted 5 Name: count, dtype: int64 =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ Value counts for: msamd_name msamd_name New York, White Plains, Wayne - NY, NJ 373622 Nassau, Suffolk - NY 205507 Rochester - NY 61609 Albany, Schenectady, Troy - NY 61400 Buffalo, Niagara Falls - NY 58655 Poughkeepsie, Newburgh, Middletown - NY 51484 Syracuse - NY 40105 Utica, Rome - NY 17609 Binghamton - NY 13557 Kingston - NY 12881 Glens Falls - NY 11108 Elmira - NY 5087 Ithaca - NY 3907 Name: count, dtype: int64 =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ Value counts for: state_name state_name New York 1009451 Name: count, dtype: int64 =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ Value counts for: state_abbr state_abbr NY 1009451 Name: count, dtype: int64 =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ Value counts for: county_name county_name Suffolk County 119892 Queens County 106680 Nassau County 85631 Kings County 85527 Westchester County 49237 Erie County 47096 Monroe County 41702 New York County 41171 Bronx County 34588 Richmond County 32395 Orange County 29681 Onondaga County 28400 Dutchess County 21820 Albany County 18913 Saratoga County 16946 Rockland County 16734 Oneida County 13488 Ulster County 13010 Schenectady County 12071 Niagara County 11564 Rensselaer County 11077 Broome County 10491 Ontario County 7473 Putnam County 7351 Oswego County 7215 Jefferson County 6461 Sullivan County 6357 Chautauqua County 6328 Wayne County 6204 Warren County 5702 Washington County 5419 Chemung County 5088 Steuben County 5076 Cayuga County 4923 Clinton County 4829 Madison County 4494 Fulton County 4355 St. Lawrence County 4353 Greene County 4315 Columbia County 4201 Herkimer County 4124 Cattaraugus County 4105 Otsego County 4051 Tompkins County 3910 Livingston County 3628 Montgomery County 3538 Delaware County 3189 Tioga County 3069 Genesee County 3034 Chenango County 2935 Essex County 2904 Franklin County 2746 Orleans County 2637 Cortland County 2615 Schoharie County 2410 Allegany County 2285 Wyoming County 2169 Seneca County 2021 Yates County 1535 Lewis County 1518 Schuyler County 1285 Hamilton County 478 Name: count, dtype: int64 =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ Value counts for: applicant_ethnicity_name applicant_ethnicity_name Not Hispanic or Latino 701712 Information not provided by applicant in mail, Internet, or telephone application 163345 Hispanic or Latino 83748 Not applicable 60646 Name: count, dtype: int64 =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ Value counts for: co_applicant_ethnicity_name co_applicant_ethnicity_name No co-applicant 601274 Not Hispanic or Latino 269823 Information not provided by applicant in mail, Internet, or telephone application 66060 Not applicable 47780 Hispanic or Latino 24514 Name: count, dtype: int64 =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ Value counts for: applicant_race_name_1 applicant_race_name_1 White 610172 Information not provided by applicant in mail, Internet, or telephone application 171080 Black or African American 110814 Not applicable 60199 Asian 46654 American Indian or Alaska Native 5342 Native Hawaiian or Other Pacific Islander 5190 Name: count, dtype: int64 =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ Value counts for: applicant_race_name_2 applicant_race_name_2 White 2051 Black or African American 422 Native Hawaiian or Other Pacific Islander 261 Asian 173 American Indian or Alaska Native 114 Name: count, dtype: int64 =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ Value counts for: applicant_race_name_3 applicant_race_name_3 White 82 Black or African American 76 Native Hawaiian or Other Pacific Islander 18 American Indian or Alaska Native 13 Asian 3 Name: count, dtype: int64 =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ Value counts for: applicant_race_name_4 applicant_race_name_4 Native Hawaiian or Other Pacific Islander 58 White 17 Asian 7 Name: count, dtype: int64 =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ Value counts for: applicant_race_name_5 applicant_race_name_5 White 56 Native Hawaiian or Other Pacific Islander 5 Asian 2 American Indian or Alaska Native 2 Name: count, dtype: int64 =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ Value counts for: co_applicant_race_name_1 co_applicant_race_name_1 No co-applicant 601274 White 248732 Information not provided by applicant in mail, Internet, or telephone application 68896 Not applicable 46507 Black or African American 27264 Asian 13513 American Indian or Alaska Native 1652 Native Hawaiian or Other Pacific Islander 1613 Name: count, dtype: int64 =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ Value counts for: co_applicant_race_name_2 co_applicant_race_name_2 White 665 Black or African American 101 Native Hawaiian or Other Pacific Islander 82 Asian 56 American Indian or Alaska Native 33 Name: count, dtype: int64 =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ Value counts for: co_applicant_race_name_3 co_applicant_race_name_3 White 23 Black or African American 22 Native Hawaiian or Other Pacific Islander 10 American Indian or Alaska Native 3 Name: count, dtype: int64 =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ Value counts for: co_applicant_race_name_4 co_applicant_race_name_4 Native Hawaiian or Other Pacific Islander 14 White 12 Asian 5 Black or African American 3 Name: count, dtype: int64 =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ Value counts for: co_applicant_race_name_5 co_applicant_race_name_5 White 15 Asian 4 Native Hawaiian or Other Pacific Islander 3 American Indian or Alaska Native 2 Name: count, dtype: int64 =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ Value counts for: applicant_sex_name applicant_sex_name Male 548444 Female 313848 Information not provided by applicant in mail, Internet, or telephone application 86910 Not applicable 60249 Name: count, dtype: int64 =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ Value counts for: co_applicant_sex_name co_applicant_sex_name No co-applicant 601274 Female 235906 Male 86243 Not applicable 47369 Information not provided by applicant in mail, Internet, or telephone application 38659 Name: count, dtype: int64 =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ Value counts for: purchaser_type_name purchaser_type_name Loan was not originated or was not sold in calendar year covered by register 690764 Affiliate institution 79082 Fannie Mae (FNMA) 75053 Freddie Mac (FHLMC) 47531 Other type of purchaser 34667 Life insurance company, credit union, mortgage bank, or finance company 24912 Private securitization 21975 Commercial bank, savings bank or savings association 20277 Ginnie Mae (GNMA) 15139 Farmer Mac (FAMC) 51 Name: count, dtype: int64 =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ Value counts for: denial_reason_name_1 denial_reason_name_1 Credit history 42911 Debt-to-income ratio 31689 Credit application incomplete 30550 Collateral 29969 Other 29472 Unverifiable information 11298 Insufficient cash (downpayment, closing costs) 2672 Employment history 1748 Mortgage insurance denied 162 Name: count, dtype: int64 =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ Value counts for: denial_reason_name_2 denial_reason_name_2 Other 10141 Credit history 8133 Debt-to-income ratio 6406 Collateral 4863 Unverifiable information 2918 Credit application incomplete 2445 Insufficient cash (downpayment, closing costs) 2314 Employment history 716 Mortgage insurance denied 137 Name: count, dtype: int64 =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ Value counts for: denial_reason_name_3 denial_reason_name_3 Other 2508 Credit history 1007 Collateral 977 Unverifiable information 624 Debt-to-income ratio 579 Insufficient cash (downpayment, closing costs) 567 Credit application incomplete 535 Employment history 103 Mortgage insurance denied 54 Name: count, dtype: int64 =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ Value counts for: hoepa_status_name hoepa_status_name Not a HOEPA loan 1009337 HOEPA loan 114 Name: count, dtype: int64 =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ Value counts for: lien_status_name lien_status_name Secured by a first lien 694718 Secured by a subordinate lien 159095 Not applicable 125325 Not secured by a lien 30313 Name: count, dtype: int64 =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ Value counts for: edit_status_name edit_status_name Quality edit failure only 190210 Name: count, dtype: int64 =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
Step 3: Data cleaning¶
A) Excluding irrelevant records from columns¶
# Records/columns
df.shape
(1009451, 78)
# Quality issues: There are 190210 records with quality issues. We will ommit them.
df = df[~df["edit_status_name"].isin(["Quality edit failure only"])]
# Our focus is on individuals hence we will focus on data with "No co-applicant"
df = df[df['co_applicant_ethnicity_name'] == 'No co-applicant']
# Here we have "excluding" and "selecting", Explanation:
# - With df = df[~df["edit_status_name"].isin(["Quality edit failure only"])] We are saying, exclude "Quality edit failure only" from the column "edit_status_name".
# - With df = df[df["co_appliant_ethnicity_name"] == "No co-applicant"] We are saying, keep only records that in the column "co_applicant_ethnicity_name" is "No co-applicant"
# Records/shape
df.shape
(472701, 78)
Note that around half the applications are without co-applicant.
B) Excluding irrelevant records from the following columns:¶
- applicant_ethnicity_name
- applicant_race_name_1
- applicant_sex_name
- action_taken_name
df = df[
~df["action_taken_name"].isin([
"File closed for incompleteness",
"Preapproval request denied by financial institution",
"Preapproval request approved but not accepted",
"Application withdrawn by applicant"
]) &
~df["applicant_ethnicity_name"].isin([
"Information not provided by applicant in mail, Internet, or telephone application",
"Not applicable"
]) &
~df["applicant_race_name_1"].isin([
"Information not provided by applicant in mail, Internet, or telephone application",
"Not applicable"
]) &
~df["applicant_sex_name"].isin([
"Information not provided by applicant in mail, Internet, or telephone application",
"Not applicable"
])
]
# Data validation
print(df.shape)
(323044, 78)
C) Addressing missing values¶
# Calculating the missing valuesand their percentage
total_rows = len(df)
# Missing value counting and missing value %
missing_counts = df.isnull().sum()
missing_percentage = (missing_counts / total_rows) * 100
# We create the header
print(f"{'Column': <30}{'Missing Count':<15}{'Missing %'}")
print("-" * 60)
# For each column we print the missing values in absolute terms and in %
for col in df.columns:
if missing_counts[col]>0:
print(f"{col:<30}{missing_counts[col]:<15}{missing_percentage[col]:.2f}%")
# Explanation:
# - {'Column': <30}; The <30 align the text on the left and allocate up to 30 characters in the column
# - Other options are; >30 Same, but aligns text on the right as well as ^30 which aligns text on the center.
# - f"{}{}{}"; it is called f-string(formatted string literal), used to put together variables or expressions one after another like concatenation:
# "The value is " + str(value) + " and the total is " + str(total)
Column Missing Count Missing % ------------------------------------------------------------ msamd_name 34069 10.55% msamd 34069 10.55% county_name 357 0.11% county_code 357 0.11% census_tract_number 460 0.14% applicant_race_name_2 321711 99.59% applicant_race_2 321711 99.59% applicant_race_name_3 322955 99.97% applicant_race_3 322955 99.97% applicant_race_name_4 323019 99.99% applicant_race_4 323019 99.99% applicant_race_name_5 323027 99.99% applicant_race_5 323027 99.99% co_applicant_race_name_2 323044 100.00% co_applicant_race_2 323044 100.00% co_applicant_race_name_3 323044 100.00% co_applicant_race_3 323044 100.00% co_applicant_race_name_4 323044 100.00% co_applicant_race_4 323044 100.00% co_applicant_race_name_5 323044 100.00% co_applicant_race_5 323044 100.00% applicant_income_000s 2719 0.84% denial_reason_name_1 251827 77.95% denial_reason_1 251827 77.95% denial_reason_name_2 308294 95.43% denial_reason_2 308294 95.43% denial_reason_name_3 320397 99.18% denial_reason_3 320397 99.18% rate_spread 295313 91.42% edit_status_name 323044 100.00% edit_status 323044 100.00% population 507 0.16% minority_population 511 0.16% hud_median_family_income 484 0.15% tract_to_msamd_income 585 0.18% number_of_owner_occupied_units1092 0.34% number_of_1_to_4_family_units 967 0.30%
Filling and dropping¶
# Filling out empty cells:
df = df.assign(
msamd_name=df['msamd_name'].fillna("Unknown"),
denial_reason_name_1=df['denial_reason_name_1'].fillna("Unknown"),
rate_spread=df['rate_spread'].fillna(0)
)
# Dropping empty values from the following columns:
df = df.dropna(subset=['county_name',
'census_tract_number',
'applicant_income_000s',
'minority_population',
'hud_median_family_income',
'tract_to_msamd_income',
'number_of_owner_occupied_units',
'number_of_1_to_4_family_units'
]
)
# Data validation
print(df.isnull().sum())
as_of_year 0 respondent_id 0 agency_name 0 agency_abbr 0 agency_code 0 loan_type_name 0 loan_type 0 property_type_name 0 property_type 0 loan_purpose_name 0 loan_purpose 0 owner_occupancy_name 0 owner_occupancy 0 loan_amount_000s 0 preapproval_name 0 preapproval 0 action_taken_name 0 action_taken 0 msamd_name 0 msamd 33468 state_name 0 state_abbr 0 state_code 0 county_name 0 county_code 0 census_tract_number 0 applicant_ethnicity_name 0 applicant_ethnicity 0 co_applicant_ethnicity_name 0 co_applicant_ethnicity 0 applicant_race_name_1 0 applicant_race_1 0 applicant_race_name_2 317520 applicant_race_2 317520 applicant_race_name_3 318752 applicant_race_3 318752 applicant_race_name_4 318816 applicant_race_4 318816 applicant_race_name_5 318824 applicant_race_5 318824 co_applicant_race_name_1 0 co_applicant_race_1 0 co_applicant_race_name_2 318841 co_applicant_race_2 318841 co_applicant_race_name_3 318841 co_applicant_race_3 318841 co_applicant_race_name_4 318841 co_applicant_race_4 318841 co_applicant_race_name_5 318841 co_applicant_race_5 318841 applicant_sex_name 0 applicant_sex 0 co_applicant_sex_name 0 co_applicant_sex 0 applicant_income_000s 0 purchaser_type_name 0 purchaser_type 0 denial_reason_name_1 0 denial_reason_1 248033 denial_reason_name_2 304184 denial_reason_2 304184 denial_reason_name_3 316221 denial_reason_3 316221 rate_spread 0 hoepa_status_name 0 hoepa_status 0 lien_status_name 0 lien_status 0 edit_status_name 318841 edit_status 318841 sequence_number 0 population 0 minority_population 0 hud_median_family_income 0 tract_to_msamd_income 0 number_of_owner_occupied_units 0 number_of_1_to_4_family_units 0 application_date_indicator 0 dtype: int64
D) Feature Engineering¶
Feature creation¶
# Creation of ethnicity_race_sex column:
df['ethnicity_race_sex'] = df['applicant_ethnicity_name'].str.lower() + "_" + df['applicant_race_name_1'].str.lower() + "_" + df['applicant_sex_name'].str.lower()
# Checking column created
print(df[['ethnicity_race_sex']].value_counts())
ethnicity_race_sex not hispanic or latino_white_male 126540 not hispanic or latino_white_female 84557 not hispanic or latino_black or african american_female 27181 not hispanic or latino_black or african american_male 22053 hispanic or latino_white_male 17879 not hispanic or latino_asian_male 13608 hispanic or latino_white_female 11132 not hispanic or latino_asian_female 8975 not hispanic or latino_american indian or alaska native_male 1116 hispanic or latino_black or african american_male 937 not hispanic or latino_native hawaiian or other pacific islander_male 922 hispanic or latino_black or african american_female 873 not hispanic or latino_native hawaiian or other pacific islander_female 741 not hispanic or latino_american indian or alaska native_female 725 hispanic or latino_native hawaiian or other pacific islander_male 436 hispanic or latino_american indian or alaska native_male 385 hispanic or latino_native hawaiian or other pacific islander_female 242 hispanic or latino_american indian or alaska native_female 241 hispanic or latino_asian_male 175 hispanic or latino_asian_female 123 Name: count, dtype: int64
Note about above results: note how the first 4 columns are labeled as "not hispanic or latino" whereas the last 4 columns are "hispanic or latino" group linked to the smallest community.
# Creation of loan_to_income_ratio column:
df['loan_to_income_ratio'] = (df['loan_amount_000s'] / df['applicant_income_000s']).round(2)
# Checking
print(df['loan_to_income_ratio'].head())
# Note: Figures below represent the times of their annual income.
# E.g. 3.10, means that the loan amount is 3.10 times their annual income.
0 3.38 2 0.85 5 3.07 6 1.76 7 1.33 Name: loan_to_income_ratio, dtype: float64
Feature Transformation; Mapping¶
print(df["action_taken_name"].value_counts())
action_taken_name Loan originated 148863 Application denied by financial institution 108294 Application approved but not accepted 31152 Loan purchased by the institution 30532 Name: count, dtype: int64
# Mapping "action_taken_name" into a binary feature approved-denied
atn_map = {
"Loan originated" : 'approved',
"Application approved but not accepted" : 'approved',
"Loan purchased by the institution" : 'approved',
"Application denied by financial institution" : 'denied'
}
df["action_taken_name"] = df["action_taken_name"].map(atn_map)
print(df["action_taken_name"].value_counts())
action_taken_name approved 210547 denied 108294 Name: count, dtype: int64
# Mapping "ethnicity_race_sex" for better readitability:
ers_map = {
"not hispanic or latino_white_male": "NH_White_M",
"not hispanic or latino_white_female": "NH_White_F",
"not hispanic or latino_black or african american_female": "NH_Black_F",
"not hispanic or latino_black or african american_male": "NH_Black_M",
"hispanic or latino_white_male": "H_White_M",
"not hispanic or latino_asian_male": "NH_Asian_M",
"hispanic or latino_white_female": "H_White_F",
"not hispanic or latino_asian_female": "NH_Asian_F",
"not hispanic or latino_american indian or alaska native_male": "NH_Indigenous_M",
"hispanic or latino_black or african american_male": "H_Black_M",
"not hispanic or latino_native hawaiian or other pacific islander_male": "NH_PacificIslander_M",
"hispanic or latino_black or african american_female": "H_Black_F",
"not hispanic or latino_native hawaiian or other pacific islander_female": "NH_PacificIslander_F",
"not hispanic or latino_american indian or alaska native_female": "NH_Indigenous_F",
"hispanic or latino_native hawaiian or other pacific islander_male": "H_PacificIslander_M",
"hispanic or latino_american indian or alaska native_male": "H_Indigenous_M",
"hispanic or latino_native hawaiian or other pacific islander_female": "H_PacificIslander_F",
"hispanic or latino_american indian or alaska native_female": "H_Indigenous_F",
"hispanic or latino_asian_male": "H_Asian_M",
"hispanic or latino_asian_female": "H_Asian_F"
}
df["ethnicity_race_sex"] = df["ethnicity_race_sex"].map(ers_map)
print(df.shape, '\n', df["ethnicity_race_sex"].value_counts())
(318841, 80) ethnicity_race_sex NH_White_M 126540 NH_White_F 84557 NH_Black_F 27181 NH_Black_M 22053 H_White_M 17879 NH_Asian_M 13608 H_White_F 11132 NH_Asian_F 8975 NH_Indigenous_M 1116 H_Black_M 937 NH_PacificIslander_M 922 H_Black_F 873 NH_PacificIslander_F 741 NH_Indigenous_F 725 H_PacificIslander_M 436 H_Indigenous_M 385 H_PacificIslander_F 242 H_Indigenous_F 241 H_Asian_M 175 H_Asian_F 123 Name: count, dtype: int64
Step 4: Exploratory Data Analysis¶
A) Initial questions¶
A.1 What is the proportion of denials to approvals for each group in the column "ethnicity_race_sex"?¶
# Calculating and plotting the distribution of action_taken for each ethnicity_race_sex group
action_distribution = df.groupby(['ethnicity_race_sex', 'action_taken_name']).size().unstack(fill_value=0)
action_distribution_pct = action_distribution.div(action_distribution.sum(axis=1), axis=0)
# Sorting the data by 'approved' column
sorted_index = action_distribution_pct['approved'].sort_values(ascending=True).index
action_distribution_pct_sorted = action_distribution_pct.loc[sorted_index]
###########################################################
# Explanation: action_distribution = df.groupby(['ethnicity_race_sex', 'action_taken_name']).size().unstack(fill_value=0)
# - df.groupby(['ethnicity_race_sex', 'action_taken_name']); Groups data by unique combination of intersectional group and action taken,
# in other words, it groups the data like follows: NH_White_M-Approved; NH_White_M-Denied; .... and so on for each intersectional group.
# - .size(); counts the number of rows in each group (like count() but without being specific about the columns)
# - .unstack(); converts the second grouping level ("action_taken_name") into columns so rows will be given by each intersectional group, and the columns by each action_taken_name (approved, denied)
# (fill_value=0) is used to ensure each intersectional group has a value, because it can happen that an intersectional group doesn't have any approved loans, for instance.
# As result we gest a df that looks like follows:
# ethnicity_race_sex Approved Denied
# NH_White_M 1234 645
# ... rest of the data.
########################################################
# Explanation: action_distribution_pct = action_distribution.div(action_distribution.sum(axis=1), axis=0)
# - .div(argument1, argument2); argument1 is what to divide by, argument2 controls how the division is applied:
# axis=0 → divide each row by a value per row (sum of all values in that row for the group),
# axis=1 → divide each column by a value per column.
# - Therefore, argument1 is action_distribution.sum(axis=1), which represents the total count of actions for each group just like before:
# # ethnicity_race_sex Approved Denied
# # NH_White_M 1234 645
# # ... rest of the data.
# - argument2, axis=0; for each intersectional group, we sum Approved + Denied. That total is used to divide **each value in the row** by.
########################################################
# Explanation: sorted_index = action_distribution_pct['approved'].sort_values(ascending=True).index
# .sort_values() --> sorts the values in 'approved' column.
# .index --> gets the row labels in that sorted order.
#
# action_distribution_pct_sorted = action_distribution_pct.loc[sorted_index]:
# .loc[sorted_index]: Rearranges the rows of action_distribution_pct to match the sorted order from sorted_index.
# Displaying
plt.figure(figsize=(20, 10))
action_distribution_pct_sorted.plot(kind='barh', stacked=True) # Stacked is used for stacking the bars on top of each other.
plt.title('Applications approved - denied')
plt.xlabel('Proportion')
plt.ylabel('Ethnicity-Race-Sex')
plt.legend(title='Action Taken', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.subplots_adjust(right=0.8)
plt.show()
#https://stackoverflow.com/questions/25068384/bbox-to-anchor-and-loc-in-matplotlib
#https://www.geeksforgeeks.org/python-pandas-dataframe-div/
<Figure size 2000x1000 with 0 Axes>
A.2 What is the approval rate for each group?¶
# Same information as above but in percentage terms.
df['approved'] = (df['action_taken_name'] == 'approved').astype(int)
approval_rates = df.groupby('ethnicity_race_sex')['approved'].mean().sort_values(ascending=True)
plt.figure(figsize=(20, 10))
ax = approval_rates.plot(kind='barh')
plt.title('Approval Rates by Ethnicity-Race-Sex Intersections')
plt.xlabel('Approval Rate')
plt.ylabel('Ethnicity-Race-Sex')
plt.xlim(0, 1) # Set x-axis limits from 0 to 1 for percentage
# Adding percentage labels
for i, v in enumerate(approval_rates):
ax.text(v, i, f' {v:.1%}', va='center')
plt.tight_layout()
plt.show()
#https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.text.html
#https://github.com/DJ4seasons/Python_Basic4beginners
Chart above: It is kind of surprising that both male and female Asians have the highest approval rate. On the other hand, having the smallest communities with the lowest approved rate it somehow expected.
A.3 What is the denial rate for each group?¶
# Same information as above but in percentage terms.
df['denied'] = (df['action_taken_name'] == 'denied').astype(int)
denied_rates = df.groupby('ethnicity_race_sex')['denied'].mean().sort_values(ascending=True)
plt.figure(figsize=(20, 10))
ax = denied_rates.plot(kind='barh')
plt.title('Denied Rates by Ethnicity-Race-Sex Intersections')
plt.xlabel('Denied Rate')
plt.ylabel('Ethnicity-Race-Sex')
plt.xlim(0, 1) # Set x-axis limits from 0 to 1 for percentage
# Adding percentage labels
for i, v in enumerate(denied_rates):
ax.text(v, i, f' {v:.1%}', va='center')
plt.tight_layout()
plt.show()
#https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.text.html
#https://github.com/DJ4seasons/Python_Basic4beginners
A.4 What percentage of total applications does each ethnicity_race_sex group represent?¶
# Total number of applications (approved + denied) for each ethnicity_race_sex group
total_applications = df.groupby('ethnicity_race_sex')['action_taken_name'].count()
# % distribution of total applications
total_applications_pct = (total_applications / total_applications.sum()) * 100
# Sorting data
total_applications_pct_sorted = total_applications_pct.sort_values(ascending=True)
# Plotting results
plt.figure(figsize=(20, 10))
ax = total_applications_pct_sorted.plot(kind='barh', color='firebrick')
# Labels
plt.title('Distribution of Loan Applications by Ethnicity-Race-Sex')
plt.xlabel('Percentage of Total Applications')
plt.ylabel('Ethnicity-Race-Sex')
# % on bars
for i, v in enumerate(total_applications_pct_sorted):
ax.text(v + 0.5, i, f'{v:.1f}%', va='center')
plt.tight_layout()
plt.show()
From the above chart we already can appreciate a link between having little data and the rate of approved/denied rate. A closer look at the smaller communities is shown below.
Excluding the two largest groups¶
# Filtering out "not hispanic or latino_white_female" and "not hispanic or latino_white_male"
total_applications = df.groupby('ethnicity_race_sex')['action_taken_name'].count()
# Total number of applications (approved + denied) for each ethnicity_race_sex group
total_applications_pct = (total_applications / total_applications.sum()) * 100
# Filtering out "not hispanic or latino_white_female" and "not hispanic or latino_white_male" from visualization
visualization_pct = total_applications_pct.drop(['NH_White_M', 'NH_White_F'])
# Sortting data
visualization_pct_sorted = visualization_pct.sort_values(ascending=True)
# Plotting
plt.figure(figsize=(20, 10))
ax = visualization_pct_sorted.plot(kind='barh', color='indianred')
# Labels
plt.title('Distribution of Loan Applications by Ethnicity-Race-Sex (Excluding NH_White_M, NH_White_F')
plt.xlabel('Percentage of Total Applications')
plt.ylabel('Ethnicity-Race-Sex')
# % for each bar
for i, v in enumerate(visualization_pct_sorted):
ax.text(v + 0.1, i, f'{v:.1f}%', va='center')
plt.tight_layout()
plt.show()
A.5 What is the ratio of applications submitted by men compared to women?¶
# Counting the number of applications for men and women
male_count = df[df['applicant_sex_name'] == 'Male'].shape[0]
female_count = df[df['applicant_sex_name'] == 'Female'].shape[0]
# Calculating the ratio of men to women
if female_count == 0: # Avoid division by zero
ratio = None
else:
ratio = female_count / male_count
# Displaying the results
print(f"Number of applications by men: {male_count}")
print(f"Number of applications by women: {female_count}")
print(f"Ratio of applications (Men vs Women): {ratio:.2f}")
Number of applications by men: 184051 Number of applications by women: 134790 Ratio of applications (Men vs Women): 0.73
A.6 What is the average loan amount requested by men compared to women?¶
# Filtering the data for men and women
male_loans = df[df['applicant_sex_name'] == 'Male']['loan_amount_000s']
female_loans = df[df['applicant_sex_name'] == 'Female']['loan_amount_000s']
# Calculating the average loan amount for men and women
average_male_loan = male_loans.mean()
average_female_loan = female_loans.mean()
# Displaying the results
print(f"Average loan amount requested by men: ${average_male_loan:.2f} (in thousands)")
print(f"Average loan amount requested by women: ${average_female_loan:.2f} (in thousands)")
Average loan amount requested by men: $233.62 (in thousands) Average loan amount requested by women: $207.58 (in thousands)
A.7 What is the average loan amount requested by each group?¶
# Grouping by 'ethnicity_race_sex' and calculate average loan amount
average_loan_by_group = df.groupby('ethnicity_race_sex')['loan_amount_000s'].mean()
# Sorting by average loan amount if desired (optional)
average_loan_by_group_sorted = average_loan_by_group.sort_values(ascending=False)
# Displaying all results in a loop
print(f"{'Group':<25} {'Average Loan Amount (K)':>25}")
print("-" * 50)
for group, avg_loan in average_loan_by_group_sorted.items():
print(f"{group:<25} ${avg_loan:>20.2f}")
Group Average Loan Amount (K) -------------------------------------------------- NH_Asian_M $ 329.28 NH_Asian_F $ 315.21 NH_PacificIslander_M $ 288.41 H_White_M $ 286.87 NH_Black_M $ 279.67 H_Asian_M $ 271.50 NH_PacificIslander_F $ 268.19 H_White_F $ 267.04 NH_Black_F $ 254.55 H_Black_M $ 252.19 H_PacificIslander_M $ 249.71 H_Black_F $ 248.37 H_Asian_F $ 235.73 NH_Indigenous_M $ 223.69 NH_White_M $ 207.31 H_PacificIslander_F $ 206.81 H_Indigenous_M $ 204.21 NH_Indigenous_F $ 185.54 H_Indigenous_F $ 185.15 NH_White_F $ 172.50
A.8 What is the average income for each ethnicity_race_sex group?¶
# Groping the data by 'ethnicity_race_sex' and calculate the average income
average_income_by_ethnicity_race_sex = df.groupby('ethnicity_race_sex')['applicant_income_000s'].mean()
# Sorting values
average_income_sorted = average_income_by_ethnicity_race_sex.sort_values(ascending=False)
# Displaying the results
print("Average Applicant Income by Ethnicity, Race, and Sex:")
print(average_income_sorted)
Average Applicant Income by Ethnicity, Race, and Sex: ethnicity_race_sex NH_Asian_M 136.616035 H_Asian_M 135.480000 NH_PacificIslander_M 125.430586 NH_Asian_F 124.184847 NH_Indigenous_M 114.870072 H_White_M 114.076235 NH_White_M 113.654220 NH_Black_M 110.438897 NH_PacificIslander_F 106.105263 H_Black_M 104.784418 H_White_F 100.674003 H_PacificIslander_M 99.802752 H_Asian_F 97.642276 H_Indigenous_M 93.327273 H_Black_F 92.533792 NH_Black_F 89.674221 NH_Indigenous_F 83.120000 H_Indigenous_F 82.941909 NH_White_F 82.184455 H_PacificIslander_F 76.260331 Name: applicant_income_000s, dtype: float64
A.9 What is the average loan-to-income ratio for each group?¶
df['loan_to_income_ratio'] = df['loan_amount_000s'] / df['applicant_income_000s']
avg_ratio_by_group = df.groupby('ethnicity_race_sex')['loan_to_income_ratio'].mean()
avg_ratio_by_group_sorted = avg_ratio_by_group.sort_values(ascending=False)
print(f"{'Group':<25} {'Average Loan-to-Income Ratio'}")
print("-" * 50)
for group, avg_ratio in avg_ratio_by_group_sorted.items():
print(f"{group:<25} {avg_ratio:.2f}")
Group Average Loan-to-Income Ratio -------------------------------------------------- NH_PacificIslander_F 3.08 H_Black_F 3.05 NH_Black_F 3.04 H_White_F 2.90 NH_Black_M 2.89 NH_Asian_F 2.86 H_White_M 2.83 NH_Asian_M 2.79 H_Asian_F 2.79 NH_PacificIslander_M 2.72 H_PacificIslander_M 2.72 H_Black_M 2.70 H_Asian_M 2.59 H_PacificIslander_F 2.54 H_Indigenous_M 2.53 NH_Indigenous_F 2.40 H_Indigenous_F 2.31 NH_Indigenous_M 2.29 NH_White_F 2.25 NH_White_M 2.09
A.10 What is the average rate_spread for each group?¶
# Grouping the data by 'ethnicity_race_sex' and calculate the average rate spread
average_rate_spread_by_ethnicity_race_sex = df.groupby('ethnicity_race_sex')['rate_spread'].mean()
# Sorting results
average_rate_spread_sorted = average_rate_spread_by_ethnicity_race_sex.sort_values(ascending=False)
# Displaying the results
print("Average Rate Spread by Ethnicity, Race, and Sex:")
print(average_rate_spread_sorted)
Average Rate Spread by Ethnicity, Race, and Sex: ethnicity_race_sex H_Asian_M 0.695257 NH_Black_M 0.659511 NH_Black_F 0.644287 H_White_M 0.545123 NH_PacificIslander_M 0.535141 H_White_F 0.508944 H_Black_M 0.499509 NH_Indigenous_F 0.464952 NH_PacificIslander_F 0.440432 NH_White_M 0.421578 NH_White_F 0.415016 NH_Indigenous_M 0.391855 H_Asian_F 0.361545 H_Black_F 0.359129 H_PacificIslander_M 0.310642 H_Indigenous_M 0.291714 H_Indigenous_F 0.282241 NH_Asian_M 0.249730 NH_Asian_F 0.228814 H_PacificIslander_F 0.195248 Name: rate_spread, dtype: float64
A.11 Which group has the fewest zero values in the column "rate_spread", suggesting they are more frequently overcharged?¶
# Calculating the total record we have per group
total_records = df.groupby('ethnicity_race_sex').size()
# Counting records where rate_spread is NOT 0
non_zero_counts = df.groupby('ethnicity_race_sex')['rate_spread'].apply(lambda x: (x != 0).sum())
# Calculating percentage of non-zero records
non_zero_percentage = (non_zero_counts / total_records) * 100
# Sorting by non-zero percentage
non_zero_percentage_sorted = non_zero_percentage.sort_values(ascending=False)
# Displaying results
print(f"{'Group':<25} {'% Non-Zero'}")
print("-" * 40)
for group in non_zero_percentage_sorted.index:
print(f"{group:<25} {non_zero_percentage_sorted[group]:.2f}%")
Group % Non-Zero ---------------------------------------- H_Asian_M 12.57% NH_Black_M 12.14% NH_Black_F 12.08% H_White_M 10.43% NH_Indigenous_F 9.79% H_White_F 9.67% NH_PacificIslander_M 9.65% H_Black_M 9.18% NH_PacificIslander_F 8.50% NH_White_M 8.15% NH_White_F 8.06% NH_Indigenous_M 7.35% H_Black_F 6.99% H_Asian_F 6.50% H_PacificIslander_M 6.42% H_Indigenous_M 5.71% H_Indigenous_F 4.98% NH_Asian_M 4.78% NH_Asian_F 4.46% H_PacificIslander_F 3.72%
B) Statistics & Feature Engineering; Feature Selection¶
# Feature Engineering; feature selection
df = df[[
"loan_type_name",
"property_type_name",
"loan_purpose_name",
"loan_amount_000s",
"action_taken_name",
"msamd_name",
"census_tract_number",
"hoepa_status_name",
"applicant_income_000s",
"denial_reason_name_1",
"rate_spread",
"lien_status_name",
"hud_median_family_income",
"tract_to_msamd_income",
"minority_population",
"number_of_owner_occupied_units",
"ethnicity_race_sex",
'loan_to_income_ratio'
]]
# Statistics
df.describe().T
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
loan_amount_000s | 318841.0 | 222.612211 | 211.328864 | 1.000000 | 69.000000 | 150.000000 | 342.000000 | 1998.000000 |
census_tract_number | 318841.0 | 1498.589303 | 2685.193183 | 1.000000 | 130.000000 | 354.000000 | 1237.020000 | 9929.000000 |
applicant_income_000s | 318841.0 | 103.666790 | 116.434086 | 10.000000 | 47.000000 | 76.000000 | 120.000000 | 1999.000000 |
rate_spread | 318841.0 | 0.452464 | 1.547799 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 13.810000 |
hud_median_family_income | 318841.0 | 65364.387579 | 13759.264157 | 50900.000000 | 59300.000000 | 59500.000000 | 66300.000000 | 93800.000000 |
tract_to_msamd_income | 318841.0 | 107.065156 | 45.470959 | 5.050000 | 82.199997 | 99.440002 | 121.019997 | 404.359985 |
minority_population | 318841.0 | 33.289470 | 34.088757 | 0.630000 | 6.100000 | 16.440001 | 59.490002 | 100.000000 |
number_of_owner_occupied_units | 318841.0 | 1079.190179 | 629.977193 | 2.000000 | 542.000000 | 1039.000000 | 1521.000000 | 4825.000000 |
loan_to_income_ratio | 318841.0 | 2.403550 | 1.852347 | 0.001515 | 1.034483 | 2.225000 | 3.381818 | 84.000000 |
C) Visualization of categorical and numerical data¶
C.1 Categorical data; histograms¶
categorical_columns = ['loan_type_name', 'loan_purpose_name', 'property_type_name', 'lien_status_name']
# Plotting categorical columns
for col in categorical_columns:
plt.figure(figsize=(10, 6))
sns.countplot(data=df, x=col, hue='action_taken_name')
plt.title(f'Distribution of {col} against action_taken_name')
plt.xticks(rotation=45)
plt.show()
# horizontal barplot
def value_counts_bar(col, title):
vc = col.value_counts()
plt.figure(figsize=(6, len(vc.keys())/2))
plt.barh(vc.keys(), vc.values)
plt.title(title, fontsize=15)
plt.xticks(rotation='vertical', fontsize=10)
plt.show()
# https://q-shick.github.io/portfolio/proj_hmda.html by By Kyoosik Kim
value_counts_bar(df.action_taken_name, "Action Taken")
value_counts_bar(df.loan_purpose_name, "Loan Purpose")
C.2 Numercial Data:¶
Looking for outliers with boxplot¶
numerical_columns = ['loan_amount_000s', 'applicant_income_000s', 'minority_population',
'hud_median_family_income', 'tract_to_msamd_income']
# Plotting numerical columns
for col in numerical_columns:
plt.figure(figsize=(10, 6))
sns.boxplot(data=df, x='action_taken_name', y=col, hue='action_taken_name')
plt.title(f'Distribution of {col} against action_taken_name')
plt.xticks(rotation=45)
plt.show()
#https://seaborn.pydata.org/generated/seaborn.countplot.html
#https://seaborn.pydata.org/generated/seaborn.boxplot.html
# Problems encountered: Boxplot didn't display different colours for denied/approved
## Solution: Forgot to add "hue" in the (boxplot loop) second loop.
### Sources: https://stackoverflow.com/questions/71459836/how-to-color-each-boxplot-differently#:~:text=artists%20.-,The%20recommended%20way%20to%20change%20the%20colors%20in%20Seaborn%20is,or%20a%20list%20of%20colors.)
Filtering out outliers¶
# Percentile bounds for applicant_income_000s
income_lower_bound = np.percentile(df['applicant_income_000s'], 1)
income_upper_bound = np.percentile(df['applicant_income_000s'], 99)
# Percentile bounds for loan_amount_000s
loan_lower_bound = np.percentile(df['loan_amount_000s'], 1)
loan_upper_bound = np.percentile(df['loan_amount_000s'], 99)
# Percentile bounds for hud_median_family_income
hud_income_lower_bound = np.percentile(df['hud_median_family_income'], 1)
hud_income_upper_bound = np.percentile(df['hud_median_family_income'], 99)
# Apply filtering to all three
df = df[
(df['applicant_income_000s'] >= income_lower_bound) &
(df['applicant_income_000s'] <= income_upper_bound) &
(df['loan_amount_000s'] >= loan_lower_bound) &
(df['loan_amount_000s'] <= loan_upper_bound) &
(df['hud_median_family_income'] >= hud_income_lower_bound) &
(df['hud_median_family_income'] <= hud_income_upper_bound)
]
Checking statistics again¶
# Statistics
df.describe().T
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
loan_amount_000s | 308206.0 | 213.700256 | 181.900571 | 4.000000 | 70.000000 | 150.000000 | 336.000000 | 994.000000 |
census_tract_number | 308206.0 | 1494.728822 | 2671.093164 | 1.000000 | 131.000000 | 362.000000 | 1238.020000 | 9929.000000 |
applicant_income_000s | 308206.0 | 94.487473 | 71.337269 | 16.000000 | 48.000000 | 76.000000 | 118.000000 | 597.000000 |
rate_spread | 308206.0 | 0.461634 | 1.563350 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 13.810000 |
hud_median_family_income | 308206.0 | 65463.361518 | 13798.666392 | 50900.000000 | 59300.000000 | 59500.000000 | 66300.000000 | 93800.000000 |
tract_to_msamd_income | 308206.0 | 106.027853 | 43.125547 | 5.050000 | 82.070000 | 99.290001 | 120.610001 | 404.359985 |
minority_population | 308206.0 | 33.602562 | 34.252241 | 0.630000 | 6.140000 | 16.610001 | 60.650002 | 100.000000 |
number_of_owner_occupied_units | 308206.0 | 1079.279900 | 629.946044 | 2.000000 | 541.000000 | 1039.000000 | 1521.000000 | 4825.000000 |
loan_to_income_ratio | 308206.0 | 2.417657 | 1.726121 | 0.007937 | 1.085106 | 2.253704 | 3.397436 | 41.187500 |
Checking for Skewnewss with histograms¶
df['income_sqrt'] = np.sqrt(df['applicant_income_000s'])
df['loan_amount_sqrt'] = np.sqrt(df['loan_amount_000s'])
# Note about sqrt;
# - We are calculating the square root of each value in those columns.
# - creating two new columns with the transformed values
# - Why? To reduce the scale of the numbers for a better visualization.
# OTHER BENEFITS:
# - To make skewed data more normal,
# - Stabilize variance,
# - Reduce the influence of large numbers.
# ALTERNATIVELY;
# - We could use np.log() which is 'stronger', more agressive
# - BUT requieres all values to be >0.
# Code's source: https://q-shick.github.io/portfolio/proj_hmda.html
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
for action in df['action_taken_name'].unique():
sns.kdeplot(
data=df[df['action_taken_name'] == action],
x='income_sqrt',
ax=axes[0],
label=action
)
sns.kdeplot(
data=df[df['action_taken_name'] == action],
x='loan_amount_sqrt',
ax=axes[1],
label=action
)
axes[0].set_title("Income (sqrt) by Loan Decision")
axes[1].set_title("Loan Amount (sqrt) by Loan Decision")
axes[1].legend(bbox_to_anchor=(1.05, 0.6))
plt.tight_layout()
plt.show()
# W
# Code's source: https://q-shick.github.io/portfolio/proj_hmda.html
# Split the data
dist_0 = df[df['action_taken_name'] == 'approved']['income_sqrt']
dist_1 = df[df['action_taken_name'] == 'denied']['income_sqrt']
# Run Welch’s t-test (assumes unequal variances)
stat, p_value = stats.ttest_ind(dist_0, dist_1, equal_var=False)
# Print results
print(f"Stat: {stat:.2f}, P-value: {p_value:}")
Stat: 25.60, P-value: 2.3018618003693452e-144
Step 5.1: Feature engineering¶
A) Feature creation