HMDA. Mortgage Data Analysis and Modeling Predictions.¶

The Dataset: https://www.consumerfinance.gov/data-research/hmda/¶

The Home Mortgage Disclosure Act (HMDA) requires many financial institutions to maintain, report, and publicly disclose loan-level information about mortgages. These data help show whether lenders are serving the housing needs of their communities; they give public officials information that helps them make decisions and policies; and they shed light on lending patterns that could be discriminatory. The public data are modified to protect applicant and borrower privacy.

HMDA was originally enacted by Congress in 1975 and is implemented by Regulation C.

The Goal:¶

We will use the models Logistic Regression and LightGBM. to assess the outcomes, we will use ROC-AUC and Log Loss. The purpouse will be:

  1. Find out if the are any intersectional biases in model's predictions.
  2. Which features influenced the most on model's outcomes. We will use SHAP Values.

Steps:¶

  1. Step 1: Setup of Working Environment and Libraries.

  2. Step 2: Overview

  • A) Understanding features: Know what each column represents and whether it is useful.

  • B) Understanding the data, shape and structure:

    • df.shape
    • df.columns
    • df.dtypes
    • df.head()
  1. Step 3: Data Cleaning & Feature Engineering
  • A) Excluding irrelevant records from columns.

    • Quality issues: in "edit_status_name" there are 190,210 records with quality issues. We will ommit them.
    • Only single idividuals: we will focus on single individuals hence from 'co_applicant_ethnicity_name' we only keep 'No co-applicant'
  • B) Excluding irrelevant records from:

    • applicant_ethnicity_name
    • applicant_race_name_1
    • applicant_sex_name
    • action_taken_name
  • C) Addressing missing values:

    • Filling relevant empty cells with "Unknown"
    • Deleting irrelevant records from other columns.
  • D) Feature engineering;

    • Feature creation:
      • 'ethnicity_race_sex'
      • "loan_to_income_ratio"
    • Feature transformation. Mapping action_taken_name into "approved" & "denied"
  1. Step 4: Exploratory Data Analysis (EDA) & Feature Engineering
  • A) Initial questions

    • A.1 What is the proportion of denials to approvals for each group in the column "ethnicity_race_sex"?
    • A.2 What is the approval rate for each group?
    • A.3 What is the denial rate for each group?
    • A.4 What percentage of total applications does each ethnicity_race_sex group represent?
    • A.5 What is the ratio of applications submitted by men compared to women?
    • A.6 What is the average loan amount requested by men compared to women?
    • A.7 What is the average loan amount requested by each group?
    • A.8 What is the average income for each ethnicity_race_sex group?
    • A.9 What is the average loan-to-income ratio for each group?
    • A.10 What is the average rate_spread for each group?
    • A.11 Which group has the fewest zero values in the column "rate_spread", suggesting they are more frequently overcharged?
  • B) Statistics & feature engineering; feature selection

  • C) Visualization of numerical and categorical data

    • Categorical data; histograms
    • Numercial Data:
      • Looking for outliers with boxplot
      • Filtering out outliers
      • Checking Statistics
      • Checking and addressing Skewness
  • E) Kernel Density Estimation (KDE)

Final Considerations: Please note that at the bottom of some coding blocks we acknowledge external original sources of the code used, if any source is mentioned credits go to myself or from official documentation.

Step 1: Setup of working environment and libraries.¶

Note:

  • !git init is for Google Colab environment, we should use "subprocess.run(["git", "init"])" which is more in line with PEP 8 doc. Style.

  • We only use once "git init", when starting a new project. The purpose of this code is to create things related to the project in a .git directory in the project folder.

  • subprocess.run(["git", "init"])

In [1]:
# Mount google drive
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
In [2]:
# Setting up GitHub
import os
import subprocess
from getpass import getpass #Secure token storage

# Importing working space
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from scipy import stats

Username and email:¶

subprocess.run(["git", "config", "--global", "user.name", "AJLR888"], check=True) subprocess.run(["git", "config", "--global", "user.email", "roldan.analytics@gmail.com"], check=True)

Storing GitHub token and repository details¶

GITHUB_TOKEN = getpass("Enter GitHub Token:") REPO_OWNER = "AJLR888" REPO_NAME = "hmda-ny-2007-loan-default" BRANCH_NAME = "main"

Setting GitHub remot URL with authentcation¶

GIT_REMOTE_URL = f"https://%7BGITHUB_TOKEN%7D@github.com/%7BREPO_OWNER%7D/%7BREPO_NAME%7D.git" os.system(f"git remote set-url origin {GIT_REMOTE_URL}")

In [3]:
# Loading the data
df = pd.read_csv('/content/drive/My Drive/Colab Notebooks/hmda_2007_ny_all-records_labels.csv')

# Checking data size
df = pd.DataFrame(df)
<ipython-input-3-100f7bdf4485>:2: DtypeWarning: Columns (34,36,38,42,44,46,48) have mixed types. Specify dtype option on import or set low_memory=False.
  df = pd.read_csv('/content/drive/My Drive/Colab Notebooks/hmda_2007_ny_all-records_labels.csv')

Step 2: General overview¶

A) Understanding of each feature. What does each feature describe?¶

  1. as_of_year: The year the mortgage was reported.

  2. respondent_id: The lender.

  3. agency_name: The regulatory agency responsible for overseeing the financial institution that reported the mortgage.

  4. agency_abbr: Agency abrebiation name.

  5. agency_code: The regulator that supervises the regulated lender that report the data.

  6. loan_type_name:The type of covered loan or application (if covered). By covered mean, there is a mechanism to protect the lender.

  7. property_type_name:

  • One to four-family (other than manufactured housing)
  • Manufactured housing
  • Multifamily
  1. loan_purpose_name: What was the loan used for.

  2. owner_occupancy_name: Whether the owner intend to live in the property or not.

  3. loan_amount_000s

  4. preapproval_name: Preapproval only applies to home purchase loans, not refinancing or home improvement loans.

  5. action_taken_name: The final outcome of a loan application.

  6. msamd_name: MSA stands for Metropolitan Statistical Area, is a region with a high population density at its core and close economic ties throught the area; MD stands for Metropolitan division, is a sub-region within a large MSA (used when the area is particularly populous).

  • Considerations:
    • If the property is outside of the MSA/MD area it may be blank/empty.
    • Helps analyze lending patterns, approval rates, and borrower demographics in specific regions.
    • msamd is broader, city or regional-level in comparison with census_tract_number. msamd is better use for housing trends or market segmentation for instance.
  1. census_tract_number: is a small area. Designed to be socially and economically homogeneous. In other words, they are kind of neighbourhoods.

  2. purchaser_type_name: Describes who purchased the loan in the secondary market after it was originated by the reporting institution.

  3. denial_reason_name_1: Important to consider that if NA the loan was not denied.

  4. rate_spread: The exra interest that a borrower pay compared to the "best-qualified" borrowers. Generally speaking it means higher risk although there are nuances to consider. e.g. someone can have a higher rate spread due to a bad credit score which doesn't reflect their ability and will to pay back.

  5. hoepa_status_name: hoepa stands for Home Ownership and Equity Protection Act. is a law disgned to protect borrowers from predatory lending.

  • Considerations:
    • It can be used to track if high-cost loans are disproportionally targeted at vulnerable groups.
  1. lien_status_name: Basically, it indicates whether the mortgage is secured or not.

  2. edit_status_name: Are rules to assist filers in checking the accuracy of HMDA data prior to submission. Idicate if there are any accuracy(quality)issues.

    • Note: In our dataset we have almost 200,000 records as "Quality edit failure only" which is indicative of some potentail data inconsistency.
  3. minority_population: % of minority population to total tract population.

  4. hud_median_family_income: Median family income for the MSA (metropolitan statistical area) or MD (metropolitan division)

    • Note: hud(Housing & Urban development).
  • List item
  1. population: total population in tract (tract ~ neighbourhood).

  2. tract_to_msamd_income: Indicates how wealthy or poor a tract is compared to its surroundings areas. compares the median family income of the census tract to the median family income of the corresponding metropolitan statistical area MSA OR metropolitan division MD. Let's remember that a tract is inside of a MSA/MD.

  3. number_of_owner_occupied_units: Total number of housing units in the census tract that are occupied by their owners rather than rented.

  4. number_of_1_to_4_family_units: Indicates the number of housing units that are classified as single-family homes. The higher the number the less population density.


Check the following sources for further undertanding:

  • Check filing instructions guide for further information: https://ffiec.cfpb.gov/documentation/fig/2024/overview

  • Glosary: https://www.ffiec.gov/hmda/glossary.htm#top

  • Original source: https://www.consumerfinance.gov/data-research/hmda/historic-data/?geo=nationwide&records=all-records&field_descriptions=labels

  • Check "lar_record_codes.pdf" from my GitHub for further Inf.

B) Understanding the data¶

In [4]:
# Checking content in the columns
pd.set_option('display.max_columns', None)
print(df.head(5))
   as_of_year respondent_id                                  agency_name  \
0        2007    20-2096530    Office of the Comptroller of the Currency   
1        2007    2085300005  Department of Housing and Urban Development   
2        2007    20-2096530    Office of the Comptroller of the Currency   
3        2007    0000003970                 Office of Thrift Supervision   
4        2007    2085300005  Department of Housing and Urban Development   

  agency_abbr  agency_code loan_type_name  loan_type  \
0         OCC            1   Conventional          1   
1         HUD            7   Conventional          1   
2         OCC            1   Conventional          1   
3         OTS            4   Conventional          1   
4         HUD            7   Conventional          1   

                                  property_type_name  property_type  \
0  One-to-four family dwelling (other than manufa...              1   
1  One-to-four family dwelling (other than manufa...              1   
2  One-to-four family dwelling (other than manufa...              1   
3  One-to-four family dwelling (other than manufa...              1   
4  One-to-four family dwelling (other than manufa...              1   

  loan_purpose_name  loan_purpose                    owner_occupancy_name  \
0  Home improvement             2  Owner-occupied as a principal dwelling   
1       Refinancing             3  Owner-occupied as a principal dwelling   
2  Home improvement             2  Owner-occupied as a principal dwelling   
3       Refinancing             3  Owner-occupied as a principal dwelling   
4       Refinancing             3  Owner-occupied as a principal dwelling   

   owner_occupancy  loan_amount_000s preapproval_name  preapproval  \
0                1                54   Not applicable            3   
1                1                60   Not applicable            3   
2                1                56   Not applicable            3   
3                1               374   Not applicable            3   
4                1               158   Not applicable            3   

                             action_taken_name  action_taken  \
0  Application denied by financial institution             3   
1  Application denied by financial institution             3   
2  Application denied by financial institution             3   
3            Loan purchased by the institution             6   
4  Application denied by financial institution             3   

             msamd_name    msamd state_name state_abbr  state_code  \
0       Binghamton - NY  13780.0   New York         NY          36   
1       Binghamton - NY  13780.0   New York         NY          36   
2       Binghamton - NY  13780.0   New York         NY          36   
3  Nassau, Suffolk - NY  35004.0   New York         NY          36   
4       Binghamton - NY  13780.0   New York         NY          36   

     county_name  county_code  census_tract_number applicant_ethnicity_name  \
0   Tioga County        107.0                206.0   Not Hispanic or Latino   
1  Broome County          7.0                 17.0       Hispanic or Latino   
2   Tioga County        107.0                206.0   Not Hispanic or Latino   
3  Nassau County         59.0               4086.0           Not applicable   
4  Broome County          7.0                125.0   Not Hispanic or Latino   

   applicant_ethnicity co_applicant_ethnicity_name  co_applicant_ethnicity  \
0                    2             No co-applicant                       5   
1                    1             No co-applicant                       5   
2                    2             No co-applicant                       5   
3                    4              Not applicable                       4   
4                    2      Not Hispanic or Latino                       2   

  applicant_race_name_1  applicant_race_1 applicant_race_name_2  \
0                 White                 5                   NaN   
1                 White                 5                   NaN   
2                 White                 5                   NaN   
3        Not applicable                 7                   NaN   
4                 White                 5                   NaN   

   applicant_race_2 applicant_race_name_3  applicant_race_3  \
0               NaN                   NaN               NaN   
1               NaN                   NaN               NaN   
2               NaN                   NaN               NaN   
3               NaN                   NaN               NaN   
4               NaN                   NaN               NaN   

  applicant_race_name_4  applicant_race_4 applicant_race_name_5  \
0                   NaN               NaN                   NaN   
1                   NaN               NaN                   NaN   
2                   NaN               NaN                   NaN   
3                   NaN               NaN                   NaN   
4                   NaN               NaN                   NaN   

   applicant_race_5 co_applicant_race_name_1  co_applicant_race_1  \
0               NaN          No co-applicant                    8   
1               NaN          No co-applicant                    8   
2               NaN          No co-applicant                    8   
3               NaN           Not applicable                    7   
4               NaN                    White                    5   

  co_applicant_race_name_2  co_applicant_race_2 co_applicant_race_name_3  \
0                      NaN                  NaN                      NaN   
1                      NaN                  NaN                      NaN   
2                      NaN                  NaN                      NaN   
3                      NaN                  NaN                      NaN   
4                      NaN                  NaN                      NaN   

   co_applicant_race_3 co_applicant_race_name_4  co_applicant_race_4  \
0                  NaN                      NaN                  NaN   
1                  NaN                      NaN                  NaN   
2                  NaN                      NaN                  NaN   
3                  NaN                      NaN                  NaN   
4                  NaN                      NaN                  NaN   

  co_applicant_race_name_5  co_applicant_race_5  \
0                      NaN                  NaN   
1                      NaN                  NaN   
2                      NaN                  NaN   
3                      NaN                  NaN   
4                      NaN                  NaN   

                                  applicant_sex_name  applicant_sex  \
0                                               Male              1   
1  Information not provided by applicant in mail,...              3   
2                                               Male              1   
3                                     Not applicable              4   
4  Information not provided by applicant in mail,...              3   

                               co_applicant_sex_name  co_applicant_sex  \
0                                    No co-applicant                 5   
1                                    No co-applicant                 5   
2                                    No co-applicant                 5   
3                                     Not applicable                 4   
4  Information not provided by applicant in mail,...                 3   

   applicant_income_000s                                purchaser_type_name  \
0                   16.0  Loan was not originated or was not sold in cal...   
1                   36.0  Loan was not originated or was not sold in cal...   
2                   66.0  Loan was not originated or was not sold in cal...   
3                    NaN  Commercial bank, savings bank or savings assoc...   
4                   62.0  Loan was not originated or was not sold in cal...   

   purchaser_type           denial_reason_name_1  denial_reason_1  \
0               0                     Collateral              4.0   
1               0                            NaN              NaN   
2               0  Credit application incomplete              7.0   
3               6                            NaN              NaN   
4               0                            NaN              NaN   

  denial_reason_name_2  denial_reason_2 denial_reason_name_3  denial_reason_3  \
0                  NaN              NaN                  NaN              NaN   
1                  NaN              NaN                  NaN              NaN   
2                  NaN              NaN                  NaN              NaN   
3                  NaN              NaN                  NaN              NaN   
4                  NaN              NaN                  NaN              NaN   

   rate_spread hoepa_status_name  hoepa_status               lien_status_name  \
0          NaN  Not a HOEPA loan             2        Secured by a first lien   
1          NaN  Not a HOEPA loan             2        Secured by a first lien   
2          NaN  Not a HOEPA loan             2  Secured by a subordinate lien   
3          NaN  Not a HOEPA loan             2                 Not applicable   
4          NaN  Not a HOEPA loan             2        Secured by a first lien   

   lien_status edit_status_name  edit_status  sequence_number  population  \
0            1              NaN          NaN            13711      7424.0   
1            1              NaN          NaN            10815      4302.0   
2            2              NaN          NaN            13712      7424.0   
3            4              NaN          NaN           174127      4497.0   
4            1              NaN          NaN            10823      6421.0   

   minority_population  hud_median_family_income  tract_to_msamd_income  \
0                 2.25                   52800.0              91.790001   
1                14.16                   52800.0              94.230003   
2                 2.25                   52800.0              91.790001   
3                11.12                   93800.0              97.110001   
4                 2.59                   52800.0              93.949997   

   number_of_owner_occupied_units  number_of_1_to_4_family_units  \
0                          2259.0                         3030.0   
1                          1033.0                         1823.0   
2                          2259.0                         3030.0   
3                          1341.0                         1459.0   
4                          1926.0                         2669.0   

   application_date_indicator  
0                           0  
1                           0  
2                           0  
3                           2  
4                           0  
In [5]:
# Force to dislplay all rows
pd.set_option('display.max_rows', None)

# Data types
print(df.dtypes)
as_of_year                          int64
respondent_id                      object
agency_name                        object
agency_abbr                        object
agency_code                         int64
loan_type_name                     object
loan_type                           int64
property_type_name                 object
property_type                       int64
loan_purpose_name                  object
loan_purpose                        int64
owner_occupancy_name               object
owner_occupancy                     int64
loan_amount_000s                    int64
preapproval_name                   object
preapproval                         int64
action_taken_name                  object
action_taken                        int64
msamd_name                         object
msamd                             float64
state_name                         object
state_abbr                         object
state_code                          int64
county_name                        object
county_code                       float64
census_tract_number               float64
applicant_ethnicity_name           object
applicant_ethnicity                 int64
co_applicant_ethnicity_name        object
co_applicant_ethnicity              int64
applicant_race_name_1              object
applicant_race_1                    int64
applicant_race_name_2              object
applicant_race_2                  float64
applicant_race_name_3              object
applicant_race_3                  float64
applicant_race_name_4              object
applicant_race_4                  float64
applicant_race_name_5              object
applicant_race_5                  float64
co_applicant_race_name_1           object
co_applicant_race_1                 int64
co_applicant_race_name_2           object
co_applicant_race_2               float64
co_applicant_race_name_3           object
co_applicant_race_3               float64
co_applicant_race_name_4           object
co_applicant_race_4               float64
co_applicant_race_name_5           object
co_applicant_race_5               float64
applicant_sex_name                 object
applicant_sex                       int64
co_applicant_sex_name              object
co_applicant_sex                    int64
applicant_income_000s             float64
purchaser_type_name                object
purchaser_type                      int64
denial_reason_name_1               object
denial_reason_1                   float64
denial_reason_name_2               object
denial_reason_2                   float64
denial_reason_name_3               object
denial_reason_3                   float64
rate_spread                       float64
hoepa_status_name                  object
hoepa_status                        int64
lien_status_name                   object
lien_status                         int64
edit_status_name                   object
edit_status                       float64
sequence_number                     int64
population                        float64
minority_population               float64
hud_median_family_income          float64
tract_to_msamd_income             float64
number_of_owner_occupied_units    float64
number_of_1_to_4_family_units     float64
application_date_indicator          int64
dtype: object
In [6]:
# Null values
print(df.isnull().sum())
as_of_year                              0
respondent_id                           0
agency_name                             0
agency_abbr                             0
agency_code                             0
loan_type_name                          0
loan_type                               0
property_type_name                      0
property_type                           0
loan_purpose_name                       0
loan_purpose                            0
owner_occupancy_name                    0
owner_occupancy                         0
loan_amount_000s                        0
preapproval_name                        0
preapproval                             0
action_taken_name                       0
action_taken                            0
msamd_name                          92920
msamd                               92920
state_name                              0
state_abbr                              0
state_code                              0
county_name                          1007
county_code                          1007
census_tract_number                  1370
applicant_ethnicity_name                0
applicant_ethnicity                     0
co_applicant_ethnicity_name             0
co_applicant_ethnicity                  0
applicant_race_name_1                   0
applicant_race_1                        0
applicant_race_name_2             1006430
applicant_race_2                  1006430
applicant_race_name_3             1009259
applicant_race_3                  1009259
applicant_race_name_4             1009369
applicant_race_4                  1009369
applicant_race_name_5             1009386
applicant_race_5                  1009386
co_applicant_race_name_1                0
co_applicant_race_1                     0
co_applicant_race_name_2          1008514
co_applicant_race_2               1008514
co_applicant_race_name_3          1009393
co_applicant_race_3               1009393
co_applicant_race_name_4          1009417
co_applicant_race_4               1009417
co_applicant_race_name_5          1009427
co_applicant_race_5               1009427
applicant_sex_name                      0
applicant_sex                           0
co_applicant_sex_name                   0
co_applicant_sex                        0
applicant_income_000s               88875
purchaser_type_name                     0
purchaser_type                          0
denial_reason_name_1               828980
denial_reason_1                    828980
denial_reason_name_2               971378
denial_reason_2                    971378
denial_reason_name_3              1002497
denial_reason_3                   1002497
rate_spread                        939632
hoepa_status_name                       0
hoepa_status                            0
lien_status_name                        0
lien_status                             0
edit_status_name                   819241
edit_status                        819241
sequence_number                         0
population                           1466
minority_population                  1477
hud_median_family_income             1418
tract_to_msamd_income                1745
number_of_owner_occupied_units       3093
number_of_1_to_4_family_units        2965
application_date_indicator              0
dtype: int64
In [7]:
# Unique values for categorical variables
categorical_columns = df.select_dtypes(include=['object']).columns.drop("respondent_id")
for col in categorical_columns:
  print(f"Value counts for: {col}")
  print(df[col].value_counts(), "\n")
  print("\n" + "=+"*50 + "\n")

# Explanation:
  # - df.select_dtypes(include=['object']).columns: This part selects all columns from the dataframe that is 'object' (basically not numerical data)
  # - .drop("respondent_id") is used to remove that columns as is not relevant
  # - for col in categorical_columns: goes through each column name, and...
  # - print(f"Value counts for: {col}"): prints the column name, and...
  # - print(df[col].value_counts(), "\n"); prints the number of times each unique value appears in a column
Value counts for: agency_name
agency_name
Office of the Comptroller of the Currency      363756
Department of Housing and Urban Development    214451
Office of Thrift Supervision                   199948
Federal Reserve System                         173886
National Credit Union Administration            34779
Federal Deposit Insurance Corporation           22631
Name: count, dtype: int64 


=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+

Value counts for: agency_abbr
agency_abbr
OCC     363756
HUD     214451
OTS     199948
FRS     173886
NCUA     34779
FDIC     22631
Name: count, dtype: int64 


=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+

Value counts for: loan_type_name
loan_type_name
Conventional          966120
FHA-insured            39128
VA-guaranteed           3573
FSA/RHS-guaranteed       630
Name: count, dtype: int64 


=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+

Value counts for: property_type_name
property_type_name
One-to-four family dwelling (other than manufactured housing)    990336
Manufactured housing                                              13427
Multifamily dwelling                                               5688
Name: count, dtype: int64 


=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+

Value counts for: loan_purpose_name
loan_purpose_name
Refinancing         490921
Home purchase       393021
Home improvement    125509
Name: count, dtype: int64 


=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+

Value counts for: owner_occupancy_name
owner_occupancy_name
Owner-occupied as a principal dwelling        930351
Not owner-occupied as a principal dwelling     73039
Not applicable                                  6061
Name: count, dtype: int64 


=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+

Value counts for: preapproval_name
preapproval_name
Not applicable                   841011
Preapproval was not requested    148534
Preapproval was requested         19906
Name: count, dtype: int64 


=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+

Value counts for: action_taken_name
action_taken_name
Loan originated                                        398639
Application denied by financial institution            265860
Loan purchased by the institution                      125325
Application withdrawn by applicant                      98667
Application approved but not accepted                   88070
File closed for incompleteness                          32846
Preapproval request denied by financial institution        39
Preapproval request approved but not accepted               5
Name: count, dtype: int64 


=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+

Value counts for: msamd_name
msamd_name
New York, White Plains, Wayne - NY, NJ     373622
Nassau, Suffolk - NY                       205507
Rochester - NY                              61609
Albany, Schenectady, Troy - NY              61400
Buffalo, Niagara Falls - NY                 58655
Poughkeepsie, Newburgh, Middletown - NY     51484
Syracuse - NY                               40105
Utica, Rome - NY                            17609
Binghamton - NY                             13557
Kingston - NY                               12881
Glens Falls - NY                            11108
Elmira - NY                                  5087
Ithaca - NY                                  3907
Name: count, dtype: int64 


=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+

Value counts for: state_name
state_name
New York    1009451
Name: count, dtype: int64 


=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+

Value counts for: state_abbr
state_abbr
NY    1009451
Name: count, dtype: int64 


=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+

Value counts for: county_name
county_name
Suffolk County         119892
Queens County          106680
Nassau County           85631
Kings County            85527
Westchester County      49237
Erie County             47096
Monroe County           41702
New York County         41171
Bronx County            34588
Richmond County         32395
Orange County           29681
Onondaga County         28400
Dutchess County         21820
Albany County           18913
Saratoga County         16946
Rockland County         16734
Oneida County           13488
Ulster County           13010
Schenectady County      12071
Niagara County          11564
Rensselaer County       11077
Broome County           10491
Ontario County           7473
Putnam County            7351
Oswego County            7215
Jefferson County         6461
Sullivan County          6357
Chautauqua County        6328
Wayne County             6204
Warren County            5702
Washington County        5419
Chemung County           5088
Steuben County           5076
Cayuga County            4923
Clinton County           4829
Madison County           4494
Fulton County            4355
St. Lawrence County      4353
Greene County            4315
Columbia County          4201
Herkimer County          4124
Cattaraugus County       4105
Otsego County            4051
Tompkins County          3910
Livingston County        3628
Montgomery County        3538
Delaware County          3189
Tioga County             3069
Genesee County           3034
Chenango County          2935
Essex County             2904
Franklin County          2746
Orleans County           2637
Cortland County          2615
Schoharie County         2410
Allegany County          2285
Wyoming County           2169
Seneca County            2021
Yates County             1535
Lewis County             1518
Schuyler County          1285
Hamilton County           478
Name: count, dtype: int64 


=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+

Value counts for: applicant_ethnicity_name
applicant_ethnicity_name
Not Hispanic or Latino                                                               701712
Information not provided by applicant in mail, Internet, or telephone application    163345
Hispanic or Latino                                                                    83748
Not applicable                                                                        60646
Name: count, dtype: int64 


=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+

Value counts for: co_applicant_ethnicity_name
co_applicant_ethnicity_name
No co-applicant                                                                      601274
Not Hispanic or Latino                                                               269823
Information not provided by applicant in mail, Internet, or telephone application     66060
Not applicable                                                                        47780
Hispanic or Latino                                                                    24514
Name: count, dtype: int64 


=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+

Value counts for: applicant_race_name_1
applicant_race_name_1
White                                                                                610172
Information not provided by applicant in mail, Internet, or telephone application    171080
Black or African American                                                            110814
Not applicable                                                                        60199
Asian                                                                                 46654
American Indian or Alaska Native                                                       5342
Native Hawaiian or Other Pacific Islander                                              5190
Name: count, dtype: int64 


=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+

Value counts for: applicant_race_name_2
applicant_race_name_2
White                                        2051
Black or African American                     422
Native Hawaiian or Other Pacific Islander     261
Asian                                         173
American Indian or Alaska Native              114
Name: count, dtype: int64 


=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+

Value counts for: applicant_race_name_3
applicant_race_name_3
White                                        82
Black or African American                    76
Native Hawaiian or Other Pacific Islander    18
American Indian or Alaska Native             13
Asian                                         3
Name: count, dtype: int64 


=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+

Value counts for: applicant_race_name_4
applicant_race_name_4
Native Hawaiian or Other Pacific Islander    58
White                                        17
Asian                                         7
Name: count, dtype: int64 


=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+

Value counts for: applicant_race_name_5
applicant_race_name_5
White                                        56
Native Hawaiian or Other Pacific Islander     5
Asian                                         2
American Indian or Alaska Native              2
Name: count, dtype: int64 


=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+

Value counts for: co_applicant_race_name_1
co_applicant_race_name_1
No co-applicant                                                                      601274
White                                                                                248732
Information not provided by applicant in mail, Internet, or telephone application     68896
Not applicable                                                                        46507
Black or African American                                                             27264
Asian                                                                                 13513
American Indian or Alaska Native                                                       1652
Native Hawaiian or Other Pacific Islander                                              1613
Name: count, dtype: int64 


=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+

Value counts for: co_applicant_race_name_2
co_applicant_race_name_2
White                                        665
Black or African American                    101
Native Hawaiian or Other Pacific Islander     82
Asian                                         56
American Indian or Alaska Native              33
Name: count, dtype: int64 


=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+

Value counts for: co_applicant_race_name_3
co_applicant_race_name_3
White                                        23
Black or African American                    22
Native Hawaiian or Other Pacific Islander    10
American Indian or Alaska Native              3
Name: count, dtype: int64 


=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+

Value counts for: co_applicant_race_name_4
co_applicant_race_name_4
Native Hawaiian or Other Pacific Islander    14
White                                        12
Asian                                         5
Black or African American                     3
Name: count, dtype: int64 


=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+

Value counts for: co_applicant_race_name_5
co_applicant_race_name_5
White                                        15
Asian                                         4
Native Hawaiian or Other Pacific Islander     3
American Indian or Alaska Native              2
Name: count, dtype: int64 


=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+

Value counts for: applicant_sex_name
applicant_sex_name
Male                                                                                 548444
Female                                                                               313848
Information not provided by applicant in mail, Internet, or telephone application     86910
Not applicable                                                                        60249
Name: count, dtype: int64 


=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+

Value counts for: co_applicant_sex_name
co_applicant_sex_name
No co-applicant                                                                      601274
Female                                                                               235906
Male                                                                                  86243
Not applicable                                                                        47369
Information not provided by applicant in mail, Internet, or telephone application     38659
Name: count, dtype: int64 


=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+

Value counts for: purchaser_type_name
purchaser_type_name
Loan was not originated or was not sold in calendar year covered by register    690764
Affiliate institution                                                            79082
Fannie Mae (FNMA)                                                                75053
Freddie Mac (FHLMC)                                                              47531
Other type of purchaser                                                          34667
Life insurance company, credit union, mortgage bank, or finance company          24912
Private securitization                                                           21975
Commercial bank, savings bank or savings association                             20277
Ginnie Mae (GNMA)                                                                15139
Farmer Mac (FAMC)                                                                   51
Name: count, dtype: int64 


=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+

Value counts for: denial_reason_name_1
denial_reason_name_1
Credit history                                    42911
Debt-to-income ratio                              31689
Credit application incomplete                     30550
Collateral                                        29969
Other                                             29472
Unverifiable information                          11298
Insufficient cash (downpayment, closing costs)     2672
Employment history                                 1748
Mortgage insurance denied                           162
Name: count, dtype: int64 


=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+

Value counts for: denial_reason_name_2
denial_reason_name_2
Other                                             10141
Credit history                                     8133
Debt-to-income ratio                               6406
Collateral                                         4863
Unverifiable information                           2918
Credit application incomplete                      2445
Insufficient cash (downpayment, closing costs)     2314
Employment history                                  716
Mortgage insurance denied                           137
Name: count, dtype: int64 


=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+

Value counts for: denial_reason_name_3
denial_reason_name_3
Other                                             2508
Credit history                                    1007
Collateral                                         977
Unverifiable information                           624
Debt-to-income ratio                               579
Insufficient cash (downpayment, closing costs)     567
Credit application incomplete                      535
Employment history                                 103
Mortgage insurance denied                           54
Name: count, dtype: int64 


=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+

Value counts for: hoepa_status_name
hoepa_status_name
Not a HOEPA loan    1009337
HOEPA loan              114
Name: count, dtype: int64 


=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+

Value counts for: lien_status_name
lien_status_name
Secured by a first lien          694718
Secured by a subordinate lien    159095
Not applicable                   125325
Not secured by a lien             30313
Name: count, dtype: int64 


=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+

Value counts for: edit_status_name
edit_status_name
Quality edit failure only    190210
Name: count, dtype: int64 


=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+

Step 3: Data cleaning¶

A) Excluding irrelevant records from columns¶

In [8]:
# Records/columns
df.shape
Out[8]:
(1009451, 78)
In [9]:
# Quality issues: There are 190210 records with quality issues. We will ommit them.
df = df[~df["edit_status_name"].isin(["Quality edit failure only"])]

# Our focus is on individuals hence we will focus on data with "No co-applicant"
df = df[df['co_applicant_ethnicity_name'] == 'No co-applicant']

# Here we have "excluding" and "selecting", Explanation:
# - With df = df[~df["edit_status_name"].isin(["Quality edit failure only"])] We are saying, exclude "Quality edit failure only" from the column "edit_status_name".
# - With df = df[df["co_appliant_ethnicity_name"] == "No co-applicant"] We are saying, keep only records that in the column "co_applicant_ethnicity_name" is "No co-applicant"
In [10]:
# Records/shape
df.shape
Out[10]:
(472701, 78)

Note that around half the applications are without co-applicant.

B) Excluding irrelevant records from the following columns:¶

  • applicant_ethnicity_name
  • applicant_race_name_1
  • applicant_sex_name
  • action_taken_name
In [11]:
df = df[
    ~df["action_taken_name"].isin([
        "File closed for incompleteness",
        "Preapproval request denied by financial institution",
        "Preapproval request approved but not accepted",
        "Application withdrawn by applicant"
    ]) &
    ~df["applicant_ethnicity_name"].isin([
        "Information not provided by applicant in mail, Internet, or telephone application",
        "Not applicable"
    ]) &
    ~df["applicant_race_name_1"].isin([
        "Information not provided by applicant in mail, Internet, or telephone application",
        "Not applicable"
    ]) &
    ~df["applicant_sex_name"].isin([
        "Information not provided by applicant in mail, Internet, or telephone application",
        "Not applicable"
    ])
]
In [12]:
# Data validation
print(df.shape)
(323044, 78)

C) Addressing missing values¶

In [13]:
# Calculating the missing valuesand their percentage
total_rows = len(df)

# Missing value counting and missing value %
missing_counts = df.isnull().sum()
missing_percentage = (missing_counts / total_rows) * 100

# We create the header
print(f"{'Column': <30}{'Missing Count':<15}{'Missing %'}")
print("-" * 60)

# For each column we print the missing values in absolute terms and in %
for col in df.columns:
  if missing_counts[col]>0:
    print(f"{col:<30}{missing_counts[col]:<15}{missing_percentage[col]:.2f}%")

# Explanation:
  # - {'Column': <30}; The <30 align the text on the left and allocate up to 30 characters in the column
  # - Other options are; >30 Same, but aligns text on the right as well as ^30 which aligns text on the center.
  # - f"{}{}{}"; it is called f-string(formatted string literal), used to put together variables or expressions one after another like concatenation:
      # "The value is " + str(value) + " and the total is " + str(total)
Column                        Missing Count  Missing %
------------------------------------------------------------
msamd_name                    34069          10.55%
msamd                         34069          10.55%
county_name                   357            0.11%
county_code                   357            0.11%
census_tract_number           460            0.14%
applicant_race_name_2         321711         99.59%
applicant_race_2              321711         99.59%
applicant_race_name_3         322955         99.97%
applicant_race_3              322955         99.97%
applicant_race_name_4         323019         99.99%
applicant_race_4              323019         99.99%
applicant_race_name_5         323027         99.99%
applicant_race_5              323027         99.99%
co_applicant_race_name_2      323044         100.00%
co_applicant_race_2           323044         100.00%
co_applicant_race_name_3      323044         100.00%
co_applicant_race_3           323044         100.00%
co_applicant_race_name_4      323044         100.00%
co_applicant_race_4           323044         100.00%
co_applicant_race_name_5      323044         100.00%
co_applicant_race_5           323044         100.00%
applicant_income_000s         2719           0.84%
denial_reason_name_1          251827         77.95%
denial_reason_1               251827         77.95%
denial_reason_name_2          308294         95.43%
denial_reason_2               308294         95.43%
denial_reason_name_3          320397         99.18%
denial_reason_3               320397         99.18%
rate_spread                   295313         91.42%
edit_status_name              323044         100.00%
edit_status                   323044         100.00%
population                    507            0.16%
minority_population           511            0.16%
hud_median_family_income      484            0.15%
tract_to_msamd_income         585            0.18%
number_of_owner_occupied_units1092           0.34%
number_of_1_to_4_family_units 967            0.30%

Filling and dropping¶

In [14]:
# Filling out empty cells:
df = df.assign(
    msamd_name=df['msamd_name'].fillna("Unknown"),
    denial_reason_name_1=df['denial_reason_name_1'].fillna("Unknown"),
    rate_spread=df['rate_spread'].fillna(0)
)

# Dropping empty values from the following columns:
df = df.dropna(subset=['county_name',
                       'census_tract_number',
                       'applicant_income_000s',
                       'minority_population',
                       'hud_median_family_income',
                       'tract_to_msamd_income',
                       'number_of_owner_occupied_units',
                       'number_of_1_to_4_family_units'
                       ]
               )
In [15]:
# Data validation
print(df.isnull().sum())
as_of_year                             0
respondent_id                          0
agency_name                            0
agency_abbr                            0
agency_code                            0
loan_type_name                         0
loan_type                              0
property_type_name                     0
property_type                          0
loan_purpose_name                      0
loan_purpose                           0
owner_occupancy_name                   0
owner_occupancy                        0
loan_amount_000s                       0
preapproval_name                       0
preapproval                            0
action_taken_name                      0
action_taken                           0
msamd_name                             0
msamd                              33468
state_name                             0
state_abbr                             0
state_code                             0
county_name                            0
county_code                            0
census_tract_number                    0
applicant_ethnicity_name               0
applicant_ethnicity                    0
co_applicant_ethnicity_name            0
co_applicant_ethnicity                 0
applicant_race_name_1                  0
applicant_race_1                       0
applicant_race_name_2             317520
applicant_race_2                  317520
applicant_race_name_3             318752
applicant_race_3                  318752
applicant_race_name_4             318816
applicant_race_4                  318816
applicant_race_name_5             318824
applicant_race_5                  318824
co_applicant_race_name_1               0
co_applicant_race_1                    0
co_applicant_race_name_2          318841
co_applicant_race_2               318841
co_applicant_race_name_3          318841
co_applicant_race_3               318841
co_applicant_race_name_4          318841
co_applicant_race_4               318841
co_applicant_race_name_5          318841
co_applicant_race_5               318841
applicant_sex_name                     0
applicant_sex                          0
co_applicant_sex_name                  0
co_applicant_sex                       0
applicant_income_000s                  0
purchaser_type_name                    0
purchaser_type                         0
denial_reason_name_1                   0
denial_reason_1                   248033
denial_reason_name_2              304184
denial_reason_2                   304184
denial_reason_name_3              316221
denial_reason_3                   316221
rate_spread                            0
hoepa_status_name                      0
hoepa_status                           0
lien_status_name                       0
lien_status                            0
edit_status_name                  318841
edit_status                       318841
sequence_number                        0
population                             0
minority_population                    0
hud_median_family_income               0
tract_to_msamd_income                  0
number_of_owner_occupied_units         0
number_of_1_to_4_family_units          0
application_date_indicator             0
dtype: int64

D) Feature Engineering¶

Feature creation¶

In [16]:
# Creation of ethnicity_race_sex column:
df['ethnicity_race_sex'] = df['applicant_ethnicity_name'].str.lower() + "_" + df['applicant_race_name_1'].str.lower() + "_" + df['applicant_sex_name'].str.lower()

# Checking column created
print(df[['ethnicity_race_sex']].value_counts())
ethnicity_race_sex                                                     
not hispanic or latino_white_male                                          126540
not hispanic or latino_white_female                                         84557
not hispanic or latino_black or african american_female                     27181
not hispanic or latino_black or african american_male                       22053
hispanic or latino_white_male                                               17879
not hispanic or latino_asian_male                                           13608
hispanic or latino_white_female                                             11132
not hispanic or latino_asian_female                                          8975
not hispanic or latino_american indian or alaska native_male                 1116
hispanic or latino_black or african american_male                             937
not hispanic or latino_native hawaiian or other pacific islander_male         922
hispanic or latino_black or african american_female                           873
not hispanic or latino_native hawaiian or other pacific islander_female       741
not hispanic or latino_american indian or alaska native_female                725
hispanic or latino_native hawaiian or other pacific islander_male             436
hispanic or latino_american indian or alaska native_male                      385
hispanic or latino_native hawaiian or other pacific islander_female           242
hispanic or latino_american indian or alaska native_female                    241
hispanic or latino_asian_male                                                 175
hispanic or latino_asian_female                                               123
Name: count, dtype: int64

Note about above results: note how the first 4 columns are labeled as "not hispanic or latino" whereas the last 4 columns are "hispanic or latino" group linked to the smallest community.

In [17]:
# Creation of loan_to_income_ratio column:
df['loan_to_income_ratio'] = (df['loan_amount_000s'] / df['applicant_income_000s']).round(2)

# Checking
print(df['loan_to_income_ratio'].head())

# Note: Figures below represent the times of their annual income.
# E.g. 3.10, means that the loan amount is 3.10 times their annual income.
0    3.38
2    0.85
5    3.07
6    1.76
7    1.33
Name: loan_to_income_ratio, dtype: float64

Feature Transformation; Mapping¶

In [18]:
print(df["action_taken_name"].value_counts())
action_taken_name
Loan originated                                148863
Application denied by financial institution    108294
Application approved but not accepted           31152
Loan purchased by the institution               30532
Name: count, dtype: int64
In [19]:
# Mapping "action_taken_name" into a binary feature approved-denied
atn_map = {
    "Loan originated" : 'approved',
    "Application approved but not accepted" : 'approved',
    "Loan purchased by the institution" : 'approved',
    "Application denied by financial institution" : 'denied'
    }


df["action_taken_name"] = df["action_taken_name"].map(atn_map)

print(df["action_taken_name"].value_counts())
action_taken_name
approved    210547
denied      108294
Name: count, dtype: int64
In [20]:
# Mapping "ethnicity_race_sex" for better readitability:
ers_map = {
    "not hispanic or latino_white_male": "NH_White_M",
    "not hispanic or latino_white_female": "NH_White_F",
    "not hispanic or latino_black or african american_female": "NH_Black_F",
    "not hispanic or latino_black or african american_male": "NH_Black_M",
    "hispanic or latino_white_male": "H_White_M",
    "not hispanic or latino_asian_male": "NH_Asian_M",
    "hispanic or latino_white_female": "H_White_F",
    "not hispanic or latino_asian_female": "NH_Asian_F",
    "not hispanic or latino_american indian or alaska native_male": "NH_Indigenous_M",
    "hispanic or latino_black or african american_male": "H_Black_M",
    "not hispanic or latino_native hawaiian or other pacific islander_male": "NH_PacificIslander_M",
    "hispanic or latino_black or african american_female": "H_Black_F",
    "not hispanic or latino_native hawaiian or other pacific islander_female": "NH_PacificIslander_F",
    "not hispanic or latino_american indian or alaska native_female": "NH_Indigenous_F",
    "hispanic or latino_native hawaiian or other pacific islander_male": "H_PacificIslander_M",
    "hispanic or latino_american indian or alaska native_male": "H_Indigenous_M",
    "hispanic or latino_native hawaiian or other pacific islander_female": "H_PacificIslander_F",
    "hispanic or latino_american indian or alaska native_female": "H_Indigenous_F",
    "hispanic or latino_asian_male": "H_Asian_M",
    "hispanic or latino_asian_female": "H_Asian_F"
}

df["ethnicity_race_sex"] = df["ethnicity_race_sex"].map(ers_map)

print(df.shape, '\n', df["ethnicity_race_sex"].value_counts())
(318841, 80) 
 ethnicity_race_sex
NH_White_M              126540
NH_White_F               84557
NH_Black_F               27181
NH_Black_M               22053
H_White_M                17879
NH_Asian_M               13608
H_White_F                11132
NH_Asian_F                8975
NH_Indigenous_M           1116
H_Black_M                  937
NH_PacificIslander_M       922
H_Black_F                  873
NH_PacificIslander_F       741
NH_Indigenous_F            725
H_PacificIslander_M        436
H_Indigenous_M             385
H_PacificIslander_F        242
H_Indigenous_F             241
H_Asian_M                  175
H_Asian_F                  123
Name: count, dtype: int64

Step 4: Exploratory Data Analysis¶

A) Initial questions¶

A.1 What is the proportion of denials to approvals for each group in the column "ethnicity_race_sex"?¶

In [21]:
# Calculating and plotting the distribution of action_taken for each ethnicity_race_sex group
action_distribution = df.groupby(['ethnicity_race_sex', 'action_taken_name']).size().unstack(fill_value=0)
action_distribution_pct = action_distribution.div(action_distribution.sum(axis=1), axis=0)

# Sorting the data by 'approved' column
sorted_index = action_distribution_pct['approved'].sort_values(ascending=True).index
action_distribution_pct_sorted = action_distribution_pct.loc[sorted_index]

###########################################################
# Explanation: action_distribution = df.groupby(['ethnicity_race_sex', 'action_taken_name']).size().unstack(fill_value=0)
# -  df.groupby(['ethnicity_race_sex', 'action_taken_name']); Groups data by unique combination of intersectional group and action taken,
#    in other words, it groups the data like follows: NH_White_M-Approved; NH_White_M-Denied; .... and so on for each intersectional group.

# - .size(); counts the number of rows in each group (like count() but without being specific about the columns)
# - .unstack(); converts the second grouping level ("action_taken_name") into columns so rows will be given by each intersectional group, and the columns by each action_taken_name (approved, denied)
#   (fill_value=0) is used to ensure each intersectional group has a value, because it can happen that an intersectional group doesn't have any approved loans, for instance.

# As result we gest a df that looks like follows:

#     ethnicity_race_sex	  Approved	  Denied
#     NH_White_M	          1234	      645
#     ... rest of the data.
########################################################
# Explanation: action_distribution_pct = action_distribution.div(action_distribution.sum(axis=1), axis=0)
# - .div(argument1, argument2); argument1 is what to divide by, argument2 controls how the division is applied:
#   axis=0 → divide each row by a value per row (sum of all values in that row for the group),
#   axis=1 → divide each column by a value per column.
# - Therefore, argument1 is action_distribution.sum(axis=1), which represents the total count of actions for each group just like before:
#                                                                                        #     ethnicity_race_sex	  Approved	  Denied
#                                                                                        #     NH_White_M	          1234	      645
#                                                                                        #     ... rest of the data.
# - argument2, axis=0; for each intersectional group, we sum Approved + Denied. That total is used to divide **each value in the row** by.
########################################################
# Explanation: sorted_index = action_distribution_pct['approved'].sort_values(ascending=True).index
#   .sort_values() --> sorts the values in 'approved' column.
#   .index --> gets the row labels in that sorted order.
#
#  action_distribution_pct_sorted = action_distribution_pct.loc[sorted_index]:
    # .loc[sorted_index]: Rearranges the rows of action_distribution_pct to match the sorted order from sorted_index.
In [22]:
# Displaying
plt.figure(figsize=(20, 10))
action_distribution_pct_sorted.plot(kind='barh', stacked=True) # Stacked is used for stacking the bars on top of each other.

plt.title('Applications approved - denied')
plt.xlabel('Proportion')
plt.ylabel('Ethnicity-Race-Sex')

plt.legend(title='Action Taken', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.subplots_adjust(right=0.8)

plt.show()

#https://stackoverflow.com/questions/25068384/bbox-to-anchor-and-loc-in-matplotlib
#https://www.geeksforgeeks.org/python-pandas-dataframe-div/
<Figure size 2000x1000 with 0 Axes>
No description has been provided for this image

A.2 What is the approval rate for each group?¶

In [23]:
# Same information as above but in percentage terms.
df['approved'] = (df['action_taken_name'] == 'approved').astype(int)
approval_rates = df.groupby('ethnicity_race_sex')['approved'].mean().sort_values(ascending=True)


plt.figure(figsize=(20, 10))
ax = approval_rates.plot(kind='barh')

plt.title('Approval Rates by Ethnicity-Race-Sex Intersections')
plt.xlabel('Approval Rate')
plt.ylabel('Ethnicity-Race-Sex')
plt.xlim(0, 1)  # Set x-axis limits from 0 to 1 for percentage

# Adding percentage labels
for i, v in enumerate(approval_rates):
    ax.text(v, i, f' {v:.1%}', va='center')

plt.tight_layout()
plt.show()

#https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.text.html
#https://github.com/DJ4seasons/Python_Basic4beginners
No description has been provided for this image

Chart above: It is kind of surprising that both male and female Asians have the highest approval rate. On the other hand, having the smallest communities with the lowest approved rate it somehow expected.

A.3 What is the denial rate for each group?¶

In [24]:
# Same information as above but in percentage terms.
df['denied'] = (df['action_taken_name'] == 'denied').astype(int)
denied_rates = df.groupby('ethnicity_race_sex')['denied'].mean().sort_values(ascending=True)


plt.figure(figsize=(20, 10))
ax = denied_rates.plot(kind='barh')

plt.title('Denied Rates by Ethnicity-Race-Sex Intersections')
plt.xlabel('Denied Rate')
plt.ylabel('Ethnicity-Race-Sex')
plt.xlim(0, 1)  # Set x-axis limits from 0 to 1 for percentage

# Adding percentage labels
for i, v in enumerate(denied_rates):
    ax.text(v, i, f' {v:.1%}', va='center')

plt.tight_layout()
plt.show()

#https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.text.html
#https://github.com/DJ4seasons/Python_Basic4beginners
No description has been provided for this image

A.4 What percentage of total applications does each ethnicity_race_sex group represent?¶

In [25]:
# Total number of applications (approved + denied) for each ethnicity_race_sex group
total_applications = df.groupby('ethnicity_race_sex')['action_taken_name'].count()

# % distribution of total applications
total_applications_pct = (total_applications / total_applications.sum()) * 100

# Sorting data
total_applications_pct_sorted = total_applications_pct.sort_values(ascending=True)

# Plotting results
plt.figure(figsize=(20, 10))
ax = total_applications_pct_sorted.plot(kind='barh', color='firebrick')

# Labels
plt.title('Distribution of Loan Applications by Ethnicity-Race-Sex')
plt.xlabel('Percentage of Total Applications')
plt.ylabel('Ethnicity-Race-Sex')

# % on bars
for i, v in enumerate(total_applications_pct_sorted):
    ax.text(v + 0.5, i, f'{v:.1f}%', va='center')


plt.tight_layout()
plt.show()
No description has been provided for this image

From the above chart we already can appreciate a link between having little data and the rate of approved/denied rate. A closer look at the smaller communities is shown below.

Excluding the two largest groups¶

In [26]:
# Filtering out "not hispanic or latino_white_female" and "not hispanic or latino_white_male"
total_applications = df.groupby('ethnicity_race_sex')['action_taken_name'].count()

# Total number of applications (approved + denied) for each ethnicity_race_sex group
total_applications_pct = (total_applications / total_applications.sum()) * 100

# Filtering out "not hispanic or latino_white_female" and "not hispanic or latino_white_male" from visualization
visualization_pct = total_applications_pct.drop(['NH_White_M', 'NH_White_F'])

# Sortting data
visualization_pct_sorted = visualization_pct.sort_values(ascending=True)

# Plotting
plt.figure(figsize=(20, 10))
ax = visualization_pct_sorted.plot(kind='barh', color='indianred')

# Labels
plt.title('Distribution of Loan Applications by Ethnicity-Race-Sex (Excluding NH_White_M, NH_White_F')
plt.xlabel('Percentage of Total Applications')
plt.ylabel('Ethnicity-Race-Sex')

# % for each bar
for i, v in enumerate(visualization_pct_sorted):
    ax.text(v + 0.1, i, f'{v:.1f}%', va='center')

plt.tight_layout()
plt.show()
No description has been provided for this image

A.5 What is the ratio of applications submitted by men compared to women?¶

In [27]:
# Counting the number of applications for men and women
male_count = df[df['applicant_sex_name'] == 'Male'].shape[0]
female_count = df[df['applicant_sex_name'] == 'Female'].shape[0]

# Calculating the ratio of men to women
if female_count == 0:  # Avoid division by zero
    ratio = None
else:
    ratio = female_count / male_count

# Displaying the results
print(f"Number of applications by men: {male_count}")
print(f"Number of applications by women: {female_count}")
print(f"Ratio of applications (Men vs Women): {ratio:.2f}")
Number of applications by men: 184051
Number of applications by women: 134790
Ratio of applications (Men vs Women): 0.73

A.6 What is the average loan amount requested by men compared to women?¶

In [28]:
# Filtering the data for men and women
male_loans = df[df['applicant_sex_name'] == 'Male']['loan_amount_000s']
female_loans = df[df['applicant_sex_name'] == 'Female']['loan_amount_000s']

# Calculating the average loan amount for men and women
average_male_loan = male_loans.mean()
average_female_loan = female_loans.mean()

# Displaying the results
print(f"Average loan amount requested by men: ${average_male_loan:.2f} (in thousands)")
print(f"Average loan amount requested by women: ${average_female_loan:.2f} (in thousands)")
Average loan amount requested by men: $233.62 (in thousands)
Average loan amount requested by women: $207.58 (in thousands)

A.7 What is the average loan amount requested by each group?¶

In [29]:
# Grouping by 'ethnicity_race_sex' and calculate average loan amount
average_loan_by_group = df.groupby('ethnicity_race_sex')['loan_amount_000s'].mean()

# Sorting by average loan amount if desired (optional)
average_loan_by_group_sorted = average_loan_by_group.sort_values(ascending=False)

# Displaying all results in a loop
print(f"{'Group':<25} {'Average Loan Amount (K)':>25}")
print("-" * 50)
for group, avg_loan in average_loan_by_group_sorted.items():
    print(f"{group:<25} ${avg_loan:>20.2f}")
Group                       Average Loan Amount (K)
--------------------------------------------------
NH_Asian_M                $              329.28
NH_Asian_F                $              315.21
NH_PacificIslander_M      $              288.41
H_White_M                 $              286.87
NH_Black_M                $              279.67
H_Asian_M                 $              271.50
NH_PacificIslander_F      $              268.19
H_White_F                 $              267.04
NH_Black_F                $              254.55
H_Black_M                 $              252.19
H_PacificIslander_M       $              249.71
H_Black_F                 $              248.37
H_Asian_F                 $              235.73
NH_Indigenous_M           $              223.69
NH_White_M                $              207.31
H_PacificIslander_F       $              206.81
H_Indigenous_M            $              204.21
NH_Indigenous_F           $              185.54
H_Indigenous_F            $              185.15
NH_White_F                $              172.50

A.8 What is the average income for each ethnicity_race_sex group?¶

In [30]:
# Groping the data by 'ethnicity_race_sex' and calculate the average income
average_income_by_ethnicity_race_sex = df.groupby('ethnicity_race_sex')['applicant_income_000s'].mean()

# Sorting values
average_income_sorted = average_income_by_ethnicity_race_sex.sort_values(ascending=False)

# Displaying the results
print("Average Applicant Income by Ethnicity, Race, and Sex:")
print(average_income_sorted)
Average Applicant Income by Ethnicity, Race, and Sex:
ethnicity_race_sex
NH_Asian_M              136.616035
H_Asian_M               135.480000
NH_PacificIslander_M    125.430586
NH_Asian_F              124.184847
NH_Indigenous_M         114.870072
H_White_M               114.076235
NH_White_M              113.654220
NH_Black_M              110.438897
NH_PacificIslander_F    106.105263
H_Black_M               104.784418
H_White_F               100.674003
H_PacificIslander_M      99.802752
H_Asian_F                97.642276
H_Indigenous_M           93.327273
H_Black_F                92.533792
NH_Black_F               89.674221
NH_Indigenous_F          83.120000
H_Indigenous_F           82.941909
NH_White_F               82.184455
H_PacificIslander_F      76.260331
Name: applicant_income_000s, dtype: float64

A.9 What is the average loan-to-income ratio for each group?¶

In [31]:
df['loan_to_income_ratio'] = df['loan_amount_000s'] / df['applicant_income_000s']
avg_ratio_by_group = df.groupby('ethnicity_race_sex')['loan_to_income_ratio'].mean()

avg_ratio_by_group_sorted = avg_ratio_by_group.sort_values(ascending=False)
print(f"{'Group':<25} {'Average Loan-to-Income Ratio'}")
print("-" * 50)
for group, avg_ratio in avg_ratio_by_group_sorted.items():
    print(f"{group:<25} {avg_ratio:.2f}")
Group                     Average Loan-to-Income Ratio
--------------------------------------------------
NH_PacificIslander_F      3.08
H_Black_F                 3.05
NH_Black_F                3.04
H_White_F                 2.90
NH_Black_M                2.89
NH_Asian_F                2.86
H_White_M                 2.83
NH_Asian_M                2.79
H_Asian_F                 2.79
NH_PacificIslander_M      2.72
H_PacificIslander_M       2.72
H_Black_M                 2.70
H_Asian_M                 2.59
H_PacificIslander_F       2.54
H_Indigenous_M            2.53
NH_Indigenous_F           2.40
H_Indigenous_F            2.31
NH_Indigenous_M           2.29
NH_White_F                2.25
NH_White_M                2.09

A.10 What is the average rate_spread for each group?¶

In [32]:
# Grouping the data by 'ethnicity_race_sex' and calculate the average rate spread
average_rate_spread_by_ethnicity_race_sex = df.groupby('ethnicity_race_sex')['rate_spread'].mean()

# Sorting results
average_rate_spread_sorted = average_rate_spread_by_ethnicity_race_sex.sort_values(ascending=False)
# Displaying the results
print("Average Rate Spread by Ethnicity, Race, and Sex:")
print(average_rate_spread_sorted)
Average Rate Spread by Ethnicity, Race, and Sex:
ethnicity_race_sex
H_Asian_M               0.695257
NH_Black_M              0.659511
NH_Black_F              0.644287
H_White_M               0.545123
NH_PacificIslander_M    0.535141
H_White_F               0.508944
H_Black_M               0.499509
NH_Indigenous_F         0.464952
NH_PacificIslander_F    0.440432
NH_White_M              0.421578
NH_White_F              0.415016
NH_Indigenous_M         0.391855
H_Asian_F               0.361545
H_Black_F               0.359129
H_PacificIslander_M     0.310642
H_Indigenous_M          0.291714
H_Indigenous_F          0.282241
NH_Asian_M              0.249730
NH_Asian_F              0.228814
H_PacificIslander_F     0.195248
Name: rate_spread, dtype: float64

A.11 Which group has the fewest zero values in the column "rate_spread", suggesting they are more frequently overcharged?¶

In [33]:
# Calculating the total record we have per group
total_records = df.groupby('ethnicity_race_sex').size()

# Counting records where rate_spread is NOT 0
non_zero_counts = df.groupby('ethnicity_race_sex')['rate_spread'].apply(lambda x: (x != 0).sum())

# Calculating percentage of non-zero records
non_zero_percentage = (non_zero_counts / total_records) * 100

# Sorting by non-zero percentage
non_zero_percentage_sorted = non_zero_percentage.sort_values(ascending=False)

# Displaying results
print(f"{'Group':<25} {'% Non-Zero'}")
print("-" * 40)
for group in non_zero_percentage_sorted.index:
    print(f"{group:<25} {non_zero_percentage_sorted[group]:.2f}%")
Group                     % Non-Zero
----------------------------------------
H_Asian_M                 12.57%
NH_Black_M                12.14%
NH_Black_F                12.08%
H_White_M                 10.43%
NH_Indigenous_F           9.79%
H_White_F                 9.67%
NH_PacificIslander_M      9.65%
H_Black_M                 9.18%
NH_PacificIslander_F      8.50%
NH_White_M                8.15%
NH_White_F                8.06%
NH_Indigenous_M           7.35%
H_Black_F                 6.99%
H_Asian_F                 6.50%
H_PacificIslander_M       6.42%
H_Indigenous_M            5.71%
H_Indigenous_F            4.98%
NH_Asian_M                4.78%
NH_Asian_F                4.46%
H_PacificIslander_F       3.72%

B) Statistics & Feature Engineering; Feature Selection¶

In [34]:
# Feature Engineering; feature selection
df = df[[
    "loan_type_name",
    "property_type_name",
    "loan_purpose_name",
    "loan_amount_000s",
    "action_taken_name",
    "msamd_name",
    "census_tract_number",
    "hoepa_status_name",
    "applicant_income_000s",
    "denial_reason_name_1",
    "rate_spread",
    "lien_status_name",
    "hud_median_family_income",
    "tract_to_msamd_income",
    "minority_population",
    "number_of_owner_occupied_units",
    "ethnicity_race_sex",
    'loan_to_income_ratio'
]]
In [35]:
# Statistics
df.describe().T
Out[35]:
count mean std min 25% 50% 75% max
loan_amount_000s 318841.0 222.612211 211.328864 1.000000 69.000000 150.000000 342.000000 1998.000000
census_tract_number 318841.0 1498.589303 2685.193183 1.000000 130.000000 354.000000 1237.020000 9929.000000
applicant_income_000s 318841.0 103.666790 116.434086 10.000000 47.000000 76.000000 120.000000 1999.000000
rate_spread 318841.0 0.452464 1.547799 0.000000 0.000000 0.000000 0.000000 13.810000
hud_median_family_income 318841.0 65364.387579 13759.264157 50900.000000 59300.000000 59500.000000 66300.000000 93800.000000
tract_to_msamd_income 318841.0 107.065156 45.470959 5.050000 82.199997 99.440002 121.019997 404.359985
minority_population 318841.0 33.289470 34.088757 0.630000 6.100000 16.440001 59.490002 100.000000
number_of_owner_occupied_units 318841.0 1079.190179 629.977193 2.000000 542.000000 1039.000000 1521.000000 4825.000000
loan_to_income_ratio 318841.0 2.403550 1.852347 0.001515 1.034483 2.225000 3.381818 84.000000

C) Visualization of categorical and numerical data¶

C.1 Categorical data; histograms¶

In [36]:
categorical_columns = ['loan_type_name', 'loan_purpose_name', 'property_type_name', 'lien_status_name']

# Plotting categorical columns
for col in categorical_columns:
    plt.figure(figsize=(10, 6))
    sns.countplot(data=df, x=col, hue='action_taken_name')

    plt.title(f'Distribution of {col} against action_taken_name')
    plt.xticks(rotation=45)

    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
In [37]:
# horizontal barplot
def value_counts_bar(col, title):
    vc = col.value_counts()
    plt.figure(figsize=(6, len(vc.keys())/2))
    plt.barh(vc.keys(), vc.values)
    plt.title(title, fontsize=15)
    plt.xticks(rotation='vertical', fontsize=10)
    plt.show()

# https://q-shick.github.io/portfolio/proj_hmda.html by By Kyoosik Kim

value_counts_bar(df.action_taken_name, "Action Taken")
value_counts_bar(df.loan_purpose_name, "Loan Purpose")
No description has been provided for this image
No description has been provided for this image

C.2 Numercial Data:¶

Looking for outliers with boxplot¶

In [38]:
numerical_columns = ['loan_amount_000s', 'applicant_income_000s', 'minority_population',
                     'hud_median_family_income', 'tract_to_msamd_income']

# Plotting numerical columns
for col in numerical_columns:
    plt.figure(figsize=(10, 6))
    sns.boxplot(data=df, x='action_taken_name', y=col, hue='action_taken_name')

    plt.title(f'Distribution of {col} against action_taken_name')
    plt.xticks(rotation=45)
    plt.show()

#https://seaborn.pydata.org/generated/seaborn.countplot.html
#https://seaborn.pydata.org/generated/seaborn.boxplot.html

# Problems encountered: Boxplot didn't display different colours for denied/approved
  ## Solution: Forgot to add "hue" in the (boxplot loop) second loop.
    ### Sources: https://stackoverflow.com/questions/71459836/how-to-color-each-boxplot-differently#:~:text=artists%20.-,The%20recommended%20way%20to%20change%20the%20colors%20in%20Seaborn%20is,or%20a%20list%20of%20colors.)
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Filtering out outliers¶

In [39]:
# Percentile bounds for applicant_income_000s
income_lower_bound = np.percentile(df['applicant_income_000s'], 1)
income_upper_bound = np.percentile(df['applicant_income_000s'], 99)

# Percentile bounds for loan_amount_000s
loan_lower_bound = np.percentile(df['loan_amount_000s'], 1)
loan_upper_bound = np.percentile(df['loan_amount_000s'], 99)

# Percentile bounds for hud_median_family_income
hud_income_lower_bound = np.percentile(df['hud_median_family_income'], 1)
hud_income_upper_bound = np.percentile(df['hud_median_family_income'], 99)

# Apply filtering to all three
df = df[
    (df['applicant_income_000s'] >= income_lower_bound) &
    (df['applicant_income_000s'] <= income_upper_bound) &
    (df['loan_amount_000s'] >= loan_lower_bound) &
    (df['loan_amount_000s'] <= loan_upper_bound) &
    (df['hud_median_family_income'] >= hud_income_lower_bound) &
    (df['hud_median_family_income'] <= hud_income_upper_bound)
]

Checking statistics again¶

In [40]:
# Statistics
df.describe().T
Out[40]:
count mean std min 25% 50% 75% max
loan_amount_000s 308206.0 213.700256 181.900571 4.000000 70.000000 150.000000 336.000000 994.000000
census_tract_number 308206.0 1494.728822 2671.093164 1.000000 131.000000 362.000000 1238.020000 9929.000000
applicant_income_000s 308206.0 94.487473 71.337269 16.000000 48.000000 76.000000 118.000000 597.000000
rate_spread 308206.0 0.461634 1.563350 0.000000 0.000000 0.000000 0.000000 13.810000
hud_median_family_income 308206.0 65463.361518 13798.666392 50900.000000 59300.000000 59500.000000 66300.000000 93800.000000
tract_to_msamd_income 308206.0 106.027853 43.125547 5.050000 82.070000 99.290001 120.610001 404.359985
minority_population 308206.0 33.602562 34.252241 0.630000 6.140000 16.610001 60.650002 100.000000
number_of_owner_occupied_units 308206.0 1079.279900 629.946044 2.000000 541.000000 1039.000000 1521.000000 4825.000000
loan_to_income_ratio 308206.0 2.417657 1.726121 0.007937 1.085106 2.253704 3.397436 41.187500

Checking for Skewnewss with histograms¶

In [41]:
df['income_sqrt'] = np.sqrt(df['applicant_income_000s'])
df['loan_amount_sqrt'] = np.sqrt(df['loan_amount_000s'])

# Note about sqrt;
  # - We are calculating the square root of each value in those columns.
  # - creating two new columns with the transformed values
  # - Why? To reduce the scale of the numbers for a better visualization.

  # OTHER BENEFITS:
  # - To make skewed data more normal,
  # - Stabilize variance,
  # - Reduce the influence of large numbers.

  # ALTERNATIVELY;
  # - We could use np.log() which is 'stronger', more agressive
  # - BUT requieres all values to be >0.

# Code's source: https://q-shick.github.io/portfolio/proj_hmda.html
In [42]:
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

for action in df['action_taken_name'].unique():
    sns.kdeplot(
        data=df[df['action_taken_name'] == action],
        x='income_sqrt',
        ax=axes[0],
        label=action
    )
    sns.kdeplot(
        data=df[df['action_taken_name'] == action],
        x='loan_amount_sqrt',
        ax=axes[1],
        label=action
    )

axes[0].set_title("Income (sqrt) by Loan Decision")
axes[1].set_title("Loan Amount (sqrt) by Loan Decision")
axes[1].legend(bbox_to_anchor=(1.05, 0.6))
plt.tight_layout()
plt.show()

# W


# Code's source: https://q-shick.github.io/portfolio/proj_hmda.html
No description has been provided for this image
In [43]:
# Split the data
dist_0 = df[df['action_taken_name'] == 'approved']['income_sqrt']
dist_1 = df[df['action_taken_name'] == 'denied']['income_sqrt']

# Run Welch’s t-test (assumes unequal variances)
stat, p_value = stats.ttest_ind(dist_0, dist_1, equal_var=False)

# Print results
print(f"Stat: {stat:.2f}, P-value: {p_value:}")
Stat: 25.60, P-value: 2.3018618003693452e-144

Step 5.1: Feature engineering¶

A) Feature creation

B) Feature selection¶