Original dataset: https://www.consumerfinance.gov/data-research/hmda/historic-data/?geo=ny&records=all-records&field_descriptions=labels

So far from original dataset:¶

  1. Selected "no co-applicant" records from the column "co_applicant_ethnicity_name"
  2. Filter columns:
    • action_taken
    • denial_reason_name_1
  3. Created the column "action_taken":
    • df['action_taken'] = df['action_taken_name'].replace({
    'Loan originated': 'approved', 'Application approved but not accepted': 'approved', 'Application denied by financial institution': 'denied'

})
4) First feature selection: - df = df[

    ['loan_type_name',
    'property_type_name',
    'loan_purpose_name',
    'loan_amount_000s',
    'action_taken',
    'msamd_name',
    'applicant_ethnicity_name',
    'applicant_race_name_1',
    'applicant_sex_name',
    'applicant_income_000s',
    'denial_reason_name_1',
    'denial_reason_name_2',
    'denial_reason_name_3',
    'rate_spread',
    'lien_status_name',
    'minority_population',
    'hud_median_family_income',
    'tract_to_msamd_income']

    ]
  1. Excluded "Credit application incomplete" records from the column "denial_reason_name_1"
  2. New column "ethnicity_race_sex"
  3. Preprocessing diverse:
    • Drop 'msamd_name'
    • Eliminate the 20 values where 'One-to-four family dwelling (other than manufactured housing)') & tract_to_msamd_income are null
    • Create mirror columns for missing values on 'tract_to_msamd_income', 'minority_population', 'hud_median_family_income' AND
    • Filling null values with Median from each group from the column "ethnicity_race_sex"
    • New column "loan_to_application_ratio"
    • Filling missing values with
      • "0" rate_spread
      • or "unknown" (denial_reason_name_1, denial_reason_name_2, denial_reason_name_3)
  4. Adressed outliers on applicant_income_000s and loan_amount_000s
In [1]:
!pip install shap
Collecting shap
  Downloading shap-0.46.0-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (24 kB)
Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from shap) (1.26.4)
Requirement already satisfied: scipy in /usr/local/lib/python3.10/dist-packages (from shap) (1.13.1)
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.10/dist-packages (from shap) (1.3.2)
Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from shap) (2.1.4)
Requirement already satisfied: tqdm>=4.27.0 in /usr/local/lib/python3.10/dist-packages (from shap) (4.66.5)
Requirement already satisfied: packaging>20.9 in /usr/local/lib/python3.10/dist-packages (from shap) (24.1)
Collecting slicer==0.0.8 (from shap)
  Downloading slicer-0.0.8-py3-none-any.whl.metadata (4.0 kB)
Requirement already satisfied: numba in /usr/local/lib/python3.10/dist-packages (from shap) (0.60.0)
Requirement already satisfied: cloudpickle in /usr/local/lib/python3.10/dist-packages (from shap) (2.2.1)
Requirement already satisfied: llvmlite<0.44,>=0.43.0dev0 in /usr/local/lib/python3.10/dist-packages (from numba->shap) (0.43.0)
Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.10/dist-packages (from pandas->shap) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->shap) (2024.2)
Requirement already satisfied: tzdata>=2022.1 in /usr/local/lib/python3.10/dist-packages (from pandas->shap) (2024.1)
Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from scikit-learn->shap) (1.4.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn->shap) (3.5.0)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.2->pandas->shap) (1.16.0)
Downloading shap-0.46.0-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (540 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 540.1/540.1 kB 11.2 MB/s eta 0:00:00
Downloading slicer-0.0.8-py3-none-any.whl (15 kB)
Installing collected packages: slicer, shap
Successfully installed shap-0.46.0 slicer-0.0.8
In [2]:
!pip install aif360
Collecting aif360
  Downloading aif360-0.6.1-py3-none-any.whl.metadata (5.0 kB)
Requirement already satisfied: numpy>=1.16 in /usr/local/lib/python3.10/dist-packages (from aif360) (1.26.4)
Requirement already satisfied: scipy>=1.2.0 in /usr/local/lib/python3.10/dist-packages (from aif360) (1.13.1)
Requirement already satisfied: pandas>=0.24.0 in /usr/local/lib/python3.10/dist-packages (from aif360) (2.1.4)
Requirement already satisfied: scikit-learn>=1.0 in /usr/local/lib/python3.10/dist-packages (from aif360) (1.3.2)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.10/dist-packages (from aif360) (3.7.1)
Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.10/dist-packages (from pandas>=0.24.0->aif360) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=0.24.0->aif360) (2024.2)
Requirement already satisfied: tzdata>=2022.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=0.24.0->aif360) (2024.1)
Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=1.0->aif360) (1.4.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=1.0->aif360) (3.5.0)
Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->aif360) (1.3.0)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib->aif360) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->aif360) (4.53.1)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->aif360) (1.4.7)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->aif360) (24.1)
Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->aif360) (10.4.0)
Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->aif360) (3.1.4)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.2->pandas>=0.24.0->aif360) (1.16.0)
Downloading aif360-0.6.1-py3-none-any.whl (259 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 259.7/259.7 kB 8.3 MB/s eta 0:00:00
Installing collected packages: aif360
Successfully installed aif360-0.6.1
In [3]:
pip install 'aif360[Reductions]'
Requirement already satisfied: aif360[Reductions] in /usr/local/lib/python3.10/dist-packages (0.6.1)
Requirement already satisfied: numpy>=1.16 in /usr/local/lib/python3.10/dist-packages (from aif360[Reductions]) (1.26.4)
Requirement already satisfied: scipy>=1.2.0 in /usr/local/lib/python3.10/dist-packages (from aif360[Reductions]) (1.13.1)
Requirement already satisfied: pandas>=0.24.0 in /usr/local/lib/python3.10/dist-packages (from aif360[Reductions]) (2.1.4)
Requirement already satisfied: scikit-learn>=1.0 in /usr/local/lib/python3.10/dist-packages (from aif360[Reductions]) (1.3.2)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.10/dist-packages (from aif360[Reductions]) (3.7.1)
Collecting fairlearn~=0.7 (from aif360[Reductions])
  Downloading fairlearn-0.10.0-py3-none-any.whl.metadata (7.0 kB)
Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.10/dist-packages (from pandas>=0.24.0->aif360[Reductions]) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=0.24.0->aif360[Reductions]) (2024.2)
Requirement already satisfied: tzdata>=2022.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=0.24.0->aif360[Reductions]) (2024.1)
Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=1.0->aif360[Reductions]) (1.4.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=1.0->aif360[Reductions]) (3.5.0)
Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->aif360[Reductions]) (1.3.0)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib->aif360[Reductions]) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->aif360[Reductions]) (4.53.1)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->aif360[Reductions]) (1.4.7)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->aif360[Reductions]) (24.1)
Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->aif360[Reductions]) (10.4.0)
Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->aif360[Reductions]) (3.1.4)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.2->pandas>=0.24.0->aif360[Reductions]) (1.16.0)
Downloading fairlearn-0.10.0-py3-none-any.whl (234 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 234.1/234.1 kB 6.8 MB/s eta 0:00:00
Installing collected packages: fairlearn
Successfully installed fairlearn-0.10.0
In [4]:
pip install 'aif360[inFairness]'
Requirement already satisfied: aif360[inFairness] in /usr/local/lib/python3.10/dist-packages (0.6.1)
Requirement already satisfied: numpy>=1.16 in /usr/local/lib/python3.10/dist-packages (from aif360[inFairness]) (1.26.4)
Requirement already satisfied: scipy>=1.2.0 in /usr/local/lib/python3.10/dist-packages (from aif360[inFairness]) (1.13.1)
Requirement already satisfied: pandas>=0.24.0 in /usr/local/lib/python3.10/dist-packages (from aif360[inFairness]) (2.1.4)
Requirement already satisfied: scikit-learn>=1.0 in /usr/local/lib/python3.10/dist-packages (from aif360[inFairness]) (1.3.2)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.10/dist-packages (from aif360[inFairness]) (3.7.1)
Collecting skorch (from aif360[inFairness])
  Downloading skorch-1.0.0-py3-none-any.whl.metadata (11 kB)
Collecting inFairness>=0.2.2 (from aif360[inFairness])
  Downloading inFairness-0.2.3-py3-none-any.whl.metadata (8.1 kB)
Collecting POT>=0.8.0 (from inFairness>=0.2.2->aif360[inFairness])
  Downloading POT-0.9.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (32 kB)
Requirement already satisfied: torch>=1.13.0 in /usr/local/lib/python3.10/dist-packages (from inFairness>=0.2.2->aif360[inFairness]) (2.4.0+cu121)
Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.10/dist-packages (from pandas>=0.24.0->aif360[inFairness]) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=0.24.0->aif360[inFairness]) (2024.2)
Requirement already satisfied: tzdata>=2022.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=0.24.0->aif360[inFairness]) (2024.1)
Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=1.0->aif360[inFairness]) (1.4.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=1.0->aif360[inFairness]) (3.5.0)
Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->aif360[inFairness]) (1.3.0)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib->aif360[inFairness]) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->aif360[inFairness]) (4.53.1)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->aif360[inFairness]) (1.4.7)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->aif360[inFairness]) (24.1)
Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->aif360[inFairness]) (10.4.0)
Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->aif360[inFairness]) (3.1.4)
Requirement already satisfied: tabulate>=0.7.7 in /usr/local/lib/python3.10/dist-packages (from skorch->aif360[inFairness]) (0.9.0)
Requirement already satisfied: tqdm>=4.14.0 in /usr/local/lib/python3.10/dist-packages (from skorch->aif360[inFairness]) (4.66.5)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.2->pandas>=0.24.0->aif360[inFairness]) (1.16.0)
Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from torch>=1.13.0->inFairness>=0.2.2->aif360[inFairness]) (3.16.0)
Requirement already satisfied: typing-extensions>=4.8.0 in /usr/local/lib/python3.10/dist-packages (from torch>=1.13.0->inFairness>=0.2.2->aif360[inFairness]) (4.12.2)
Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch>=1.13.0->inFairness>=0.2.2->aif360[inFairness]) (1.13.2)
Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch>=1.13.0->inFairness>=0.2.2->aif360[inFairness]) (3.3)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch>=1.13.0->inFairness>=0.2.2->aif360[inFairness]) (3.1.4)
Requirement already satisfied: fsspec in /usr/local/lib/python3.10/dist-packages (from torch>=1.13.0->inFairness>=0.2.2->aif360[inFairness]) (2024.6.1)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch>=1.13.0->inFairness>=0.2.2->aif360[inFairness]) (2.1.5)
Requirement already satisfied: mpmath<1.4,>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from sympy->torch>=1.13.0->inFairness>=0.2.2->aif360[inFairness]) (1.3.0)
Downloading inFairness-0.2.3-py3-none-any.whl (45 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 45.8/45.8 kB 1.6 MB/s eta 0:00:00
Downloading skorch-1.0.0-py3-none-any.whl (239 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 239.4/239.4 kB 11.6 MB/s eta 0:00:00
Downloading POT-0.9.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (835 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 835.4/835.4 kB 29.5 MB/s eta 0:00:00
Installing collected packages: POT, skorch, inFairness
Successfully installed POT-0.9.4 inFairness-0.2.3 skorch-1.0.0
In [5]:
import pandas as pd
import numpy as np
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
import shap
import lightgbm as lgbm

from lightgbm import LGBMClassifier

from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV, StratifiedKFold
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report, roc_auc_score, log_loss

from scipy.stats import skew

from imblearn.over_sampling import SMOTE
from imblearn.over_sampling import SMOTE

from aif360.metrics import BinaryLabelDatasetMetric
from aif360.datasets import BinaryLabelDataset
from aif360.algorithms.preprocessing import Reweighing

import joblib
/usr/local/lib/python3.10/dist-packages/dask/dataframe/__init__.py:42: FutureWarning: 
Dask dataframe query planning is disabled because dask-expr is not installed.

You can install it with `pip install dask[dataframe]` or `conda install dask`.
This will raise in a future version.

  warnings.warn(msg, FutureWarning)
/usr/local/lib/python3.10/dist-packages/inFairness/utils/ndcg.py:37: FutureWarning: We've integrated functorch into PyTorch. As the final step of the integration, `functorch.vmap` is deprecated as of PyTorch 2.0 and will be deleted in a future version of PyTorch >= 2.3. Please use `torch.vmap` instead; see the PyTorch 2.0 release notes and/or the `torch.func` migration guide for more details https://pytorch.org/docs/main/func.migrating.html
  vect_normalized_discounted_cumulative_gain = vmap(
/usr/local/lib/python3.10/dist-packages/inFairness/utils/ndcg.py:48: FutureWarning: We've integrated functorch into PyTorch. As the final step of the integration, `functorch.vmap` is deprecated as of PyTorch 2.0 and will be deleted in a future version of PyTorch >= 2.3. Please use `torch.vmap` instead; see the PyTorch 2.0 release notes and/or the `torch.func` migration guide for more details https://pytorch.org/docs/main/func.migrating.html
  monte_carlo_vect_ndcg = vmap(vect_normalized_discounted_cumulative_gain, in_dims=(0,))
In [6]:
# Load the Drive helper and mount
from google.colab import drive

# This will prompt for authorization.
drive.mount('/content/drive')

%ls "/content/drive/My Drive/Colab Notebooks/hmdaNY_02092024_1603_Ready_2.csv"
Mounted at /content/drive
'/content/drive/My Drive/Colab Notebooks/hmdaNY_02092024_1603_Ready_2.csv'
In [7]:
df = pd.read_csv('/content/drive/My Drive/Colab Notebooks/hmdaNY_02092024_1603_Ready_2.csv')

print(df.shape,'\n')
print(df.dtypes)
(146718, 15) 

action_taken                         object
applicant_income_000s               float64
ethnicity_race_sex                   object
hud_median_family_income            float64
hud_median_family_income_missing      int64
lien_status_name                     object
loan_amount_000s                      int64
loan_purpose_name                    object
loan_type_name                       object
loan_to_income_ratio                float64
minority_population                 float64
minority_population_missing           int64
property_type_name                   object
tract_to_msamd_income               float64
tract_to_msamd_income_missing         int64
dtype: object

A) Creation of Binary target¶

In [8]:
# Creation of binary column for approval - denails
df['action_taken_binary'] = df['action_taken'].map({'denied': 1,'approved': 0})

# Drop the original 'action_taken' column
df = df.drop(columns=['action_taken'])

print(df['action_taken_binary'].value_counts())
action_taken_binary
0    116408
1     30310
Name: count, dtype: int64

B) Splitting data¶

  1. We are using 70-30 ratio.
In [9]:
# Split the data
train_df, test_df = train_test_split(df, test_size=0.3, random_state=42)

train_df.to_csv('/content/drive/My Drive/Colab Notebooks/hmdaNY_03092024_1603_train_df.csv', index=False)
test_df.to_csv('/content/drive/My Drive/Colab Notebooks/hmdaNY_03092024_1603_test_df.csv', index=False)
In [10]:
print(train_df.shape)
print(test_df.shape)
(102702, 15)
(44016, 15)

C) Feature Engineering¶

C.1) Checking for Skewness¶

Looking for skewness to whether apply either or both, Log transformation - scaling¶
  1. We apply first Log transformation
  2. Then Scaling
In [11]:
# Log transformation
# Scaling

# 'loan_to_income_ratio'
plt.hist(train_df['loan_to_income_ratio'], bins=100, edgecolor='black')

plt.title('Distribution of loan_to_income_ratio')
plt.xlabel('loan_to_income_ratio')
plt.ylabel('Frequency')

plt.show()


# Calculating the skewness of loan_to_income_ratio
original_skewness = skew(train_df['loan_to_income_ratio'])
print(f"Skewness: {original_skewness}")
No description has been provided for this image
Skewness: 4.0683291922765275
In [12]:
# Log transformation: yes
# Scaling: yes


# applicant_income_000s
plt.hist(train_df['applicant_income_000s'], bins=30, edgecolor='black')

plt.title('Distribution of Applicant Income (in 000s)')
plt.xlabel('Applicant Income (000s)')
plt.ylabel('Frequency')

plt.show()


# Calculating the skewness of applicant_income_000s
original_skewness = skew(train_df['applicant_income_000s'])
print(f"Skewness: {original_skewness}")
No description has been provided for this image
Skewness: 3.39818358160144
In [13]:
# Log transformation: No
# Scaling: yes

# hud_median_family_income
plt.hist(train_df['hud_median_family_income'], bins=30, edgecolor='black')

plt.title('Distribution of hud_median_family_income')
plt.xlabel('hud_median_family_income')
plt.ylabel('Frequency')

plt.show()


# Calculating the skewness of hud_median_family_income
original_skewness = skew(train_df['hud_median_family_income'])
print(f"Skewness: {original_skewness}")
No description has been provided for this image
Skewness: 1.2479818056447094
In [14]:
# Log transformation: yes
# scaling: yes

# loan_amount_000s
plt.hist(train_df['loan_amount_000s'], bins=30, edgecolor='black')

plt.title('Distribution of loan_amount_000s')
plt.xlabel('loan_amount_000s')
plt.ylabel('Frequency')

plt.show()


# Calculating the skewness of loan_amount_000s
original_skewness = skew(train_df['loan_amount_000s'])
print(f"Skewness: {original_skewness}")
No description has been provided for this image
Skewness: 1.824575061311811
In [15]:
# ONLY scaling.

# minority_population
plt.hist(train_df['minority_population'], bins=30, edgecolor='black')

plt.title('Distribution of minority_population')
plt.xlabel('minority_population')
plt.ylabel('Frequency')

plt.show()


# Calculating the skewness of minority_population
original_skewness = skew(train_df['minority_population'])
print(f"Skewness: {original_skewness}")
No description has been provided for this image
Skewness: 1.1907000593514665
In [16]:
# Log transformation optional
# Scaling yes

# Applicant Income
plt.hist(train_df['tract_to_msamd_income'], bins=30, edgecolor='black')

plt.title('Distribution of tract_to_msamd_income')
plt.xlabel('tract_to_msamd_income')
plt.ylabel('Frequency')

plt.show()


# Calculating the skewness of tract_to_msamd_income
original_skewness = skew(train_df['tract_to_msamd_income'])
print(f"Skewness: {original_skewness}")
No description has been provided for this image
Skewness: 1.9468038744343557

C.2) Log Transformation¶

'tract_to_msamd_income',
'loan_amount_000s',
'applicant_income_000s'
In [17]:
columns_to_log = ['loan_to_income_ratio', 'tract_to_msamd_income', 'loan_amount_000s', 'applicant_income_000s']

train_df[columns_to_log] = train_df[columns_to_log].apply(np.log1p)
test_df[columns_to_log] = test_df[columns_to_log].apply(np.log1p)

C.3) Scaling;¶

Standardization (Z-score normalization)¶

  1. We want to avoid any leakage on from the training data into the test data, hence we will apply scaler and one-hot encoding separately (train_df - test_df)
  2. We will first, Fit preprocessing steps on training data only.
  3. Then we will apply preprocessing to both sets: we will use the parameters learned from the training data to transform both the training and test sets.

This approach ensures that:

  1. The test set remains truly unseen during the training process.
  2. Both training and test data are in the same format for model training and evaluation.
  3. No information from the test set leaks into the preprocessing steps.
In [18]:
# Features to Scale
columns_to_scaling = ['loan_to_income_ratio',
                    'applicant_income_000s',
                    'hud_median_family_income',
                    'loan_amount_000s',
                    'minority_population',
                    'tract_to_msamd_income'
                   ]

# Fitting preprocessing on training data
scaler = StandardScaler()
scaler.fit(train_df[columns_to_scaling])

## Applying preprocessing to both sets
train_df[columns_to_scaling] = scaler.transform(train_df[columns_to_scaling])
test_df[columns_to_scaling] = scaler.transform(test_df[columns_to_scaling])
In [19]:
# Check the mean and standard deviation
print("Means of scaled features:")
print(train_df[columns_to_scaling].mean())
print("\nStandard deviations of scaled features:")
print(train_df[columns_to_scaling].std())
Means of scaled features:
loan_to_income_ratio        2.184859e-16
applicant_income_000s      -2.292788e-16
hud_median_family_income   -2.480970e-16
loan_amount_000s           -1.053202e-15
minority_population        -7.131233e-17
tract_to_msamd_income      -2.276737e-15
dtype: float64

Standard deviations of scaled features:
loan_to_income_ratio        1.000005
applicant_income_000s       1.000005
hud_median_family_income    1.000005
loan_amount_000s            1.000005
minority_population         1.000005
tract_to_msamd_income       1.000005
dtype: float64
In [20]:
# Checking the range
print("\nMin values:")
print(train_df[columns_to_scaling].min())
print("\nMax values:")
print(train_df[columns_to_scaling].max())
Min values:
loan_to_income_ratio        -2.212567
applicant_income_000s       -2.513814
hud_median_family_income    -1.002217
loan_amount_000s            -3.041207
minority_population         -1.023697
tract_to_msamd_income      -11.541474
dtype: float64

Max values:
loan_to_income_ratio        5.389856
applicant_income_000s       3.714564
hud_median_family_income    2.033801
loan_amount_000s            2.021953
minority_population         2.410389
tract_to_msamd_income       3.007542
dtype: float64

C.4) One-hot encoding¶

We follow the same approach as above,

  1. We want to avoid any leakage on from the training data into the test data, hence we will apply scaler and one-hot encoding separately (train_df - test_df)
  2. We will first, Fit preprocessing steps on training data only.
  3. Then we will apply preprocessing to both sets: we will use the parameters learned from the training data to transform both the training and test sets.

This approach ensures that:

  1. The test set remains truly unseen during the training process.
  2. Both training and test data are in the same format for model training and evaluation.
  3. No information from the test set leaks into the preprocessing steps. we use This approach ensures that:
OneHotEncoder vs get_dummies:¶
OneHotEncoder and get_dummies are both used for one-hot encoding, but they have some differences:
  1. OneHotEncoder (from scikit-learn):
    • Can be fit on training data and applied to test data separately
    • Handles unseen categories in test data (with 'handle_unknown' parameter)
    • Part of scikit-learn's preprocessing module, integrates well with pipelines
    • Can handle non-string categorical data easily
¶
  1. get_dummies (from pandas):
    • Simpler to use for quick encoding of pandas DataFrames
    • Applies encoding immediately to the entire dataset
    • Doesn't handle unseen categories in new data by default
    • Works directly on pandas DataFrames and returns a DataFrame
¶
  1. Main difference (which is why we are using OneHotEncoder) is:
    • OneHotEncoder is better for preventing data leakage and handling unseen categories in test data
    • get_dummies is more convenient for quick, one-time encoding of all data
In [21]:
# Features to encode
categorical_columns = ['loan_type_name',
                             'loan_purpose_name',
                             'property_type_name',
                             'lien_status_name',
                             'ethnicity_race_sex']

# Fitting preprocessing on training data only
encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
encoder.fit(train_df[categorical_columns])

# Applying preprocessing on both sets
train_df_LogEnco = encoder.transform(train_df[categorical_columns])
test_df_LogEnco = encoder.transform(test_df[categorical_columns])

# Creating new column names for the encoded features
new_column_names = encoder.get_feature_names_out(categorical_columns)

# Converting the encoded arrays to DataFrames
train_df_LogEnco = pd.DataFrame(train_df_LogEnco, columns=new_column_names, index=train_df.index)
test_df_LogEnco = pd.DataFrame(test_df_LogEnco, columns=new_column_names, index=test_df.index)

# Dropping the original categorical columns and add the encoded columns
train_df = train_df.drop(columns=categorical_columns).join(train_df_LogEnco)
test_df = test_df.drop(columns=categorical_columns).join(test_df_LogEnco)

#https://medium.com/@vinodkumargr/11-column-transformer-in-ml-sklearn-column-transformer-in-machine-learning-48479f8cb48f#:~:text=Scikit%2DLearn's%20Column%20Transformer%20is,transformer%20should%20be%20applied%20to.
#https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
In [22]:
print(train_df)
        applicant_income_000s  hud_median_family_income  \
93790               -0.560030                 -0.621935   
3731                 0.380770                 -0.279059   
58545                2.404752                 -0.279059   
142437              -0.189761                  2.033801   
21819                0.030613                 -0.279059   
...                       ...                       ...   
110268              -1.733987                 -0.603233   
119879              -1.167283                 -0.603233   
103694              -0.716095                 -0.621935   
131932               0.091232                  2.033801   
121958               0.009887                  0.020179   

        hud_median_family_income_missing  loan_amount_000s  \
93790                                  0         -0.235946   
3731                                   0          0.774233   
58545                                  0          1.897857   
142437                                 0          0.573865   
21819                                  0          0.066945   
...                                  ...               ...   
110268                                 0         -2.694126   
119879                                 0         -0.268873   
103694                                 0         -0.311898   
131932                                 0          0.679899   
121958                                 0         -1.136912   

        loan_to_income_ratio  minority_population  \
93790              -0.117323             1.438200   
3731                0.707355            -0.335849   
58545               0.781324            -0.119502   
142437              0.898603             1.477005   
21819              -0.136738             0.080362   
...                      ...                  ...   
110268             -1.862718             1.527486   
119879              0.354152            -0.740041   
103694             -0.104487             2.104069   
131932              0.817267            -0.266481   
121958             -1.473303            -0.368130   

        minority_population_missing  tract_to_msamd_income  \
93790                             0              -1.883378   
3731                              0               0.883041   
58545                             0               0.680940   
142437                            0              -1.704510   
21819                             0               0.811845   
...                             ...                    ...   
110268                            0              -1.092325   
119879                            0              -0.024931   
103694                            0              -1.151373   
131932                            0              -0.389044   
121958                            0               0.302802   

        tract_to_msamd_income_missing  action_taken_binary  ...  \
93790                               0                    0  ...   
3731                                0                    0  ...   
58545                               0                    0  ...   
142437                              0                    0  ...   
21819                               0                    0  ...   
...                               ...                  ...  ...   
110268                              0                    1  ...   
119879                              0                    0  ...   
103694                              0                    0  ...   
131932                              0                    1  ...   
121958                              0                    0  ...   

        ethnicity_race_sex_not hispanic or latino_american indian or alaska native_female  \
93790                                                 0.0                                   
3731                                                  0.0                                   
58545                                                 0.0                                   
142437                                                0.0                                   
21819                                                 0.0                                   
...                                                   ...                                   
110268                                                0.0                                   
119879                                                0.0                                   
103694                                                0.0                                   
131932                                                0.0                                   
121958                                                0.0                                   

        ethnicity_race_sex_not hispanic or latino_american indian or alaska native_male  \
93790                                                 0.0                                 
3731                                                  0.0                                 
58545                                                 0.0                                 
142437                                                0.0                                 
21819                                                 0.0                                 
...                                                   ...                                 
110268                                                0.0                                 
119879                                                0.0                                 
103694                                                0.0                                 
131932                                                0.0                                 
121958                                                0.0                                 

        ethnicity_race_sex_not hispanic or latino_asian_female  \
93790                                                 0.0        
3731                                                  1.0        
58545                                                 0.0        
142437                                                0.0        
21819                                                 0.0        
...                                                   ...        
110268                                                0.0        
119879                                                0.0        
103694                                                0.0        
131932                                                0.0        
121958                                                0.0        

        ethnicity_race_sex_not hispanic or latino_asian_male  \
93790                                                 0.0      
3731                                                  0.0      
58545                                                 0.0      
142437                                                0.0      
21819                                                 1.0      
...                                                   ...      
110268                                                0.0      
119879                                                0.0      
103694                                                0.0      
131932                                                0.0      
121958                                                0.0      

        ethnicity_race_sex_not hispanic or latino_black or african american_female  \
93790                                                 0.0                            
3731                                                  0.0                            
58545                                                 0.0                            
142437                                                0.0                            
21819                                                 0.0                            
...                                                   ...                            
110268                                                0.0                            
119879                                                0.0                            
103694                                                1.0                            
131932                                                0.0                            
121958                                                0.0                            

        ethnicity_race_sex_not hispanic or latino_black or african american_male  \
93790                                                 0.0                          
3731                                                  0.0                          
58545                                                 0.0                          
142437                                                0.0                          
21819                                                 0.0                          
...                                                   ...                          
110268                                                1.0                          
119879                                                0.0                          
103694                                                0.0                          
131932                                                0.0                          
121958                                                0.0                          

        ethnicity_race_sex_not hispanic or latino_native hawaiian or other pacific islander_female  \
93790                                                 0.0                                            
3731                                                  0.0                                            
58545                                                 0.0                                            
142437                                                0.0                                            
21819                                                 0.0                                            
...                                                   ...                                            
110268                                                0.0                                            
119879                                                0.0                                            
103694                                                0.0                                            
131932                                                0.0                                            
121958                                                0.0                                            

        ethnicity_race_sex_not hispanic or latino_native hawaiian or other pacific islander_male  \
93790                                                 0.0                                          
3731                                                  0.0                                          
58545                                                 0.0                                          
142437                                                0.0                                          
21819                                                 0.0                                          
...                                                   ...                                          
110268                                                0.0                                          
119879                                                0.0                                          
103694                                                0.0                                          
131932                                                0.0                                          
121958                                                0.0                                          

        ethnicity_race_sex_not hispanic or latino_white_female  \
93790                                                 0.0        
3731                                                  0.0        
58545                                                 0.0        
142437                                                1.0        
21819                                                 0.0        
...                                                   ...        
110268                                                0.0        
119879                                                0.0        
103694                                                0.0        
131932                                                0.0        
121958                                                0.0        

        ethnicity_race_sex_not hispanic or latino_white_male  
93790                                                 1.0     
3731                                                  0.0     
58545                                                 1.0     
142437                                                0.0     
21819                                                 0.0     
...                                                   ...     
110268                                                0.0     
119879                                                1.0     
103694                                                0.0     
131932                                                0.0     
121958                                                1.0     

[102702 rows x 43 columns]
In [23]:
print(train_df.isnull().sum())
applicant_income_000s                                                                         0
hud_median_family_income                                                                      0
hud_median_family_income_missing                                                              0
loan_amount_000s                                                                              0
loan_to_income_ratio                                                                          0
minority_population                                                                           0
minority_population_missing                                                                   0
tract_to_msamd_income                                                                         0
tract_to_msamd_income_missing                                                                 0
action_taken_binary                                                                           0
loan_type_name_Conventional                                                                   0
loan_type_name_FHA-insured                                                                    0
loan_type_name_FSA/RHS-guaranteed                                                             0
loan_type_name_VA-guaranteed                                                                  0
loan_purpose_name_Home improvement                                                            0
loan_purpose_name_Home purchase                                                               0
loan_purpose_name_Refinancing                                                                 0
property_type_name_Manufactured housing                                                       0
property_type_name_Multifamily dwelling                                                       0
property_type_name_One-to-four family dwelling (other than manufactured housing)              0
lien_status_name_Not secured by a lien                                                        0
lien_status_name_Secured by a first lien                                                      0
lien_status_name_Secured by a subordinate lien                                                0
ethnicity_race_sex_hispanic or latino_american indian or alaska native_female                 0
ethnicity_race_sex_hispanic or latino_american indian or alaska native_male                   0
ethnicity_race_sex_hispanic or latino_asian_female                                            0
ethnicity_race_sex_hispanic or latino_asian_male                                              0
ethnicity_race_sex_hispanic or latino_black or african american_female                        0
ethnicity_race_sex_hispanic or latino_black or african american_male                          0
ethnicity_race_sex_hispanic or latino_native hawaiian or other pacific islander_female        0
ethnicity_race_sex_hispanic or latino_native hawaiian or other pacific islander_male          0
ethnicity_race_sex_hispanic or latino_white_female                                            0
ethnicity_race_sex_hispanic or latino_white_male                                              0
ethnicity_race_sex_not hispanic or latino_american indian or alaska native_female             0
ethnicity_race_sex_not hispanic or latino_american indian or alaska native_male               0
ethnicity_race_sex_not hispanic or latino_asian_female                                        0
ethnicity_race_sex_not hispanic or latino_asian_male                                          0
ethnicity_race_sex_not hispanic or latino_black or african american_female                    0
ethnicity_race_sex_not hispanic or latino_black or african american_male                      0
ethnicity_race_sex_not hispanic or latino_native hawaiian or other pacific islander_female    0
ethnicity_race_sex_not hispanic or latino_native hawaiian or other pacific islander_male      0
ethnicity_race_sex_not hispanic or latino_white_female                                        0
ethnicity_race_sex_not hispanic or latino_white_male                                          0
dtype: int64

D) Define features and target variable:¶

We separate the features (X) and target variable (y) for both train and test sets.
In [24]:
features =[    'action_taken_binary',
 #               'applicant_income_000s',
                'hud_median_family_income',
                'hud_median_family_income_missing',
#                'loan_amount_000s',
                'loan_to_income_ratio',
                'minority_population',
                'minority_population_missing',
                'tract_to_msamd_income',
                'tract_to_msamd_income_missing',
                'loan_type_name_Conventional',
                'loan_type_name_FHA-insured',
                'loan_type_name_FSA/RHS-guaranteed',
                'loan_type_name_VA-guaranteed',
                'loan_purpose_name_Home improvement',
                'loan_purpose_name_Home purchase',
                'loan_purpose_name_Refinancing',
                'property_type_name_Manufactured housing',
                'property_type_name_Multifamily dwelling',
                'property_type_name_One-to-four family dwelling (other than manufactured housing)',
                'lien_status_name_Not secured by a lien',
                'lien_status_name_Secured by a first lien',
                'lien_status_name_Secured by a subordinate lien',
                "ethnicity_race_sex_hispanic or latino_american indian or alaska native_female",
                "ethnicity_race_sex_hispanic or latino_american indian or alaska native_male",
                "ethnicity_race_sex_hispanic or latino_asian_female",
                "ethnicity_race_sex_hispanic or latino_asian_male",
                "ethnicity_race_sex_hispanic or latino_black or african american_female",
                "ethnicity_race_sex_hispanic or latino_black or african american_male",
                "ethnicity_race_sex_hispanic or latino_native hawaiian or other pacific islander_female",
                "ethnicity_race_sex_hispanic or latino_native hawaiian or other pacific islander_male",
                "ethnicity_race_sex_hispanic or latino_white_female",
                "ethnicity_race_sex_hispanic or latino_white_male",
                "ethnicity_race_sex_not hispanic or latino_american indian or alaska native_female",
                "ethnicity_race_sex_not hispanic or latino_american indian or alaska native_male",
                "ethnicity_race_sex_not hispanic or latino_asian_female",
                "ethnicity_race_sex_not hispanic or latino_asian_male",
                "ethnicity_race_sex_not hispanic or latino_black or african american_female",
                "ethnicity_race_sex_not hispanic or latino_black or african american_male",
                "ethnicity_race_sex_not hispanic or latino_native hawaiian or other pacific islander_female",
                "ethnicity_race_sex_not hispanic or latino_native hawaiian or other pacific islander_male",
                "ethnicity_race_sex_not hispanic or latino_white_female",
                'ethnicity_race_sex_not hispanic or latino_white_male'
           ]


train_df = train_df[features]
test_df = test_df[features]
In [25]:
# Train set
X_train = train_df.drop('action_taken_binary', axis=1)
y_train = train_df['action_taken_binary']

# Test Set
X_test = test_df.drop('action_taken_binary', axis=1)
y_test = test_df['action_taken_binary']
In [26]:
# Checking the shapes of the resulting splits
print(f'X_train shape: {X_train.shape}')
print(f'y_train shape: {y_train.shape}')
print(f'X_test shape: {X_test.shape}')
print(f'y_test shape: {y_test.shape}')
X_train shape: (102702, 40)
y_train shape: (102702,)
X_test shape: (44016, 40)
y_test shape: (44016,)

E) Bias measurement:¶

We measure biases BEFORE smote with AIF360. We use Disparate Impact as measurement.¶
  1. First, we'll create a single categorical column from the 20 one-hot encoded ethnicity-race-sex columns. This new column will be our protected attribute.
  2. We'll use the BinaryLabelDataset from AIF360, specifying 'action_taken_binary' as the target and our new categorical column as the protected attribute.
  3. We'll use BinaryLabelDatasetMetric to compute metrics for each group, focusing on disparate impact and statistical parity difference.
  4. We'll examine the metrics for each ethnicity-race-sex group and identify any groups facing significantly higher rejection rates.

Privileged class:¶

Since we are interested on disparities within our dataset, we will use the most "populous class" (privieged class) as benchmarking

In [27]:
print(train_df['ethnicity_race_sex_not hispanic or latino_white_male'])
93790     1.0
3731      0.0
58545     1.0
142437    0.0
21819     0.0
         ... 
110268    0.0
119879    1.0
103694    0.0
131932    0.0
121958    1.0
Name: ethnicity_race_sex_not hispanic or latino_white_male, Length: 102702, dtype: float64

E.1) Dataset creation¶

In [28]:
# Define variables:
protected_attribute_names=[
                "ethnicity_race_sex_hispanic or latino_american indian or alaska native_female",
                "ethnicity_race_sex_hispanic or latino_american indian or alaska native_male",
                "ethnicity_race_sex_hispanic or latino_asian_female",
                "ethnicity_race_sex_hispanic or latino_asian_male",
                "ethnicity_race_sex_hispanic or latino_black or african american_female",
                "ethnicity_race_sex_hispanic or latino_black or african american_male",
                "ethnicity_race_sex_hispanic or latino_native hawaiian or other pacific islander_female",
                "ethnicity_race_sex_hispanic or latino_native hawaiian or other pacific islander_male",
                "ethnicity_race_sex_hispanic or latino_white_female",
                "ethnicity_race_sex_hispanic or latino_white_male",
                "ethnicity_race_sex_not hispanic or latino_american indian or alaska native_female",
                "ethnicity_race_sex_not hispanic or latino_american indian or alaska native_male",
                "ethnicity_race_sex_not hispanic or latino_asian_female",
                "ethnicity_race_sex_not hispanic or latino_asian_male",
                "ethnicity_race_sex_not hispanic or latino_black or african american_female",
                "ethnicity_race_sex_not hispanic or latino_black or african american_male",
                "ethnicity_race_sex_not hispanic or latino_native hawaiian or other pacific islander_female",
                "ethnicity_race_sex_not hispanic or latino_native hawaiian or other pacific islander_male",
                "ethnicity_race_sex_not hispanic or latino_white_female",
                'ethnicity_race_sex_not hispanic or latino_white_male' #notice we inlcude the privilege group
                            ]

favorable_label = 0  # loan approved
unfavorable_label = 1  # loan denied

# First, we create the dataset
aif_dataset = BinaryLabelDataset(
    df=train_df,
    label_names=['action_taken_binary'],
    protected_attribute_names=protected_attribute_names,
    favorable_label = favorable_label,  # loan approved
    unfavorable_label = unfavorable_label  # loan denied

                                )

#https://www.rdocumentation.org/packages/aif360/versions/0.1.0/topics/aif_dataset
#https://aif360.readthedocs.io/en/latest/
#

E.2) Defining groups¶

In [29]:
# Defining privileged group directly
privileged_groups = [{'ethnicity_race_sex_not hispanic or latino_white_male': 1}]

# Defining unprivileged groups using a loop
unprivileged_groups = []
for attribute in protected_attribute_names:
    if attribute != 'ethnicity_race_sex_not hispanic or latino_white_male':
        unprivileged_groups.append({attribute: 1})

# Checking the groups
print("Privileged group:", privileged_groups)
print("Number of unprivileged groups:", len(unprivileged_groups))
print("First few unprivileged groups:", unprivileged_groups[:3])
Privileged group: [{'ethnicity_race_sex_not hispanic or latino_white_male': 1}]
Number of unprivileged groups: 19
First few unprivileged groups: [{'ethnicity_race_sex_hispanic or latino_american indian or alaska native_female': 1}, {'ethnicity_race_sex_hispanic or latino_american indian or alaska native_male': 1}, {'ethnicity_race_sex_hispanic or latino_asian_female': 1}]

E.3) Metrics¶

In [30]:
# Calculating metrics
metric = BinaryLabelDatasetMetric(aif_dataset,
                                  unprivileged_groups=unprivileged_groups,
                                  privileged_groups=privileged_groups)
In [31]:
# Printing metrics
print(f"Disparate Impact: {metric.disparate_impact():.4f}")
print(f"Statistical Parity Difference: {metric.statistical_parity_difference():.4f}")

# We calculate and print the mean difference in label predictions
print(f"Mean difference in label predictions: {metric.mean_difference():.4f}")

# Calculate group-specific metrics
for group in unprivileged_groups:
    group_metric = BinaryLabelDatasetMetric(aif_dataset,
                                            unprivileged_groups=[group],
                                            privileged_groups=privileged_groups)

    group_name = list(group.keys())[0]
    print(f"\nGroup: {group_name}")
    print(f"Disparate Impact: {group_metric.disparate_impact():.4f}")
    print(f"Statistical Parity Difference: {group_metric.statistical_parity_difference():.4f}")
Disparate Impact: 0.9534
Statistical Parity Difference: -0.0380
Mean difference in label predictions: -0.0380

Group: ethnicity_race_sex_hispanic or latino_american indian or alaska native_female
Disparate Impact: 0.6647
Statistical Parity Difference: -0.2732

Group: ethnicity_race_sex_hispanic or latino_american indian or alaska native_male
Disparate Impact: 0.5467
Statistical Parity Difference: -0.3694

Group: ethnicity_race_sex_hispanic or latino_asian_female
Disparate Impact: 0.8863
Statistical Parity Difference: -0.0927

Group: ethnicity_race_sex_hispanic or latino_asian_male
Disparate Impact: 0.7809
Statistical Parity Difference: -0.1785

Group: ethnicity_race_sex_hispanic or latino_black or african american_female
Disparate Impact: 0.7363
Statistical Parity Difference: -0.2149

Group: ethnicity_race_sex_hispanic or latino_black or african american_male
Disparate Impact: 0.7181
Statistical Parity Difference: -0.2297

Group: ethnicity_race_sex_hispanic or latino_native hawaiian or other pacific islander_female
Disparate Impact: 0.5137
Statistical Parity Difference: -0.3963

Group: ethnicity_race_sex_hispanic or latino_native hawaiian or other pacific islander_male
Disparate Impact: 0.5764
Statistical Parity Difference: -0.3452

Group: ethnicity_race_sex_hispanic or latino_white_female
Disparate Impact: 0.9077
Statistical Parity Difference: -0.0753

Group: ethnicity_race_sex_hispanic or latino_white_male
Disparate Impact: 0.9181
Statistical Parity Difference: -0.0667

Group: ethnicity_race_sex_not hispanic or latino_american indian or alaska native_female
Disparate Impact: 0.8057
Statistical Parity Difference: -0.1584

Group: ethnicity_race_sex_not hispanic or latino_american indian or alaska native_male
Disparate Impact: 0.7504
Statistical Parity Difference: -0.2034

Group: ethnicity_race_sex_not hispanic or latino_asian_female
Disparate Impact: 1.0054
Statistical Parity Difference: 0.0044

Group: ethnicity_race_sex_not hispanic or latino_asian_male
Disparate Impact: 0.9994
Statistical Parity Difference: -0.0005

Group: ethnicity_race_sex_not hispanic or latino_black or african american_female
Disparate Impact: 0.8262
Statistical Parity Difference: -0.1416

Group: ethnicity_race_sex_not hispanic or latino_black or african american_male
Disparate Impact: 0.8220
Statistical Parity Difference: -0.1451

Group: ethnicity_race_sex_not hispanic or latino_native hawaiian or other pacific islander_female
Disparate Impact: 0.8331
Statistical Parity Difference: -0.1360

Group: ethnicity_race_sex_not hispanic or latino_native hawaiian or other pacific islander_male
Disparate Impact: 0.8433
Statistical Parity Difference: -0.1277

Group: ethnicity_race_sex_not hispanic or latino_white_female
Disparate Impact: 0.9999
Statistical Parity Difference: -0.0001

F) SMOTE¶

Application for our unbalanced data.

In [32]:
#Initializing SMOTE
smote = SMOTE(random_state=42)

# Application of SMOTE only on the training set to balance the classes
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

# Checking the distribution of the target variable after SMOTE
print("Before SMOTE:", y_train.value_counts())
print("\n After SMOTE:", y_train_smote.value_counts())
Before SMOTE: action_taken_binary
0    81489
1    21213
Name: count, dtype: int64

 After SMOTE: action_taken_binary
0    81489
1    81489
Name: count, dtype: int64

F.1) Bias AFTER SMOTE¶

We measure biases after SMOTE application

F.1.1) First¶

We resamble the dataset.

In [33]:
# Converting the resampled X_train and y_train into a DataFrame
X_train_s = pd.DataFrame(X_train_smote, columns=X_train.columns)  # We retained original column names
y_train_s = pd.DataFrame(y_train_smote, columns=['action_taken_binary'])

# Combining X and y into a single DataFrame
train_df_smote = pd.concat([X_train_s, y_train_s], axis=1)

F.1.2) Second¶

We use the resambled dataset to apply the same procces used earlier and assess any difference.

In [34]:
# Defining variables:
protected_attribute_names=[
                "ethnicity_race_sex_hispanic or latino_american indian or alaska native_female",
                "ethnicity_race_sex_hispanic or latino_american indian or alaska native_male",
                "ethnicity_race_sex_hispanic or latino_asian_female",
                "ethnicity_race_sex_hispanic or latino_asian_male",
                "ethnicity_race_sex_hispanic or latino_black or african american_female",
                "ethnicity_race_sex_hispanic or latino_black or african american_male",
                "ethnicity_race_sex_hispanic or latino_native hawaiian or other pacific islander_female",
                "ethnicity_race_sex_hispanic or latino_native hawaiian or other pacific islander_male",
                "ethnicity_race_sex_hispanic or latino_white_female",
                "ethnicity_race_sex_hispanic or latino_white_male",
                "ethnicity_race_sex_not hispanic or latino_american indian or alaska native_female",
                "ethnicity_race_sex_not hispanic or latino_american indian or alaska native_male",
                "ethnicity_race_sex_not hispanic or latino_asian_female",
                "ethnicity_race_sex_not hispanic or latino_asian_male",
                "ethnicity_race_sex_not hispanic or latino_black or african american_female",
                "ethnicity_race_sex_not hispanic or latino_black or african american_male",
                "ethnicity_race_sex_not hispanic or latino_native hawaiian or other pacific islander_female",
                "ethnicity_race_sex_not hispanic or latino_native hawaiian or other pacific islander_male",
                "ethnicity_race_sex_not hispanic or latino_white_female",
                'ethnicity_race_sex_not hispanic or latino_white_male' #notice we inlcude the privilege group
                            ]

favorable_label = 0  # loan approved
unfavorable_label = 1  # loan denied

# Creatimng the dataset
aif_dataset = BinaryLabelDataset(
    df=train_df_smote,
    label_names=['action_taken_binary'],
    protected_attribute_names=protected_attribute_names,
    favorable_label = favorable_label,  # loan approved
    unfavorable_label = unfavorable_label  # loan denied

                                )
In [35]:
# Definining privileged group directly
privileged_groups = [{'ethnicity_race_sex_not hispanic or latino_white_male': 1}]

# Defining unprivileged groups using a loop
unprivileged_groups = []
for attribute in protected_attribute_names:
    if attribute != 'ethnicity_race_sex_not hispanic or latino_white_male':
        unprivileged_groups.append({attribute: 1})

# Checking the groups for verification
print("Privileged group:", privileged_groups)
print("Number of unprivileged groups:", len(unprivileged_groups))
print("First few unprivileged groups:", unprivileged_groups[:3])
Privileged group: [{'ethnicity_race_sex_not hispanic or latino_white_male': 1}]
Number of unprivileged groups: 19
First few unprivileged groups: [{'ethnicity_race_sex_hispanic or latino_american indian or alaska native_female': 1}, {'ethnicity_race_sex_hispanic or latino_american indian or alaska native_male': 1}, {'ethnicity_race_sex_hispanic or latino_asian_female': 1}]
In [36]:
# Calculating metrics
metric = BinaryLabelDatasetMetric(aif_dataset,
                                  unprivileged_groups=unprivileged_groups,
                                  privileged_groups=privileged_groups)
In [37]:
# Printing metrics
print(f"Disparate Impact: {metric.disparate_impact():.4f}")
print(f"Statistical Parity Difference: {metric.statistical_parity_difference():.4f}")

# We calculate and print the mean difference in label predictions
print(f"Mean difference in label predictions: {metric.mean_difference():.4f}")

# Calculating group-specific metrics
for group in unprivileged_groups:
    group_metric = BinaryLabelDatasetMetric(aif_dataset,
                                            unprivileged_groups=[group],
                                            privileged_groups=privileged_groups)

    group_name = list(group.keys())[0]
    print(f"\nGroup: {group_name}")
    print(f"Disparate Impact: {group_metric.disparate_impact():.4f}")
    print(f"Statistical Parity Difference: {group_metric.statistical_parity_difference():.4f}")
Disparate Impact: 0.9054
Statistical Parity Difference: -0.0506
Mean difference in label predictions: -0.0506

Group: ethnicity_race_sex_hispanic or latino_american indian or alaska native_female
Disparate Impact: 0.9000
Statistical Parity Difference: -0.0535

Group: ethnicity_race_sex_hispanic or latino_american indian or alaska native_male
Disparate Impact: 0.4300
Statistical Parity Difference: -0.3049

Group: ethnicity_race_sex_hispanic or latino_asian_female
Disparate Impact: 1.3501
Statistical Parity Difference: 0.1873

Group: ethnicity_race_sex_hispanic or latino_asian_male
Disparate Impact: 1.1136
Statistical Parity Difference: 0.0608

Group: ethnicity_race_sex_hispanic or latino_black or african american_female
Disparate Impact: 0.6992
Statistical Parity Difference: -0.1609

Group: ethnicity_race_sex_hispanic or latino_black or african american_male
Disparate Impact: 0.6863
Statistical Parity Difference: -0.1678

Group: ethnicity_race_sex_hispanic or latino_native hawaiian or other pacific islander_female
Disparate Impact: 0.4948
Statistical Parity Difference: -0.2702

Group: ethnicity_race_sex_hispanic or latino_native hawaiian or other pacific islander_male
Disparate Impact: 0.6231
Statistical Parity Difference: -0.2016

Group: ethnicity_race_sex_hispanic or latino_white_female
Disparate Impact: 0.8179
Statistical Parity Difference: -0.0974

Group: ethnicity_race_sex_hispanic or latino_white_male
Disparate Impact: 0.8335
Statistical Parity Difference: -0.0891

Group: ethnicity_race_sex_not hispanic or latino_american indian or alaska native_female
Disparate Impact: 0.8476
Statistical Parity Difference: -0.0815

Group: ethnicity_race_sex_not hispanic or latino_american indian or alaska native_male
Disparate Impact: 0.7295
Statistical Parity Difference: -0.1447

Group: ethnicity_race_sex_not hispanic or latino_asian_female
Disparate Impact: 1.0254
Statistical Parity Difference: 0.0136

Group: ethnicity_race_sex_not hispanic or latino_asian_male
Disparate Impact: 1.0085
Statistical Parity Difference: 0.0045

Group: ethnicity_race_sex_not hispanic or latino_black or african american_female
Disparate Impact: 0.6568
Statistical Parity Difference: -0.1836

Group: ethnicity_race_sex_not hispanic or latino_black or african american_male
Disparate Impact: 0.6555
Statistical Parity Difference: -0.1843

Group: ethnicity_race_sex_not hispanic or latino_native hawaiian or other pacific islander_female
Disparate Impact: 1.0247
Statistical Parity Difference: 0.0132

Group: ethnicity_race_sex_not hispanic or latino_native hawaiian or other pacific islander_male
Disparate Impact: 0.9997
Statistical Parity Difference: -0.0002

Group: ethnicity_race_sex_not hispanic or latino_white_female
Disparate Impact: 0.9981
Statistical Parity Difference: -0.0010

G) Reweighting¶

In [38]:
# AIF360 dataset
aif_dataset = BinaryLabelDataset(
    df=train_df_smote,  # Using the SMOTE Dataset
    label_names=['action_taken_binary'],
    protected_attribute_names=protected_attribute_names,
    favorable_label=favorable_label,  # loan approved
    unfavorable_label=unfavorable_label  # loan denied
)

# Defining the privileged and unprivileged groups
privileged_groups = [{'ethnicity_race_sex_not hispanic or latino_white_male': 1}]
unprivileged_groups = []
for attribute in protected_attribute_names:
    if attribute != 'ethnicity_race_sex_not hispanic or latino_white_male':
        unprivileged_groups.append({attribute: 1})

# Applying the Reweighing algorithm
RW = Reweighing(unprivileged_groups=unprivileged_groups,
                privileged_groups=privileged_groups)

# Fitting the "reweighing model". Transforming the dataset
reweighted_dataset = RW.fit_transform(aif_dataset)

# Turning the reweighted data back to pandas to use with our logistic regression model
reweighted_df = reweighted_dataset.convert_to_dataframe()[0]

# Re-defining variables
X_train_reweighted = reweighted_df.drop(columns=['action_taken_binary'])
y_train_reweighted = reweighted_df['action_taken_binary']

H) LightGBM¶

In [39]:
# Defining the parameter grid for tuning
param_grid = {
    'n_estimators': [100, 200, 300, 500],
    'learning_rate': [0.01, 0.05, 0.1],
    'num_leaves': [31, 62, 127],
    'max_depth': [-1, 5, 10, 15],
    'min_child_samples': [20, 50, 100],
    'subsample': [0.6, 0.8, 1.0],
    'colsample_bytree': [0.6, 0.8, 1.0],
    'reg_alpha': [0, 0.1, 0.5],
    'reg_lambda': [0, 0.1, 0.5]
}

# Setting the lgbm_model and randomizedSearch
lgbm_model = lgbm.LGBMClassifier(random_state=42)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# We use joblib parallel backend for threading since we got our previous code stuck
with joblib.parallel_backend('threading'):  # This line is added to avoid JAX threading issues
    random_search = RandomizedSearchCV(
        estimator=lgbm_model,
        param_distributions=param_grid,
        n_iter=50,
        scoring='roc_auc',
        cv=skf,
        verbose=2,
        random_state=42,
        n_jobs=-1
    )

    # Fitting the randomized search
    random_search.fit(X_train_reweighted, y_train_reweighted)

# Getting the best lgbm_model and parameters
best_model = random_search.best_estimator_
best_params = random_search.best_params_
print("Best parameters:", best_params)

# Making predictions
y_pred = best_model.predict(X_test)
y_pred_proba = best_model.predict_proba(X_test)[:, 1]
Fitting 5 folds for each of 50 candidates, totalling 250 fits
[CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=-1, min_child_samples=20, n_estimators=500, num_leaves=31, reg_alpha=0, reg_lambda=0, subsample=0.6; total time=  11.6s
[CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=-1, min_child_samples=20, n_estimators=500, num_leaves=31, reg_alpha=0, reg_lambda=0, subsample=0.6; total time=  11.8s
[CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=-1, min_child_samples=20, n_estimators=500, num_leaves=31, reg_alpha=0, reg_lambda=0, subsample=0.6; total time=  11.4s
[CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=-1, min_child_samples=20, n_estimators=500, num_leaves=31, reg_alpha=0, reg_lambda=0, subsample=0.6; total time=  11.6s
[CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=-1, min_child_samples=20, n_estimators=500, num_leaves=31, reg_alpha=0, reg_lambda=0, subsample=0.6; total time=   9.1s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=-1, min_child_samples=100, n_estimators=300, num_leaves=62, reg_alpha=0.5, reg_lambda=0.1, subsample=1.0; total time=  11.4s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=-1, min_child_samples=100, n_estimators=300, num_leaves=62, reg_alpha=0.5, reg_lambda=0.1, subsample=1.0; total time=  11.9s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=-1, min_child_samples=100, n_estimators=300, num_leaves=62, reg_alpha=0.5, reg_lambda=0.1, subsample=1.0; total time=  11.2s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=-1, min_child_samples=100, n_estimators=300, num_leaves=62, reg_alpha=0.5, reg_lambda=0.1, subsample=1.0; total time=  13.4s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=-1, min_child_samples=100, n_estimators=300, num_leaves=62, reg_alpha=0.5, reg_lambda=0.1, subsample=1.0; total time=  14.4s
[CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=15, min_child_samples=50, n_estimators=500, num_leaves=31, reg_alpha=0.5, reg_lambda=0.5, subsample=0.8; total time=  14.6s
[CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=15, min_child_samples=50, n_estimators=500, num_leaves=31, reg_alpha=0.5, reg_lambda=0.5, subsample=0.8; total time=  13.1s
[CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=15, min_child_samples=50, n_estimators=500, num_leaves=31, reg_alpha=0.5, reg_lambda=0.5, subsample=0.8; total time=  13.3s
[CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=15, min_child_samples=50, n_estimators=500, num_leaves=31, reg_alpha=0.5, reg_lambda=0.5, subsample=0.8; total time=  13.1s
[CV] END colsample_bytree=0.6, learning_rate=0.05, max_depth=10, min_child_samples=50, n_estimators=200, num_leaves=62, reg_alpha=0, reg_lambda=0, subsample=0.8; total time=   6.0s
[CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=15, min_child_samples=50, n_estimators=500, num_leaves=31, reg_alpha=0.5, reg_lambda=0.5, subsample=0.8; total time=  11.8s
[CV] END colsample_bytree=0.6, learning_rate=0.05, max_depth=10, min_child_samples=50, n_estimators=200, num_leaves=62, reg_alpha=0, reg_lambda=0, subsample=0.8; total time=   8.4s
[CV] END colsample_bytree=0.6, learning_rate=0.05, max_depth=10, min_child_samples=50, n_estimators=200, num_leaves=62, reg_alpha=0, reg_lambda=0, subsample=0.8; total time=   7.4s
[CV] END colsample_bytree=0.6, learning_rate=0.05, max_depth=10, min_child_samples=50, n_estimators=200, num_leaves=62, reg_alpha=0, reg_lambda=0, subsample=0.8; total time=   6.1s
[CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=5, min_child_samples=50, n_estimators=100, num_leaves=31, reg_alpha=0, reg_lambda=0, subsample=1.0; total time=   2.3s
[CV] END colsample_bytree=0.6, learning_rate=0.05, max_depth=10, min_child_samples=50, n_estimators=200, num_leaves=62, reg_alpha=0, reg_lambda=0, subsample=0.8; total time=   6.1s
[CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=5, min_child_samples=50, n_estimators=100, num_leaves=31, reg_alpha=0, reg_lambda=0, subsample=1.0; total time=   3.3s
[CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=5, min_child_samples=50, n_estimators=100, num_leaves=31, reg_alpha=0, reg_lambda=0, subsample=1.0; total time=   4.2s
[CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=5, min_child_samples=50, n_estimators=100, num_leaves=31, reg_alpha=0, reg_lambda=0, subsample=1.0; total time=   3.6s
[CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=5, min_child_samples=50, n_estimators=100, num_leaves=31, reg_alpha=0, reg_lambda=0, subsample=1.0; total time=   2.9s
[CV] END colsample_bytree=0.8, learning_rate=0.1, max_depth=10, min_child_samples=50, n_estimators=500, num_leaves=31, reg_alpha=0.1, reg_lambda=0, subsample=1.0; total time=   8.4s
[CV] END colsample_bytree=0.8, learning_rate=0.1, max_depth=10, min_child_samples=50, n_estimators=500, num_leaves=31, reg_alpha=0.1, reg_lambda=0, subsample=1.0; total time=   8.2s
[CV] END colsample_bytree=0.8, learning_rate=0.1, max_depth=10, min_child_samples=50, n_estimators=500, num_leaves=31, reg_alpha=0.1, reg_lambda=0, subsample=1.0; total time=  10.7s
[CV] END colsample_bytree=0.8, learning_rate=0.1, max_depth=10, min_child_samples=50, n_estimators=500, num_leaves=31, reg_alpha=0.1, reg_lambda=0, subsample=1.0; total time=  10.4s
[CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=-1, min_child_samples=50, n_estimators=200, num_leaves=127, reg_alpha=0.1, reg_lambda=0.1, subsample=0.6; total time=   7.7s
[CV] END colsample_bytree=0.8, learning_rate=0.1, max_depth=10, min_child_samples=50, n_estimators=500, num_leaves=31, reg_alpha=0.1, reg_lambda=0, subsample=1.0; total time=   9.8s
[CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=-1, min_child_samples=50, n_estimators=200, num_leaves=127, reg_alpha=0.1, reg_lambda=0.1, subsample=0.6; total time=   8.7s
[CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=-1, min_child_samples=50, n_estimators=200, num_leaves=127, reg_alpha=0.1, reg_lambda=0.1, subsample=0.6; total time=   7.9s
[CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=-1, min_child_samples=50, n_estimators=200, num_leaves=127, reg_alpha=0.1, reg_lambda=0.1, subsample=0.6; total time=   7.9s
[CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=-1, min_child_samples=50, n_estimators=200, num_leaves=127, reg_alpha=0.1, reg_lambda=0.1, subsample=0.6; total time=   8.4s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=5, min_child_samples=100, n_estimators=100, num_leaves=127, reg_alpha=0.1, reg_lambda=0, subsample=1.0; total time=   4.0s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=5, min_child_samples=100, n_estimators=100, num_leaves=127, reg_alpha=0.1, reg_lambda=0, subsample=1.0; total time=   3.9s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=5, min_child_samples=100, n_estimators=100, num_leaves=127, reg_alpha=0.1, reg_lambda=0, subsample=1.0; total time=   3.0s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=5, min_child_samples=100, n_estimators=100, num_leaves=127, reg_alpha=0.1, reg_lambda=0, subsample=1.0; total time=   2.9s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=5, min_child_samples=100, n_estimators=100, num_leaves=127, reg_alpha=0.1, reg_lambda=0, subsample=1.0; total time=   3.0s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=-1, min_child_samples=100, n_estimators=200, num_leaves=62, reg_alpha=0.1, reg_lambda=0.1, subsample=0.8; total time=   7.8s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=-1, min_child_samples=100, n_estimators=200, num_leaves=62, reg_alpha=0.1, reg_lambda=0.1, subsample=0.8; total time=   9.4s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=-1, min_child_samples=100, n_estimators=200, num_leaves=62, reg_alpha=0.1, reg_lambda=0.1, subsample=0.8; total time=   8.3s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=-1, min_child_samples=100, n_estimators=200, num_leaves=62, reg_alpha=0.1, reg_lambda=0.1, subsample=0.8; total time=   6.8s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=-1, min_child_samples=100, n_estimators=200, num_leaves=62, reg_alpha=0.1, reg_lambda=0.1, subsample=0.8; total time=   8.2s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=10, min_child_samples=50, n_estimators=300, num_leaves=31, reg_alpha=0, reg_lambda=0.1, subsample=0.6; total time=  10.8s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=10, min_child_samples=50, n_estimators=300, num_leaves=31, reg_alpha=0, reg_lambda=0.1, subsample=0.6; total time=   9.5s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=10, min_child_samples=50, n_estimators=300, num_leaves=31, reg_alpha=0, reg_lambda=0.1, subsample=0.6; total time=   9.2s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=10, min_child_samples=50, n_estimators=300, num_leaves=31, reg_alpha=0, reg_lambda=0.1, subsample=0.6; total time=  10.5s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=10, min_child_samples=50, n_estimators=300, num_leaves=31, reg_alpha=0, reg_lambda=0.1, subsample=0.6; total time=   9.8s
[CV] END colsample_bytree=0.6, learning_rate=0.05, max_depth=5, min_child_samples=50, n_estimators=200, num_leaves=62, reg_alpha=0.5, reg_lambda=0, subsample=0.8; total time=   4.7s
[CV] END colsample_bytree=0.6, learning_rate=0.05, max_depth=5, min_child_samples=50, n_estimators=200, num_leaves=62, reg_alpha=0.5, reg_lambda=0, subsample=0.8; total time=   5.7s
[CV] END colsample_bytree=0.6, learning_rate=0.05, max_depth=5, min_child_samples=50, n_estimators=200, num_leaves=62, reg_alpha=0.5, reg_lambda=0, subsample=0.8; total time=   5.7s
[CV] END colsample_bytree=0.6, learning_rate=0.05, max_depth=5, min_child_samples=50, n_estimators=200, num_leaves=62, reg_alpha=0.5, reg_lambda=0, subsample=0.8; total time=   6.5s
[CV] END colsample_bytree=0.6, learning_rate=0.05, max_depth=5, min_child_samples=50, n_estimators=200, num_leaves=62, reg_alpha=0.5, reg_lambda=0, subsample=0.8; total time=   5.9s
[CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=10, min_child_samples=20, n_estimators=100, num_leaves=127, reg_alpha=0.5, reg_lambda=0, subsample=0.6; total time=   3.8s
[CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=10, min_child_samples=20, n_estimators=100, num_leaves=127, reg_alpha=0.5, reg_lambda=0, subsample=0.6; total time=   3.8s
[CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=10, min_child_samples=20, n_estimators=100, num_leaves=127, reg_alpha=0.5, reg_lambda=0, subsample=0.6; total time=   3.8s
[CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=10, min_child_samples=20, n_estimators=100, num_leaves=127, reg_alpha=0.5, reg_lambda=0, subsample=0.6; total time=   3.7s
[CV] END colsample_bytree=0.8, learning_rate=0.1, max_depth=-1, min_child_samples=50, n_estimators=100, num_leaves=31, reg_alpha=0, reg_lambda=0.1, subsample=1.0; total time=   3.8s
[CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=10, min_child_samples=20, n_estimators=100, num_leaves=127, reg_alpha=0.5, reg_lambda=0, subsample=0.6; total time=   6.2s
[CV] END colsample_bytree=0.8, learning_rate=0.1, max_depth=-1, min_child_samples=50, n_estimators=100, num_leaves=31, reg_alpha=0, reg_lambda=0.1, subsample=1.0; total time=   4.3s
[CV] END colsample_bytree=0.8, learning_rate=0.1, max_depth=-1, min_child_samples=50, n_estimators=100, num_leaves=31, reg_alpha=0, reg_lambda=0.1, subsample=1.0; total time=   4.3s
[CV] END colsample_bytree=0.8, learning_rate=0.1, max_depth=-1, min_child_samples=50, n_estimators=100, num_leaves=31, reg_alpha=0, reg_lambda=0.1, subsample=1.0; total time=   4.2s
[CV] END colsample_bytree=0.8, learning_rate=0.1, max_depth=-1, min_child_samples=50, n_estimators=100, num_leaves=31, reg_alpha=0, reg_lambda=0.1, subsample=1.0; total time=   3.3s
[CV] END colsample_bytree=1.0, learning_rate=0.05, max_depth=5, min_child_samples=50, n_estimators=300, num_leaves=31, reg_alpha=0.5, reg_lambda=0, subsample=0.8; total time=   6.7s
[CV] END colsample_bytree=1.0, learning_rate=0.05, max_depth=5, min_child_samples=50, n_estimators=300, num_leaves=31, reg_alpha=0.5, reg_lambda=0, subsample=0.8; total time=   7.7s
[CV] END colsample_bytree=1.0, learning_rate=0.05, max_depth=5, min_child_samples=50, n_estimators=300, num_leaves=31, reg_alpha=0.5, reg_lambda=0, subsample=0.8; total time=   6.5s
[CV] END colsample_bytree=1.0, learning_rate=0.05, max_depth=5, min_child_samples=50, n_estimators=300, num_leaves=31, reg_alpha=0.5, reg_lambda=0, subsample=0.8; total time=   5.6s
[CV] END colsample_bytree=1.0, learning_rate=0.05, max_depth=5, min_child_samples=50, n_estimators=300, num_leaves=31, reg_alpha=0.5, reg_lambda=0, subsample=0.8; total time=   5.5s
[CV] END colsample_bytree=0.6, learning_rate=0.05, max_depth=10, min_child_samples=50, n_estimators=300, num_leaves=127, reg_alpha=0.5, reg_lambda=0.5, subsample=0.6; total time=  12.5s
[CV] END colsample_bytree=0.6, learning_rate=0.05, max_depth=10, min_child_samples=50, n_estimators=300, num_leaves=127, reg_alpha=0.5, reg_lambda=0.5, subsample=0.6; total time=  12.2s
[CV] END colsample_bytree=0.6, learning_rate=0.05, max_depth=10, min_child_samples=50, n_estimators=300, num_leaves=127, reg_alpha=0.5, reg_lambda=0.5, subsample=0.6; total time=  11.0s
[CV] END colsample_bytree=0.6, learning_rate=0.05, max_depth=10, min_child_samples=50, n_estimators=300, num_leaves=127, reg_alpha=0.5, reg_lambda=0.5, subsample=0.6; total time=  12.6s
[CV] END colsample_bytree=0.6, learning_rate=0.05, max_depth=10, min_child_samples=50, n_estimators=300, num_leaves=127, reg_alpha=0.5, reg_lambda=0.5, subsample=0.6; total time=  10.9s
[CV] END colsample_bytree=1.0, learning_rate=0.05, max_depth=-1, min_child_samples=20, n_estimators=500, num_leaves=31, reg_alpha=0.5, reg_lambda=0.1, subsample=0.6; total time=   9.1s
[CV] END colsample_bytree=1.0, learning_rate=0.05, max_depth=-1, min_child_samples=20, n_estimators=500, num_leaves=31, reg_alpha=0.5, reg_lambda=0.1, subsample=0.6; total time=  11.1s
[CV] END colsample_bytree=1.0, learning_rate=0.05, max_depth=-1, min_child_samples=20, n_estimators=500, num_leaves=31, reg_alpha=0.5, reg_lambda=0.1, subsample=0.6; total time=  10.8s
[CV] END colsample_bytree=1.0, learning_rate=0.05, max_depth=-1, min_child_samples=20, n_estimators=500, num_leaves=31, reg_alpha=0.5, reg_lambda=0.1, subsample=0.6; total time=   9.6s
[CV] END colsample_bytree=1.0, learning_rate=0.05, max_depth=-1, min_child_samples=20, n_estimators=500, num_leaves=31, reg_alpha=0.5, reg_lambda=0.1, subsample=0.6; total time=  11.3s
[CV] END colsample_bytree=1.0, learning_rate=0.01, max_depth=10, min_child_samples=50, n_estimators=100, num_leaves=127, reg_alpha=0, reg_lambda=0.5, subsample=1.0; total time=   5.5s
[CV] END colsample_bytree=1.0, learning_rate=0.01, max_depth=10, min_child_samples=50, n_estimators=100, num_leaves=127, reg_alpha=0, reg_lambda=0.5, subsample=1.0; total time=   4.0s
[CV] END colsample_bytree=1.0, learning_rate=0.01, max_depth=10, min_child_samples=50, n_estimators=100, num_leaves=127, reg_alpha=0, reg_lambda=0.5, subsample=1.0; total time=   4.1s
[CV] END colsample_bytree=1.0, learning_rate=0.01, max_depth=10, min_child_samples=50, n_estimators=100, num_leaves=127, reg_alpha=0, reg_lambda=0.5, subsample=1.0; total time=   4.5s
[CV] END colsample_bytree=1.0, learning_rate=0.01, max_depth=10, min_child_samples=50, n_estimators=100, num_leaves=127, reg_alpha=0, reg_lambda=0.5, subsample=1.0; total time=   4.2s
[CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=15, min_child_samples=50, n_estimators=200, num_leaves=127, reg_alpha=0.1, reg_lambda=0.5, subsample=0.6; total time=   9.4s
[CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=15, min_child_samples=50, n_estimators=200, num_leaves=127, reg_alpha=0.1, reg_lambda=0.5, subsample=0.6; total time=   9.5s
[CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=15, min_child_samples=50, n_estimators=200, num_leaves=127, reg_alpha=0.1, reg_lambda=0.5, subsample=0.6; total time=   7.1s
[CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=15, min_child_samples=50, n_estimators=200, num_leaves=127, reg_alpha=0.1, reg_lambda=0.5, subsample=0.6; total time=   7.7s
[CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=15, min_child_samples=50, n_estimators=200, num_leaves=127, reg_alpha=0.1, reg_lambda=0.5, subsample=0.6; total time=   9.4s
[CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=10, min_child_samples=100, n_estimators=500, num_leaves=62, reg_alpha=0.1, reg_lambda=0.5, subsample=1.0; total time=  13.4s
[CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=10, min_child_samples=100, n_estimators=500, num_leaves=62, reg_alpha=0.1, reg_lambda=0.5, subsample=1.0; total time=  14.3s
[CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=10, min_child_samples=100, n_estimators=500, num_leaves=62, reg_alpha=0.1, reg_lambda=0.5, subsample=1.0; total time=  14.4s
[CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=10, min_child_samples=100, n_estimators=500, num_leaves=62, reg_alpha=0.1, reg_lambda=0.5, subsample=1.0; total time=  14.5s
[CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=10, min_child_samples=100, n_estimators=500, num_leaves=62, reg_alpha=0.1, reg_lambda=0.5, subsample=1.0; total time=  14.3s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=10, min_child_samples=100, n_estimators=200, num_leaves=127, reg_alpha=0.5, reg_lambda=0, subsample=1.0; total time=   8.2s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=10, min_child_samples=100, n_estimators=200, num_leaves=127, reg_alpha=0.5, reg_lambda=0, subsample=1.0; total time=  10.0s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=10, min_child_samples=100, n_estimators=200, num_leaves=127, reg_alpha=0.5, reg_lambda=0, subsample=1.0; total time=   9.7s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=10, min_child_samples=100, n_estimators=200, num_leaves=127, reg_alpha=0.5, reg_lambda=0, subsample=1.0; total time=   8.0s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=10, min_child_samples=100, n_estimators=200, num_leaves=127, reg_alpha=0.5, reg_lambda=0, subsample=1.0; total time=   9.2s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=-1, min_child_samples=20, n_estimators=300, num_leaves=62, reg_alpha=0, reg_lambda=0, subsample=0.6; total time=  12.7s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=-1, min_child_samples=20, n_estimators=300, num_leaves=62, reg_alpha=0, reg_lambda=0, subsample=0.6; total time=  11.3s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=-1, min_child_samples=20, n_estimators=300, num_leaves=62, reg_alpha=0, reg_lambda=0, subsample=0.6; total time=  12.5s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=-1, min_child_samples=20, n_estimators=300, num_leaves=62, reg_alpha=0, reg_lambda=0, subsample=0.6; total time=  12.6s
[CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=15, min_child_samples=100, n_estimators=100, num_leaves=31, reg_alpha=0, reg_lambda=0, subsample=1.0; total time=   2.8s
[CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=15, min_child_samples=100, n_estimators=100, num_leaves=31, reg_alpha=0, reg_lambda=0, subsample=1.0; total time=   3.2s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=-1, min_child_samples=20, n_estimators=300, num_leaves=62, reg_alpha=0, reg_lambda=0, subsample=0.6; total time=  11.8s
[CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=15, min_child_samples=100, n_estimators=100, num_leaves=31, reg_alpha=0, reg_lambda=0, subsample=1.0; total time=   4.6s
[CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=15, min_child_samples=100, n_estimators=100, num_leaves=31, reg_alpha=0, reg_lambda=0, subsample=1.0; total time=   3.4s
[CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=15, min_child_samples=100, n_estimators=100, num_leaves=31, reg_alpha=0, reg_lambda=0, subsample=1.0; total time=   2.8s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=5, min_child_samples=100, n_estimators=500, num_leaves=62, reg_alpha=0.1, reg_lambda=0, subsample=0.6; total time=  14.3s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=5, min_child_samples=100, n_estimators=500, num_leaves=62, reg_alpha=0.1, reg_lambda=0, subsample=0.6; total time=  14.4s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=5, min_child_samples=100, n_estimators=500, num_leaves=62, reg_alpha=0.1, reg_lambda=0, subsample=0.6; total time=  13.7s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=5, min_child_samples=100, n_estimators=500, num_leaves=62, reg_alpha=0.1, reg_lambda=0, subsample=0.6; total time=  14.3s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=5, min_child_samples=100, n_estimators=500, num_leaves=62, reg_alpha=0.1, reg_lambda=0, subsample=0.6; total time=  13.3s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=5, min_child_samples=20, n_estimators=500, num_leaves=62, reg_alpha=0.5, reg_lambda=0.5, subsample=0.8; total time=  13.0s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=5, min_child_samples=20, n_estimators=500, num_leaves=62, reg_alpha=0.5, reg_lambda=0.5, subsample=0.8; total time=  14.4s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=5, min_child_samples=20, n_estimators=500, num_leaves=62, reg_alpha=0.5, reg_lambda=0.5, subsample=0.8; total time=  13.7s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=5, min_child_samples=20, n_estimators=500, num_leaves=62, reg_alpha=0.5, reg_lambda=0.5, subsample=0.8; total time=  14.9s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=5, min_child_samples=20, n_estimators=500, num_leaves=62, reg_alpha=0.5, reg_lambda=0.5, subsample=0.8; total time=  14.5s
[CV] END colsample_bytree=1.0, learning_rate=0.1, max_depth=-1, min_child_samples=50, n_estimators=200, num_leaves=62, reg_alpha=0.1, reg_lambda=0.5, subsample=0.6; total time=   6.8s
[CV] END colsample_bytree=1.0, learning_rate=0.1, max_depth=-1, min_child_samples=50, n_estimators=200, num_leaves=62, reg_alpha=0.1, reg_lambda=0.5, subsample=0.6; total time=   7.2s
[CV] END colsample_bytree=1.0, learning_rate=0.1, max_depth=-1, min_child_samples=50, n_estimators=200, num_leaves=62, reg_alpha=0.1, reg_lambda=0.5, subsample=0.6; total time=   4.6s
[CV] END colsample_bytree=1.0, learning_rate=0.1, max_depth=-1, min_child_samples=50, n_estimators=200, num_leaves=62, reg_alpha=0.1, reg_lambda=0.5, subsample=0.6; total time=   5.1s
[CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=15, min_child_samples=100, n_estimators=100, num_leaves=127, reg_alpha=0, reg_lambda=0, subsample=0.6; total time=   4.0s
[CV] END colsample_bytree=1.0, learning_rate=0.1, max_depth=-1, min_child_samples=50, n_estimators=200, num_leaves=62, reg_alpha=0.1, reg_lambda=0.5, subsample=0.6; total time=   4.9s
[CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=15, min_child_samples=100, n_estimators=100, num_leaves=127, reg_alpha=0, reg_lambda=0, subsample=0.6; total time=   6.5s
[CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=15, min_child_samples=100, n_estimators=100, num_leaves=127, reg_alpha=0, reg_lambda=0, subsample=0.6; total time=   6.0s
[CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=15, min_child_samples=100, n_estimators=100, num_leaves=127, reg_alpha=0, reg_lambda=0, subsample=0.6; total time=   4.0s
[CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=15, min_child_samples=100, n_estimators=100, num_leaves=127, reg_alpha=0, reg_lambda=0, subsample=0.6; total time=   4.0s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=15, min_child_samples=50, n_estimators=500, num_leaves=127, reg_alpha=0.5, reg_lambda=0, subsample=0.8; total time=  26.0s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=15, min_child_samples=50, n_estimators=500, num_leaves=127, reg_alpha=0.5, reg_lambda=0, subsample=0.8; total time=  26.1s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=15, min_child_samples=50, n_estimators=500, num_leaves=127, reg_alpha=0.5, reg_lambda=0, subsample=0.8; total time=  26.9s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=15, min_child_samples=50, n_estimators=500, num_leaves=127, reg_alpha=0.5, reg_lambda=0, subsample=0.8; total time=  27.5s
[CV] END colsample_bytree=0.6, learning_rate=0.05, max_depth=-1, min_child_samples=20, n_estimators=100, num_leaves=31, reg_alpha=0, reg_lambda=0, subsample=1.0; total time=   5.0s
[CV] END colsample_bytree=0.6, learning_rate=0.05, max_depth=-1, min_child_samples=20, n_estimators=100, num_leaves=31, reg_alpha=0, reg_lambda=0, subsample=1.0; total time=   3.2s
[CV] END colsample_bytree=0.6, learning_rate=0.05, max_depth=-1, min_child_samples=20, n_estimators=100, num_leaves=31, reg_alpha=0, reg_lambda=0, subsample=1.0; total time=   3.2s
[CV] END colsample_bytree=0.6, learning_rate=0.05, max_depth=-1, min_child_samples=20, n_estimators=100, num_leaves=31, reg_alpha=0, reg_lambda=0, subsample=1.0; total time=   3.2s
[CV] END colsample_bytree=0.6, learning_rate=0.05, max_depth=-1, min_child_samples=20, n_estimators=100, num_leaves=31, reg_alpha=0, reg_lambda=0, subsample=1.0; total time=   5.4s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=15, min_child_samples=50, n_estimators=500, num_leaves=127, reg_alpha=0.5, reg_lambda=0, subsample=0.8; total time=  25.9s
[CV] END colsample_bytree=1.0, learning_rate=0.05, max_depth=15, min_child_samples=50, n_estimators=500, num_leaves=62, reg_alpha=0.1, reg_lambda=0.1, subsample=1.0; total time=  11.0s
[CV] END colsample_bytree=1.0, learning_rate=0.05, max_depth=15, min_child_samples=50, n_estimators=500, num_leaves=62, reg_alpha=0.1, reg_lambda=0.1, subsample=1.0; total time=  12.9s
[CV] END colsample_bytree=1.0, learning_rate=0.05, max_depth=15, min_child_samples=50, n_estimators=500, num_leaves=62, reg_alpha=0.1, reg_lambda=0.1, subsample=1.0; total time=  12.9s
[CV] END colsample_bytree=1.0, learning_rate=0.05, max_depth=15, min_child_samples=50, n_estimators=500, num_leaves=62, reg_alpha=0.1, reg_lambda=0.1, subsample=1.0; total time=  12.1s
[CV] END colsample_bytree=1.0, learning_rate=0.05, max_depth=15, min_child_samples=50, n_estimators=500, num_leaves=62, reg_alpha=0.1, reg_lambda=0.1, subsample=1.0; total time=  13.1s
[CV] END colsample_bytree=0.8, learning_rate=0.01, max_depth=10, min_child_samples=100, n_estimators=500, num_leaves=31, reg_alpha=0, reg_lambda=0.1, subsample=0.6; total time=  14.6s
[CV] END colsample_bytree=0.8, learning_rate=0.01, max_depth=10, min_child_samples=100, n_estimators=500, num_leaves=31, reg_alpha=0, reg_lambda=0.1, subsample=0.6; total time=  14.2s
[CV] END colsample_bytree=0.8, learning_rate=0.01, max_depth=10, min_child_samples=100, n_estimators=500, num_leaves=31, reg_alpha=0, reg_lambda=0.1, subsample=0.6; total time=  13.3s
[CV] END colsample_bytree=0.8, learning_rate=0.01, max_depth=10, min_child_samples=100, n_estimators=500, num_leaves=31, reg_alpha=0, reg_lambda=0.1, subsample=0.6; total time=  14.0s
[CV] END colsample_bytree=0.8, learning_rate=0.1, max_depth=10, min_child_samples=50, n_estimators=100, num_leaves=127, reg_alpha=0.1, reg_lambda=0.5, subsample=0.6; total time=   3.4s
[CV] END colsample_bytree=0.8, learning_rate=0.1, max_depth=10, min_child_samples=50, n_estimators=100, num_leaves=127, reg_alpha=0.1, reg_lambda=0.5, subsample=0.6; total time=   3.4s
[CV] END colsample_bytree=0.8, learning_rate=0.01, max_depth=10, min_child_samples=100, n_estimators=500, num_leaves=31, reg_alpha=0, reg_lambda=0.1, subsample=0.6; total time=  14.2s
[CV] END colsample_bytree=0.8, learning_rate=0.1, max_depth=10, min_child_samples=50, n_estimators=100, num_leaves=127, reg_alpha=0.1, reg_lambda=0.5, subsample=0.6; total time=   4.2s
[CV] END colsample_bytree=0.8, learning_rate=0.1, max_depth=10, min_child_samples=50, n_estimators=100, num_leaves=127, reg_alpha=0.1, reg_lambda=0.5, subsample=0.6; total time=   4.8s
[CV] END colsample_bytree=0.8, learning_rate=0.1, max_depth=10, min_child_samples=50, n_estimators=100, num_leaves=127, reg_alpha=0.1, reg_lambda=0.5, subsample=0.6; total time=   5.0s
[CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=10, min_child_samples=100, n_estimators=500, num_leaves=31, reg_alpha=0.1, reg_lambda=0.5, subsample=0.8; total time=  12.5s
[CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=10, min_child_samples=100, n_estimators=500, num_leaves=31, reg_alpha=0.1, reg_lambda=0.5, subsample=0.8; total time=  11.9s
[CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=10, min_child_samples=100, n_estimators=500, num_leaves=31, reg_alpha=0.1, reg_lambda=0.5, subsample=0.8; total time=  14.0s
[CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=10, min_child_samples=100, n_estimators=500, num_leaves=31, reg_alpha=0.1, reg_lambda=0.5, subsample=0.8; total time=  12.3s
[CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=5, min_child_samples=20, n_estimators=100, num_leaves=62, reg_alpha=0.1, reg_lambda=0.5, subsample=1.0; total time=   4.1s
[CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=5, min_child_samples=20, n_estimators=100, num_leaves=62, reg_alpha=0.1, reg_lambda=0.5, subsample=1.0; total time=   2.9s
[CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=5, min_child_samples=20, n_estimators=100, num_leaves=62, reg_alpha=0.1, reg_lambda=0.5, subsample=1.0; total time=   2.5s
[CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=10, min_child_samples=100, n_estimators=500, num_leaves=31, reg_alpha=0.1, reg_lambda=0.5, subsample=0.8; total time=  13.5s
[CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=5, min_child_samples=20, n_estimators=100, num_leaves=62, reg_alpha=0.1, reg_lambda=0.5, subsample=1.0; total time=   2.5s
[CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=5, min_child_samples=20, n_estimators=100, num_leaves=62, reg_alpha=0.1, reg_lambda=0.5, subsample=1.0; total time=   2.6s
[CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=-1, min_child_samples=100, n_estimators=100, num_leaves=31, reg_alpha=0.1, reg_lambda=0, subsample=0.6; total time=   3.1s
[CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=-1, min_child_samples=100, n_estimators=100, num_leaves=31, reg_alpha=0.1, reg_lambda=0, subsample=0.6; total time=   4.3s
[CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=-1, min_child_samples=100, n_estimators=100, num_leaves=31, reg_alpha=0.1, reg_lambda=0, subsample=0.6; total time=   5.4s
[CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=-1, min_child_samples=100, n_estimators=100, num_leaves=31, reg_alpha=0.1, reg_lambda=0, subsample=0.6; total time=   3.9s
[CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=-1, min_child_samples=100, n_estimators=100, num_leaves=31, reg_alpha=0.1, reg_lambda=0, subsample=0.6; total time=   3.1s
[CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=15, min_child_samples=50, n_estimators=100, num_leaves=31, reg_alpha=0, reg_lambda=0, subsample=0.6; total time=   3.0s
[CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=15, min_child_samples=50, n_estimators=100, num_leaves=31, reg_alpha=0, reg_lambda=0, subsample=0.6; total time=   2.9s
[CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=15, min_child_samples=50, n_estimators=100, num_leaves=31, reg_alpha=0, reg_lambda=0, subsample=0.6; total time=   3.0s
[CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=15, min_child_samples=50, n_estimators=100, num_leaves=31, reg_alpha=0, reg_lambda=0, subsample=0.6; total time=   2.9s
[CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=15, min_child_samples=50, n_estimators=100, num_leaves=31, reg_alpha=0, reg_lambda=0, subsample=0.6; total time=   3.1s
[CV] END colsample_bytree=1.0, learning_rate=0.01, max_depth=-1, min_child_samples=50, n_estimators=500, num_leaves=31, reg_alpha=0, reg_lambda=0, subsample=1.0; total time=  14.0s
[CV] END colsample_bytree=1.0, learning_rate=0.01, max_depth=-1, min_child_samples=50, n_estimators=500, num_leaves=31, reg_alpha=0, reg_lambda=0, subsample=1.0; total time=  13.7s
[CV] END colsample_bytree=1.0, learning_rate=0.01, max_depth=-1, min_child_samples=50, n_estimators=500, num_leaves=31, reg_alpha=0, reg_lambda=0, subsample=1.0; total time=  14.3s
[CV] END colsample_bytree=1.0, learning_rate=0.01, max_depth=-1, min_child_samples=50, n_estimators=500, num_leaves=31, reg_alpha=0, reg_lambda=0, subsample=1.0; total time=  14.3s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=10, min_child_samples=100, n_estimators=100, num_leaves=31, reg_alpha=0.5, reg_lambda=0, subsample=1.0; total time=   3.9s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=10, min_child_samples=100, n_estimators=100, num_leaves=31, reg_alpha=0.5, reg_lambda=0, subsample=1.0; total time=   4.6s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=10, min_child_samples=100, n_estimators=100, num_leaves=31, reg_alpha=0.5, reg_lambda=0, subsample=1.0; total time=   3.2s
[CV] END colsample_bytree=1.0, learning_rate=0.01, max_depth=-1, min_child_samples=50, n_estimators=500, num_leaves=31, reg_alpha=0, reg_lambda=0, subsample=1.0; total time=  14.5s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=10, min_child_samples=100, n_estimators=100, num_leaves=31, reg_alpha=0.5, reg_lambda=0, subsample=1.0; total time=   3.2s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=10, min_child_samples=100, n_estimators=100, num_leaves=31, reg_alpha=0.5, reg_lambda=0, subsample=1.0; total time=   3.1s
[CV] END colsample_bytree=1.0, learning_rate=0.01, max_depth=-1, min_child_samples=20, n_estimators=200, num_leaves=127, reg_alpha=0.5, reg_lambda=0, subsample=1.0; total time=  11.0s
[CV] END colsample_bytree=1.0, learning_rate=0.01, max_depth=-1, min_child_samples=20, n_estimators=200, num_leaves=127, reg_alpha=0.5, reg_lambda=0, subsample=1.0; total time=  11.0s
[CV] END colsample_bytree=1.0, learning_rate=0.01, max_depth=-1, min_child_samples=20, n_estimators=200, num_leaves=127, reg_alpha=0.5, reg_lambda=0, subsample=1.0; total time=   8.6s
[CV] END colsample_bytree=1.0, learning_rate=0.01, max_depth=-1, min_child_samples=20, n_estimators=200, num_leaves=127, reg_alpha=0.5, reg_lambda=0, subsample=1.0; total time=   9.2s
[CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=5, min_child_samples=50, n_estimators=500, num_leaves=127, reg_alpha=0.1, reg_lambda=0.1, subsample=1.0; total time=  10.4s
[CV] END colsample_bytree=1.0, learning_rate=0.01, max_depth=-1, min_child_samples=20, n_estimators=200, num_leaves=127, reg_alpha=0.5, reg_lambda=0, subsample=1.0; total time=  12.0s
[CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=5, min_child_samples=50, n_estimators=500, num_leaves=127, reg_alpha=0.1, reg_lambda=0.1, subsample=1.0; total time=  11.9s
[CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=5, min_child_samples=50, n_estimators=500, num_leaves=127, reg_alpha=0.1, reg_lambda=0.1, subsample=1.0; total time=  11.9s
[CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=5, min_child_samples=50, n_estimators=500, num_leaves=127, reg_alpha=0.1, reg_lambda=0.1, subsample=1.0; total time=   9.7s
[CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=5, min_child_samples=50, n_estimators=500, num_leaves=127, reg_alpha=0.1, reg_lambda=0.1, subsample=1.0; total time=  10.6s
[CV] END colsample_bytree=0.8, learning_rate=0.01, max_depth=10, min_child_samples=100, n_estimators=500, num_leaves=62, reg_alpha=0.1, reg_lambda=0.5, subsample=0.6; total time=  17.6s
[CV] END colsample_bytree=0.8, learning_rate=0.01, max_depth=10, min_child_samples=100, n_estimators=500, num_leaves=62, reg_alpha=0.1, reg_lambda=0.5, subsample=0.6; total time=  18.0s
[CV] END colsample_bytree=0.8, learning_rate=0.01, max_depth=10, min_child_samples=100, n_estimators=500, num_leaves=62, reg_alpha=0.1, reg_lambda=0.5, subsample=0.6; total time=  17.5s
[CV] END colsample_bytree=0.8, learning_rate=0.01, max_depth=10, min_child_samples=100, n_estimators=500, num_leaves=62, reg_alpha=0.1, reg_lambda=0.5, subsample=0.6; total time=  19.3s
[CV] END colsample_bytree=0.8, learning_rate=0.01, max_depth=10, min_child_samples=100, n_estimators=500, num_leaves=62, reg_alpha=0.1, reg_lambda=0.5, subsample=0.6; total time=  18.8s
[CV] END colsample_bytree=1.0, learning_rate=0.01, max_depth=15, min_child_samples=20, n_estimators=500, num_leaves=62, reg_alpha=0.1, reg_lambda=0.5, subsample=1.0; total time=  20.1s
[CV] END colsample_bytree=1.0, learning_rate=0.01, max_depth=15, min_child_samples=20, n_estimators=500, num_leaves=62, reg_alpha=0.1, reg_lambda=0.5, subsample=1.0; total time=  18.5s
[CV] END colsample_bytree=1.0, learning_rate=0.01, max_depth=15, min_child_samples=20, n_estimators=500, num_leaves=62, reg_alpha=0.1, reg_lambda=0.5, subsample=1.0; total time=  17.9s
[CV] END colsample_bytree=1.0, learning_rate=0.01, max_depth=15, min_child_samples=20, n_estimators=500, num_leaves=62, reg_alpha=0.1, reg_lambda=0.5, subsample=1.0; total time=  18.5s
[CV] END colsample_bytree=1.0, learning_rate=0.01, max_depth=15, min_child_samples=20, n_estimators=500, num_leaves=62, reg_alpha=0.1, reg_lambda=0.5, subsample=1.0; total time=  19.6s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=5, min_child_samples=50, n_estimators=500, num_leaves=62, reg_alpha=0.5, reg_lambda=0, subsample=0.8; total time=  14.0s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=5, min_child_samples=50, n_estimators=500, num_leaves=62, reg_alpha=0.5, reg_lambda=0, subsample=0.8; total time=  13.9s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=5, min_child_samples=50, n_estimators=500, num_leaves=62, reg_alpha=0.5, reg_lambda=0, subsample=0.8; total time=  14.8s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=5, min_child_samples=50, n_estimators=500, num_leaves=62, reg_alpha=0.5, reg_lambda=0, subsample=0.8; total time=  13.6s
[CV] END colsample_bytree=0.6, learning_rate=0.05, max_depth=-1, min_child_samples=20, n_estimators=100, num_leaves=127, reg_alpha=0, reg_lambda=0, subsample=0.8; total time=   6.4s
[CV] END colsample_bytree=0.6, learning_rate=0.05, max_depth=-1, min_child_samples=20, n_estimators=100, num_leaves=127, reg_alpha=0, reg_lambda=0, subsample=0.8; total time=   4.2s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=5, min_child_samples=50, n_estimators=500, num_leaves=62, reg_alpha=0.5, reg_lambda=0, subsample=0.8; total time=  14.2s
[CV] END colsample_bytree=0.6, learning_rate=0.05, max_depth=-1, min_child_samples=20, n_estimators=100, num_leaves=127, reg_alpha=0, reg_lambda=0, subsample=0.8; total time=   4.3s
[CV] END colsample_bytree=0.6, learning_rate=0.05, max_depth=-1, min_child_samples=20, n_estimators=100, num_leaves=127, reg_alpha=0, reg_lambda=0, subsample=0.8; total time=   4.9s
[CV] END colsample_bytree=0.6, learning_rate=0.05, max_depth=-1, min_child_samples=20, n_estimators=100, num_leaves=127, reg_alpha=0, reg_lambda=0, subsample=0.8; total time=   6.4s
[CV] END colsample_bytree=1.0, learning_rate=0.01, max_depth=-1, min_child_samples=50, n_estimators=500, num_leaves=62, reg_alpha=0.1, reg_lambda=0.5, subsample=1.0; total time=  19.1s
[CV] END colsample_bytree=1.0, learning_rate=0.01, max_depth=-1, min_child_samples=50, n_estimators=500, num_leaves=62, reg_alpha=0.1, reg_lambda=0.5, subsample=1.0; total time=  18.4s
[CV] END colsample_bytree=1.0, learning_rate=0.01, max_depth=-1, min_child_samples=50, n_estimators=500, num_leaves=62, reg_alpha=0.1, reg_lambda=0.5, subsample=1.0; total time=  18.1s
[CV] END colsample_bytree=1.0, learning_rate=0.01, max_depth=-1, min_child_samples=50, n_estimators=500, num_leaves=62, reg_alpha=0.1, reg_lambda=0.5, subsample=1.0; total time=  17.7s
[CV] END colsample_bytree=0.8, learning_rate=0.1, max_depth=-1, min_child_samples=20, n_estimators=100, num_leaves=31, reg_alpha=0.1, reg_lambda=0.5, subsample=1.0; total time=   2.6s
[CV] END colsample_bytree=0.8, learning_rate=0.1, max_depth=-1, min_child_samples=20, n_estimators=100, num_leaves=31, reg_alpha=0.1, reg_lambda=0.5, subsample=1.0; total time=   2.6s
[CV] END colsample_bytree=0.8, learning_rate=0.1, max_depth=-1, min_child_samples=20, n_estimators=100, num_leaves=31, reg_alpha=0.1, reg_lambda=0.5, subsample=1.0; total time=   3.9s
[CV] END colsample_bytree=0.8, learning_rate=0.1, max_depth=-1, min_child_samples=20, n_estimators=100, num_leaves=31, reg_alpha=0.1, reg_lambda=0.5, subsample=1.0; total time=   3.3s
[CV] END colsample_bytree=0.8, learning_rate=0.1, max_depth=-1, min_child_samples=20, n_estimators=100, num_leaves=31, reg_alpha=0.1, reg_lambda=0.5, subsample=1.0; total time=   2.6s
[CV] END colsample_bytree=1.0, learning_rate=0.01, max_depth=-1, min_child_samples=50, n_estimators=500, num_leaves=62, reg_alpha=0.1, reg_lambda=0.5, subsample=1.0; total time=  17.9s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=5, min_child_samples=20, n_estimators=100, num_leaves=62, reg_alpha=0.5, reg_lambda=0.1, subsample=0.8; total time=   3.0s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=5, min_child_samples=20, n_estimators=100, num_leaves=62, reg_alpha=0.5, reg_lambda=0.1, subsample=0.8; total time=   3.0s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=5, min_child_samples=20, n_estimators=100, num_leaves=62, reg_alpha=0.5, reg_lambda=0.1, subsample=0.8; total time=   2.8s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=5, min_child_samples=20, n_estimators=100, num_leaves=62, reg_alpha=0.5, reg_lambda=0.1, subsample=0.8; total time=   3.1s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=5, min_child_samples=20, n_estimators=100, num_leaves=62, reg_alpha=0.5, reg_lambda=0.1, subsample=0.8; total time=   4.4s
[CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=15, min_child_samples=100, n_estimators=500, num_leaves=127, reg_alpha=0.1, reg_lambda=0.5, subsample=0.8; total time=  23.6s
[CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=15, min_child_samples=100, n_estimators=500, num_leaves=127, reg_alpha=0.1, reg_lambda=0.5, subsample=0.8; total time=  21.5s
[CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=15, min_child_samples=100, n_estimators=500, num_leaves=127, reg_alpha=0.1, reg_lambda=0.5, subsample=0.8; total time=  21.3s
[CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=15, min_child_samples=100, n_estimators=500, num_leaves=127, reg_alpha=0.1, reg_lambda=0.5, subsample=0.8; total time=  21.0s
[CV] END colsample_bytree=0.8, learning_rate=0.01, max_depth=5, min_child_samples=100, n_estimators=200, num_leaves=62, reg_alpha=0.5, reg_lambda=0.5, subsample=0.8; total time=   6.7s
[CV] END colsample_bytree=0.8, learning_rate=0.01, max_depth=5, min_child_samples=100, n_estimators=200, num_leaves=62, reg_alpha=0.5, reg_lambda=0.5, subsample=0.8; total time=   4.7s
[CV] END colsample_bytree=0.8, learning_rate=0.01, max_depth=5, min_child_samples=100, n_estimators=200, num_leaves=62, reg_alpha=0.5, reg_lambda=0.5, subsample=0.8; total time=   4.7s
[CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=15, min_child_samples=100, n_estimators=500, num_leaves=127, reg_alpha=0.1, reg_lambda=0.5, subsample=0.8; total time=  24.0s
[CV] END colsample_bytree=0.8, learning_rate=0.01, max_depth=5, min_child_samples=100, n_estimators=200, num_leaves=62, reg_alpha=0.5, reg_lambda=0.5, subsample=0.8; total time=   7.0s
[CV] END colsample_bytree=0.8, learning_rate=0.01, max_depth=5, min_child_samples=100, n_estimators=200, num_leaves=62, reg_alpha=0.5, reg_lambda=0.5, subsample=0.8; total time=   4.5s
[CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=5, min_child_samples=100, n_estimators=500, num_leaves=62, reg_alpha=0.5, reg_lambda=0.5, subsample=1.0; total time=  10.7s
[CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=5, min_child_samples=100, n_estimators=500, num_leaves=62, reg_alpha=0.5, reg_lambda=0.5, subsample=1.0; total time=  12.1s
[CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=5, min_child_samples=100, n_estimators=500, num_leaves=62, reg_alpha=0.5, reg_lambda=0.5, subsample=1.0; total time=  12.4s
[CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=5, min_child_samples=100, n_estimators=500, num_leaves=62, reg_alpha=0.5, reg_lambda=0.5, subsample=1.0; total time=  10.7s
[CV] END colsample_bytree=0.6, learning_rate=0.05, max_depth=15, min_child_samples=20, n_estimators=100, num_leaves=127, reg_alpha=0.1, reg_lambda=0.5, subsample=0.6; total time=   6.0s
[CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=5, min_child_samples=100, n_estimators=500, num_leaves=62, reg_alpha=0.5, reg_lambda=0.5, subsample=1.0; total time=  12.3s
[CV] END colsample_bytree=0.6, learning_rate=0.05, max_depth=15, min_child_samples=20, n_estimators=100, num_leaves=127, reg_alpha=0.1, reg_lambda=0.5, subsample=0.6; total time=   4.3s
[CV] END colsample_bytree=0.6, learning_rate=0.05, max_depth=15, min_child_samples=20, n_estimators=100, num_leaves=127, reg_alpha=0.1, reg_lambda=0.5, subsample=0.6; total time=   4.3s
[CV] END colsample_bytree=0.6, learning_rate=0.05, max_depth=15, min_child_samples=20, n_estimators=100, num_leaves=127, reg_alpha=0.1, reg_lambda=0.5, subsample=0.6; total time=   4.4s
[CV] END colsample_bytree=0.6, learning_rate=0.05, max_depth=15, min_child_samples=20, n_estimators=100, num_leaves=127, reg_alpha=0.1, reg_lambda=0.5, subsample=0.6; total time=   4.4s
[LightGBM] [Warning] Found whitespace in feature_names, replace with underlines
[LightGBM] [Info] Number of positive: 81489, number of negative: 81489
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.064936 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3976
[LightGBM] [Info] Number of data points in the train set: 162978, number of used features: 38
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
Best parameters: {'subsample': 0.8, 'reg_lambda': 0.5, 'reg_alpha': 0.1, 'num_leaves': 127, 'n_estimators': 500, 'min_child_samples': 100, 'max_depth': 15, 'learning_rate': 0.1, 'colsample_bytree': 0.6}
In [40]:
# Model performance evaluation:

# Accuracy
accuracy = accuracy_score(y_test, y_pred)

# Precision
precision = precision_score(y_test, y_pred)

# Recall
recall = recall_score(y_test, y_pred)

# F1 Score
f1 = f1_score(y_test, y_pred)

# Calculate ROC-AUC
roc_auc = roc_auc_score(y_test, y_pred_proba)

# Negative log loss
neg_log_loss = log_loss(y_test, y_pred_proba)

# Results
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")
print(f"ROC-AUC: {roc_auc:.4f}")
print(f"Negative Log Loss: {neg_log_loss:.4f}")
Accuracy: 0.7421
Precision: 0.3914
Recall: 0.4465
F1 Score: 0.4172
ROC-AUC: 0.7008
Negative Log Loss: 0.5332
In [41]:
# For each group we will print in descendent order the different metrics

# Adding the predicted values to a new test set for evaluation
X_test_with_pred = X_test.copy()
X_test_with_pred['y_test'] = y_test  # Columns for y_test and y_pred.
X_test_with_pred['y_pred'] = y_pred

# Predicted probabilities for class 1 (default)
X_test_with_pred['y_pred_proba'] = best_model.predict_proba(X_test)[:, 1]

# List of intersectional group columns
ethnicity_race_sex_cols = [col for col in X_test_with_pred.columns if col.startswith('ethnicity_race_sex')]

# Creation of a column where each row represents an intersectional group
X_test_with_pred['intersectional_group'] = X_test_with_pred[ethnicity_race_sex_cols].idxmax(axis=1)

# Groupping by the intersectional group to evaluate the model performance for each intersectional group
grouped = X_test_with_pred.groupby('intersectional_group')

# Evaluation of the metrics for each group
for group_name, group_data in grouped:
    accuracy = accuracy_score(group_data['y_test'], group_data['y_pred'])
    precision = precision_score(group_data['y_test'], group_data['y_pred'], zero_division=0)
    recall = recall_score(group_data['y_test'], group_data['y_pred'], zero_division=0)
    f1 = f1_score(group_data['y_test'], group_data['y_pred'])
    roc_auc = roc_auc_score(group_data['y_test'], group_data['y_pred_proba'])
    neg_log_loss = log_loss(group_data['y_test'], group_data['y_pred_proba'])

    # Showing results
    print(f"# Group: {group_name}")
    print(f"  Accuracy: {accuracy:.4f}")
    print(f"  Precision: {precision:.4f}")
    print(f"  Recall: {recall:.4f}")
    print(f"  F1 Score: {f1:.4f}")
    print(f"  ROC-AUC: {roc_auc:.4f}")
    print(f"  Negative Log Loss: {neg_log_loss:.4f}\n")
# Group: ethnicity_race_sex_hispanic or latino_american indian or alaska native_female
  Accuracy: 0.6897
  Precision: 0.6875
  Recall: 0.7333
  F1 Score: 0.7097
  ROC-AUC: 0.8190
  Negative Log Loss: 0.5684

# Group: ethnicity_race_sex_hispanic or latino_american indian or alaska native_male
  Accuracy: 0.6585
  Precision: 0.5926
  Recall: 0.8421
  F1 Score: 0.6957
  ROC-AUC: 0.8038
  Negative Log Loss: 0.6024

# Group: ethnicity_race_sex_hispanic or latino_asian_female
  Accuracy: 0.8235
  Precision: 0.6667
  Recall: 0.8000
  F1 Score: 0.7273
  ROC-AUC: 0.8000
  Negative Log Loss: 0.5010

# Group: ethnicity_race_sex_hispanic or latino_asian_male
  Accuracy: 0.4815
  Precision: 0.2222
  Recall: 0.2222
  F1 Score: 0.2222
  ROC-AUC: 0.4444
  Negative Log Loss: 0.7215

# Group: ethnicity_race_sex_hispanic or latino_black or african american_female
  Accuracy: 0.6167
  Precision: 0.4821
  Recall: 0.6136
  F1 Score: 0.5400
  ROC-AUC: 0.6932
  Negative Log Loss: 0.7674

# Group: ethnicity_race_sex_hispanic or latino_black or african american_male
  Accuracy: 0.7228
  Precision: 0.6271
  Recall: 0.8605
  F1 Score: 0.7255
  ROC-AUC: 0.8360
  Negative Log Loss: 0.5922

# Group: ethnicity_race_sex_hispanic or latino_native hawaiian or other pacific islander_female
  Accuracy: 0.8667
  Precision: 0.7143
  Recall: 1.0000
  F1 Score: 0.8333
  ROC-AUC: 0.8600
  Negative Log Loss: 0.5357

# Group: ethnicity_race_sex_hispanic or latino_native hawaiian or other pacific islander_male
  Accuracy: 0.5625
  Precision: 0.6000
  Recall: 0.7895
  F1 Score: 0.6818
  ROC-AUC: 0.6518
  Negative Log Loss: 0.7241

# Group: ethnicity_race_sex_hispanic or latino_white_female
  Accuracy: 0.6888
  Precision: 0.4000
  Recall: 0.5641
  F1 Score: 0.4681
  ROC-AUC: 0.6906
  Negative Log Loss: 0.6104

# Group: ethnicity_race_sex_hispanic or latino_white_male
  Accuracy: 0.6948
  Precision: 0.3974
  Recall: 0.5578
  F1 Score: 0.4642
  ROC-AUC: 0.7020
  Negative Log Loss: 0.6046

# Group: ethnicity_race_sex_not hispanic or latino_american indian or alaska native_female
  Accuracy: 0.7315
  Precision: 0.6889
  Recall: 0.6739
  F1 Score: 0.6813
  ROC-AUC: 0.7763
  Negative Log Loss: 0.5611

# Group: ethnicity_race_sex_not hispanic or latino_american indian or alaska native_male
  Accuracy: 0.6525
  Precision: 0.4571
  Recall: 0.4211
  F1 Score: 0.4384
  ROC-AUC: 0.6322
  Negative Log Loss: 0.6571

# Group: ethnicity_race_sex_not hispanic or latino_asian_female
  Accuracy: 0.7576
  Precision: 0.3316
  Recall: 0.4235
  F1 Score: 0.3720
  ROC-AUC: 0.6509
  Negative Log Loss: 0.5311

# Group: ethnicity_race_sex_not hispanic or latino_asian_male
  Accuracy: 0.7572
  Precision: 0.3837
  Recall: 0.4219
  F1 Score: 0.4019
  ROC-AUC: 0.6914
  Negative Log Loss: 0.5279

# Group: ethnicity_race_sex_not hispanic or latino_black or african american_female
  Accuracy: 0.6140
  Precision: 0.4294
  Recall: 0.6771
  F1 Score: 0.5255
  ROC-AUC: 0.6788
  Negative Log Loss: 0.6956

# Group: ethnicity_race_sex_not hispanic or latino_black or african american_male
  Accuracy: 0.6205
  Precision: 0.4610
  Recall: 0.6549
  F1 Score: 0.5411
  ROC-AUC: 0.6841
  Negative Log Loss: 0.6689

# Group: ethnicity_race_sex_not hispanic or latino_native hawaiian or other pacific islander_female
  Accuracy: 0.6739
  Precision: 0.3077
  Recall: 0.4000
  F1 Score: 0.3478
  ROC-AUC: 0.6028
  Negative Log Loss: 0.5880

# Group: ethnicity_race_sex_not hispanic or latino_native hawaiian or other pacific islander_male
  Accuracy: 0.6667
  Precision: 0.6522
  Recall: 0.5769
  F1 Score: 0.6122
  ROC-AUC: 0.6663
  Negative Log Loss: 0.7676

# Group: ethnicity_race_sex_not hispanic or latino_white_female
  Accuracy: 0.7672
  Precision: 0.3787
  Recall: 0.3790
  F1 Score: 0.3789
  ROC-AUC: 0.6992
  Negative Log Loss: 0.4978

# Group: ethnicity_race_sex_not hispanic or latino_white_male
  Accuracy: 0.7597
  Precision: 0.3607
  Recall: 0.3736
  F1 Score: 0.3670
  ROC-AUC: 0.6889
  Negative Log Loss: 0.5089

I) Bias measurement:¶

  1. We use Disparate Impact as a method to measure biases

  2. We re-join X_test and y_pred

In [42]:
# We got many empty rows due to index misalignment, so we need to reset index for both df:
# X_test and y_pred
X_test_reset = X_test.reset_index(drop=True)
y_pred_reset = pd.DataFrame(y_pred, columns=['action_taken_binary']).reset_index(drop=True)

# Concatenating the two DataFrames
lgbm_trained_df = pd.concat([X_test_reset, y_pred_reset], axis=1)
In [43]:
# Defining variables:
protected_attribute_names=[
                "ethnicity_race_sex_hispanic or latino_american indian or alaska native_female",
                "ethnicity_race_sex_hispanic or latino_american indian or alaska native_male",
                "ethnicity_race_sex_hispanic or latino_asian_female",
                "ethnicity_race_sex_hispanic or latino_asian_male",
                "ethnicity_race_sex_hispanic or latino_black or african american_female",
                "ethnicity_race_sex_hispanic or latino_black or african american_male",
                "ethnicity_race_sex_hispanic or latino_native hawaiian or other pacific islander_female",
                "ethnicity_race_sex_hispanic or latino_native hawaiian or other pacific islander_male",
                "ethnicity_race_sex_hispanic or latino_white_female",
                "ethnicity_race_sex_hispanic or latino_white_male",
                "ethnicity_race_sex_not hispanic or latino_american indian or alaska native_female",
                "ethnicity_race_sex_not hispanic or latino_american indian or alaska native_male",
                "ethnicity_race_sex_not hispanic or latino_asian_female",
                "ethnicity_race_sex_not hispanic or latino_asian_male",
                "ethnicity_race_sex_not hispanic or latino_black or african american_female",
                "ethnicity_race_sex_not hispanic or latino_black or african american_male",
                "ethnicity_race_sex_not hispanic or latino_native hawaiian or other pacific islander_female",
                "ethnicity_race_sex_not hispanic or latino_native hawaiian or other pacific islander_male",
                "ethnicity_race_sex_not hispanic or latino_white_female",
                'ethnicity_race_sex_not hispanic or latino_white_male' #notice we inlcude the privilege group
                            ]

favorable_label = 0  # loan approved
unfavorable_label = 1  # loan denied

# Creating the dataset
aif_dataset = BinaryLabelDataset(
    df=lgbm_trained_df,
    label_names=['action_taken_binary'],
    protected_attribute_names=protected_attribute_names,
    favorable_label = favorable_label,  # loan approved
    unfavorable_label = unfavorable_label  # loan denied

                                )
In [44]:
# Definition of the privileged group directly
privileged_groups = [{'ethnicity_race_sex_not hispanic or latino_white_male': 1}]

# Defining unprivileged groups using a loop
unprivileged_groups = []
for attribute in protected_attribute_names:
    if attribute != 'ethnicity_race_sex_not hispanic or latino_white_male':
        unprivileged_groups.append({attribute: 1})

# Checking groups
print("Privileged group:", privileged_groups)
print("Number of unprivileged groups:", len(unprivileged_groups))
print("First few unprivileged groups:", unprivileged_groups[:3])
Privileged group: [{'ethnicity_race_sex_not hispanic or latino_white_male': 1}]
Number of unprivileged groups: 19
First few unprivileged groups: [{'ethnicity_race_sex_hispanic or latino_american indian or alaska native_female': 1}, {'ethnicity_race_sex_hispanic or latino_american indian or alaska native_male': 1}, {'ethnicity_race_sex_hispanic or latino_asian_female': 1}]
In [45]:
# Calculating metrics
metric = BinaryLabelDatasetMetric(aif_dataset,
                                  unprivileged_groups=unprivileged_groups,
                                  privileged_groups=privileged_groups)
In [46]:
# Printing metrics
print(f"Disparate Impact: {metric.disparate_impact():.6f}")
print(f"Statistical Parity Difference: {metric.statistical_parity_difference():.6f}")

# We calculate and print the mean difference in label predictions
print(f"Mean difference in label predictions: {metric.mean_difference():.6f}")

# Calculating group-specific metrics
for group in unprivileged_groups:
    group_metric = BinaryLabelDatasetMetric(aif_dataset,
                                            unprivileged_groups=[group],
                                            privileged_groups=privileged_groups)

    group_name = list(group.keys())[0]
    print(f"\nGroup: {group_name}")
    print(f"Disparate Impact: {group_metric.disparate_impact():.4f}")
    print(f"Statistical Parity Difference: {group_metric.statistical_parity_difference():.4f}")
Disparate Impact: 0.905503
Statistical Parity Difference: -0.076246
Mean difference in label predictions: -0.076246

Group: ethnicity_race_sex_hispanic or latino_american indian or alaska native_female
Disparate Impact: 0.5556
Statistical Parity Difference: -0.3586

Group: ethnicity_race_sex_hispanic or latino_american indian or alaska native_male
Disparate Impact: 0.4232
Statistical Parity Difference: -0.4654

Group: ethnicity_race_sex_hispanic or latino_asian_female
Disparate Impact: 0.8020
Statistical Parity Difference: -0.1598

Group: ethnicity_race_sex_hispanic or latino_asian_male
Disparate Impact: 0.8263
Statistical Parity Difference: -0.1402

Group: ethnicity_race_sex_hispanic or latino_black or african american_female
Disparate Impact: 0.6610
Statistical Parity Difference: -0.2735

Group: ethnicity_race_sex_hispanic or latino_black or african american_male
Disparate Impact: 0.5154
Statistical Parity Difference: -0.3910

Group: ethnicity_race_sex_hispanic or latino_native hawaiian or other pacific islander_female
Disparate Impact: 0.6610
Statistical Parity Difference: -0.2735

Group: ethnicity_race_sex_hispanic or latino_native hawaiian or other pacific islander_male
Disparate Impact: 0.2711
Statistical Parity Difference: -0.5881

Group: ethnicity_race_sex_hispanic or latino_white_female
Disparate Impact: 0.8151
Statistical Parity Difference: -0.1492

Group: ethnicity_race_sex_hispanic or latino_white_male
Disparate Impact: 0.8271
Statistical Parity Difference: -0.1395

Group: ethnicity_race_sex_not hispanic or latino_american indian or alaska native_female
Disparate Impact: 0.7230
Statistical Parity Difference: -0.2235

Group: ethnicity_race_sex_not hispanic or latino_american indian or alaska native_male
Disparate Impact: 0.8718
Statistical Parity Difference: -0.1035

Group: ethnicity_race_sex_not hispanic or latino_asian_female
Disparate Impact: 0.9711
Statistical Parity Difference: -0.0233

Group: ethnicity_race_sex_not hispanic or latino_asian_male
Disparate Impact: 0.9759
Statistical Parity Difference: -0.0195

Group: ethnicity_race_sex_not hispanic or latino_black or african american_female
Disparate Impact: 0.6223
Statistical Parity Difference: -0.3047

Group: ethnicity_race_sex_not hispanic or latino_black or african american_male
Disparate Impact: 0.6380
Statistical Parity Difference: -0.2921

Group: ethnicity_race_sex_not hispanic or latino_native hawaiian or other pacific islander_female
Disparate Impact: 0.8891
Statistical Parity Difference: -0.0895

Group: ethnicity_race_sex_not hispanic or latino_native hawaiian or other pacific islander_male
Disparate Impact: 0.7393
Statistical Parity Difference: -0.2104

Group: ethnicity_race_sex_not hispanic or latino_white_female
Disparate Impact: 1.0071
Statistical Parity Difference: 0.0057

J) SHAP¶

J.1) Overall model assessment¶

In [47]:
# lgbm_model.fit(X_train_smote, y_train_smote)
# lgbm_model = lgb.LGBMClassifier(random_state=42)

# Creating SHAP explainer for LightGBM
explainer = shap.TreeExplainer(best_model, X_train_reweighted)

# Calculating SHAP values for the test set
shap_values = explainer.shap_values(X_test)

# Calculating mean absolute SHAP values for each feature
mean_abs_shap = np.abs(shap_values).mean(axis=0)

# Creating a DataFrame with feature names and mean absolute SHAP values
shap_importance = pd.DataFrame({
    'feature': X_test.columns,
    'importance': mean_abs_shap
})

# Sorting by importance
shap_importance_sorted = shap_importance.sort_values(by='importance', ascending=False)
100%|===================| 43998/44016 [26:42<00:00]       
In [48]:
# Bar Plot for Top 20 Feature Importances
plt.figure(figsize=(12, 8))
sns.barplot(x='importance', y='feature', data=shap_importance_sorted.head(20))
plt.title('Top 20 Feature Importances (based on absolute SHAP values)')
plt.tight_layout()
plt.show()

# Summary Plot
plt.figure(figsize=(12, 8))
shap.summary_plot(shap_values, X_test, plot_type="bar")
plt.show()

# Detailed Summary Plot
plt.figure(figsize=(12, 8))
shap.summary_plot(shap_values, X_test)
plt.show()
No description has been provided for this image
/usr/local/lib/python3.10/dist-packages/shap/plots/_beeswarm.py:950: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations.
  pl.tight_layout()
No description has been provided for this image
/usr/local/lib/python3.10/dist-packages/shap/plots/_beeswarm.py:950: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations.
  pl.tight_layout()
No description has been provided for this image

J.2) WaterFall plot per group.¶

In [49]:
# List of the intersectional group columns
ethnicity_race_sex_cols = [col for col in X_test_with_pred.columns if col.startswith('ethnicity_race_sex')]

# Creating a column that represents each intersectional group
X_test_with_pred['intersectional_group'] = X_test_with_pred[ethnicity_race_sex_cols].idxmax(axis=1)

# Reseting the index of X_test_with_pred and X_test to ensure they match shap_values (we had issues earlier without resetting)
X_test_with_pred = X_test_with_pred.reset_index(drop=True)
X_test = X_test.reset_index(drop=True)

# Grouping by intersectional group to evaluate SHAP values for each group
grouped = X_test_with_pred.groupby('intersectional_group')

# Loop through each intersectional group
for group_name, group_data in grouped:
    print(f"Generating SHAP Waterfall plot for group: {group_name}")

    # Getting the subset of the data for this intersectional group
    X_group = X_test.loc[group_data.index]

    # Creation of a subset for the SHAP values for this group
    shap_values_group = shap_values[group_data.index]  # This bit ensure that SHAP values match X_group

    # Picking a specific row to explain for the waterfall plot
    row_to_explain = group_data.index[0]

    # We generate the SHAP waterfall plot for a single prediction in this group
    shap.waterfall_plot(
        shap.Explanation(
            values=shap_values[row_to_explain],  # SHAP values for the specific row
            base_values=explainer.expected_value,  # Base value for the SHAP model
            data=X_test.iloc[row_to_explain, :],  # Input data for the specific row
            feature_names=X_test.columns  # Feature names
        )
    )
Generating SHAP Waterfall plot for group: ethnicity_race_sex_hispanic or latino_american indian or alaska native_female
No description has been provided for this image
Generating SHAP Waterfall plot for group: ethnicity_race_sex_hispanic or latino_american indian or alaska native_male
No description has been provided for this image
Generating SHAP Waterfall plot for group: ethnicity_race_sex_hispanic or latino_asian_female
No description has been provided for this image
Generating SHAP Waterfall plot for group: ethnicity_race_sex_hispanic or latino_asian_male
No description has been provided for this image
Generating SHAP Waterfall plot for group: ethnicity_race_sex_hispanic or latino_black or african american_female
No description has been provided for this image
Generating SHAP Waterfall plot for group: ethnicity_race_sex_hispanic or latino_black or african american_male
No description has been provided for this image
Generating SHAP Waterfall plot for group: ethnicity_race_sex_hispanic or latino_native hawaiian or other pacific islander_female
No description has been provided for this image
Generating SHAP Waterfall plot for group: ethnicity_race_sex_hispanic or latino_native hawaiian or other pacific islander_male
No description has been provided for this image
Generating SHAP Waterfall plot for group: ethnicity_race_sex_hispanic or latino_white_female
No description has been provided for this image
Generating SHAP Waterfall plot for group: ethnicity_race_sex_hispanic or latino_white_male
No description has been provided for this image
Generating SHAP Waterfall plot for group: ethnicity_race_sex_not hispanic or latino_american indian or alaska native_female
No description has been provided for this image
Generating SHAP Waterfall plot for group: ethnicity_race_sex_not hispanic or latino_american indian or alaska native_male
No description has been provided for this image
Generating SHAP Waterfall plot for group: ethnicity_race_sex_not hispanic or latino_asian_female
No description has been provided for this image
Generating SHAP Waterfall plot for group: ethnicity_race_sex_not hispanic or latino_asian_male
No description has been provided for this image
Generating SHAP Waterfall plot for group: ethnicity_race_sex_not hispanic or latino_black or african american_female
No description has been provided for this image
Generating SHAP Waterfall plot for group: ethnicity_race_sex_not hispanic or latino_black or african american_male
No description has been provided for this image
Generating SHAP Waterfall plot for group: ethnicity_race_sex_not hispanic or latino_native hawaiian or other pacific islander_female
No description has been provided for this image
Generating SHAP Waterfall plot for group: ethnicity_race_sex_not hispanic or latino_native hawaiian or other pacific islander_male
No description has been provided for this image
Generating SHAP Waterfall plot for group: ethnicity_race_sex_not hispanic or latino_white_female
No description has been provided for this image
Generating SHAP Waterfall plot for group: ethnicity_race_sex_not hispanic or latino_white_male
No description has been provided for this image

J.3) Summary per group¶

In [50]:
for group_name, group_data in grouped:
    print(f"Generating SHAP Summary plot for group: {group_name}")

    # Geting the subset of the data for this intersectional group
    X_group = X_test.loc[group_data.index]

    # Subsetting the SHAP values for this group
    shap_values_group = shap_values[group_data.index]

    # Generating the SHAP summary plot for the entire group
    shap.summary_plot(shap_values_group, X_group, feature_names=X_test.columns)
Generating SHAP Summary plot for group: ethnicity_race_sex_hispanic or latino_american indian or alaska native_female
/usr/local/lib/python3.10/dist-packages/shap/plots/_beeswarm.py:950: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations.
  pl.tight_layout()
No description has been provided for this image
Generating SHAP Summary plot for group: ethnicity_race_sex_hispanic or latino_american indian or alaska native_male
/usr/local/lib/python3.10/dist-packages/shap/plots/_beeswarm.py:950: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations.
  pl.tight_layout()
No description has been provided for this image
Generating SHAP Summary plot for group: ethnicity_race_sex_hispanic or latino_asian_female
/usr/local/lib/python3.10/dist-packages/shap/plots/_beeswarm.py:950: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations.
  pl.tight_layout()
No description has been provided for this image
Generating SHAP Summary plot for group: ethnicity_race_sex_hispanic or latino_asian_male
/usr/local/lib/python3.10/dist-packages/shap/plots/_beeswarm.py:950: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations.
  pl.tight_layout()
No description has been provided for this image
Generating SHAP Summary plot for group: ethnicity_race_sex_hispanic or latino_black or african american_female
/usr/local/lib/python3.10/dist-packages/shap/plots/_beeswarm.py:950: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations.
  pl.tight_layout()
No description has been provided for this image
Generating SHAP Summary plot for group: ethnicity_race_sex_hispanic or latino_black or african american_male
/usr/local/lib/python3.10/dist-packages/shap/plots/_beeswarm.py:950: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations.
  pl.tight_layout()
No description has been provided for this image
Generating SHAP Summary plot for group: ethnicity_race_sex_hispanic or latino_native hawaiian or other pacific islander_female
/usr/local/lib/python3.10/dist-packages/shap/plots/_beeswarm.py:950: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations.
  pl.tight_layout()
No description has been provided for this image
Generating SHAP Summary plot for group: ethnicity_race_sex_hispanic or latino_native hawaiian or other pacific islander_male
/usr/local/lib/python3.10/dist-packages/shap/plots/_beeswarm.py:950: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations.
  pl.tight_layout()
No description has been provided for this image
Generating SHAP Summary plot for group: ethnicity_race_sex_hispanic or latino_white_female
/usr/local/lib/python3.10/dist-packages/shap/plots/_beeswarm.py:950: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations.
  pl.tight_layout()
No description has been provided for this image
Generating SHAP Summary plot for group: ethnicity_race_sex_hispanic or latino_white_male
/usr/local/lib/python3.10/dist-packages/shap/plots/_beeswarm.py:950: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations.
  pl.tight_layout()
No description has been provided for this image
Generating SHAP Summary plot for group: ethnicity_race_sex_not hispanic or latino_american indian or alaska native_female
/usr/local/lib/python3.10/dist-packages/shap/plots/_beeswarm.py:950: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations.
  pl.tight_layout()
No description has been provided for this image
Generating SHAP Summary plot for group: ethnicity_race_sex_not hispanic or latino_american indian or alaska native_male
/usr/local/lib/python3.10/dist-packages/shap/plots/_beeswarm.py:950: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations.
  pl.tight_layout()
No description has been provided for this image
Generating SHAP Summary plot for group: ethnicity_race_sex_not hispanic or latino_asian_female
/usr/local/lib/python3.10/dist-packages/shap/plots/_beeswarm.py:950: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations.
  pl.tight_layout()
No description has been provided for this image
Generating SHAP Summary plot for group: ethnicity_race_sex_not hispanic or latino_asian_male
/usr/local/lib/python3.10/dist-packages/shap/plots/_beeswarm.py:950: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations.
  pl.tight_layout()
No description has been provided for this image
Generating SHAP Summary plot for group: ethnicity_race_sex_not hispanic or latino_black or african american_female
/usr/local/lib/python3.10/dist-packages/shap/plots/_beeswarm.py:950: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations.
  pl.tight_layout()
No description has been provided for this image
Generating SHAP Summary plot for group: ethnicity_race_sex_not hispanic or latino_black or african american_male
/usr/local/lib/python3.10/dist-packages/shap/plots/_beeswarm.py:950: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations.
  pl.tight_layout()
No description has been provided for this image
Generating SHAP Summary plot for group: ethnicity_race_sex_not hispanic or latino_native hawaiian or other pacific islander_female
/usr/local/lib/python3.10/dist-packages/shap/plots/_beeswarm.py:950: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations.
  pl.tight_layout()
No description has been provided for this image
Generating SHAP Summary plot for group: ethnicity_race_sex_not hispanic or latino_native hawaiian or other pacific islander_male
/usr/local/lib/python3.10/dist-packages/shap/plots/_beeswarm.py:950: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations.
  pl.tight_layout()
No description has been provided for this image
Generating SHAP Summary plot for group: ethnicity_race_sex_not hispanic or latino_white_female
/usr/local/lib/python3.10/dist-packages/shap/plots/_beeswarm.py:950: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations.
  pl.tight_layout()
No description has been provided for this image
Generating SHAP Summary plot for group: ethnicity_race_sex_not hispanic or latino_white_male
/usr/local/lib/python3.10/dist-packages/shap/plots/_beeswarm.py:950: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations.
  pl.tight_layout()
No description has been provided for this image
In [51]:
# ALTERNATIVE TO Summary VALUES pere group SINCE SOME OF the charts were showing distorted display.
for group_name, group_data in grouped:
    print(f"Generating SHAP Custom Summary plot for group: {group_name}")

    # Getting the subset of the data for this intersectional group
    X_group = X_test.loc[group_data.index]

    # Subsetting the SHAP values for this group
    shap_values_group = shap_values[group_data.index]

    # Calculating mean absolute SHAP values for the group
    mean_abs_shap = np.abs(shap_values_group).mean(axis=0)

    # Creating a DataFrame with feature names and mean absolute SHAP values
    shap_importance = pd.DataFrame({
        'feature': X_test.columns,
        'importance': mean_abs_shap
    })

    # Sorting df by importance
    shap_importance_sorted = shap_importance.sort_values(by='importance', ascending=False)

    # Plotting the top 20 features
    plt.figure(figsize=(12, 8))
    sns.barplot(x='importance', y='feature', data=shap_importance_sorted.head(20))
    plt.title(f'Top 20 Feature Importances for Group: {group_name}')
    plt.tight_layout()
    plt.show()
Generating SHAP Custom Summary plot for group: ethnicity_race_sex_hispanic or latino_american indian or alaska native_female
No description has been provided for this image
Generating SHAP Custom Summary plot for group: ethnicity_race_sex_hispanic or latino_american indian or alaska native_male
No description has been provided for this image
Generating SHAP Custom Summary plot for group: ethnicity_race_sex_hispanic or latino_asian_female
No description has been provided for this image
Generating SHAP Custom Summary plot for group: ethnicity_race_sex_hispanic or latino_asian_male
No description has been provided for this image
Generating SHAP Custom Summary plot for group: ethnicity_race_sex_hispanic or latino_black or african american_female
No description has been provided for this image
Generating SHAP Custom Summary plot for group: ethnicity_race_sex_hispanic or latino_black or african american_male
No description has been provided for this image
Generating SHAP Custom Summary plot for group: ethnicity_race_sex_hispanic or latino_native hawaiian or other pacific islander_female
No description has been provided for this image
Generating SHAP Custom Summary plot for group: ethnicity_race_sex_hispanic or latino_native hawaiian or other pacific islander_male
No description has been provided for this image
Generating SHAP Custom Summary plot for group: ethnicity_race_sex_hispanic or latino_white_female
No description has been provided for this image
Generating SHAP Custom Summary plot for group: ethnicity_race_sex_hispanic or latino_white_male
No description has been provided for this image
Generating SHAP Custom Summary plot for group: ethnicity_race_sex_not hispanic or latino_american indian or alaska native_female
No description has been provided for this image
Generating SHAP Custom Summary plot for group: ethnicity_race_sex_not hispanic or latino_american indian or alaska native_male
No description has been provided for this image
Generating SHAP Custom Summary plot for group: ethnicity_race_sex_not hispanic or latino_asian_female
No description has been provided for this image
Generating SHAP Custom Summary plot for group: ethnicity_race_sex_not hispanic or latino_asian_male
No description has been provided for this image
Generating SHAP Custom Summary plot for group: ethnicity_race_sex_not hispanic or latino_black or african american_female
No description has been provided for this image
Generating SHAP Custom Summary plot for group: ethnicity_race_sex_not hispanic or latino_black or african american_male
No description has been provided for this image
Generating SHAP Custom Summary plot for group: ethnicity_race_sex_not hispanic or latino_native hawaiian or other pacific islander_female
No description has been provided for this image
Generating SHAP Custom Summary plot for group: ethnicity_race_sex_not hispanic or latino_native hawaiian or other pacific islander_male
No description has been provided for this image
Generating SHAP Custom Summary plot for group: ethnicity_race_sex_not hispanic or latino_white_female
No description has been provided for this image
Generating SHAP Custom Summary plot for group: ethnicity_race_sex_not hispanic or latino_white_male
No description has been provided for this image

J.4) Summary per group (average)¶

In [52]:
for group_name, group_data in grouped:
    print(f"Generating SHAP Waterfall plot for group: {group_name}")

    # Getting the subset of the data for this intersectional group
    X_group = X_test.loc[group_data.index]

    # Subsetting the SHAP values for this group
    shap_values_group = shap_values[group_data.index] # This bit ensure that SHAP values match X_group

    # Calculating the average SHAP values for the group
    avg_shap_values = shap_values_group.mean(axis=0)

    # Generating the SHAP waterfall plot for the average prediction Mind that is for the AVG!
    shap.waterfall_plot(
        shap.Explanation(
            values=avg_shap_values,  # Average SHAP values for the group
            base_values=explainer.expected_value,  # Base value for the SHAP model
            data=X_group.mean(axis=0),  # Average input data for the group
            feature_names=X_test.columns  # Feature names
        )
    )
Generating SHAP Waterfall plot for group: ethnicity_race_sex_hispanic or latino_american indian or alaska native_female
No description has been provided for this image
Generating SHAP Waterfall plot for group: ethnicity_race_sex_hispanic or latino_american indian or alaska native_male
No description has been provided for this image
Generating SHAP Waterfall plot for group: ethnicity_race_sex_hispanic or latino_asian_female
No description has been provided for this image
Generating SHAP Waterfall plot for group: ethnicity_race_sex_hispanic or latino_asian_male
No description has been provided for this image
Generating SHAP Waterfall plot for group: ethnicity_race_sex_hispanic or latino_black or african american_female
No description has been provided for this image
Generating SHAP Waterfall plot for group: ethnicity_race_sex_hispanic or latino_black or african american_male
No description has been provided for this image
Generating SHAP Waterfall plot for group: ethnicity_race_sex_hispanic or latino_native hawaiian or other pacific islander_female
No description has been provided for this image
Generating SHAP Waterfall plot for group: ethnicity_race_sex_hispanic or latino_native hawaiian or other pacific islander_male
No description has been provided for this image
Generating SHAP Waterfall plot for group: ethnicity_race_sex_hispanic or latino_white_female
No description has been provided for this image
Generating SHAP Waterfall plot for group: ethnicity_race_sex_hispanic or latino_white_male
No description has been provided for this image
Generating SHAP Waterfall plot for group: ethnicity_race_sex_not hispanic or latino_american indian or alaska native_female
No description has been provided for this image
Generating SHAP Waterfall plot for group: ethnicity_race_sex_not hispanic or latino_american indian or alaska native_male
No description has been provided for this image
Generating SHAP Waterfall plot for group: ethnicity_race_sex_not hispanic or latino_asian_female
No description has been provided for this image
Generating SHAP Waterfall plot for group: ethnicity_race_sex_not hispanic or latino_asian_male
No description has been provided for this image
Generating SHAP Waterfall plot for group: ethnicity_race_sex_not hispanic or latino_black or african american_female
No description has been provided for this image
Generating SHAP Waterfall plot for group: ethnicity_race_sex_not hispanic or latino_black or african american_male
No description has been provided for this image
Generating SHAP Waterfall plot for group: ethnicity_race_sex_not hispanic or latino_native hawaiian or other pacific islander_female
No description has been provided for this image
Generating SHAP Waterfall plot for group: ethnicity_race_sex_not hispanic or latino_native hawaiian or other pacific islander_male
No description has been provided for this image
Generating SHAP Waterfall plot for group: ethnicity_race_sex_not hispanic or latino_white_female
No description has been provided for this image
Generating SHAP Waterfall plot for group: ethnicity_race_sex_not hispanic or latino_white_male
No description has been provided for this image