This project presents a full-cycle Credit Risk Modeling solution to predict the likelihood of a borrower defaulting on a loan. It involves meticulous data cleaning, feature engineering, model training, business-aligned metric optimization, and deployment using Streamlit. Designed with real-world financial services impact in mind, the model prioritizes recall to minimize false negatives (i.e., not catching risky borrowers).
- Goal: Predict whether a borrower will default on a loan.
- Dataset: Provided by a financial institution with borrower-level and loan-level details.
- Target Variable:
default
(1 = default, 0 = not default) - Business Objective: High recall for defaulters to minimize risk exposure.
- Deployment: Web app hosted using Streamlit Cloud.
The dataset was highly imbalanced:
- Techniques used: SMOTE-Tomek, oversampling, and threshold tuning.
Handled properly by eliminating leak-prone features like disbursal_date
, installment_start_dt
, and derived leakage indicators.
Boxplots revealed processing_fee
> loan_amount
, which is invalid. These anomalies were cleaned or capped appropriately.
loan_purpose
cleaned and grouped into standard categories.- One-hot encoding and WoE/IV analysis used for feature transformation and selection.
- Loan-to-Income Ratio (LTI):
loan_amount / income
- Delinquency Ratio
- Average DPD per Delinquency
- High LTI, delinquency_ratio, and avg_dpd_per_delinquency were strong predictors of default.
- Defaulted customers had younger age, longer loan tenure, and higher credit utilization.
Dropped correlated features: sanction_amount
, processing_fee
, gst
, net_disbursement
, principal_outstanding
.
Top features:
credit_utilization_ratio
avg_dpd_per_delinquency
loan_to_income
loan_purpose
residence_type
loan_tenure_months
loan_type
age
, etc.
Model | Accuracy | Recall (Defaulters) |
---|---|---|
Logistic Regression (Basic) | 96% | 0.70 |
Random Forest | 96% | 0.69 |
XGBoost | 96% | 0.75 |
- Logistic Regression
- SMOTE-Tomek
- Optuna for Hyperparameter Tuning
- Business chose Logistic Regression for explainability
- Accuracy: 93%
- Recall (Defaulters): 0.95
- AUC: 98.3%
- Gini Coefficient: 0.967
- App Framework: Streamlit
- Main Files:
main.py
,prediction_helper.py
- Hosting: Streamlit Cloud
- Enables better credit risk filtering.
- High recall helps reduce bad debt.
- Easy model interpretability aids compliance and auditing.
Advance_Credit_Risk_Model_Loan_prediction/
βββ data/
βββ notebooks/
βββ main.py
βββ prediction_helper.py
βββ README.md
βββ requirements.txt
βββ images/
β βββ ks_statistic.png
β βββ roc_curve.png
β βββ confusion_matrix.png
β βββ streamlit_app_screenshot.png
β βββ metrics.png
β βββ feature_importance.png
βββ artifacts/
β βββ modeldata.joblib
- Mehul Ligade
- GitHub: @mehulcode12
- CodeBasics
- GitHub: @mehulcode12
This project was completed as part of the Codebasics Data Science Bootcamp. Special thanks to mentors and the open-source community for libraries and frameworks.
You are welcome to use this project as a reference. Please give credit to CodeBasics if you find it helpful.