Skip to content

This project delivers a production-grade Credit Risk Prediction System that accurately identifies potential loan defaults using real-world financial data. I engineered domain-specific features like Loan-to-Income Ratio, Delinquency Ratio, and Avg DPD per Delinquency, which significantly enhanced the model's predictive power. I addressed severe clas

Notifications You must be signed in to change notification settings

mehulcode12/Advance_Credit_Risk_Model_Loan_prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

16 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Advance Credit Risk Modeling - Loan Default Prediction

App Screenshot

This project presents a full-cycle Credit Risk Modeling solution to predict the likelihood of a borrower defaulting on a loan. It involves meticulous data cleaning, feature engineering, model training, business-aligned metric optimization, and deployment using Streamlit. Designed with real-world financial services impact in mind, the model prioritizes recall to minimize false negatives (i.e., not catching risky borrowers).


πŸš€ Project Overview

  • Goal: Predict whether a borrower will default on a loan.
  • Dataset: Provided by a financial institution with borrower-level and loan-level details.
  • Target Variable: default (1 = default, 0 = not default)
  • Business Objective: High recall for defaulters to minimize risk exposure.
  • Deployment: Web app hosted using Streamlit Cloud.

πŸ“Š Exploratory Data Analysis (EDA) & Preprocessing

βœ… Class Imbalance

The dataset was highly imbalanced:

  • Techniques used: SMOTE-Tomek, oversampling, and threshold tuning.

πŸ›‘ Data Leakage

Handled properly by eliminating leak-prone features like disbursal_date, installment_start_dt, and derived leakage indicators.

πŸ“‰ Processing Fee Anomaly

Boxplots revealed processing_fee > loan_amount, which is invalid. These anomalies were cleaned or capped appropriately.

🧼 Categorical Feature Cleaning

  • loan_purpose cleaned and grouped into standard categories.
  • One-hot encoding and WoE/IV analysis used for feature transformation and selection.

πŸ” Feature Engineering

Key New Features:

  • Loan-to-Income Ratio (LTI): loan_amount / income
  • Delinquency Ratio
  • Average DPD per Delinquency

Insights:

  • High LTI, delinquency_ratio, and avg_dpd_per_delinquency were strong predictors of default.
  • Defaulted customers had younger age, longer loan tenure, and higher credit utilization.

πŸ“ Feature Selectio

Multicollinearity Check (VIF)

Dropped correlated features: sanction_amount, processing_fee, gst, net_disbursement, principal_outstanding.

WoE & IV-Based Categorical Feature Selection:

Top features:

  • credit_utilization_ratio
  • avg_dpd_per_delinquency
  • loan_to_income
  • loan_purpose
  • residence_type
  • loan_tenure_months
  • loan_type
  • age, etc.

πŸ€– Model Training & Optimization

Model Attempt 1 default:

Model Accuracy Recall (Defaulters)
Logistic Regression (Basic) 96% 0.70
Random Forest 96% 0.69
XGBoost 96% 0.75

Final Model:

  • Logistic Regression
  • SMOTE-Tomek
  • Optuna for Hyperparameter Tuning
  • Business chose Logistic Regression for explainability

Final Metrics:

  • Accuracy: 93%
  • Recall (Defaulters): 0.95
  • AUC: 98.3%
  • Gini Coefficient: 0.967

πŸ“ˆ Model Evaluation

Metrics per class

Metrics per class

ROC Curve

ROC Curve

KS Statistic

  • KS Value: 85.98% at Decile 8

  • Indicates strong rank-ordering capability.

KS Plot

Feature Importance

Feature_importance


πŸ“¦ Deployment

  • App Framework: Streamlit
  • Main Files: main.py, prediction_helper.py
  • Hosting: Streamlit Cloud

Streamlit Screenshot


🧠 Business Impact

  • Enables better credit risk filtering.
  • High recall helps reduce bad debt.
  • Easy model interpretability aids compliance and auditing.

πŸ“ Folder Structure

Advance_Credit_Risk_Model_Loan_prediction/
β”œβ”€β”€ data/
β”œβ”€β”€ notebooks/
β”œβ”€β”€ main.py
β”œβ”€β”€ prediction_helper.py
β”œβ”€β”€ README.md
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ images/
β”‚   β”œβ”€β”€ ks_statistic.png
β”‚   β”œβ”€β”€ roc_curve.png
β”‚   β”œβ”€β”€ confusion_matrix.png
β”‚   └── streamlit_app_screenshot.png
β”‚   └── metrics.png
β”‚   └── feature_importance.png
β”œβ”€β”€ artifacts/
β”‚   └── modeldata.joblib

✍️ Author


πŸ™Œ Acknowledgements

This project was completed as part of the Codebasics Data Science Bootcamp. Special thanks to mentors and the open-source community for libraries and frameworks.

πŸ“Œ Note

You are welcome to use this project as a reference. Please give credit to CodeBasics if you find it helpful.

About

This project delivers a production-grade Credit Risk Prediction System that accurately identifies potential loan defaults using real-world financial data. I engineered domain-specific features like Loan-to-Income Ratio, Delinquency Ratio, and Avg DPD per Delinquency, which significantly enhanced the model's predictive power. I addressed severe clas

Topics

Resources

Stars

Watchers

Forks