MIT+FinTech Machine Learning in Credit Risk Prediction

阅读数: 8次 2020-08-19

ML in Credit Risk Modeling

This is not a typical ML class.

Use the credit Risk model as a lab.

How to apply the tool in a wiser way.

The problem: credit risk modeling is just a example.

How to Model Credit Risk

Ds: credit card performance.

2 types of outcome: credit card delinquency/default(fail to meet the payment)

Your goal:increase the chance that you are lending money to those blue circled people(肯定是要把钱借给能还钱的人啦！)

How to do?

Ex-ante information:

Ex-ante: these information we have are available at the begining when we start to analyze. Balance/income/other characteristics

Ex-post: given ex-ante characteristics and get ex-post outcome.

Setting the Stage:

ML:

Optimize a performance criterion example data or past experience and domain knowledge.
Role of statistics: Inference from a sample.
Jargons:Supervise Learning Techniques.
Role of CS: find efficient algorithms to
- solve the optimization problem
- represent and evaluate the model for (online) inference

What is different about this class？

We will emphasize on what is unique about applying ML to Finance.

在这个情况下的 $x_i$ 和 $y_i$ :

$y_i=default$

$x_i=\begin{pmatrix}balance\income\end{pmatrix}$

Predicting Defaults

$Y_i$ 是categorical outcome: default/solvent. 用1 和0 表示，multiclass:0,1,2,3,4….不要把0,1想的太literally，所以在蘑菇分类的好时候才可以用scikit-learning的那个工具包
$y_i=\beta x_i+\epsilon_i$ Linear Regression:linear function of 2 featuers.
$E[y_i|x_i],$ expected value of y is the same thing as Probability(概率) of default.(因为是0和1 )

Linear not good and we want to build non-linear.

$F(x_i,\theta)=Pr(y_i=1|x_i)$

看这里左图，在balance 和default 的概率之间建立线性的关系，并不好用，因为图中拟合的蓝色曲线，出现了负值的情况。

Clearly , 在右图中，在balance 和default 的概率之间建立线性的关系，从图片上可以看出我们得到的效果要好很多。 Behave like a s-shape.

How to build that curve?

Serveral functions to build the curve.

Logistic regression

Sigmoid function 为什么好用呢？

注意，在机器学习中，默认的都是列向量。

$F(\theta;\vec x)=sigmoid(\vec\theta^T·\vec x)=value$

sigmoid 函数，这里不就是把一个vector $x$ 映射到了一个(0,1)之间的数，这样的映射恰好是我们的目标，我们想要的。

Logistic: parametric model : our job is find the parameter.

Non-parametric approach

K-NN/DNN/Robust-DNN

The Toolbox

Logistic Regression

Log odds ratio:

LO=log $\frac{Pr(y=1|x)}{Pr(y=0|x)}$ Pr(y=1|x)+Pr(y=0|x)=1;

LO is continuous and ranges between 负无穷到正无穷

directly model LO as linear function of x:

LO= $\theta^T\vec x$
用 $F(X,\theta)$ ,表示 $Pr(y=1|x)$ .解出来，可以得到：

$Pr(y=1|x)=\frac{e^{\theta^Tx}}{1+e^{\theta^Tx}}$

在FT领域，interpretability 非常非常非常重要，连续3个lecture都验证了。

Logistic regression 用MLE来估计参数。

MLE与交叉熵损失

交叉熵损失也是似然函数。

Multi-class logit

Model the log odds ratio for classselect 1 class as one baseline(benchmark).

即是去计算 $\frac{Pr(y=1)}{Pr(y=0)}$ / $\frac{Pr(y=2)}{Pr(y=0)}$ / $\frac{Pr(y=3)}{Pr(y=0)}$ …

这就是softmax。

Logically:extend logistic for 2 class or more class.

Prediction and Confusion Matrix

whether the borrower is a good borrower.

How to decide whether a model is a good model?

Confusion Matrix

$N+P（originNUM)=N^+P^(receive LabelNUM)$

FN 和FP都是我的model 错误的地方。

FN 和FP errorhave different economic consequences.

threshold很小，you are behave a very considerate(保守)的决策方式。less mistakes to lend people that are bad borrowers.

economic consequences really counts.

ROC Curve

Skip

K-nearest neighbours

why KNN and LR are pole opposite.

为什么用KNN分类和LR分类会有这么大的不同之处呢?

怎么用KNN进行分类呢？

看这个左图：左边这个蓝色的圈圈里有一个“叉”，我想对他进行分类，那我怎么分类呢？

我有一个圈圈，我找这个最3-近邻，然后呢我发现，这个“叉”的邻居中，有两个是蓝的圈，又一个是黄的，那我这个“叉”，就有66.7%的可能性属于蓝的，33.3%的可能性属于黄的。

a very intuitive argument.

右图：右的图里，就相当于我把这个decision boundary 画出来以后的样子。

KNN procedure:data-driven approach

k is parameter.
how do I measure the distance

c.f. KNN v.s. K-means clustering

distance between categorical values.

Key of the Lecture

To summarize, the key point of training this model is that how to balance these 2 types of errors considering the economic consequences these out error output can bring about.

It is a economic problem rather than a ML problem