ML in Credit Risk Modeling
This is not a typical ML class.
Use the credit Risk model as a lab.
How to apply the tool in a wiser way.
The problem: credit risk modeling is just a example.
How to Model Credit Risk
Ds: credit card performance.
2 types of outcome: credit card delinquency/default(fail to meet the payment)
Your goal:increase the chance that you are lending money to those blue circled people(肯定是要把钱借给能还钱的人啦!)
How to do?
Ex-ante information:
Ex-ante: these information we have are available at the begining when we start to analyze. Balance/income/other characteristics
Ex-post: given ex-ante characteristics and get ex-post outcome.
Setting the Stage:
ML:
Optimize a performance criterion example data or past experience and domain knowledge.
Role of statistics: Inference from a sample.
Jargons:Supervise Learning Techniques.
Role of CS: find efficient algorithms to
- solve the optimization problem
- represent and evaluate the model for (online) inference
What is different about this class?
We will emphasize on what is unique about applying ML to Finance.
在这个情况下的$x_i$和$y_i$:
$y_i=default$
$x_i=\begin{pmatrix}balance\income\end{pmatrix}$
Predicting Defaults
- $Y_i$是categorical outcome: default/solvent. 用1 和0 表示,multiclass:0,1,2,3,4….不要把0,1想的太literally,所以在蘑菇分类的好时候才可以用scikit-learning的那个工具包
- $y_i=\beta x_i+\epsilon_i$ Linear Regression:linear function of 2 featuers.
- $E[y_i|x_i],$expected value of y is the same thing as Probability(概率) of default.(因为是0和1 )
Linear not good and we want to build non-linear.
$F(x_i,\theta)=Pr(y_i=1|x_i)$
看这里左图,在balance 和default 的概率之间建立线性的关系,并不好用,因为图中拟合的蓝色曲线,出现了负值的情况。
Clearly , 在右图中,在balance 和default 的概率之间建立线性的关系,从图片上可以看出我们得到的效果要好很多。 Behave like a s-shape.
How to build that curve?
Serveral functions to build the curve.
Logistic regression
Sigmoid function 为什么好用呢?
注意,在机器学习中,默认的都是列向量。
$F(\theta;\vec x)=sigmoid(\vec\theta^T·\vec x)=value$
sigmoid 函数,这里不就是把一个vector $x$映射到了一个(0,1)之间的数,这样的映射恰好是我们的目标,我们想要的。
Logistic: parametric model : our job is find the parameter.
Non-parametric approach
K-NN/DNN/Robust-DNN
The Toolbox
Logistic Regression
Log odds ratio:
LO=log$\frac{Pr(y=1|x)}{Pr(y=0|x)}$Pr(y=1|x)+Pr(y=0|x)=1;
LO is continuous and ranges between 负无穷到正无穷
directly model LO as linear function of x:
LO=$\theta^T\vec x$
用$F(X,\theta)$,表示$Pr(y=1|x)$.解出来,可以得到:
$Pr(y=1|x)=\frac{e^{\theta^Tx}}{1+e^{\theta^Tx}}$
在FT领域 ,interpretability 非常非常非常重要,连续3个lecture都验证了。
Logistic regression 用MLE来估计参数。
MLE与交叉熵损失
交叉熵损失也是似然函数。
Multi-class logit
Model the log odds ratio for classselect 1 class as one baseline(benchmark).
即是去计算$\frac{Pr(y=1)}{Pr(y=0)}$/$\frac{Pr(y=2)}{Pr(y=0)}$/$\frac{Pr(y=3)}{Pr(y=0)}$…
这就是softmax。
Logically:extend logistic for 2 class or more class.
Prediction and Confusion Matrix
whether the borrower is a good borrower.
How to decide whether a model is a good model?
Confusion Matrix
$N+P(originNUM)=N^+P^(receive LabelNUM)$
FN 和FP都是我的model 错误的地方。
FN 和FP errorhave different economic consequences.
threshold很小,you are behave a very considerate(保守)的决策方式。less mistakes to lend people that are bad borrowers.
economic consequences really counts.
ROC Curve
Skip
K-nearest neighbours
why KNN and LR are pole opposite.
为什么用KNN分类和LR分类会有这么大的不同之处呢?
怎么用KNN进行分类呢?
看这个左图:左边这个蓝色的圈圈里有一个“叉”,我想对他进行分类,那我怎么分类呢?
我有一个圈圈,我找这个最3-近邻,然后呢我发现,这个“叉”的邻居中,有两个是蓝的圈,又一个是黄的,那我这个“叉”,就有66.7%的可能性属于蓝的,33.3%的可能性属于黄的。
a very intuitive argument.
右图:右的图里,就相当于我把这个decision boundary 画出来以后的样子。
KNN procedure:data-driven approach
- k is parameter.
- how do I measure the distance
c.f. KNN v.s. K-means clustering
distance between categorical values.
Key of the Lecture
To summarize, the key point of training this model is that how to balance these 2 types of errors considering the economic consequences these out error output can bring about.
It is a economic problem rather than a ML problem