ML in Credit Risk Modeling
This is not a typical ML class.
Use the credit Risk model as a lab.
How to apply the tool in a wiser way.
The problem: credit risk modeling is just a example.
How to Model Credit Risk
Ds: credit card performance.
2 types of outcome: credit card delinquency/default(fail to meet the payment)
Your goal:increase the chance that you are lending money to those blue circled people(肯定是要把钱借给能还钱的人啦!)
How to do?
Ex-ante information:
Ex-ante: these information we have are available at the begining when we start to analyze. Balance/income/other characteristics
Ex-post: given ex-ante characteristics and get ex-post outcome.
Setting the Stage:
ML:
Optimize a performance criterion example data or past experience and domain knowledge.
Role of statistics: Inference from a sample.
Jargons:Supervise Learning Techniques.
Role of CS: find efficient algorithms to
- solve the optimization problem
- represent and evaluate the model for (online) inference
What is different about this class?
We will emphasize on what is unique about applying ML to Finance.
在这个情况下的xi和yi:
yi=default
xi=(balance\income)
Predicting Defaults
- Yi是categorical outcome: default/solvent. 用1 和0 表示,multiclass:0,1,2,3,4….不要把0,1想的太literally,所以在蘑菇分类的好时候才可以用scikit-learning的那个工具包
- yi=βxi+ϵi Linear Regression:linear function of 2 featuers.
- E[yi|xi],expected value of y is the same thing as Probability(概率) of default.(因为是0和1 )
Linear not good and we want to build non-linear.
F(xi,θ)=Pr(yi=1|xi)
看这里左图,在balance 和default 的概率之间建立线性的关系,并不好用,因为图中拟合的蓝色曲线,出现了负值的情况。
Clearly , 在右图中,在balance 和default 的概率之间建立线性的关系,从图片上可以看出我们得到的效果要好很多。 Behave like a s-shape.
How to build that curve?
Serveral functions to build the curve.
Logistic regression
Sigmoid function 为什么好用呢?
注意,在机器学习中,默认的都是列向量。
F(θ;→x)=sigmoid(→θT·→x)=value
sigmoid 函数,这里不就是把一个vector x映射到了一个(0,1)之间的数,这样的映射恰好是我们的目标,我们想要的。
Logistic: parametric model : our job is find the parameter.
Non-parametric approach
K-NN/DNN/Robust-DNN
The Toolbox
Logistic Regression
Log odds ratio:
LO=logPr(y=1|x)Pr(y=0|x)Pr(y=1|x)+Pr(y=0|x)=1;
LO is continuous and ranges between 负无穷到正无穷
directly model LO as linear function of x:
LO=θT→x
用F(X,θ),表示Pr(y=1|x).解出来,可以得到:
Pr(y=1|x)=eθTx1+eθTx
在FT领域 ,interpretability 非常非常非常重要,连续3个lecture都验证了。
Logistic regression 用MLE来估计参数。
MLE与交叉熵损失
交叉熵损失也是似然函数。
Multi-class logit
Model the log odds ratio for classselect 1 class as one baseline(benchmark).
即是去计算Pr(y=1)Pr(y=0)/Pr(y=2)Pr(y=0)/Pr(y=3)Pr(y=0)…
这就是softmax。
Logically:extend logistic for 2 class or more class.
Prediction and Confusion Matrix
whether the borrower is a good borrower.
How to decide whether a model is a good model?
Confusion Matrix
$N+P(originNUM)=N^+P^(receive LabelNUM)$
FN 和FP都是我的model 错误的地方。
FN 和FP errorhave different economic consequences.
threshold很小,you are behave a very considerate(保守)的决策方式。less mistakes to lend people that are bad borrowers.
economic consequences really counts.
ROC Curve
Skip
K-nearest neighbours
why KNN and LR are pole opposite.
为什么用KNN分类和LR分类会有这么大的不同之处呢?
怎么用KNN进行分类呢?
看这个左图:左边这个蓝色的圈圈里有一个“叉”,我想对他进行分类,那我怎么分类呢?
我有一个圈圈,我找这个最3-近邻,然后呢我发现,这个“叉”的邻居中,有两个是蓝的圈,又一个是黄的,那我这个“叉”,就有66.7%的可能性属于蓝的,33.3%的可能性属于黄的。
a very intuitive argument.
右图:右的图里,就相当于我把这个decision boundary 画出来以后的样子。
KNN procedure:data-driven approach
- k is parameter.
- how do I measure the distance
c.f. KNN v.s. K-means clustering
distance between categorical values.
Key of the Lecture
To summarize, the key point of training this model is that how to balance these 2 types of errors considering the economic consequences these out error output can bring about.
It is a economic problem rather than a ML problem