Data mining 2

阅读数: 5次 2020-05-27

Components of Data Mining Algorithms

Representation:Determing the nature and structure of the respresentation to be used.
Score function: Quantifying and comparing how well different representation fit the data
Search/Optimization method: Choosing an algorithm process to optimize the score function.
Data management: Deciding what principles of data management are required to implement the algorithms efficienly.

数据分析里的数据类型

EDA(探索性的数据分析与可视化)

探索数据的概率统计特征
- Mean/Mode/Median/Quartile/Variance/Skewness(平均矩)
- Number of distinct values for a variable
可视化可以更直观的表示
- Box Plot: v.s.Histogram(关注某一个y的取值到底有多少个)

基于实验猜规律

Model $\rightleftharpoons$ Data

思想实验与真实可进行的实验

Statistical inference: inferring properties of an unknown data by guess a distribution and generating that distribution.

我们关心的是MSE,即误差平方均值,是越小越好.

所以MSE同时考虑了 $Bias^2$ 和Variance.

$Bias$ 意味着误差小,而Variance则说明置信度比较集中.

而很多时候,这两者是矛盾的.

似然表示什么呢?

意味着我观察到的数据, $\theta$ 模型产生数据的概率

极大似然估计，通俗理解来说，就是利用已知的样本结果信息，反推最具有可能（最大概率）导致这些样本结果出现的模型参数值！

换句话说，极大似然估计提供了一种给定观察数据来评估模型参数的方法，即：“模型已定，参数未知”。

有一些很小的kernel,kernel左右移动.进行叠加.

理解成小的山叠加起来变成大的山

加权函数叠加变成.

如何评估模型的建模效果?

这三个是哪个图的建模效果好呢?

我们用什么样的手段,来帮助我们发现过拟合?

交叉验证

奥卡姆剃刀原则:如非必要,勿增实体

MDL: Minimum Description Length，最小描述长度原则

随机过程

Mixture of parametric models

Model complexity tend to grow exponetially with dimensions.