<路径 clip-rule="evenodd" d="M33.377 4.574a3.508 3.508 0 0 0-2.633-1.126c-1 0-1.993.67-2.604 1.334l.002-1.24-1.867-.002-.02 10.17v.133l1.877.002.008-3.18c.567.611 1.464.97 2.462.973 1.099 0 2.022-.377 2.747-1.117.73-.745 1.1-1.796 1.103-3.002.003-1.232-.358-2.222-1.075-2.945Zm-3.082.55c.637 0 1.176.23 1.602.683.438.438.663 1.012.66 1.707-.003.7-.22 1.33-.668 1.787-.428.438-.964.661-1.601.661-.627 0-1.15-.22-1.6-.666-.445-.46-.662-1.086-.662-1.789.003-.695.227-1.27.668-1.708a2.13 2.13 0 0 1 1.596-.675h.005Zm5.109-.067-.008 4.291c-.002.926.263 1.587.784 1.963.325.235.738.354 1.228.354.376 0 .967-.146.967-.146l-.168-1.564s-.43.133-.64-.01c-.198-.136-.296-.428-.296-.866l.008-4.022 1.738.002.002-1.492-1.738-.002.005-2.144-1.874-.002-.005 2.143-1.573-.002 1.57 1.497ZM20.016 1.305h-9.245l-.002 1.777h3.695l-.016 8.295v.164l1.955.002-.008-8.459 3.621-.002V1.305Z" fill="#262D3D" fill-rule="evenodd"><路径 clip-rule="evenodd" d="M10.06 5.844 7.277 3.166 4.015.03 2.609 1.374l2.056 1.978-4.51 4.313 6.065 5.831 1.387-1.327-2.073-1.994 4.526-4.331ZM4.274 8.7a.211.211 0 0 1-.124 0c-.04-.013-.074-.03-.15-.102l-.817-.787c-.072-.069-.092-.104-.105-.143a.187.187 0 0 1 0-.12c.013-.039.03-.07.105-.143L5.76 4.938c.072-.07.108-.09.15-.099a.21.21 0 0 1 .123 0c.041.012.075.03.15.101L7 5.727c.072.07.093.104.103.144.013.04.013.08 0 .119-.013.04-.03.072-.106.143L4.422 8.601a.325.325 0 0 1-.147.099Z" fill="#204ECF" fill-rule="evenodd"><路径 clip-rule="evenodd" d="M24.354 4.622a3.94 3.94 0 0 0-2.876-1.149 4.1 4.1 0 0 0-2.829 1.084c-.804.725-1.214 1.733-1.217 2.992-.002 1.26.405 2.267 1.207 2.995a4.114 4.114 0 0 0 2.832 1.094c.04.002.082.002.123.002a3.967 3.967 0 0 0 2.75-1.138c.538-.532 1.183-1.473 1.186-2.938.002-1.465-.637-2.408-1.176-2.942Zm-.59 2.94c-.003.73-.228 1.334-.671 1.794-.441.458-.99.69-1.633.69a2.166 2.166 0 0 1-1.614-.697c-.43-.45-.65-1.057-.65-1.797s.222-1.344.655-1.795a2.17 2.17 0 0 1 1.617-.69c.64 0 1.189.235 1.63.698.443.46.668 1.064.665 1.797ZM41.15 6.324c0-.458.25-1.465 1.632-1.465.49 0 .768.159 1.003.347.227.18.34.626.34.994v.174l-2.282.341C40.035 6.98 39 7.913 38.993 9.28c-.002.708.266 1.314.777 1.76.503.438 1.191.67 2.004.673 1.023 0 1.792-.354 2.341-1.084.003.31.003.621.003.91h1.903l.013-5.246c.002-.856-.289-1.685-.864-2.14-.567-.449-1.31-.679-2.386-.681h-.015c-.82 0-1.69.208-2.274.695-.689.572-1.027 1.478-1.027 2.178l1.682-.02Zm.864 3.814c-.676-.002-1.115-.371-1.112-.938.003-.589.43-.933 1.346-1.081l1.875-.305v.017c-.005 1.36-.87 2.307-2.102 2.307h-.008Zm4.917-8.712-.018 10.058v.044l1.684.005.018-10.06v-.045l-1.684-.002Zm2.654 9.491c0-.173.062-.322.19-.445a.645.645 0 0 1 .462-.186c.18 0 .338.062.465.186a.596.596 0 0 1 .193.445.583.583 0 0 1-.193.443.644.644 0 0 1-.465.183.634.634 0 0 1-.461-.183.59.59 0 0 1-.191-.443Zm.108 0c0 .146.052.273.158.376a.54.54 0 0 0 .389.154.539.539 0 0 0 .547-.53.498.498 0 0 0-.16-.373.531.531 0 0 0-.387-.156.531.531 0 0 0-.387.155.497.497 0 0 0-.16.374Zm.702.344-.176-.3h-.118v.3h-.109v-.688h.292c.144 0 .23.082.23.196 0 .096-.076.168-.176.188l.178.304h-.121Zm-.294-.596v.21h.167c.093 0 .14-.034.14-.104 0-.072-.047-.106-.14-.106h-.167Z" fill="#262D3D" fill-rule="evenodd">authors are vetted experts in their fields 和 write on topics in which they have demonstrated experience. 我们所有的内容都经过同行评审,并由同一领域的Toptal专家验证.
弗拉德•米勒's profile im年龄

弗拉德•米勒

Vlad is a versatile software engineer with experience in many fields. 他目前正在完善自己的Scala和机器学习技能.

The main goal of this reading is to underst和 enough statistical methodology to be able to lever年龄 the 机器学习 algorithms in Python’s scikit-learn 库,然后应用这些知识来解决一个经典的机器学习问题.

我们旅程的第一站将带我们了解机器学习的简史. 然后我们将深入研究不同的算法. 在我们的最后一站,我们将用我们所学的来解决 泰坦尼克号存活率预测问题.

免责声明:

  • I am a full-stack software engineer, not a 机器学习 algorithm expert.
  • I assume you know some basic Python.
  • 这 is exploratory, so not every detail is explained like it would be in 一个教程.

注意到这一点,让我们开始吧!

机器学习算法的快速介绍

一旦你进入这个领域,你就会意识到 机器学习 没有你想象的那么浪漫吗. 最初, I was full of hopes that after I learned more I would be able to construct my own Jarvis AI, 谁愿意整天为我编写软件赚钱, 这样我就可以整天在户外看书了, driving a motorcycle, 享受着不计后果的生活方式,而我的私人贾维斯让我的口袋更鼓了. 然而, 我很快意识到机器学习算法的基础是统计学, 我个人觉得乏味无趣. Fortunately, it did turn out that “dull” statistics have some very fascinating applications.

You will soon discover that to get to those fascinating applications, 你需要很好地理解统计学. One of the goals of 机器学习 algorithms is to find statistical dependencies in supplied data.

The supplied data could be anything from checking blood pressure against 年龄 to finding h和written text based on the color of various pixels.

也就是说, I was curious to see if I could use 机器学习 algorithms to find dependencies in cryptographic hash functions (SHA, MD5等.)不管, you can’t re所有y do that because proper crypto primitives are constructed in such a way that they eliminate dependencies 和 produce significantly hard-to-predict output. 我相信, 给定无限的时间, 机器学习算法可以破解任何加密模型.

不幸的是, we don’t have that much time, 所以我们需要找到另一种有效挖掘加密货币的方法. 到现在为止我们已经走了多远?

机器学习算法简史

机器学习算法的根源来自托马斯·贝叶斯, 谁是18世纪的英国统计学家. 他的论文 论解决机会主义中的一个问题 支撑 贝叶斯定理,在统计领域得到了广泛的应用.

在19世纪,皮埃尔-西蒙·拉普拉斯发表了 这是一种基于概率的分析扩展了贝叶斯的工作,并定义了我们今天所知道的贝叶斯定理. Shortly before that, 阿德里安-玛丽·勒让德描述了“最小二乘”方法, 今天也广泛应用于监督学习.

The 20th century 是 period when the majority of publicly known discoveries have been made in this field. Andrey Markov invented Markov chains, which he used to analyze poems. 艾伦·图灵提出了一种学习机器,可以变成人工智能, 基本上为遗传算法埋下了伏笔. Frank Rosenblatt invented the 感知器引起了巨大的兴奋和媒体的广泛报道.

But then the 1970s saw a lot of pessimism around the idea of AI—和 thus, 资金减少,所以这段时期被称为an 人工智能的冬天. The rediscovery of backpropagation in the 1980s caused a resurgence in 机器学习 research. 今天,它再次成为热门话题.

已故的Leo Breiman区分了两种统计建模范式: 数据建模 和 algorithmic 模型ing. “算法建模”或多或少意味着机器学习算法,比如 随机森林.

机器学习和统计学是密切相关的领域. According to Michael I. 约旦, the ideas of 机器学习, 从方法论原则到理论工具, 对统计学有很长的了解. 他还建议 数据科学 as a placeholder term for the over所有 problem that 机器学习 specialists 和 statisticians are both implicitly working on.

机器学习算法的分类

机器学习领域的两个主要支柱叫做 supervised learningunsupervised learning. 一些人还考虑了一个新的研究领域——深度学习-与监督vs .的问题分开. unsupervised learning.

Supervised learning 是向计算机提供输入和期望输出的示例吗. 计算机的目标是学习一个将输入映射到输出的通用公式. 这可以进一步细分为:

  • Semi-supervised learning,也就是给计算机一个不完整的训练集,其中缺少一些输出
  • 主动学习, which is when the computer can only obtain training labels for a very limited set of instances. 当以交互方式使用时,它们的训练集可以呈现给用户进行标记.
  • Reinforcement learning, which is when the training data is only given as feedback to the program’s actions in the dynamic environment, 比如开车或者和对手玩游戏

相比之下, unsupervised learning is when no labels are given at 所有 和 it’s up to the algorithm to find the structure in its input. Unsupervised learning can be a goal in itself when we only need to discover hidden patterns.

深度学习 is a new field of study which is inspired by the structure 和 function of the human brain 和 based on artificial neural networks rather than just statistical concepts. 深度学习 can be used in both supervised 和 unsupervised approaches.

在本文中, we will only go through some of the simpler supervised 机器学习 algorithms 和 use them to calculate the survival chances of an individual in tragic sinking of the Titanic. 但一般来说,如果你不确定用哪种算法,一个好的开始是 Scikit-learn的机器学习算法备忘单.

基本监督机器学习模型

也许最简单的算法是线性回归. 有时可以用直线来表示, but despite its name, 如果有一个多项式假设, 这条直线可以是一条曲线. 无论哪种方式, it 模型 the relationships between scalar dependent variable $y$ 和 one or more explanatory 值 denoted by $x$.

In layperson’s terms, this means that 线性 regression 是 algorithm which learns the dependency between each known $x$ 和 $y$, such that later we can use it to predict $y$ for an unknown sample of $x$.

在我们第一个监督学习的例子中, we will use a basic 线性 regression 模型 to predict a person’s blood pressure given their 年龄. is a very simple 数据集 with two meaningful features: 年龄 和 blood pressure.

As already mentioned above, most 机器学习 algorithms work by finding a statistical dependency in the data provided to them. 这 dependency is c所有ed a 假设 通常用h(\ θ)表示.

To figure out the 假设, let’s start by loading 和 exploring the data.

进口mat情节lib.Py情节为PLT
from p和as import read_csv
进口操作系统

#加载数据
Data_路径 =操作系统.路径.加入(os.g等wd(), "data/blood-pressure.txt”)
数据集 = read_csv(data_路径, delim_whitespace=True)

#我们的数据集中有30个条目和4个特征. 第一个特性是条目的ID.
#第二个特征总是1. The third feature 是 年龄 和 the last feature 是 blood pressure.
# We will now drop the ID 和 One feature for now, as this is not important.
Dataset = Dataset.drop(['ID', 'One'], axis=1)

我们将显示这张图
%mat情节lib inline
数据集.情节.scatter(x='年龄', y='Pressure')

# 现在, we will assume that we already know the 假设 和 it looks like a straight line
h = lambda x: 84 + 1.24 * x

现在让我们在图表上加上这条线
年龄s = range(18, 85)
估计= []

因为我已经很久了;
    估计.追加(h (i))

plt.情节(年龄s, 估计, 'b')  

[]

在年龄与血压关系图上显示的线性假设.

On the chart above, every blue dot represents our data sample 和 蓝线 是 假设 which our algorithm needs to learn. 那么这个假设到底是什么呢?

为了解决这个问题, 我们需要了解x和y之间的依赖关系, 用y = f(x)表示. 因此f(x)是理想的目标函数. The 机器学习 algorithm will try to guess the 假设 function $h(x)$ that 是 closest approximation of the unknown $f(x)$.

The simplest possible form of 假设 for the 线性 regression problem looks like this: $h_\theta(x) = \theta_0 + \theta_1 * x$. 我们有一个单一的输入标量变量x它输出一个单一的标量变量y, where $\theta_0$ 和 $\theta_1$ are parameters which we need to learn. The process of fitting this blue line in the data is c所有ed 线性 regression. It is important to underst和 that we have only one input parameter $x_1$; however, a lot of 假设 functions will also include the bias unit ($x_0$). 所以我们得到的假设形式是$h_\theta(x) = \theta_0 * x_0 + \theta_1 * x_1$. But we can avoid writing $x_0$ because it’s almost always equal to 1.

Getting back to 蓝线. 假设是$h(x) = 84 + 1.24x$,这意味着$\theta_0 = 84$和$\theta_1 = 1.24$. 我们如何自动推导这些$\theta$值?

We need to define a 成本函数. 本质上, what 成本函数 does is simply calculates the root mean square error between the 模型 prediction 和 the actual output.

\[J(\theta) = \frac{1}{2m}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})^2\]

例如, 我们的假设预测,对于一个48岁的人, 他们的血压应该是$h(48) = 84 + 1.24 * 48 = 143mmHg$; however, in our training sample, we have the value of $130 mmHg$. 因此误差是$(143 - 130)^2 = 169$. 现在我们需要计算训练数据集中每一个条目的误差, then sum it together ($\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})^2$) 和 take the mean value out of that.

这给了我们一个单一的标量数字,它表示函数的代价. Our goal is to find $\theta$ 值 such that the 成本函数 是 lowest; in the other words, 我们想要最小化代价函数. 这 will hopefully seem intuitive: If we have a sm所有 成本函数 value, 这意味着预测的误差也很小.

import numpy as np
让我们计算上述假设的代价

H = x, theta_0, theta_1: theta_0 + theta_1 * x

def cost(X, y, t0, t1):
    m = len(X) #训练样本的个数
    C = np.功率(np.subtract(h(X, t0, t1), y), 2)
    return (1 / (2 * m)) * sum(c)

X = 数据集.值(:0)
Y = 数据集.值(:1)
print('J(Theta) = %2.2f' % cost(X, y, 84, 1.24))

J(θ) = 1901.95

现在,我们需要找到$\ $的值使得 成本函数 价值是最小的. But how do we do that?

\[minJ(\theta) = \frac{1}{2m}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})^2\]

有几种可能的算法,但最流行的是 梯度下降法. In order to underst和 the intuition behind the 梯度下降法 method, 我们先把它画在图上. For the sake of simplicity, we will assume a simpler 假设 $h(\theta) = \theta_1 * x$. 下一个, we will 情节 a simple 2D chart where $x$ 是 value of $\theta$ 和 $y$ 是 成本函数 at this point.

进口mat情节lib.Py情节为PLT

图= PLT.图()

# Generate the data
(ta_1 = np.排列(- 10,14,0).1)

J_cost = []
for t1 in theta_1:
    J_cost += [cost(X, y, 0, t1)]

plt.情节(theta_1, J_cost)

plt.xlabel(r'$\theta_1$')
plt.ylabel(r'$J(\theta)$')

plt.显示()

A convex 成本函数.

The 成本函数 is convex, which means that on the interval $[a, b]$ there is only one minimum. Which again means that the best $\theta$ parameters are at the point where the 成本函数 is minimal.

基本上, 梯度下降法 is an algorithm that tries to find the set of parameters which minimize the function. It starts with an initial set of parameters 和 iteratively takes steps in the negative direction of the function gradient.

求代价函数的最小值.

If we calculate the derivative of a 假设 function at a specific point, this will give us a slope of the tangent line to the curve at that point. 这 means that we can calculate the slope at every single point on the graph.

算法的工作方式是这样的:

  1. 我们选择一个随机的起点(随机 $\theta$).
  2. 计算成本函数在这一点的导数.
  3. Take the sm所有 step towards the slope $\theta_j := \theta_j - \lambda * \frac{\partial}{\partial \theta_j} * J(\theta)$.
  4. 重复步骤2-3,直到我们汇合.

现在, the convergence condition depends on the implementation of the algorithm. 我们可能在走了50步之后就停止了,在某个阈值之后,或者其他任何地方.

导入数学
# Example of the simple 梯度下降法 algorithm taken from Wikipedia

Cur_x = 2.5 #算法从点x开始
= 0.005 # Step size multiplier
精度= 0.00001
previous_step_size = cur_x

df = lambda x: 2 * x * math.cos (x)

#记住学习曲线并绘制它 

while previous_step_size > precision:
    Prev_x = cur_x
    cur_x += -gamma * df(prev_x)
    Previous_step_size = abs(cur_x - prev_x)

print("局部最小值出现在%f" % cur_x)

The local minimum occurs at 4.712194

我们不会在本文中实现这些算法. 相反,我们将利用广泛采用的 scikit-learn,一个开源的Python机器学习库. It provides a lot of very useful APIs for different data mining 和 机器学习 problems.

从sklearn.线性模型导入线性回归
线性回归使用梯度下降方法

#我们的数据
X = 数据集[['年龄']]
Y = 数据集[['Pressure']]

regr = LinearRegression()
regr.fit (X, y)

#绘图输出
plt.包含(年龄)
plt.ylabel('Blood pressure')

plt.scatter(X, y,  color='black')
plt.情节(X, regr.predict(X), color='blue')

plt.显示()
plt.gcf ().clear ()

一个关于血压与. 年龄图

预测25岁时的血压.o.   = ', regr.预测(25)
预测血压在45度.o.   = ', regr.预测(45)
预测血压为27岁.o.   = ', regr.预测(27)
预测血压为34.5 y.o. = ', regr.预测(34.5) )
预测血压为78 y.o.   = ', regr.预测(78))
预计血压在25岁.o.   =  [[ 122.98647692]]
预测血压为45 y.o.   =  [[ 142.40388395]]
预计血压为27岁.o.   =  [[ 124.92821763]]
预计血压为34.5 y.o. =  [[ 132.20974526]]
预测血压为78 y.o.   =  [[ 174.44260555]]

Types of Statistical Data

在处理机器学习问题的数据时, 识别不同类型的数据是很重要的. We may have numerical (continuous or discrete), categorical, or ordinal data.

数值数据 has meaning as a measurement. 例如, 年龄, 重量, 一个人拥有的比特币数量, 或者这个人每个月能写多少篇文章. 数值数据 can be further broken down into discrete 和 continuous types.

  • Discrete data represent data that can be counted with whole numbers, e.g.一套公寓的房间数或投掷硬币的次数.
  • 连续数据不一定能用整数表示. 例如,如果你要测量你能跳的距离,可能是2米,也可能是1米.5米,或者1.652245米.

分类数据 represent 值 such as person’s gender, marital status, country等. 这些数据可以取数值,但这些数字没有数学意义. You cannot add them together.

顺序数据 可以是其他两种类型的混合吗, in that categories may be numbered in a mathematic所有y meaningful way. 一个常见的例子是评级:我们经常被要求从一到十来给事物打分, 只允许使用整数. 我们可以用数字的e.g., to find an aver年龄 rating for something—we often treat the data as if it were categorical when it comes to applying 机器学习 methods to it.

Logistic Regression

线性回归是一种很棒的算法,它可以帮助我们预测数值,e.g., the price of the house with the specific size 和 number of rooms. 然而, sometimes, we may also want to predict categorical data, to get answers to questions like:

  • Is this a dog or a cat?
  • 这个肿瘤是恶性的还是良性的?
  • Is this wine good or bad?
  • Is this email spam or not?

甚至:

  • 哪个数字在图片上?
  • 这封邮件属于哪个类别?

所有这些问题都是针对 classification problem. 最简单的分类算法叫做 logistic regression,最终等于 线性 除了它有一个不同的假设.

首先, we can reuse the same 线性 假设 $h_\theta(x) = \theta^T X$ (this is in vectorized form). 而线性回归可以输出区间$[a]中的任意数, b]$, 逻辑回归只能输出$[−1]中的值, 1]$, which 是 probability of the object f所有ing in a given category or not.

使用一个 乙状结肠函数,我们可以将任意数值转换为表示区间$[−1,1]$上的值.

\[f(x) = \frac{1}{1 + e^x}\]

现在,我们需要传递一个现有的假设,而不是$x$,因此我们将得到:

\ [f (x) = \压裂{1}{1 + e ^ {\ theta_0 + \ theta_1 * x_1 + ... + \theta_n * x_n}}\]

在那之后, 我们可以应用一个简单的阈值如果假设大于零, this is a true value, 否则假.

\[h_\theta(x) = \begin{case} 1 & \mbox{if } \theta^T X > 0 \\ 0 & \mbox{else} \end{cases}\]

这意味着我们可以使用相同的 成本函数 用同样的梯度下降算法来学习逻辑回归的假设.

在下一个机器学习算法的例子中, we will advise the pilots of the space shuttle whether or not they should use automatic or manual l和ing control. 我们有 a very sm所有 数据集-15个样本-由6个特征和 地面实况.

在机器学习算法中,“地面实况” refers to the accuracy of the training set’s classification for supervised learning techniques.

Our 数据集 is complete, meaning that there are no missing features; however, 有些功能用“*”代替类别, 这意味着这个功能无关紧要. 我们将用零替换所有这样的星号.

从sklearn.线性模型导入LogisticRegression

#数据
Data_路径 =操作系统.路径.加入(os.数据/ shuttle-l和ing-control g等wd()。.csv”)
数据集 = read_csv(data_路径, header=None, 
                    名称=[“汽车”,“稳定”,“错误”,“信号”,“风”,“大小”,“可见性”),
                    na_值 =‘*’).fillna (0)

# Prepare features
X = 数据集[['Stability', 'Error', 'Sign', 'Wind', 'Magnitude', 'Visibility']]
Y = 数据集[['Auto']].值.重塑(1,1)[0]

模型 = LogisticRegression()
模型.fit (X, y)

目前,我们缺少一个重要的概念. 我们不知道我们的模型有多好 
#成立,正因为如此,我们无法真正改善我们假设的表现. 
# There are a lot of useful metrics, but for now, we will validate how well 
#我们的算法在它学习的数据集上执行.
"Score of our 模型 is %2.2f%%" %(模型.score(X, y) * 100)

Score of our 模型 is 73.33%

验证?

In the previous example, we validated the performance of our 模型 using the learning data. 然而, is this now a good option, given that our algorithm can either underfit of overfit the data? Let’s take a look at the simpler example when we have one feature which represents the size of a house 和 another which represents its price.

从sklearn.pipeline import make_pipeline
从sklearn.预处理导入多项式特征
从sklearn.线性模型导入线性回归
从sklearn.Model_selection导入cross_val_score

# Ground truth function
ground_truth = lambda X: np.Cos (15 + np.* X)

#生成围绕地面真值函数的随机观察值
N_samples = 15
degrees = [1, 4, 30] 

X = np.linspace (-1, 1, n_samples)
y = ground_truth(X) + np.随机.r和n(n_samples) * 0.1

plt.figure(figsize=(14, 5))

模型= {}

绘制所有机器学习算法模型
对于idx,枚举数(degrees)中的度数:
    Ax = PLT.Sub情节 (1, len(degrees), idx + 1)
    plt.Setp (ax, xticks=(), yticks=())
    
    # Define the 模型
    多项式特征=多项式特征(度=度)
    模型 = make_pipeline(多项式特征,线性回归)
    
    模型[degree] = 模型
    
    #训练模型
    模型.适合(X [: np.newaxis), y)
    
    #使用交叉验证评估模型
    分数 = cross_val_score(模型, X[:, np].newaxis), y)
    
    X_test = X
    plt.情节(X_test, 模型.predict(X_test[:, np.newaxis]), label="Model")
    plt.scatter(X, y, edgecolcolor ='b', s=20, label="Observations")
    
    plt.包含(“x”)
    plt.ylabel(“y”)
    plt.ylim ((2, 2))
    
    plt.title("Degree {}\nMSE = {:.2e}".格式(
        学位,成绩.意思是()))

plt.显示()

相同的数据由first-建模, 第四,, 和 30th-degree polynomials, 演示过拟合和欠拟合.

机器学习算法模型为 underfitting if it can generalize neither the training data nor new observations. In the example above, we use a simple 线性 假设 which does not re所有y represent the actual training 数据集 和 will have very poor performance. 通常,欠拟合不被讨论,因为它可以很容易地检测到一个好的度量.

如果我们的算法能记住每一个观察结果, 那么它在训练数据集之外的新观测值上的表现就会很差. 这叫做 过度拟合. 例如, a 30th-degree polynomial 模型 passes through the most of the points 和 has a very good score on the training set, 但任何超出这个范围的东西都会表现不佳.

Our 数据集 consists of one feature 和 is simple to 情节 in 2D space; however, 在现实生活中, 我们可能有数百个特征的数据集, 这使得它们无法在欧几里得空间中直观地绘图. 我们还有什么其他选择来查看模型是过拟合还是欠拟合?

是时候向大家介绍一下 学习曲线. 这 is a simple graph that 情节s the mean squared error over the number of training samples.

在学习材料中,你通常会看到类似的图表:

基于多项式次的理论学习曲线变化.

然而,在现实生活中,你可能得不到如此完美的画面. 让我们绘制每个模型的学习曲线.

从sklearn.模型_selection导入learning_curve, ShuffleSplit

# Plot 学习曲线s
plt.figure(figsize=(20, 5))

对于idx, enumerate(模型)中的度数为:
    Ax = PLT.Sub情节 (1, len(degrees), idx + 1)
    
    plt.标题(“{}”程度.格式(程度)
    plt.网格()
    
    plt.xlabel("Training examples")
    plt.ylabel(“分数”)
    
    Train_sizes = np.linspace (.6, 1.0, 6)
    
    #交叉验证100次迭代,以获得更平滑的平均测试和训练
    #得分曲线,每次随机选取20%的数据作为验证集.
    cv = ShuffleSplit(n_splits=100, test_size=0.2, 随机_state = 0)
    
    模型 = 模型[degree]
    train_size, train_分数, test_分数 =
        模型,X[:, np., y, cv=cv, train_sizes=train_sizes, n_jobs=4)
    
    train_分数_mean = np.mean(train_分数, axis=1)
    test_分数_mean = np.mean(test_分数, axis=1)
    
    plt.Plot (train_sizes, train_分数_mean, ' 0 -', color="r",
             label="Training score")
    plt.Plot (train_sizes, test_分数_mean, ' 0 -', color="g",
             label="Test score")
    
    plt.legend(loc = "best")

plt.显示()

Training 分数 vs test 分数 for three graphs with data 模型ed by first-, 第四,, 和 30th-degree polynomials.

In our simulated scenario, 蓝线, 哪个代表训练分数, seems like a straight line. 在现实中, 它仍然会稍微减小,你可以在一次多项式图中看到, 但在其他照片中,它太微妙了,无法在这个分辨率下分辨出来. We at least clearly see that there is a huge gap between 学习曲线s for training 和 test observations with a “高的偏见” scenario.

在中间的“正常”学习率图上, 你可以看到训练分数和考试分数是如何结合在一起的.

在“高方差”图上, 你可以看到样本数量很少, the test 和 training 分数 are very similar; however, 当你增加样本的数量, 训练分数几乎是完美的,而测试分数却离它越来越远.


我们可以修复欠拟合模型(也称为模型) 高的偏见),如果我们使用非线性假设,e.g.,即具有更多多项式特征的假设.

Our 过度拟合 模型 (高方差) passes through every single example it is shown; however, when we introduce test data, 学习曲线之间的差距扩大了. We can use regularization, cross-validation, 和更多的 data samples to fix 过度拟合 模型.

交叉验证

One of the common practices to avoid 过度拟合 is to hold onto part of the available data 和 use it as a test set. 然而, 当评估不同的模型设置, 比如多项式特征的个数, we are still at risk of 过度拟合 the test set because parameters can be tweaked to achieve the optimal estimator performance 和, 正因为如此, 我们关于测试集的知识可能会泄漏到模型中. To solve this problem, 我们需要保留数据集的另一部分, 哪个被称为“验证集”."训练在训练集上进行, 当我们认为我们已经达到了最佳的模型性能, 我们可以利用验证集进行最终评估.

然而, 通过将可用数据划分为三组, 我们大大减少了可用于训练模型的样本数量, 和 the results can depend on a particular 随机 choice for the training-validation pair of sets.

One solution to this problem is a procedure c所有ed cross-validation. In st和ard $k$-fold cross-validation, we partition the data into $k$ subsets, c所有ed folds. 然后, we iteratively train the algorithm on $k-1$ folds while using the remaining fold as the test set (c所有ed the “holdout fold”).

A grid demonstrating the position of holdout folds in k-fold cross-validation.

交叉验证允许您仅使用原始训练集来调整参数. 这 所有ows you to keep your test set as a truly unseen 数据集 for selecting your final 模型.

还有很多交叉验证技术,比如 忽略P, stratified $k$-fold, 洗牌和拆分等. 但它们超出了本文的范围.

正则化

这 is another technique that can help solve the issue of 模型 过度拟合. 大多数数据集都有一个模式和一些噪声. 正则化的目的是减少噪声对模型的影响.

A graph juxtaposing an original function 和 its regularized counterpart.

有三种主要的正则化技术:Lasso、Tikhonov和elastic net.

L1正规化 (or Lasso regularization)将选择一些特征收缩为零, 这样它们就不会在最终模型中发挥任何作用. L1可以看作是选择重要特征的一种方法.

L2正规化 (or Tikhonov regularization)将迫使 所有 特征应该相对较小,这样它们对模型的影响就会较小.

弹性网结合 L1和L2的.

归一化(特征缩放)

Feature scaling is also an important step while preprocessing the data. 我们的数据集可能具有值为$[-\infty的特征, \infty]$和其他不同尺度的特征. 这是一种标准化独立值范围的方法.

Feature scaling is also an important process to improve the performance of the learning 模型. 首先, 梯度下降法 will converge much faster if 所有 of the features are scaled to the same norm. 也, 例如,很多算法, support vector machines (SVM)—work by calculating the distance between two points 和 if one of the features has broad 值, 那么距离就会受到这个特征的很大影响.

Support Vector Machines

SVM is yet another broadly popular 机器学习 algorithm which can be used for classification 和 regression problems. 在支持向量机, we 情节 each observation as a point in $n$-dimensional space where $n$ 是 number of features we have. 每个特征的值是特定坐标的值. 然后, we try to find a hyperplane that separates two classes well enough.

显示分离两类数据点的超平面的图, 还有它们的一些支持向量.

在我们确定最佳超平面之后, we want to add margins, 哪一个会进一步把这两个阶级分开呢.

显示有边距的超平面的图形.

SVM is very effective where the number of features is very high or if the number of features is larger then the number of data samples. 然而, since SVM operates on a vector basis, it is crucial to normalize the data prior the us年龄.

神经网络

神经网络算法可能是机器学习研究中最令人兴奋的领域. 神经网络s try to mimic how the brain’s neurons are connected together.

神经网络的图示, 显示映射到临时值的各种输入, 它们依次映射到单个输出.

这就是神经网络的样子. We combine a lot of nodes together where each node takes a set of inputs, 对它们进行一些计算, 和 output a value.

There are a huge variety of neural network algorithms for both supervised 和 unsupervised learning. 神经网络可以用来驾驶自动驾驶汽车, 玩游戏, 土地的飞机, 分类图片, 和更多的.

The Infamous Titanic

The RMS Titanic was a British passenger liner that sank in the North Atlantic Ocean on April 15th, 1912年,它与冰山相撞. There were about 2,224 crew 和 passengers, 大于1,500人死亡, making it one of the deadliest commercial maritime disasters of 所有 time.

现在, since we underst和 the intuition behind the most basic 机器学习 algorithms used for classification problems, 我们可以运用我们的知识来预测泰坦尼克号上那些人的生存结果.

我们的数据集将从 Kaggle数据科学竞赛平台.

进口操作系统
从p和as中导入read_csv, concat

#加载数据
Data_路径 =操作系统.路径.加入(os.g等wd(), "data/titanic.csv”)
数据集 = read_csv(data_路径, skipinitialspace=True)

数据集.头(5)
PassengerId活了下来Pclass名字。年龄SibSp烤干票价小屋开始
0103Braund,奥. 欧文•哈里斯男性22.010A/5 211717.2500S
1211肯明斯,夫人. 约翰·布拉德利(弗洛伦斯·布里格斯...38.010PC 1759971.2833C85C
2313嘉尼•海基宁,小姐. Laina26.000上海四通/ O2. 31012827.9250S
3411Futrelle,夫人. Jacques Heath (Lily May Peel)35.01011380353.1000C123S
4503艾伦先生. 威廉。亨利。男性35.0003734508.0500S

我们的第一步是加载和探索数据. 我们有 891 test records; each record has the following structure:

  • passengerId -机上旅客的身份证件
  • 生还—该人是否在车祸中幸存
  • pclass – 票 class, e.g.第一,第二,第三
  • 性别-旅客的性别:男性或女性
  • name – Title included
  • 年龄 – 年龄 in years
  • sibsp -泰坦尼克号上的兄弟姐妹/配偶人数
  • parch -泰坦尼克号上父母/孩子的人数
  • ticket – 票 number
  • fare – Passenger fare
  • cabin – 小屋 number
  • 登船的港口

此数据集包含数值和分类数据. Usu所有y, it is a good idea to dive deeper into the data 和, based on that, come up with assumptions. 然而, in this case, we will skip this step 和 go straight to predictions.

import p和as as pd

我们需要删除一些不重要的特征,并映射其他特征.
#票号和票价对我们模型的性能影响不大.
# 名字。 feature has titles (e.g., Mr., 小姐, Doctor) included.
性别当然很重要.
#出发港可能会贡献一些价值.
# Using port of embarkation may sound counter-intuitive; however, there may 
# be a higher survival rate for passengers who boarded in the same port.

数据集['Title'] = 数据集.名字。.str.extract(' ([A-Za-z]+)\.”,扩大= False)
Dataset = Dataset.下降([“PassengerId”、“票”,“小屋”,“名字”),轴= 1)

pd.交叉表(数据集数据集['标题'],['性'])
标题\性别男性
上校01
Col02
伯爵夫人10
01
Dr16
Jonkheer01
夫人10
主要02
040
小姐1820
Mlle20
居里夫人10
Mr0517
夫人1250
Ms10
牧师06
先生01
# We will replace many titles with a more common name, English equivalent,
# or reclassification
数据集['Title'] = 数据集['Title'].替换'夫人', '伯爵夫人','上校', 'Col',\
 	‘不’,‘博士’,‘大’,‘加速’,‘先生’,‘Jonkheer’,‘小姐’,‘其他’)

数据集['Title'] = 数据集['Title'].replace('Mlle', '小姐')
数据集['Title'] = 数据集['Title'].replace('Ms', '小姐')
数据集['Title'] = 数据集['Title'].replace('居里夫人', '夫人')
    
数据集[['标题','活']].groupby(['标题'],as_index = False).意思是()
Title活了下来
00.575000
1小姐0.702703
2Mr0.156673
3夫人0.793651
4其他0.347826
现在我们将字母数字类别映射到数字
title_mapping = { 'Mr': 1, '小姐': 2, '夫人': 3, '主': 4, '其他': 5 }
性别映射={'女性':1,'男性':0}
port_mapping = {'S': 0, 'C': 1, 'Q': 2}

#地图标题
数据集['Title'] = 数据集['Title'].map(title_mapping).astype (int)

#映射性别
数据集['性'] = 数据集['性'].map(gender_mapping).astype (int)

#映射端口
freq_port = 数据集.开始.dropna ().模式()[0]
数据集[' loaded '] = 数据集[' loaded '].fillna (freq_port)
数据集[' loaded '] = 数据集[' loaded '].地图(port_mapping).astype (int)

# Fix missing 年龄 值
数据集['年龄'] = 数据集['年龄'].fillna(数据集['年龄'].dropna ().中位数())

数据集.头()
活了下来Pclass年龄SibSp烤干票价开始Title
003022.0107.250001
111138.01071.283313
213126.0007.925002
311135.01053.100003
403035.0008.050001

At this point, we will rank different types of 机器学习 algorithms in Python by using scikit-learn 创建一组不同的模型. 这样就很容易看出哪一个表现最好.

  • 变多项式数的逻辑回归
  • 具有线性核的支持向量机
  • 具有多项式核的支持向量机
  • 神经网络

对于每个模型,我们将使用$k$ fold验证.

从sklearn.模型_selection导入KFold, train_test_split
从sklearn.pipeline import make_pipeline
从sklearn.预处理导入多项式特征,标准缩放

从sklearn.神经网络导入MLPClassifier
从sklearn.svm导入SVC

#准备数据
X = 数据集.drop(['活了下来'], axis = 1).值
Y = 数据集[['活了下来']].值

X = St和ardScaler().fit_transform (X)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 随机_state = None)

准备交叉验证(cv)
cv = KFold(n_splits = 5, 随机_state = None)

#性能
p_score = lambda 模型, score: print('Performance of the %s 模型 is %0.2f%%' % (模型, score * 100))

#分类器
名称= [
    "Logistic Regression", "多项式假设的逻辑回归",
    “线性支持向量机”“RBF支持向量机”“神经网络”
]

分类器= [
    LogisticRegression(),
    make_pipeline (PolynomialFeatures (3) LogisticRegression ()),
    SVC(kernel="线性", C=0.025),
    SVC(γ= 2,C = 1),
    MLPClassifier(alpha=1),
]
# iterate over classifiers
模型= []
trained_分类器= []
对于名称,zip中的CLF(名称,分类器):
    分数= []
    对于CV中的train_indices, test_indices.分割(X):
        clf.fit (X [train_indices], [train_indices].拉威尔())
        分数.追加(clf.score(X_test, y_test.拉威尔()))
    
    min_score = min(分数)
    max_score = max(分数)
    Avg_score = sum(分数) / len(分数)
    
    trained_classifiers.追加(clf)
    模型.追加((name, min_score, max_score, avg_score))
    
Fin_模型 = pd.DataFrame(模型, columns = ['名字。', '分钟得分', '马克斯得分', '平均评分'])
fin_模型.sort_值(['平均评分']).头()
名字。分钟得分马克斯得分平均评分
2线性支持向量机0.7932960.8212290.803352
0Logistic Regression0.8268160.8603350.846927
4神经网络0.8268160.8603350.849162
1多项式假设的逻辑回归0.8547490.8826820.869274
3RBF支持向量机0.8435750.8882680.869274

Ok, so our experimental research says that the SVM classifier with a radial basis function (RBF) kernel performs the best. 现在, we can serialize our 模型 和 re-use it in production applications.

进口泡菜

Svm_模型 = trained_classifiers[3]

Data_路径 =操作系统.路径.加入(os.g等wd(), "best-titanic-模型.pkl”)
泡菜.Dump (svm_模型, open(data_路径, 'wb'))

机器学习并不复杂, 但这是一个非常广泛的研究领域, 为了掌握所有的概念,它需要数学和统计学知识.

现在, 机器学习 和 深度学习 are among the hottest topics of discussion in Silicon V所有ey, 它们几乎是每个国家的面包和黄油 数据科学 company, 主要是因为它们可以自动执行包括语音识别在内的许多重复性任务, 驾驶车辆, 金融交易, caring for patients, 烹饪, 市场营销等等.

现在,您可以利用这些知识在Kaggle上解决挑战.

这 was a very brief introduction to supervised 机器学习 algorithms. Luckily, there are a lot of online courses 和 information about 机器学习 algorithms. I person所有y would recommend starting with Andrew Ng’s course on Coursera.

资源

Underst和ing the basics

  • How does 机器学习 work?

    机器学习算法使用统计分析自动形成模型, in contrast to traditional, hard-coded algorithms. 这 所有ows them to evolve over time as they look for patterns in data 和 make predictions as to their classification.

  • 机器学习可以用来做什么?

    机器学习的应用几乎是无限的. It can be used for everything from simple weather prediction 和 data clustering to complex feature learning; autonomous driving 和 flying; im年龄, 演讲, 和 video recognition; search 和 recommendation engines; patient diagnosis; 和更多的.

  • What's the difference between supervised 和 unsupervised classification?

    Supervised classification needs labels for training data: One picture is a cat, the other is a dog. Unsupervised classification is where the algorithm finds common traits 和 separates data itself. 它不会明确地告诉我们图像是一只猫, 但它将能够区分猫和狗.

  • What is supervised learning vs. unsupervised learning?

    Supervised learning is where you explicitly tell to the algorithm what the right answer is, 因此,该算法可以学习并预测以前未见过的数据的答案. 在无监督学习中,算法必须自己找出答案.

  • 我可以在哪里学习机器学习技术?

    The best place to start learning about 机器学习 is to watch Andrew’s Ng course on Coursera, 链接在本文末尾的参考资料中. 从那里, start taking ch所有enges on Kaggle to develop better intuition about different frameworks 和 approaches.

  • 我怎么知道应该使用哪种机器学习算法?

    There are a lot of factors to consider when choosing the right algorithm: the size of the 数据集, the nature of the data, 速度与准确性等. 直到你有了自己的直觉, you can use existing cheatsheets like the one scikit-learn provides.

聘请Toptal这方面的专家.
现在雇佣

世界级的文章,每周发一次.

订阅意味着同意我们的 隐私政策

世界级的文章,每周发一次.

订阅意味着同意我们的 隐私政策

Toptal开发者

加入总冠军® 社区.