挑战100天搞定机器学习第三天。通过翻译，我发现我的词汇量少的惊人，我终于知道，不试试就永远不知道自己有多差。翻译的质量可能不尽人意，但我已经尽力了。翻译和这个项目本身对我有没有帮助？这个我不知道，但我做了就一定比发呆强。

这些字真的是我一个一个敲出来的，转载请注明出处。

第三天，多元线性回归

下面是作者给出的知识图谱。

multiple linear regression attempts to model the relationship between two or more features and a response by fitting a linear equation to observed data

多元线性回归尝试通过把线性方程拟合到观测数据中，建模两种或多种特征于响应值之间的关系。

The steps to preform multiple linear regression are almost similar to that of simple linear regression.

预处理多元线性回归的步骤与简单线性回归相似。

The difference lies in the evaluation.

不同之处在于评价。

You can use it to find out which factor has the highest impact on the predicted output and how different variable relate other

你可以用它来找出哪一个因素对预测的输出有着最大的影响，以及不同的变量是如何相互影响的。

Assumptions 假设

for a successful regression analysis it’s essential to validate these assumptions

对于一个成功的回归分析而言，必须验证这些假设

1. linearity: the relationship between dependent and independent variables should be Linear.

线性假定：自变量和因变量之间的关系是线性的

2. homoscedasticity(constant variance) of the error should be maintained

同方差性假定(固定的方差) 误差固定

3. lack of multicollinearity：it is assumed that here is little or no multicollinearity in the data. Multicollinearity occurs when the features(or independent variables) are not independent of each other

不共线性:假设数据中很少或没有多重共线性。当特征（或独立变量）不相互独立时，就会发生多重共线性

note

注意

having too many variables could potentially cause our model to become less

accurate, especially if certain variables have no effect on other variables.

有太多的变量可能会导致我们的模型变得不那么精确，

特别是如果某些变量对其他变量没有影响。

there are various methods to select appropriate like -

有很多方法可以正确的筛选变量，比如

1. forward selection

1. 前瞻性选择

2. backward elimination

2.向后消除

3. bi-directional comparision

3.双向比较

dummy variables

虚拟变量

using categorical data in multiple regression models is a powerful method to include non-numeric data values with a fixed and unordered number of values, for instance,gender(male/female).

在多个回归模型中使用分类数据是一种强大的方法，可以包含具有固定和无序数值的非数字数据值，例如性别（男/女）。

in a regression model, these values can be represented by dummy variables - variables containing values such as 1 or 0 representing the presence or absence of the categorical value

在回归模型中，这些值可以用虚拟变量表示 -包含诸如1或0之类的值的变量表示是否存在分类值

Dummy variable trap

虚拟变量陷阱

The dummy variable trap is scenario in which two or more variables are highly correlated; in simple terms, one variable can be predicted from the others

虚拟变量陷阱是两个或多个变量高度相关的情况; 简单来说，可以从其他变量预测这个变量

Intuitively, there is a duplicate category: if we dropped the male category it is inherently defined in the female category (zero female value indicate male, and vice-versa ).

直观地说，有一个重复的类别：如果我们放弃男性类别，它在女性类别中固有地定义（零女性值表示男性，反之亦然）。

The solution to the dummy variable trap is to drop one of the categorical variables - if there are m number of categories, use m-1 in the model, the value left out can be thought of as the reference value.

虚拟变量陷阱的解决方案是丢弃其中一个分类变量 - 如果有m个类别，则在模型中使用m-1，剩下的值可以被认为是参考值。

1 preproccess the data

1 数据预处理

import the libraries

导入必要的库

import the dataset

导入数据集

check for missing data

处理丢失数据

encode categorical data

分类数据编码

make dummy variables if necessary and avoid dummy variable trap

如果需要就创建虚拟变量，避免虚拟变量陷阱

feature scaling will be taken care by the library we will use for simple linear

regression model

特征提取用到的库和简单线性回归用到的一样

2 fitting our model to the training set

2 使模型拟合训练集

this step is exactly the same as for simple linear regression.

这一步与简单线性回归相同。

to fit the dataset into the model we will use linearRegression class from sklearn.linear_model library.

为了使模型拟合数据，我们使用sklearn.linear_model 包下的linearRegression 类

then we make an object regressor of LinearRegression class

然后我们创建一个 LinearRegression 的实体 regressor

now we will fit the regressor object into our dataset using fit() method of LinearRegression Class

现在我们使用LinearRegression类中的fit()方法把regressor 实体拟合到数据集中

to predict the result we use predict() method of LinearRegression Class on the regressor we trained in the previous step.

为了预测结果，我们对regressor 使用 LinearRegression 类中的predict()方法

代码如下

Step 1: Data Preprocessing 第一步：处理数据

Importing the libraries 导入库

import pandas as pd
import numpy as np

Importing the dataset 导入数据集

dataset = pd.read_csv('50_Startups.csv')
X = dataset.iloc[ : , :-1].values
Y = dataset.iloc[ : ,  4 ].values

Encoding Categorical data 编码分类数据

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder = LabelEncoder()
X[: , 3] = labelencoder.fit_transform(X[ : , 3])
onehotencoder = OneHotEncoder(categorical_features = [3])
X = onehotencoder.fit_transform(X).toarray()

Avoiding Dummy Variable Trap 避免虚拟变量陷阱

X = X[: , 1:]

Splitting the dataset into the Training set and Test set

把数据集分割为训练集和测试集

from sklearn.cross_validation import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 0)

Step 2: Fitting Multiple Linear Regression to the Training set

使模型拟合多元线性回归模型

from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, Y_train)

Step 3: Predicting the Test set results

预测测试集结果

y_pred = regressor.predict(X_test)

杂谈

没有机器学习的基础，你可能不知道这是讲了些什么。以前说的是适合初学者，不是零基础。再次建议学习这些之前看一看李宏毅教授的机器学习基础课程，讲的很好。

如果按照原作者给出的代码，还是没有显示内容，这里加上两行，可以看到预测结果和真实值

print(Y_test)
print(y_pred)

忽然发现，这样很没有营养，下次可能会增加代码解析，但也可能不会，毕竟我有点懒。

100 days of ML ---挑战100天搞定机器学习（3）