Day 1 | Data PreProcessing

Get the dataset from here.

Step 1 : Importing the required Libraries

These three are essential libraries which we will often import.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Step 2 : Importing the Dataset

We use the read_csv method of the pandas library to read a local CSV file as a dataframe.

dataset = pd.read_csv("F://Data.csv")
X = dataset.iloc[:,:-1].values
Y = dataset.iloc[:,3].values
print(dataset)

'''   
        Country  Age	Salary	Purchased
0	France	44.0	72000.0	No
1	Spain	27.0	48000.0	Yes
2	Germany	30.0	54000.0	No
3	Spain	38.0	61000.0	No
4	Germany	40.0	NaN	Yes
5	France	35.0	58000.0	Yes
6	Spain	NaN	52000.0	No
7	France	48.0	79000.0	Yes
8	Germany	50.0	83000.0	No
9	France	37.0	67000.0	Yes
'''

扫描二维码关注公众号，回复： 5484707 查看本文章

Step 3 : Handling the Missing Data

The data we get is rarely homogeneous. We can replace missing data by Mean or Median of the entire column. We use Imputer class of sklearn.preprocessing for this task.

from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = "NaN", strategy = "mean", axis = 0)
imputer = imputer.fit(X[:,1:3])
X[:,1:3] = imputer.transform(X[:,1:3])
print(X)

'''
[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]
'''

Step 4 : Encoding Categorical Data

Values such as "Yes" and "No" cannot be used in mathematical equaltions of the model so we need to encode these variables into numbers. To achieve this we import LabelEncoder class from sklearn.preprocessing library.

The usage of LabelEncoder and OneHotEncoder.

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder = LabelEncoder()
X[:,0] = labelencoder.fit_transform(X[:,0])
print(X)

'''
[[0 44.0 72000.0]
 [2 27.0 48000.0]
 [1 30.0 54000.0]
 [2 38.0 61000.0]
 [1 40.0 63777.77777777778]
 [0 35.0 58000.0]
 [2 38.77777777777778 52000.0]
 [0 48.0 79000.0]
 [1 50.0 83000.0]
 [0 37.0 67000.0]]
'''

onehotencoder = OneHotEncoder(categorical_features=[0])
X = onehotencoder.fit_transform(X).toarray()
labelencoder_Y = LabelEncoder()
Y = labelencoder_Y.fit_transform(Y)
print(X,"\n=======\n",Y)

'''
[[1.00000000e+00 0.00000000e+00 0.00000000e+00 4.40000000e+01 7.20000000e+04]
 [0.00000000e+00 0.00000000e+00 1.00000000e+00 2.70000000e+01 4.80000000e+04]
 [0.00000000e+00 1.00000000e+00 0.00000000e+00 3.00000000e+01 5.40000000e+04]
 [0.00000000e+00 0.00000000e+00 1.00000000e+00 3.80000000e+01 6.10000000e+04]
 [0.00000000e+00 1.00000000e+00 0.00000000e+00 4.00000000e+01 6.37777778e+04]
 [1.00000000e+00 0.00000000e+00 0.00000000e+00 3.50000000e+01 5.80000000e+04]
 [0.00000000e+00 0.00000000e+00 1.00000000e+00 3.87777778e+01 5.20000000e+04]
 [1.00000000e+00 0.00000000e+00 0.00000000e+00 4.80000000e+01 7.90000000e+04]
 [0.00000000e+00 1.00000000e+00 0.00000000e+00 5.00000000e+01 8.30000000e+04]
 [1.00000000e+00 0.00000000e+00 0.00000000e+00 3.70000000e+01 6.70000000e+04]] 
=======
 [0 1 0 0 1 1 0 1 0 1]
'''

Step 5 : Spliting the dataset into test set and train set

We import train_test_split method of sklearn.cross_validation library. The split is generally 80/20.

from sklearn.cross_validation import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state = 0)

Step 6 : Features Scaling

Done by Feature standardization or Z-score normalization. StandardScalar of sklearn.preprocessing is imported.

The usage of StandardScalar.

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.fit_transform(X_test)
print(X_train,'\n=====\n',X_test)

'''
[[-1.          2.64575131 -0.77459667  0.26306757  0.12381479]
 [ 1.         -0.37796447 -0.77459667 -0.25350148  0.46175632]
 [-1.         -0.37796447  1.29099445 -1.97539832 -1.53093341]
 [-1.         -0.37796447  1.29099445  0.05261351 -1.11141978]
 [ 1.         -0.37796447 -0.77459667  1.64058505  1.7202972 ]
 [-1.         -0.37796447  1.29099445 -0.0813118  -0.16751412]
 [ 1.         -0.37796447 -0.77459667  0.95182631  0.98614835]
 [ 1.         -0.37796447 -0.77459667 -0.59788085 -0.48214934]] 
=====
 [[ 0.  0.  0. -1. -1.]
 [ 0.  0.  0.  1.  1.]]
'''

Day2 | Grandient Descent

Cost Function

The cost function is equal to the square error between estimators and real values. Our goal is to minimize the cost.

import numpy as np

def compute_cost(X, y, theta):
    # Initialize some useful values
    m = y.size
    cost = 0

    # ===================== Your Code Here =====================
    # Instructions : Compute the cost of a particular choice of theta.
    #                You should set the variable "cost" to the correct value.

    cost = np.sum(np.dot(X, theta) ** 2) / 2m

    return cost

The usage of np.dot() is here.

Day 3 | Simple Lenar Regression

Step 1 : Data preprocession

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

data = pd.read_csv("F://studentscores.csv")
X = data.iloc[:,:-1].values
y = data.iloc[:,-1].values

from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

Step 2 : Fitting Simplr Linear Regression Model to the training set

from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor = regressor.fit(X_train, y_train)

Step 3 : Predicting the result

y_p = regressor.predict(X_test)

Step 4 : Visualization

Visualizing the Training result

plt.scatter(X_train, y_train, color = "red")
plt.plot(X_train, regressor.predict(X_train), color = "blue")

Visualizing the Test result

plt.scatter(X_test, y_test, color = "red")
plt.plot(X_test, regressor.predict(X_test), color = "blue")

Day 4 | Multiple Linear Regression

Step 1 : Data preprocessing

import numpy as np
import pandas as pd

dataset = pd.read_csv("F://50_startups.csv")
X = dataset.iloc[:,:-1].values
Y = dataset.iloc[:,-1].values

#Transform the str label to numeric label
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder = LabelEncoder()
X[:,3] = labelencoder.fit_transform(X[:,3])

onehotencoder = OneHotEncoder(categorical_features=[3])
X = onehotencoder.fit_transform(X).toarray()

#Avoiding Dummy Variable Trap
X = X[:,1:]

from sklearn.cross_validation import train_test_split
X_train,X_test,Y_train,Y_test = train_test_split(X, Y, test_size = 0.2,random_state = 0)

Step 2 : Fitting the Mutiple Linear Regression Model to the Training set

from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train,Y_trian)

Step 3 : Predicting the Test result

Y_p = regressor.predict(X_test)

100-Days-Of-ML-Code