目录
Day 3 | Simple Lenar Regression
Day 4 | Multiple Linear Regression
Day 1 | Data PreProcessing
Get the dataset from here.
Step 1 : Importing the required Libraries
These three are essential libraries which we will often import.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Step 2 : Importing the Dataset
We use the read_csv method of the pandas library to read a local CSV file as a dataframe.
dataset = pd.read_csv("F://Data.csv")
X = dataset.iloc[:,:-1].values
Y = dataset.iloc[:,3].values
print(dataset)
'''
Country Age Salary Purchased
0 France 44.0 72000.0 No
1 Spain 27.0 48000.0 Yes
2 Germany 30.0 54000.0 No
3 Spain 38.0 61000.0 No
4 Germany 40.0 NaN Yes
5 France 35.0 58000.0 Yes
6 Spain NaN 52000.0 No
7 France 48.0 79000.0 Yes
8 Germany 50.0 83000.0 No
9 France 37.0 67000.0 Yes
'''
Step 3 : Handling the Missing Data
The data we get is rarely homogeneous. We can replace missing data by Mean or Median of the entire column. We use Imputer class of sklearn.preprocessing for this task.
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = "NaN", strategy = "mean", axis = 0)
imputer = imputer.fit(X[:,1:3])
X[:,1:3] = imputer.transform(X[:,1:3])
print(X)
'''
[['France' 44.0 72000.0]
['Spain' 27.0 48000.0]
['Germany' 30.0 54000.0]
['Spain' 38.0 61000.0]
['Germany' 40.0 63777.77777777778]
['France' 35.0 58000.0]
['Spain' 38.77777777777778 52000.0]
['France' 48.0 79000.0]
['Germany' 50.0 83000.0]
['France' 37.0 67000.0]]
'''
Step 4 : Encoding Categorical Data
Values such as "Yes" and "No" cannot be used in mathematical equaltions of the model so we need to encode these variables into numbers. To achieve this we import LabelEncoder class from sklearn.preprocessing library.
The usage of LabelEncoder and OneHotEncoder.
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder = LabelEncoder()
X[:,0] = labelencoder.fit_transform(X[:,0])
print(X)
'''
[[0 44.0 72000.0]
[2 27.0 48000.0]
[1 30.0 54000.0]
[2 38.0 61000.0]
[1 40.0 63777.77777777778]
[0 35.0 58000.0]
[2 38.77777777777778 52000.0]
[0 48.0 79000.0]
[1 50.0 83000.0]
[0 37.0 67000.0]]
'''
onehotencoder = OneHotEncoder(categorical_features=[0])
X = onehotencoder.fit_transform(X).toarray()
labelencoder_Y = LabelEncoder()
Y = labelencoder_Y.fit_transform(Y)
print(X,"\n=======\n",Y)
'''
[[1.00000000e+00 0.00000000e+00 0.00000000e+00 4.40000000e+01 7.20000000e+04]
[0.00000000e+00 0.00000000e+00 1.00000000e+00 2.70000000e+01 4.80000000e+04]
[0.00000000e+00 1.00000000e+00 0.00000000e+00 3.00000000e+01 5.40000000e+04]
[0.00000000e+00 0.00000000e+00 1.00000000e+00 3.80000000e+01 6.10000000e+04]
[0.00000000e+00 1.00000000e+00 0.00000000e+00 4.00000000e+01 6.37777778e+04]
[1.00000000e+00 0.00000000e+00 0.00000000e+00 3.50000000e+01 5.80000000e+04]
[0.00000000e+00 0.00000000e+00 1.00000000e+00 3.87777778e+01 5.20000000e+04]
[1.00000000e+00 0.00000000e+00 0.00000000e+00 4.80000000e+01 7.90000000e+04]
[0.00000000e+00 1.00000000e+00 0.00000000e+00 5.00000000e+01 8.30000000e+04]
[1.00000000e+00 0.00000000e+00 0.00000000e+00 3.70000000e+01 6.70000000e+04]]
=======
[0 1 0 0 1 1 0 1 0 1]
'''
Step 5 : Spliting the dataset into test set and train set
We import train_test_split method of sklearn.cross_validation library. The split is generally 80/20.
from sklearn.cross_validation import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state = 0)
Step 6 : Features Scaling
Done by Feature standardization or Z-score normalization. StandardScalar of sklearn.preprocessing is imported.
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.fit_transform(X_test)
print(X_train,'\n=====\n',X_test)
'''
[[-1. 2.64575131 -0.77459667 0.26306757 0.12381479]
[ 1. -0.37796447 -0.77459667 -0.25350148 0.46175632]
[-1. -0.37796447 1.29099445 -1.97539832 -1.53093341]
[-1. -0.37796447 1.29099445 0.05261351 -1.11141978]
[ 1. -0.37796447 -0.77459667 1.64058505 1.7202972 ]
[-1. -0.37796447 1.29099445 -0.0813118 -0.16751412]
[ 1. -0.37796447 -0.77459667 0.95182631 0.98614835]
[ 1. -0.37796447 -0.77459667 -0.59788085 -0.48214934]]
=====
[[ 0. 0. 0. -1. -1.]
[ 0. 0. 0. 1. 1.]]
'''
Day2 | Grandient Descent
Cost Function
The cost function is equal to the square error between estimators and real values. Our goal is to minimize the cost.
import numpy as np
def compute_cost(X, y, theta):
# Initialize some useful values
m = y.size
cost = 0
# ===================== Your Code Here =====================
# Instructions : Compute the cost of a particular choice of theta.
# You should set the variable "cost" to the correct value.
cost = np.sum(np.dot(X, theta) ** 2) / 2m
return cost
The usage of np.dot() is here.
Day 3 | Simple Lenar Regression
Step 1 : Data preprocession
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv("F://studentscores.csv")
X = data.iloc[:,:-1].values
y = data.iloc[:,-1].values
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
Step 2 : Fitting Simplr Linear Regression Model to the training set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor = regressor.fit(X_train, y_train)
Step 3 : Predicting the result
y_p = regressor.predict(X_test)
Step 4 : Visualization
Visualizing the Training result
plt.scatter(X_train, y_train, color = "red")
plt.plot(X_train, regressor.predict(X_train), color = "blue")
Visualizing the Test result
plt.scatter(X_test, y_test, color = "red")
plt.plot(X_test, regressor.predict(X_test), color = "blue")
Day 4 | Multiple Linear Regression
Step 1 : Data preprocessing
import numpy as np
import pandas as pd
dataset = pd.read_csv("F://50_startups.csv")
X = dataset.iloc[:,:-1].values
Y = dataset.iloc[:,-1].values
#Transform the str label to numeric label
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder = LabelEncoder()
X[:,3] = labelencoder.fit_transform(X[:,3])
onehotencoder = OneHotEncoder(categorical_features=[3])
X = onehotencoder.fit_transform(X).toarray()
#Avoiding Dummy Variable Trap
X = X[:,1:]
from sklearn.cross_validation import train_test_split
X_train,X_test,Y_train,Y_test = train_test_split(X, Y, test_size = 0.2,random_state = 0)
Step 2 : Fitting the Mutiple Linear Regression Model to the Training set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train,Y_trian)
Step 3 : Predicting the Test result
Y_p = regressor.predict(X_test)