菜鷄日記——《Python数据分析与挖掘实战》实验6-1 拉格朗日插值法

实验6-1 用拉格朗日插值法

题目描述:用拉格朗日插值法对missing_data.xls中表格的空值进行填补。

# p1, lab6
# Fill all of the null values with Lagrange's interpolation
# Data file name is "missing_data.xls"


import pandas as pd
from scipy.interpolate import lagrange


dir = 'F:/Data Mining/codes/ch6/lab6_1'     # dir is a built-in name, will be shadowed if is distinctly defined
data = pd.read_excel(dir + '/data/missing_data.xls', header=None)   # header=None indicates that the table does not have header


def lagrange_interpolate(s, n, k=5):
    y = s[list(range(n-k, n)) + list(range(n+1, n+1+k))]    # may create indexes out of bound, which are defined as null values
    y = y[y.notnull()]      # y.notnull() returns a Series object in boolean type
    return lagrange(y.index, list(y))(n)
    # method lagrange(x, w) in module scipy.interpolate
    # param x is an array like object, represents the x-coordinates of a set of points
    # param w is an array like object, represents the y-coordinates of a set of points
    # return a numpy.lib.polynomial.poly1d object (polynomial type) represents the Lagrange interpolating polynomial
    # WARNING: this implementation is unstable, do not expect to be able to use more than 20 points
    # (poly1d)(n) gets the result of the polynomial when x=n


for col in data.columns:
    for i in range(len(data)):
        if data[col].isnull()[i]:   # Series.isnull() returns a Series object in boolean type
            data[col][i] = lagrange_interpolate(data[col], i)   # DataFrame[column][index] can locate elements in the DataFrame object
# error ever made: in the conditional statement, miss [col] so that returns a DataFrame object rather than a Series object


data.to_excel(dir + '/data/result.xls', header=None, index=False)   # the last two params construct a table without header and index

 missing_data.xls                          result.xls

我学到了什么?

df.read_excel(header=None)  说明读入的表格没有表头,否则missing_data.xls的首行会被当作表头

df.to_excel(header=None, index=False)  指定导出的表格不含表头和索引,否则result.xls会有表头并在最左边显示索引

  • isnull()和notnull()的返回对象

二者都是DataFrame或Series的方法,用于空值的判断,返回DataFrame或Series对象。isnull()方法在空值的位置记为True,否则记为False;notnull()方法在空值的位置记为False,否则记为True

  • DataFrame对象的定位

data[column][index]可以定位到列名为column、索引名为index的位置

  • 提取数据时越界

在上面的lagrange_interpolate()方法中,首行用于提取样本点,显然(n-k)和(n+k)都可能越界。但是通过调试观察发现,当发生越界时,越界的下标对应的位置值位空值,然后在配合下一条去除空值的语句将越界的取值剔除了

扫描二维码关注公众号,回复: 3622578 查看本文章

猜你喜欢

转载自blog.csdn.net/Wyatt__Liu/article/details/83111321