Python编程学习:深度剖析shap.datasets.adult()源码中的X,y和X_display,y_display输出数区别

Python编程学习:深度剖析shap.datasets.adult()源码中的X,y和X_display,y_display输出数区别

目录

深度剖析shap.datasets.adult()源码中的X,y和X_display,y_display

读取源码

理解源代码

data与raw_data对比结果

X.shape 

X_display.shape 


深度剖析shap.datasets.adult()源码中的X,y和X_display,y_display

X,y = shap.datasets.adult()
X_display,y_display = shap.datasets.adult(display=True)

读取源码

def adult(display=False):
    """ Return the Adult census data in a nice package. """
    dtypes = [
        ("Age", "float32"), ("Workclass", "category"), ("fnlwgt", "float32"),
        ("Education", "category"), ("Education-Num", "float32"), ("Marital Status", "category"),
        ("Occupation", "category"), ("Relationship", "category"), ("Race", "category"),
        ("Sex", "category"), ("Capital Gain", "float32"), ("Capital Loss", "float32"),
        ("Hours per week", "float32"), ("Country", "category"), ("Target", "category")
    ]
    raw_data = pd.read_csv(
        cache(github_data_url + "adult.data"),
        names=[d[0] for d in dtypes],
        na_values="?",
        dtype=dict(dtypes)
    )
    data = raw_data.drop(["Education"], axis=1)  # redundant with Education-Num
    filt_dtypes = list(filter(lambda x: not (x[0] in ["Target", "Education"]), dtypes))
    data["Target"] = data["Target"] == " >50K"
    rcode = {
        "Not-in-family": 0,
        "Unmarried": 1,
        "Other-relative": 2,
        "Own-child": 3,
        "Husband": 4,
        "Wife": 5
    }
    for k, dtype in filt_dtypes:
        if dtype == "category":
            if k == "Relationship":
                data[k] = np.array([rcode[v.strip()] for v in data[k]])
            else:
                data[k] = data[k].cat.codes

    if display:
        return raw_data.drop(["Education", "Target", "fnlwgt"], axis=1), data["Target"].values
    return data.drop(["Target", "fnlwgt"], axis=1), data["Target"].values

理解源代码

data与raw_data对比结果

结论:data,是基于raw_data读入的csv文件数据,为新定义的新数据drop了3列,并进行了数据的部分处理;而raw_data为原始数据,从csv读入,仅经过drop了3列,其余原封不同输出数据。
意义

X.shape 

(32561, 12) X.shape 
        age         workclass  ...  hours-per-week native-country
0       39         State-gov  ...              40  United-States
1       50  Self-emp-not-inc  ...              13  United-States
2       38           Private  ...              40  United-States
3       53           Private  ...              40  United-States
4       28           Private  ...              40           Cuba
...    ...               ...  ...             ...            ...
32556   27           Private  ...              38  United-States
32557   40           Private  ...              40  United-States
32558   58           Private  ...              40  United-States
32559   22           Private  ...              20  United-States
32560   52      Self-emp-inc  ...              40  United-States

[32561 rows x 12 columns]
age workclass education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country
0 39 State-gov 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States
1 50 Self-emp-not-inc 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States
2 38 Private 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States
3 53 Private 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States
4 28 Private 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba
5 37 Private 14 Married-civ-spouse Exec-managerial Wife White Female 0 0 40 United-States
6 49 Private 5 Married-spouse-absent Other-service Not-in-family Black Female 0 0 16 Jamaica
7 52 Self-emp-not-inc 9 Married-civ-spouse Exec-managerial Husband White Male 0 0 45 United-States
8 31 Private 14 Never-married Prof-specialty Not-in-family White Female 14084 0 50 United-States
9 42 Private 13 Married-civ-spouse Exec-managerial Husband White Male 5178 0 40 United-States

X_display.shape 

(32561, 12) X_display.shape 
        age         workclass  ...  hours-per-week native-country
0       39         State-gov  ...              40  United-States
1       50  Self-emp-not-inc  ...              13  United-States
2       38           Private  ...              40  United-States
3       53           Private  ...              40  United-States
4       28           Private  ...              40           Cuba
...    ...               ...  ...             ...            ...
32556   27           Private  ...              38  United-States
32557   40           Private  ...              40  United-States
32558   58           Private  ...              40  United-States
32559   22           Private  ...              20  United-States
32560   52      Self-emp-inc  ...              40  United-States

[32561 rows x 12 columns]
age workclass education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country
0 39 State-gov 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States
1 50 Self-emp-not-inc 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States
2 38 Private 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States
3 53 Private 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States
4 28 Private 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba
5 37 Private 14 Married-civ-spouse Exec-managerial Wife White Female 0 0 40 United-States
6 49 Private 5 Married-spouse-absent Other-service Not-in-family Black Female 0 0 16 Jamaica
7 52 Self-emp-not-inc 9 Married-civ-spouse Exec-managerial Husband White Male 0 0 45 United-States
8 31 Private 14 Never-married Prof-specialty Not-in-family White Female 14084 0 50 United-States
9 42 Private 13 Married-civ-spouse Exec-managerial Husband White Male 5178 0 40 United-States

猜你喜欢

转载自blog.csdn.net/qq_41185868/article/details/125611687