kaggle数据科学从业者分析报告

数据科学从业者调查python语言分析

数据描述

2017年8月26日，全球最大的数据科学社群Kaggle发布了数据科学／机器学习业界现状全行业调查的数据集。调查问卷数据从2017年8月7日~8月25日收集。受访者囊括了来自50多个国家的16716+位从业者，根据kaggle的调查问卷数据集，我们挖掘一些有营养的信息。

变量选择

一共 228个变量，我们选择部分感兴趣的问题分析, 变量说明表如下：
在这里插入图片描述

数据预处理

将地区中的台湾地区的数据也算入中国的数据中，并生成变量说明表

data = pd.read_csv('multipleChoiceResponses.csv', encoding="unicode_escape") 
data.shape
data.info() 
col_name  =list( data.columns ) 
data['Country']=['China' if i=='People \'s Republic of China' or 
                 i=='Republic of China'  else i for i in data['Country'] ]
# 选中感兴趣的变量
df = data[[
            'Age', 'Country', 'CurrentJobTitleSelect','MLToolNextYearSelect',
            'MLMethodNextYearSelect', 'EmploymentStatus', 'FormalEducation',
            'CoursePlatformSelect', 'FirstTrainingSelect', 'Tenure','JobSatisfaction',
            'LanguageRecommendationSelect', 'JobSkillImportancePython',
            'JobSkillImportanceR', 'JobSkillImportanceSQL', 'JobSkillImportanceBigData'
            ]]
v_illustration = pd.DataFrame({
    
    '变量名':df.columns, 
      '类型':[type(df.iat[1,i]) for i,col in enumerate(df.columns)],
      '缺失量':[sum(df.iloc[:,i].isnull()) for i,col in enumerate(df.columns)]})
v_illustration.to_csv('v_illustration.csv',encoding='utf-8-sig')

单变量分析

分类型变量分布展示

将画水平条形的过程写成一个函数，方便复用。

def type_to_bar(type_obj):
    '''输入pd.Series(),输出水平的条形图'''
    count = type_obj.value_counts()  
    if len(count)>10 :
        count = count[:10]  # 只要前十个
    count = count[::-1] # 第一放在上面
    position  = range(len(count)) 
    color = color_list[:len(count)]
    plt.barh(y=position, width=count, color = color[::-1] )
   # plt.title(type_obj.name)
    plt.yticks(ticks=position, labels=count.index)
    z = zip(count.values,position)
    for i,j in z:
        print(i,j)
        plt.text(i,j,s=str(int(i)),ha='right',va='center',color='white')

查看受访者国家分布

fig = plt.figure(dpi=140)
type_obj= df['Country']
type_to_bar(type_obj)

在这里插入图片描述

受访者职位分布

# 受访者职位分布
fig = plt.figure(dpi=140)
type_obj= df['CurrentJobTitleSelect']
type_to_bar(type_obj)

在这里插入图片描述

受访者下一年准备学习的机器学习工具

# 受访者下一年准备学习的机器学习工具
fig = plt.figure(dpi=140)
type_obj= df['MLToolNextYearSelect']
type_to_bar(type_obj)

在这里插入图片描述

其余变量

剩下的分类型变量方法相同，可以一起画出，这里不再展示

for i in df.columns: 
    fig = plt.figure(dpi=140)
    type_obj= df[i]
    type_to_bar(type_obj)

数值型变量分布展示

受访者年龄分布

# 单变量年龄分布
# 画出一条垂线标明均值 
fig = plt.figure(figsize=(10,6),dpi=200) 
sns.kdeplot(df['Age'], shade=True, alpha=.7,)
plt.suptitle('年龄分布') 
x=df['Age'].mean()
plt.axvline(x,color='red') 
plt.text(x,0.04,'年龄均值32',size =15)
plt.show()

在这里插入图片描述

双变量分析

分类型变量对数值型变量影响

以不同国家的年龄差异为例，用水平条形图展示不同国家的年龄均值

# 双变量关系分析， 国家对年龄的影响.   分类型——数值型
# 水平条形图 
def num_to_bar(num, label):
    '''输入pd.dataframe(),一列为数字型，一列为标签,
    输出各类别均值直方图'''
    cla  = label.value_counts().index
    count = []
    for lab in cla:
        count.append(num[label==lab].mean())
    count = pd.Series(count, index = cla)
    count.sort_values(ascending=False,inplace=True)
    if len(count)>10 :
        count = count[:10]  # 只要前十个
    count = count[::-1] # 第一放在上面
    position  = range(len(count)) 
    color = color_list[:len(count)]
    plt.barh(y=position, width=count, color = color[::-1] )
    plt.title(num.name)
    plt.yticks(ticks=position, labels=count.index)
    z = zip(count.values,position)
    for i,j in z:
       # print(i,j)
        plt.text(i,j,s=str(int(i)),
                 ha='right',va='center',color='white')
               
# 各国家受访者年龄对比
fig = plt.figure(dpi=140)
num  = df['Age']; label=df['Country']
num_to_bar(num, label)

在这里插入图片描述
各个国家工作数据从业者满意度对比

# 各国家受访者满意度对比
def trans_jbsatisfy(x):
    if x=='10 - Highly Satisfied':
        y=10 
    elif x=='1 - Highly Dissatisfied':
        y=1
    elif x=='I prefer not to share':
        y= np.NaN
    elif isinstance(x, float): 
        y=x
    elif isinstance(int(x), int):
        y=int(x)
    else : print(x)
    return y
df['JobSatisfaction']=df['JobSatisfaction'].apply(trans_jbsatisfy)
fig = plt.figure(dpi=140)
num  = df['JobSatisfaction']; label=df['Country']
num_to_bar(num, label)   
plt.title("工作满意度最高的是瑞士")

在这里插入图片描述

分类型变量对分类型变量影响

以国家对偏好编程语言的影响为例

fig = plt.figure(figsize=(10,6),dpi=140)
plt.subplot(1,2,1)
type_to_bar(df.loc[df['Country']=='China','LanguageRecommendationSelect'])
plt.title('China')
plt.subplot(1,2,2)
type_obj=df.loc[df['Country']=='United States','LanguageRecommendationSelect']
type_to_bar(type_obj)
plt.title('America')

在这里插入图片描述

kaggle数据科学从业者分析报告

数据描述

变量选择

数据预处理

单变量分析

分类型变量分布展示

查看受访者国家分布

受访者职位分布

受访者下一年准备学习的机器学习工具

其余变量

数值型变量分布展示

双变量分析

分类型变量对数值型变量影响

分类型变量对分类型变量影响

猜你喜欢