ch02-1880-2010年间全美婴儿姓名

美国社会保障总署（SSA）提供了一份从1880年到2010年的婴儿姓名频率数据。Hadley Wickham（许多流行R包的作者）经常用这份数据来演示R的数据处理功能。

内容提要

了解数据
载入数据
婴儿出生数变化
分析命名趋势
评估命名多样性的增长
“最后一个字母”的变革

了解数据

可能是考虑到数据量比较大，本文中采用的数据是分别保存在名为“yob***.txt”的众多文档中，每一个文件格式是txt，但实际上是csv的表格模式

这里写图片描述

这里的文件同样可以从Git上获得，路径与前面两篇文章路径相同

载入数据

我们先试着打开一个文件

import pandas as pd
import numpy as np
import os

path = 'C:/.../pydata-book-1st-edition/ch02/names/yob1880.txt'
os.chdir(os.path.dirname(path))
columns = ['names', 'sex', 'births']
names1880 = pd.read_csv(os.path.basename(path), names = columns)

得到如下结果

names1880[:5]
Out[105]: 
       names sex  births
0       Mary   F    7065
1       Anna   F    2604
2       Emma   F    2003
3  Elizabeth   F    1939
4     Minnie   F    1746

考虑到需要整合文件，编写了一个小的循环

years = range(1880, 2011)
pieces = []

for year in years:
    path = 'C:/.../pydata-book-1st-edition/ch02/names/yob%d.txt'%year
    os.chdir(os.path.dirname(path))
    frame = pd.read_csv(os.path.basename(path), names = columns)
    frame['year'] = year#swift
    pieces.append(frame)
names = pd.concat(pieces, ignore_index=True)

婴儿出生数变化

基于以上的数据，通过groupby或者pivot_table可以在year和sex级别上对数据进行聚合

total_births = names.pivot_table(values = 'births', index = ['year'], 
                                     columns = ['sex'], aggfunc = sum)
total_births.plot(title='Total births by sex and year')

这里写图片描述

分析命名趋势

之后如果希望知道研究各种名字之间的变化趋势，那么就必须得获取每一种名字占所有名字的比例

def add_prop(group):
    births = group.births.astype(float)
    group['prop'] = births/births.sum()
    return group
names = names.groupby(['year', 'sex']).apply(add_prop)

得到如下

names[:5]
Out[112]: 
       names sex  births  year      prop
0       Mary   F    7065  1880  0.077643
1       Anna   F    2604  1880  0.028618
2       Emma   F    2003  1880  0.022013
3  Elizabeth   F    1939  1880  0.021309
4     Minnie   F    1746  1880  0.019188

一般情况下，进行归一化后需要对其进行验证

In[113]np.allclose(names.groupby(['year', 'sex']).prop.sum(), 1)
Out[113]: True

下面为了减小操作的数据规模，我们从数据集中挑选出出现频次top1000的名字

def get_top1000(group):
    return group.sort_index(by='births', ascending=False)[:1000]

grouped = names.groupby(['year', 'sex'])
top1000 = grouped.apply(get_top1000)

这样数据集就小多了

boys = top1000[top1000.sex =='M']
girls = top1000[top1000.sex == 'F']

total_births = top1000.pivot_table(values = 'births', index=['year'], 
                                     columns = ['names'], aggfunc=sum)

subset = total_births[['John', 'Harry', 'Mary', 'Marilyn']]
subset.plot(subplots=True, figsize=(12,10), grid=False, title='Number of births per year')

这里分别先用pivot方法获得了每一年，每一个名字（这两者分别是第一和第二维）的出生数量。接着从名字的维度中切片出“John”等四个名字，并分析这几个名字随年份的变化

这里写图片描述

评估命名多样性的增长

主观的衡量和评价往往是最让人头疼的，不过，在本问题中，我们可以采用更为简单的方法来衡量，比如最流行的1000个名字所占的比例，或者总人数中前50%一共有多少种名字（有趣的是基尼系数也会在衡量差异性时先给对象们排个序）
前一种方法的的实现出奇的简单，只是需要换一种方法调用pivot函数就可以了（因为top1000问题已经解决了）

table = top1000.pivot_table(values = 'prop', index = ['year'], columns='sex', aggfunc = sum)
table.plot(title='Sum of table.prop by year and sex', yticks = np.linspace(0,1.2,13), xticks =

这里写图片描述
另外一种办法也不是特别复杂

先举个例子说明原理

df = boys[boys.year == 2010]
prop_cumsum = df.sort_index(by='prop', ascending=False).prop.cumsum()

首先构建一个关于前n个名字占比的有序排列（与基尼系数的列表完全相同），得到如下

prop_cumsum[:5]
Out[117]: 
year  sex         
2010  M    1676644    0.011523
           1676645    0.020934
           1676646    0.029959
           1676647    0.038930
           1676648    0.047817
Name: prop, dtype: float64

接着使用searchsorted函数（详细介绍和例程课件附录）

prop_cumsum.searchsorted(0.5)

def get_quantile_count(group, q=0.5):
    group = group.sort_index(by='prop', ascending=False)
    return group.prop.cumsum().searchsorted(q)+1

diversity = top1000.groupby(['year', 'sex']).apply(get_quantile_count)
diversity = diversity.unstack('sex')
diversity.head()
diversity = diversity.astype('int')

diversity.plot(title='Number of popular names in top 50%')

但是值得注意的是，这里查找中位数的办法是调用searchsorted方法，找到0.5应该插入的位子

这里写图片描述

“最后一个字母”的变革

2007年，一名婴儿姓名研究人员Laura Wattenberg在她自己的网站上指出；近百年来，男孩名字的最后一个字母的分布发生了显著的变化。为了具体了解情况，“我”首先将全部出生数据在年度、性别以及末字母上进行了聚合

get_last_letter = lambda x:x[-1]
last_letters = names.names.map(get_last_letter)
last_letters.names = 'last_letter'
table = names.pivot_table(values = 'births', index =last_letters, 
                           columns = ['sex', 'year'], aggfunc=sum)

这同样是通过pivot方法，聚合数据得到
筛选出有代表性的三年

subtable = table.reindex(columns=[1910, 1960, 2010], level='year')

得到一下结果

subtable.head()
Out[122]: 
sex           F                            M                    
year       1910      1960      2010     1910      1960      2010
names                                                           
a      108376.0  691247.0  670605.0    977.0    5204.0   28438.0
b           NaN     694.0     450.0    411.0    3912.0   38859.0
c           5.0      49.0     946.0    482.0   15476.0   23125.0
d        6750.0    3729.0    2607.0  22111.0  262112.0   44398.0
e      133569.0  435013.0  313833.0  28655.0  178823.0  129012.0

为了研究比例，对数据进行规范化处理

subtable.sum()
letter_prop = subtable / subtable.sum().astype(float)

之后生成图像

import matplotlib.pyplot as plt
fig, axes = plt.subplots(2,1, figsize=(10, 8))
letter_prop['M'].plot(kind='bar', rot=0, ax=axes[0], title='Male')
letter_prop['F'].plot(kind='bar', rot=0, ax=axes[1], title='Female', legend=False)

这里写图片描述

这样的研究是主要站在不同字母的层次上，年份的趋势并不能很好地展现出来，如果我们希望能够更好地展现年份，那么可以在显示的字母上做些取舍

letter_prop = table/table.sum().astype(float)
dny_ts = letter_prop.ix[['d', 'n', 'y'], 'M'].T
dny_ts.plot()

代码的主要任务就是使用ix方法提取出我们希望研究的姓氏
这里写图片描述

变成女孩名字的男孩名字（以及相反的情况）

这里我们并没有给出如何去找这些名字，而是验证这些名字在男孩和女孩之间的转换

all_names = top1000.names.unique()
mask = np.array(['lesl' in x.lower() for x in all_names])
lesley_like = all_names[mask]

得到如下

In[127]lesley_like
Out[127]: array(['Leslie', 'Lesley', 'Leslee', 'Lesli', 'Lesly'], dtype=object)

利用这个结果过滤其他名字，并按照名字分组计算出生书一查看相对频率

filtered = top1000[top1000.names.isin(lesley_like)]
filtered.groupby('names').births.sum()
table = filtered.pivot_table(values = 'births', index='year', columns='sex', aggfunc='sum')
table = table.div(table.sum(1), axis=0)

得到如下

table.tail()
Out[128]: 
sex     F   M
year         
2006  1.0 NaN
2007  1.0 NaN
2008  1.0 NaN
2009  1.0 NaN
2010  1.0 NaN

最后输出图像

table.plot(style = {'M': 'k-', 'F': 'k--'})

这里写图片描述

大功告成！

附录1

searcsorted官方介绍

searchsorted(a, v, side='left', sorter=None)
    Find indices where elements should be inserted to maintain order.
Examples
    --------
    np.searchsorted([1,2,3,4,5], 3)
    2
    np.searchsorted([1,2,3,4,5], 3, side='right')
    3
    np.searchsorted([1,2,3,4,5], [-10, 10, 2, 3])
    array([0, 5, 1, 2])

最开始载入数据的时候总是遇到很多困难，例如使用如下的方法直接访问具体数据，将会得到OSError: Initializing from file failed

names1880 = pd.read_csv(path, names = columns)

根据一位博友的介绍，采用了本文开篇使用的方法，用os库切换路径

前文中说到关于字母数目和时间变化的清晰程度之间的取舍是确实存在的，否则将会得到这样的图像

这里写图片描述

1.本文中的“最后一个字母”的变革部分是我最喜欢的，能将所学推而广之，没有更快乐的事情了
2.以及，那个“婴儿姓名研究人员”不就是算命取名的嘛hhh，Just joking
3.最后，想起这里面出生很多人其实已经年过60甚至更高，称他们为boys和girls确实有所不妥