hadoop（4）——用python代码结合hadoop完成一个小项目

mapper.py和reducer.py文件内容借鉴于如下博客： https://blog.csdn.net/marywang56/article/details/80395519

我们都知道hadoop是在java环境下完成的，但是通过hadoop-streaming这个java小程序，我们可以把python代码放入hadoop中，然后通过stdin和stdout来进行数据的传递。
（1）开启yarn
通过jps命令查看
在这里插入图片描述

（2）查看mapper.py和reducer.py

import sys
# input comes from STDIN (standard input)
for line in sys.stdin:
    line = line.strip()
    words = line.split()
    for word in words:
        print '%s\t%s' % (word, 1)

在这里插入图片描述

from operator import itemgetter
import sys

current_word = None
current_count = 0
word = None

for line in sys.stdin:
    line = line.strip()
    word, count = line.split('\t', 1)
    try:
        count = int(count)
    except ValueError:
        continue
    if current_word == word:
        current_count += count
    else:
        if current_word:
            print '%s\t%s' % (current_word, current_count)
        current_count = count
        current_word = word

if current_word == word:
    print '%s\t%s' % (current_word, current_count)

（3）测试命令
<1>
先看hadoop.txt
在这里插入图片描述

<2>
在这里插入图片描述
可以看见mapper把每一个字符都分割了开来
<3>

可见sort函数将字母进行排序，对应hadoop里的shuffle过程
<4>

这时可以看见模拟出了最后输出的结果，将一样的词合并作为输出
（4）用hadoop来实现
此时要写好脚本，如图：

在这里插入图片描述
（5）实行脚本

任务实行结束
（6）查看输出结果

（7）可视化查看

如图，此运算例已经实行成功

hadoop（4）——用python代码结合hadoop完成一个小项目

猜你喜欢