MapReduce实战 - 根据文章记录获取时段内发帖频率

MapReduce简介

MapReduce是一种分布式计算模型，是Google提出的，主要用于搜索领域，解决海量数据的计算问题。
MR有两个阶段组成：Map和Reduce，用户只需实现map()和reduce()两个函数，即可实现分布式计算。

例子

数据源结构

首先查看数据源结构:

CREATE TABLE `article` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `allowed_add_tag` int(2) DEFAULT NULL,
  `attitudes` varchar(255) DEFAULT NULL,
  `attitudes_id` int(11) DEFAULT NULL,
  `banana_count` int(11) DEFAULT NULL,
  `big_cover_image` varchar(255) DEFAULT NULL,
  `channel_id` int(11) DEFAULT NULL,
  `channel_name` varchar(255) DEFAULT NULL,
  `channel_path` varchar(255) DEFAULT NULL,
  `comment_count` int(11) DEFAULT NULL,
  `contribute_time` datetime DEFAULT NULL,
  `cover_image` varchar(255) DEFAULT NULL,
  `description` varchar(255) DEFAULT NULL,
  `essense` int(2) DEFAULT NULL,
  `favorite_count` int(11) DEFAULT NULL,
  `latest_active_time` datetime DEFAULT NULL,
  `latest_comment_time` datetime DEFAULT NULL,
  `like_count` int(11) DEFAULT NULL,
  `link` varchar(255) DEFAULT NULL,
  `parent_channel_id` int(11) DEFAULT NULL,
  `parent_channel_name` varchar(255) DEFAULT NULL,
  `parent_realm_id` int(11) DEFAULT NULL,
  `realm_id` int(11) DEFAULT NULL,
  `realm_name` varchar(255) DEFAULT NULL,
  `recommended` int(2) DEFAULT NULL,
  `status` int(11) DEFAULT NULL,
  `tag_list` varchar(255) DEFAULT NULL,
  `title` varchar(255) DEFAULT NULL,
  `top_level` int(2) DEFAULT NULL,
  `tudou_domain` int(2) DEFAULT NULL,
  `type_id` int(11) DEFAULT NULL,
  `user_avatar` varchar(255) DEFAULT NULL,
  `user_id` int(11) DEFAULT NULL,
  `username` varchar(255) DEFAULT NULL,
  `view_count` int(11) DEFAULT NULL,
  `view_only` int(2) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=13103 DEFAULT CHARSET=utf8mb4;
复制代码

这里我将其中的数据导出为csv文件。

思路

在这个例子中，我要做的是根据帖子发布时间，统计全天中每隔30分钟的发帖个数。

由于当前我没有重写InputFormat接口，因此采用的是hadoop默认的按行读取文件方法。所以传入参数为<0, [一行数据]>.

InputFormat 接口 - 该接口指定输入文件的内容格式。

其中getSplits函数将所有输入数据分成numSplits个split，每个split交给一个map task处理。

getRecordReader函数提供一个用户解析split的迭代器对象，它将split中的每个record解析成key/value对。

获取数据中的发帖时间
计算发帖时间在全天时间中的时间段并传递个reduce() - <时间段, 1>
reduce对时间段出现次数进行统计

util

首先先编写工具类Times.java - period(str:String, format:String)方法，该方法的作用为：

根据传入的字符串和时间格式获取一天中改时间的时间区间，如：

输入："2018-10-18 22:05:11", "yyyy-MM-dd HH:mm:ss"

输出: "201810182200-201810182230"

方法如下：

public static String period(String time, String format) {
    Objects.requireNonNull(time);
    DateTimeFormatter formatter = DateTimeFormatter.ofPattern(format);
    LocalDateTime dateTime = LocalDateTime.parse(time, formatter);
    int m = dateTime.getMinute();
    LocalDateTime start = dateTime.withMinute(m < 30 ? 0 : 30);
    LocalDateTime end = null;
    if (m < 30) {
        end = dateTime.withMinute(30);
    } else {
       end = dateTime.plusHours(1);
       end = end.withMinute(0);
    }

    DateTimeFormatter dateTimeFormatter = DateTimeFormatter.ofPattern("yyyyMMddHHmm");
    return start.format(dateTimeFormatter) + "-" + end.format(dateTimeFormatter);
}
复制代码

测试输入:

period("2018-11-11 23:34", "yyyy-MM-dd HH:mm");

返回结果:

201811112330-201811120000

Map

TimeMapper.java代码为：

public class TimeMapper extends Mapper<LongWritable, Text, Text, LongWritable> {


    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String time = Matchers.stringCutBymatchers(value.toString(), "[0-9]{4}[/][0-9]{1,2}[/][0-9]{1,2}[ ][0-9]{1,2}[:][0-9]{1,2}[:][0-9]{1,2}");
        Objects.requireNonNull(time);
        String result = Times.period(time, "yyyy/MM/dd HH:mm:ss");
        context.write(new Text(result), new LongWritable(1));
    }
}
复制代码

由于按行读取.csv文件并且一行中的时间格式为yyyy/MM/dd HH:mm:ss,因此直接用正则表达式截取时间。然后获取时间区段，然后将<时间区段, 1>传递给reduce().

Matchers.stringCutBymatchers():

public static String stringCutBymatchers(String str, String mstr) {
    Pattern patternn = Pattern.compile(mstr);
    Matcher matcher = patternn.matcher(str);
    String result = null;
    if (matcher.find()) {
        result = matcher.group(0);
    }
    return result;
}
复制代码

Reduce

reduce()阶段的操作就很简单了，只要统计时间区段出现的次数就好了

TimeReduce.java:

public class TimeReduce extends Reducer<Text, LongWritable, Text, LongWritable> {

    @Override
    protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {
        long counts = 0L;
        for (LongWritable val : values) {
            counts += val.get();
        }
        context.write(key, new LongWritable(counts));
    }
}
复制代码

main

main方法如下:

TimeApp.java:

public class TimeApp {

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        String[] otherArgs = (new GenericOptionsParser(conf, args)).getRemainingArgs();
        if (otherArgs.length < 2) {
            System.out.println("请输入input目录和output目录");
            System.exit(2);
        }

        Job job = Job.getInstance(conf, "acfun-time");
        job.setJarByClass(CSVApp.class);
        job.setMapperClass(TimeMapper.class);
        job.setReducerClass(TimeReduce.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(LongWritable.class);

        for (int i = 0; i < otherArgs.length - 1; ++i) {
            FileInputFormat.addInputPath(job, new Path(otherArgs[i]));
        }

        Path path = new Path(otherArgs[otherArgs.length - 1]);// 取第1个表示输出目录参数（第0个参数是输入目录）
        FileSystem fileSystem = path.getFileSystem(conf);// 根据path找到这个文件
        if (fileSystem.exists(path)) {
            fileSystem.delete(path, true);// true的意思是，就算output有东西，也一带删除
        }

        FileOutputFormat.setOutputPath(job, path);
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
    
}
复制代码

运行

最终文件目录如下：

其他package是为了之后继承其他类准备，目前没用。

这里我采用和hadoop-example一样的启动方法，设置一个总Main.java

public class Main {

    public static void main(String[] args) {
        int exitCode = -1;
        ProgramDriver pgd = new ProgramDriver();
        try {
            pgd.addClass("citycount", CSVApp.class, "统计文章中出现的城市个数");
            pgd.addClass("timecount", TimeApp.class, "统计文章时段发帖数目");
            exitCode = pgd.run(args);
        } catch (Throwable e) {
            e.printStackTrace();
        }
        System.exit(exitCode);
    }

}
复制代码

根据命令参数来选择需要执行的job。

打包并上传后执行。

执行

yarn jar com.dust-1.0-SNAPSHOT.jar timecount /acfun/input/dust_acfun_article.csv /acfun/output
复制代码

等待job执行完成:

执行完成之后通过

hdfs dfs -cat /acfun/output/part-r-00000
复制代码

查看结果

之后只要将该文件的数据提取出来画成图表就能直观地查看发帖时段了。