《Hadoop 权威指南 - 大数据的存储与分析》学习笔记

第一章初识Hadoop

1.2 数据的存储与分析

对多个硬盘中的数据并行进行读/写数据，有以下两个重要问题：

硬件故障问题。解决方案：复制（replication）,系统保存数据的副本（replica）。
以某种方式结合大部分数据来共同完成分析。MapReduce 提出一个编程模型，该模型抽象出这些硬件读/写问题，并且将其转换成对一个数据集（由键-值对组成）的计算。
简而言之，Hadoop 为我们提供了一个存储和分析平台。

1.5 关系型数据库和Hadoop 的区别

它们所操作的数据集的结构化程度。Hadoop 对非结构化（unstructured data）和半结构化（semi-structured data）数据非常有效。Web 服务器日志就是典型的非规范化的数据记录，这就是Hadoop 非常适合于分析各种日志的原因。

第二章关于MapReduce

2.3 使用Hadoop 来分析数据

MapReduce 任务分为两个处理阶段。每个阶段都是以键值对作为输入输出。对程序员来说，需要写两个函数：map 函数和 reduce 函数。好友一个MaperReduce 作业。
Java MapReduce:
Mapper 函数：

public class MaxTemperatureMapper extends Mapper {

private static final int MISSING = 9999;

@Override
protected void map(Object key, Object value, Context context) throws IOException, InterruptedException {
    String line = value.toString();
    String year = line.substring(15, 19);
    int airTemperature;
    if (line.charAt(87) == '+') {
        airTemperature = Integer.parseInt(line.substring(88, 92));
    }else {
        airTemperature = Integer.parseInt(line.substring(87, 92));
    }
    String quality = line.substring(92, 93);
    if (airTemperature != MISSING && quality.matches("[01459]")) {
        context.write(new Text(year), new IntWritable(airTemperature));
    };
}

}

Reduce 函数：

import java.io.IOException;

public class MaxTemperatureReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        int maxValue = Integer.MIN_VALUE;
        for (IntWritable value : values) {
            maxValue = Math.max(maxValue, value.get());
        }
        context.write(key, new IntWritable(maxValue));
    }
}


public class MaxTemperatureDemo {
    public static void main(String[] args) throws Exception {
        Job job = new Job();
        job.setJarByClass(MaxTemperatureDemo.class);
        job.setJobName("Max temperature");

    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    job.setMapperClass(MaxTemperatureMapper.class);
    job.setReducerClass(MaxTemperatureReducer.class);

    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
}
}

LongWritable — Long, Text – -String, IntWritable — Integer
job 作业日志关键字： job_local26392882
mapper 任务 task 日志关键字： attempt_local26392882_001_m_0000_0
reduce 任务task 日志关键字： attempt_local26392882_001_r_0000_0

2.4 横向扩展

作业（job）: 客户端需要执行的一个工作单元。包括输入数据，MapReduce 程序和配置信息。
Hadoop 将作业分成若干个任务（task）。任务包括两类：map(任务)，reduce(任务)。
任务运行在集群的节点上，由YARN 进行调度。

《Hadoop 权威指南 - 大数据的存储与分析》学习笔记

第一章 初识Hadoop

1.2 数据的存储与分析

1.5 关系型数据库和Hadoop 的区别

第二章 关于MapReduce

2.3 使用Hadoop 来分析数据

2.4 横向扩展

猜你喜欢

第一章初识Hadoop

第二章关于MapReduce