搭建好了Eclipse的开发环境,接下来就是Helloword,hadoop 的HelloWord是一个Wordcount的例子,就是统计单词在不同的文档里出现的次数。
我这边准备了三个文档:(存入hdfs 的文件系统中)
[root@bigdata2 hadoop-1.0.1]# ./bin/hadoop fs -cat /user/root/in/helloword.txt Warning: $HADOOP_HOME is deprecated. Hello,Word! [root@bigdata2 hadoop-1.0.1]# ./bin/hadoop fs -cat /user/root/in/input1.txt Warning: $HADOOP_HOME is deprecated. hello,word ! what's your name ? haow are you ? are you ok ? are you ok ? [root@bigdata2 hadoop-1.0.1]# ./bin/hadoop fs -cat /user/root/in/input2.txt Warning: $HADOOP_HOME is deprecated. hello,mobile. hello,word ! what's your name ? haow are you ? are you ok ? are you ok ?
WordCount.java
package wordcount; import java.io.IOException; import java.text.SimpleDateFormat; import java.util.ArrayList; import java.util.Date; import java.util.Iterator; import java.util.List; import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapred.FileInputFormat; import org.apache.hadoop.mapred.FileOutputFormat; import org.apache.hadoop.mapred.JobClient; import org.apache.hadoop.mapred.JobConf; import org.apache.hadoop.mapred.MapReduceBase; import org.apache.hadoop.mapred.Mapper; import org.apache.hadoop.mapred.OutputCollector; import org.apache.hadoop.mapred.Reducer; import org.apache.hadoop.mapred.Reporter; import org.apache.hadoop.mapred.TextInputFormat; import org.apache.hadoop.mapred.TextOutputFormat; public class WordCount { public static class MapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } /** * A reducer class that just emits the sum of the input values. */ public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(MapClass.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); String outFileExt = "_" + new SimpleDateFormat("yyyyMMddHHmmss").format(new Date()); FileInputFormat.setInputPaths(conf,new Path("hdfs://192.168.1.2:9000/user/root/in/")); FileOutputFormat.setOutputPath(conf, new Path("hdfs://192.168.1.2:9000/user/root/out/"+outFileExt)); JobClient.runJob(conf); } }
直接运行
结果 写道
! 2
? 8
Hello,Word! 1
are 6
haow 2
hello,mobile. 1
hello,word 2
name 2
ok 4
what's 2
you 6
your 2
? 8
Hello,Word! 1
are 6
haow 2
hello,mobile. 1
hello,word 2
name 2
ok 4
what's 2
you 6
your 2
代码解释:
JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(MapClass.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class);
扫描二维码关注公众号,回复:
577187 查看本文章
JobConf 负责读取配置文件(主要包括:core-site.xml,hdfs-site.xml,mapred-site.xml等)
conf.setJobName("wordcount");主要用来设置JOB名称,便于页面监控
InputFormat 主要负责调用getRecodeReader()方法生成RecordReader对象,RecordReader对象则调用CreatKey和CreatValue方法生产可以供Map处理的<Key,Value>键值对
InputFormat方法有很多重写版本,支持不同的数据源,例如FileInputFormat,DbInputFormat等
OutputFormat这负责输出的格式应为Key和value 是Object类型,那么内部会转为String来输出。
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter
Map函数产生<Key,ValueList>类型的键值对,交由Reduce函数进行处理
int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum));
Reduce函数则负责将Value的值做Count,计算出次数,然后将结果输出。