Hadoop的OutputFormat和InputFormat

        Hadoop用于数据的输入和输出,需要指定OutputFormat和InputFormat,这两个类的目的是为了指明读数据和写数据相关的包括格式等信息。

InputFormat:

 public abstract 
    List<InputSplit> getSplits(JobContext context
                               ) throws IOException, InterruptedException;
  public abstract 
    RecordReader<K,V> createRecordReader(InputSplit split,
                                         TaskAttemptContext context
                                        ) throws IOException, 
                                                 InterruptedException;

createRecordReader:指明具体的读操作

getSplits:获取要读的数据块

我们可以看到InputSplit的类:

public abstract long getLength() throws IOException, InterruptedException;

public abstract 
    String[] getLocations() throws IOException, InterruptedException;

具体的路径和长度

OutputFormat:

  public abstract RecordWriter<K, V> 
    getRecordWriter(TaskAttemptContext context
                    ) throws IOException, InterruptedException;
  public abstract void checkOutputSpecs(JobContext context
                                        ) throws IOException, 
                                                 InterruptedException;
  public abstract 
  OutputCommitter getOutputCommitter(TaskAttemptContext context
                                     ) throws IOException, InterruptedException;

getRecordWriter:具体记录的写的方式

checkOutputSpecs:检测数据输出空间

getOutputCommitter:写flush操作

猜你喜欢

转载自snv.iteye.com/blog/1845063