Hadoop的OutputFormat和InputFormat

Hadoop用于数据的输入和输出，需要指定OutputFormat和InputFormat，这两个类的目的是为了指明读数据和写数据相关的包括格式等信息。

InputFormat：

 public abstract 
    List<InputSplit> getSplits(JobContext context
                               ) throws IOException, InterruptedException;
  public abstract 
    RecordReader<K,V> createRecordReader(InputSplit split,
                                         TaskAttemptContext context
                                        ) throws IOException, 
                                                 InterruptedException;

createRecordReader：指明具体的读操作

getSplits：获取要读的数据块

我们可以看到InputSplit的类：

public abstract long getLength() throws IOException, InterruptedException;

public abstract 
    String[] getLocations() throws IOException, InterruptedException;

具体的路径和长度

OutputFormat：

  public abstract RecordWriter<K, V> 
    getRecordWriter(TaskAttemptContext context
                    ) throws IOException, InterruptedException;
  public abstract void checkOutputSpecs(JobContext context
                                        ) throws IOException, 
                                                 InterruptedException;
  public abstract 
  OutputCommitter getOutputCommitter(TaskAttemptContext context
                                     ) throws IOException, InterruptedException;

getRecordWriter：具体记录的写的方式

checkOutputSpecs：检测数据输出空间

getOutputCommitter：写flush操作

Hadoop的OutputFormat和InputFormat

猜你喜欢