Hadoop用于数据的输入和输出,需要指定OutputFormat和InputFormat,这两个类的目的是为了指明读数据和写数据相关的包括格式等信息。
InputFormat:
public abstract List<InputSplit> getSplits(JobContext context ) throws IOException, InterruptedException; public abstract RecordReader<K,V> createRecordReader(InputSplit split, TaskAttemptContext context ) throws IOException, InterruptedException;
createRecordReader:指明具体的读操作
getSplits:获取要读的数据块
我们可以看到InputSplit的类:
public abstract long getLength() throws IOException, InterruptedException; public abstract String[] getLocations() throws IOException, InterruptedException;
具体的路径和长度
OutputFormat:
public abstract RecordWriter<K, V> getRecordWriter(TaskAttemptContext context ) throws IOException, InterruptedException; public abstract void checkOutputSpecs(JobContext context ) throws IOException, InterruptedException; public abstract OutputCommitter getOutputCommitter(TaskAttemptContext context ) throws IOException, InterruptedException;
getRecordWriter:具体记录的写的方式
checkOutputSpecs:检测数据输出空间
getOutputCommitter:写flush操作