compress files in directory to another directory
use ‘cut -f 2’
hadoop jar ./share/hadoop/tools/lib/hadoop-streaming-2.7.5.jar \
-Dmapred.output.compress=true \
-Dmapred.compress.map.output=true \
-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
-Dmapred.reduce.tasks=0 \
-input /home/houzhizhen/defaultfs/test/input \
-output /home/houzhizhen/defaultfs/test/outputcut \
-mapper "cut -f 2"
This produces one file in output directory for one file in input directory. After unzip the file using command ‘gunzip’, the file length is not equals to the source file lenght, the file length reduce by 1 for every line in the file, probably because is replace ‘\n\r’ with ‘\n’.
[houzhizhen@localhost outputcut]$ ll
总用量 12
-rw-r--r--. 1 houzhizhen root 2938 5月 16 10:07 part-00000.gz
-rw-r--r--. 1 houzhizhen root 325 5月 16 10:07 part-00001.gz
-rw-r--r--. 1 houzhizhen root 128 5月 16 10:07 part-00002.gz
-rw-r--r--. 1 houzhizhen root 0 5月 16 10:07 _SUCCESS
use ‘/bin/cat’
The output result is identical to the previous test.
hadoop jar ./share/hadoop/tools/lib/hadoop-streaming-2.7.5.jar \
-Dmapred.reduce.tasks=0 \
-Dmapred.output.compress=true \
-Dmapred.compress.map.output=true \
-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
-input /home/houzhizhen/defaultfs/test/input \
-output /home/houzhizhen/defaultfs/test/output-gz \
-mapper /bin/cat \
-inputformat org.apache.hadoop.mapred.TextInputFormat \
-outputformat org.apache.hadoop.mapred.TextOutputFormat
reduce into one compressed file directly
Notice: this will cause all the data to single reduce task, and runs very slow if the input size is large.
hadoop jar ./share/hadoop/tools/lib/hadoop-streaming-2.7.5.jar \
-Dmapred.reduce.tasks=1 \
-Dmapred.output.compress=true \
-Dmapred.compress.map.output=true \
-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.BZip2Codec \
-input /home/houzhizhen/defaultfs/test/input \
-output /home/houzhizhen/defaultfs/test/archive \
-mapper /bin/cat \
-reducer /bin/cat \
-inputformat org.apache.hadoop.mapred.TextInputFormat \
-outputformat org.apache.hadoop.mapred.TextOutputFormat
- decompress
/home/houzhizhen/defaultfs/test/archive
bunzip2 part-00000.bz2