hadoop 基准测试


使用TestDFSIO、mrbench、nnbench、Terasort 、sort 几个使用较广的基准测试程序




$ hadoop jar hadoop-mapreduce-client-jobclient-2.9.2-tests.jar
An example program must be given as the first argument.
Valid program names are:
DFSCIOTest: Distributed i/o benchmark of libhdfs.
DistributedFSCheck: Distributed checkup of the file system consistency.
JHLogAnalyzer: Job History Log analyzer.
MRReliabilityTest: A program that tests the reliability of the MR framework by injecting faults/failures
NNdataGenerator: Generate the data to be used by NNloadGenerator
NNloadGenerator: Generate load on Namenode using NN loadgenerator run WITHOUT MR
NNloadGeneratorMR: Generate load on Namenode using NN loadgenerator run as MR job
NNstructureGenerator: Generate the structure to be used by NNdataGenerator
SliveTest: HDFS Stress Test and Live Data Verification.
TestDFSIO: Distributed i/o benchmark.
fail: a job that always fails
filebench: Benchmark SequenceFile(Input|Output)Format (block,record compressed and uncompressed), Text(Input|Output)Format (compressed and uncompressed)
gsleep: A sleep job whose mappers create 1MB buffer for every record.
largesorter: Large-Sort tester
loadgen: Generic map/reduce load generator
mapredtest: A map/reduce test check.
minicluster: Single process HDFS and MR cluster.
mrbench: A map/reduce benchmark that can create many small jobs
nnbench: A benchmark that stresses the namenode w/ MR.
nnbenchWithoutMR: A benchmark that stresses the namenode w/o MR.
sleep: A job that sleeps at each map and reduce task.
testbigmapoutput: A map/reduce program that works on a very big non-splittable file and does identity map/reduce
testfilesystem: A test for FileSystem read/write.
testmapredsort: A map/reduce program that validates the map-reduce framework's sort.
testsequencefile: A test for flat files of binary key value pairs.
testsequencefileinputformat: A test for sequence file input format.
testtextinputformat: A test for text input format.
threadedmapbench: A map/reduce benchmark that compares the performance of maps with multiple spills over maps with 1 spill
timelineperformance: A job that launches mappers to test timline service performance.




$ hadoop jar hadoop-mapreduce-client-jobclient-2.9.2-tests.jar  TestDFSIO
20/05/27 14:11:42 INFO fs.TestDFSIO: TestDFSIO.1.8
Missing arguments.
Usage: TestDFSIO [genericOptions] -read [-random | -backward | -skip [-skipSize Size]] | -write | -append | -truncate | -clean [-compression codecClassName] [-nrFiles N] [-size Size[B|KB|MB|GB|TB]] [-resFile resultFileName] [-bufferSize Bytes] [-rootDir][hduser@yjt mapreduce]$ 



$ hadoop jar hadoop-mapreduce-client-jobclient-2.9.2-tests.jar TestDFSIO -write -nrFiles 10 -size 128MB -resFile /home/hduser/TestDFSIO_result.txt




吞吐量:67.88 mb/sec 

平均IO速率:94.41 mb/sec

IO rate std deviation(IO速率标准偏差): 43.3mb/sec




$ hadoop jar hadoop-mapreduce-client-jobclient-2.9.2-tests.jar TestDFSIO -read -nrFiles 10 -size 128MB -resFile /home/hduser/TestDFSIO_read_result.txt



hadoop jar hadoop-mapreduce-client-jobclient-2.9.2-tests.jar TestDFSIO -clean





$ hadoop jar hadoop-mapreduce-client-jobclient-2.9.2-tests.jar nnbench --help
Usage: nnbench <options>
    -operation <Available operations are create_write open_read rename delete. This option is mandatory>
     * NOTE: The open_read, rename and delete operations assume that the files they operate on, are already available. The create_write operation must be run before running the other operations.
    -maps <number of maps. default is 1. This is not mandatory>
    -reduces <number of reduces. default is 1. This is not mandatory>
    -startTime <time to start, given in seconds from the epoch. Make sure this is far enough into the future, so all maps (operations) will start at the same time. default is launch time + 2 mins. This is not mandatory>
    -blockSize <Block size in bytes. default is 1. This is not mandatory>
    -bytesToWrite <Bytes to write. default is 0. This is not mandatory>
    -bytesPerChecksum <Bytes per checksum for the files. default is 1. This is not mandatory>
    -numberOfFiles <number of files to create. default is 1. This is not mandatory>
    -replicationFactorPerFile <Replication factor for the files. default is 1. This is not mandatory>
    -baseDir <base DFS path. default is /benchmarks/NNBench. This is not mandatory>
    -readFileAfterOpen <true or false. if true, it reads the file and reports the average time to read. This is valid with the open_read operation. default is false. This is not mandatory>
    -help: Display the help statement


$ hadoop jar hadoop-mapreduce-client-jobclient-2.9.2-tests.jar nnbench -operation create_write -maps 10 -reduces 5 -numberOfFiles 1000 -readFileAfterOpen true





$ hadoop jar hadoop-mapreduce-client-jobclient-2.9.2-tests.jar mrbench --help
Usage: mrbench [-baseDir <base DFS path for output/input, default is /benchmarks/MRBench>] [-jar <local path to job jar file containing Mapper and Reducer implementations, default is current jar file>] [-numRuns <number of times to run the job, default is 1>] [-maps <number of maps for each run, default is 2>] [-reduces <number of reduces for each run, default is 1>] [-inputLines <number of input lines to generate, default is 1>] [-inputType <type of input to generate, one of ascending (default), descending, random>] [-verbose]


$ hadoop jar hadoop-mapreduce-client-jobclient-2.9.2-tests.jar mrbench -numRuns 10 -maps 10 -reduces 5 -inputLines 10 -inputType descending



Terasort是测试Hadoop的一个有效的排序程序。通过Hadoop自带的Terasort排序程序,测试不同的Map任务和Reduce任务数量,对Hadoop性能的影响。 实验数据由程序中的teragen程序生成,数量为1G和10G。

1. TeraGen生成随机数据
2. TeraSort对数据排序
3. TeraValidate来验证TeraSort输出的数据是否有序,如果检测到问题,将乱序的key输出到目录

$ hadoop jar hadoop-mapreduce-examples-2.9.2.jar --help
Unknown program '--help' chosen.
Valid program names are:
  aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
  aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
  bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi.
  dbcount: An example job that count the pageview counts from a database.
  distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of Pi.
  grep: A map/reduce program that counts the matches of a regex in the input.
  join: A job that effects a join over sorted, equally partitioned datasets
  multifilewc: A job that counts words from several files.
  pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
  pi: A map/reduce program that estimates Pi using a quasi-Monte Carlo method.
  randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
  randomwriter: A map/reduce program that writes 10GB of random data per node.
  secondarysort: An example defining a secondary sort to the reduce.
  sort: A map/reduce program that sorts the data written by the random writer.
  sudoku: A sudoku solver.
  teragen: Generate data for the terasort
  terasort: Run the terasort
  teravalidate: Checking results of terasort
  wordcount: A map/reduce program that counts the words in the input files.
  wordmean: A map/reduce program that counts the average length of the words in the input files.
  wordmedian: A map/reduce program that counts the median length of the words in the input files.
  wordstandarddeviation: A map/reduce program that counts the standard deviation of the length of the words in the input files


1. TeraGen生成随机数,生成1G的随机数据,结果存放在/user/hduser/test_data

$ hadoop jar hadoop-mapreduce-examples-2.9.2.jar  teragen 10000000 test_data   # 注意,这个测试数据大小不能写成1g或者1t等这样的格式,在测试的时候使用这种格式,发现生成的数据大小为0


$ hadoop jar hadoop-mapreduce-examples-2.9.2.jar  terasort test_data terasort-output



$ hadoop jar hadoop-mapreduce-examples-2.9.2.jar  teravalidate  terasort-output terasort-validate


$ hadoop fs -cat terasort-validate/part-r-00000
checksum    4c49607ac53602




