Hadoop基准测试工具 - 51Testing软件测试论坛

测试对于验证系统的正确性、分析系统的性能来说非常重要，但往往容易被我们所忽视。为了能对系统有更全面的了解、能找到系统的瓶颈所在、能对系统性能做更好的改进，打算先从测试入手，学习Hadoop几种主要的测试手段。本文将分成两部分：第一部分记录如何使用Hadoop自带的测试工具进行测试；第二部分记录Intel开放的Hadoop Benchmark Suit: HiBench的安装及使用。

Hadoop自带了几个基准测试，被打包在几个jar包中，如hadoop-test.jar和hadoop-examples.jar，在Hadoop环境中可以很方便地运行测试。本文测试使用的Hadoop版本是cloudera的hadoop-0.20.2-cdh3u3。

$ export $HADOOP_HOME=/home/hadoop/hadoop
$ export $PATH=$PATH:$HADOOP_HOME/bin

复制代码

$ hadoop jar $HADOOP_HOME/xxx.jar

复制代码

当不带参数调用hadoop-test-0.20.2-cdh3u3.jar时，会列出所有的测试程序：

$ hadoop jar $HADOOP_HOME/hadoop-test-0.20.2-cdh3u3.jar
An example program must be given as the first argument.
Valid program names are:
DFSCIOTest: Distributed i/o benchmark of libhdfs.
DistributedFSCheck: Distributed checkup of the file system consistency.
MRReliabilityTest: A program that tests the reliability of the MR framework by injecting faults/failures
TestDFSIO: Distributed i/o benchmark.
dfsthroughput: measure hdfs throughput
filebench: Benchmark SequenceFile(Input|Output)Format (block,record compressed and uncompressed), Text(Input|Output)Format (compressed and uncompressed)
loadgen: Generic map/reduce load generator
mapredtest: A map/reduce test check.
minicluster: Single process HDFS and MR cluster.
mrbench: A map/reduce benchmark that can create many small jobs
nnbench: A benchmark that stresses the namenode.
testarrayfile: A test for flat files of binary key/value pairs.
testbigmapoutput: A map/reduce program that works on a very big non-splittable file and does identity map/reduce
testfilesystem: A test for FileSystem read/write.
testipc: A test for ipc.
testmapredsort: A map/reduce program that validates the map-reduce framework's sort.
testrpc: A test for rpc.
testsequencefile: A test for flat files of binary key value pairs.
testsequencefileinputformat: A test for sequence file input format.
testsetfile: A test for flat files of binary key/value pairs.
testtextinputformat: A test for text input format.
threadedmapbench: A map/reduce benchmark that compares the performance of maps with multiple spills over maps with 1 spill

复制代码

这些程序从多个角度对Hadoop进行测试，TestDFSIO、mrbench和nnbench是三个广泛被使用的测试。

TestDFSIO用于测试HDFS的IO性能，使用一个MapReduce作业来并发地执行读写操作，每个map任务用于读或写每个文件，map的输出用于收集与处理文件相关的统计信息，reduce用于累积统计信息，并产生summary。

$ hadoop jar $HADOOP_HOME/hadoop-test-0.20.2-cdh3u3.jar mrbench -numRuns 50

复制代码

Hadoop自带了一些例子，比如WordCount和TeraSort，这些例子在hadoop-examples-0.20.2-cdh3u3.jar中。执行以下命令会列出所有的示例程序：

并不需要在每次测试时都生成输入数据，生成一次数据之后，每次测试可以跳过第一步。

$ hadoop jar hadoop-*examples*.jar teragen <number of 100-byte rows> <output dir>

复制代码

以下命令运行TeraGen生成1GB的输入数据，并输出到目录/examples/terasort-input：

$ hadoop jar $HADOOP_HOME/hadoop-examples-0.20.2-cdh3u3.jar teragen \
10000000 /examples/terasort-input

复制代码

<10 bytes key><10 bytes rowid><78 bytes filler>\r\n

复制代码

HiBench是Intel开放的一个Hadoop Benchmark Suit，包含9个典型的Hadoop负载（Micro benchmarks、HDFS benchmarks、web search benchmarks、machine learning benchmarks和data analytics benchmarks），主页是： https://github.com/intel-hadoop/hibench 。

HiBench为大多数负载提供是否启用压缩的选项，默认的compression codec是zlib。