Hadoop基准测试工具

巴黎的灯光下 · 发表于 2019-6-18 10:59:43

Hadoop基准测试工具

测试对于验证系统的正确性、分析系统的性能来说非常重要，但往往容易被我们所忽视。为了能对系统有更全面的了解、能找到系统的瓶颈所在、能对系统性能做更好的改进，打算先从测试入手，学习Hadoop几种主要的测试手段。本文将分成两部分：第一部分记录如何使用Hadoop自带的测试工具进行测试；第二部分记录Intel开放的Hadoop Benchmark Suit: HiBench的安装及使用。

1. Hadoop基准测试

Hadoop自带了几个基准测试，被打包在几个jar包中，如hadoop-test.jar和hadoop-examples.jar，在Hadoop环境中可以很方便地运行测试。本文测试使用的Hadoop版本是cloudera的hadoop-0.20.2-cdh3u3。

在测试前，先设置好环境变量：

$ export $HADOOP_HOME=/home/hadoop/hadoop
$ export $PATH=$PATH:$HADOOP_HOME/bin

复制代码

使用以下命令就可以调用jar包中的类：

$ hadoop jar $HADOOP_HOME/xxx.jar

复制代码

(1). Hadoop Test

当不带参数调用hadoop-test-0.20.2-cdh3u3.jar时，会列出所有的测试程序：

$ hadoop jar $HADOOP_HOME/hadoop-test-0.20.2-cdh3u3.jar
An example program must be given as the first argument.
Valid program names are:
DFSCIOTest: Distributed i/o benchmark of libhdfs.
DistributedFSCheck: Distributed checkup of the file system consistency.
MRReliabilityTest: A program that tests the reliability of the MR framework by injecting faults/failures
TestDFSIO: Distributed i/o benchmark.
dfsthroughput: measure hdfs throughput
filebench: Benchmark SequenceFile(Input|Output)Format (block,record compressed and uncompressed), Text(Input|Output)Format (compressed and uncompressed)
loadgen: Generic map/reduce load generator
mapredtest: A map/reduce test check.
minicluster: Single process HDFS and MR cluster.
mrbench: A map/reduce benchmark that can create many small jobs
nnbench: A benchmark that stresses the namenode.
testarrayfile: A test for flat files of binary key/value pairs.
testbigmapoutput: A map/reduce program that works on a very big non-splittable file and does identity map/reduce
testfilesystem: A test for FileSystem read/write.
testipc: A test for ipc.
testmapredsort: A map/reduce program that validates the map-reduce framework's sort.
testrpc: A test for rpc.
testsequencefile: A test for flat files of binary key value pairs.
testsequencefileinputformat: A test for sequence file input format.
testsetfile: A test for flat files of binary key/value pairs.
testtextinputformat: A test for text input format.
threadedmapbench: A map/reduce benchmark that compares the performance of maps with multiple spills over maps with 1 spill

复制代码

这些程序从多个角度对Hadoop进行测试，TestDFSIO、mrbench和nnbench是三个广泛被使用的测试。

TestDFSIO

TestDFSIO用于测试HDFS的IO性能，使用一个MapReduce作业来并发地执行读写操作，每个map任务用于读或写每个文件，map的输出用于收集与处理文件相关的统计信息，reduce用于累积统计信息，并产生summary。

以下例子会运行一个小作业50次：

$ hadoop jar $HADOOP_HOME/hadoop-test-0.20.2-cdh3u3.jar mrbench -numRuns 50

复制代码

Hadoop Examples

Hadoop自带了一些例子，比如WordCount和TeraSort，这些例子在hadoop-examples-0.20.2-cdh3u3.jar中。执行以下命令会列出所有的示例程序：

TeraSort

一个完整的TeraSort测试需要按以下三步执行：

用TeraGen生成随机数据
对输入数据运行TeraSort
用TeraValidate验证排好序的输出数据

并不需要在每次测试时都生成输入数据，生成一次数据之后，每次测试可以跳过第一步。

TeraGen的用法如下：

$ hadoop jar hadoop-*examples*.jar teragen <number of 100-byte rows> <output dir>

复制代码

以下命令运行TeraGen生成1GB的输入数据，并输出到目录/examples/terasort-input：

$ hadoop jar $HADOOP_HOME/hadoop-examples-0.20.2-cdh3u3.jar teragen \
10000000 /examples/terasort-input

复制代码

TeraGen产生的数据每行的格式如下：

<10 bytes key><10 bytes rowid><78 bytes filler>\r\n

复制代码

其中：

key是一些随机字符，每个字符的ASCII码取值范围为[32, 126]
rowid是一个整数，右对齐
filler由7组字符组成，每组有10个字符（最后一组8个），字符从’A’到’Z’依次取值

HiBench

HiBench是Intel开放的一个Hadoop Benchmark Suit，包含9个典型的Hadoop负载（Micro benchmarks、HDFS benchmarks、web search benchmarks、machine learning benchmarks和data analytics benchmarks），主页是： https://github.com/intel-hadoop/hibench 。

HiBench为大多数负载提供是否启用压缩的选项，默认的compression codec是zlib。

Micro Benchmarks:

Sort (sort)：使用Hadoop RandomTextWriter生成数据，并对数据进行排序
WordCount (wordcount)：统计输入数据中每个单词的出现次数，输入数据使用Hadoop RandomTextWriter生成
TeraSort (terasort)：这是由微软的数据库大牛Jim Gray（2007年失踪）创建的标准benchmark，输入数据由Hadoop TeraGen产生

HDFS Benchmarks:

增强的DFSIO (dfsioe)：通过产生大量同时执行读写请求的任务来测试Hadoop机群的HDFS吞吐量

Web Search Benchmarks:

Nutch indexing (nutchindexing)：大规模搜索引擎索引是MapReduce的一个重要应用，这个负载测试Nutch（Apache的一个开源搜索引擎）的索引子系统，使用自动生成的Web数据，Web数据中的链接和单词符合Zipfian分布
PageRank (pagerank)：这个负载包含一种在Hadoop上的PageRank算法实现，使用自动生成的Web数据，Web数据中的链接符合Zipfian分布

Machine Learning Benchmarks:

Mahout Bayesian classification (bayes)：大规模机器学习也是MapReduce的一个重要应用，这个负载测试Mahout 0.7（Apache的一个开源机器学习库）中的Naive Bayesian训练器，输入数据是自动生成的文档，文档中的单词符合Zipfian分布
Mahout K-means clustering (kmeans)：这个负载测试Mahout 0.7中的K-means聚类算法，输入数据集由基于均匀分布和高斯分布的GenKMeansDataset产生

Data Analytics Benchmarks:

Hive Query Benchmarks (hivebench)：这个负载的开发基于SIGMOD 09的一篇论文“A Comparison of Approaches to Large-Scale Data Analysis”和HIVE-396，包含执行典型OLAP查询的Hive查询（Aggregation and Join），使用自动生成的Web数据，Web数据中的链接符合Zipfian分布

梦想家 · 发表于 2019-6-18 13:30:14

		自动登录	找回密码
密码			(注-册)加入51Testing

[资料] Hadoop基准测试工具

站长推荐 /1