性能测试VS负载测试VS压力测试[中文翻译]

AlexanderIII · 发表于 2007-1-15 13:20:30

源文来源：http://agiletesting.blogspot.com ... stress-testing.html

作者：Grig Gheorghiu

作者联系方式：http://agiletesting.blogspot.com/

译者：AlexanderIII

译文联系方式：cnalexanderiii@gmail.com

http://blog.51testing.com/?61747

［译者注］欢迎转载，但转载请尊重一下作者及译者，请把作者，译者及联系方式也转上，谢谢！同时如果有建议或者认为有什么不妥之处，欢迎来信交流，大家一起提高测试水平。
本篇译文是本人第一次尝试着翻译比较长一点的测试文章，因此有一些地方可能还不是翻得太准确，在相应的地方我都打上了*号，请指正！

＊由于本人已经取得作者的中文翻译权,请您在将本贴子内容用于商业行为之前,先与本人取得联系.谢谢合作．

［源文内容］
Monday, February 28, 2005
Performance vs. load vs. stress testing
Here's a good interview question for a tester: how do you define performance/load/stress testing? Many times people use these terms interchangeably, but they have in fact quite different meanings. This post is a quick review of these concepts, based on my own experience, but also using definitions from testing literature -- in particular: "Testing computer software" by Kaner et al, "Software testing techniques" by Loveland et al, and "Testing applications on the Web" by Nguyen et al.

Update July 7th, 2005

From the referrer logs I see that this post comes up fairly often in Google searches. I'm updating it with a link to a later post I wrote called 'More on performance vs. load testing'.

Performance testing

The goal of performance testing is not to find bugs, but to eliminate bottlenecks and establish a baseline for future regression testing. To conduct performance testing is to engage in a carefully controlled process of measurement and analysis. Ideally, the software under test is already stable enough so that this process can proceed smoothly.

A clearly defined set of expectations is essential for meaningful performance testing. If you don't know where you want to go in terms of the performance of the system, then it matters little which direction you take (remember Alice and the Cheshire Cat?). For example, for a Web application, you need to know at least two things:
• expected load in terms of concurrent users or HTTP connections
• acceptable response time
Once you know where you want to be, you can start on your way there by constantly increasing the load on the system while looking for bottlenecks. To take again the example of a Web application, these bottlenecks can exist at multiple levels, and to pinpoint them you can use a variety of tools:
• at the application level, developers can use profilers to spot inefficiencies in their code (for example poor search algorithms)
• at the database level, developers and DBAs can use database-specific profilers and query optimizers
• at the operating system level, system engineers can use utilities such as top, vmstat, iostat (on Unix-type systems) and PerfMon (on Windows) to monitor hardware resources such as CPU, memory, swap, disk I/O; specialized kernel monitoring software can also be used
• at the network level, network engineers can use packet sniffers such as tcpdump, network protocol analyzers such as ethereal, and various utilities such as netstat, MRTG, ntop, mii-tool
From a testing point of view, the activities described above all take a white-box approach, where the system is inspected and monitored "from the inside out" and from a variety of angles. Measurements are taken and analyzed, and as a result, tuning is done.

However, testers also take a black-box approach in running the load tests against the system under test. For a Web application, testers will use tools that simulate concurrent users/HTTP connections and measure response times. Some lightweight open source tools I've used in the past for this purpose are ab, siege, httperf. A more heavyweight tool I haven't used yet is OpenSTA. I also haven't used The Grinder yet, but it is high on my TODO list.

When the results of the load test indicate that performance of the system does not meet its expected goals, it is time for tuning, starting with the application and the database. You want to make sure your code runs as efficiently as possible and your database is optimized on a given OS/hardware configurations. TDD practitioners will find very useful in this context a framework such as Mike Clark's jUnitPerf, which enhances existing unit test code with load test and timed test functionality. Once a particular function or method has been profiled and tuned, developers can then wrap its unit tests in jUnitPerf and ensure that it meets performance requirements of load and timing. Mike Clark calls this "continuous performance testing". I should also mention that I've done an initial port of jUnitPerf to Python -- I called it pyUnitPerf.

If, after tuning the application and the database, the system still doesn't meet its expected goals in terms of performance, a wide array of tuning procedures is available at the all the levels discussed before. Here are some examples of things you can do to enhance the performance of a Web application outside of the application code per se:
• Use Web cache mechanisms, such as the one provided by Squid
• Publish highly-requested Web pages statically, so that they don't hit the database
• Scale the Web server farm horizontally via load balancing
• Scale the database servers horizontally and split them into read/write servers and read-only servers, then load balance the read-only servers
• Scale the Web and database servers vertically, by adding more hardware resources (CPU, RAM, disks)
• Increase the available network bandwidth
Performance tuning can sometimes be more art than science, due to the sheer complexity of the systems involved in a modern Web application. Care must be taken to modify one variable at a time and redo the measurements, otherwise multiple changes can have subtle interactions that are hard to qualify and repeat.

In a standard test environment such as a test lab, it will not always be possible to replicate the production server configuration. In such cases, a staging environment is used which is a subset of the production environment. The expected performance of the system needs to be scaled down accordingly.

The cycle "run load test->measure performance->tune system" is repeated until the system under test achieves the expected levels of performance. At this point, testers have a baseline for how the system behaves under normal conditions. This baseline can then be used in regression tests to gauge how well a new version of the software performs.

Another common goal of performance testing is to establish benchmark numbers for the system under test. There are many industry-standard benchmarks such as the ones published by TPC, and many hardware/software vendors will fine-tune their systems in such ways as to obtain a high ranking in the TCP top-tens. It is common knowledge that one needs to be wary of any performance claims that do not include a detailed specification of all the hardware and software configurations that were used in that particular test.

Load testing

We have already seen load testing as part of the process of performance testing and tuning. In that context, it meant constantly increasing the load on the system via automated tools. For a Web application, the load is defined in terms of concurrent users or HTTP connections.

In the testing literature, the term "load testing" is usually defined as the process of exercising the system under test by feeding it the largest tasks it can operate with. Load testing is sometimes called volume testing, or longevity/endurance testing.

Examples of volume testing:
• testing a word processor by editing a very large document
• testing a printer by sending it a very large job
• testing a mail server with thousands of users mailboxes
• a specific case of volume testing is zero-volume testing, where the system is fed empty tasks
Examples of longevity/endurance testing:
• testing a client-server application by running the client in a loop against the server over an extended period of time
Goals of load testing:
• expose bugs that do not surface in cursory testing, such as memory management bugs, memory leaks, buffer overflows, etc.
• ensure that the application meets the performance baseline established during performance testing. This is done by running regression tests against the application at a specified maximum load.
Although performance testing and load testing can seem similar, their goals are different. On one hand, performance testing uses load testing techniques and tools for measurement and benchmarking purposes and uses various load levels. On the other hand, load testing operates at a predefined load level, usually the highest load that the system can accept while still functioning properly. Note that load testing does not aim to break the system by overwhelming it, but instead tries to keep the system constantly humming like a well-oiled machine.

In the context of load testing, I want to emphasize the extreme importance of having large datasets available for testing. In my experience, many important bugs simply do not surface unless you deal with very large entities such thousands of users in repositories such as LDAP/NIS/Active Directory, thousands of mail server mailboxes, multi-gigabyte tables in databases, deep file/directory hierarchies on file systems, etc. Testers obviously need automated tools to generate these large data sets, but fortunately any good scripting language worth its salt will do the job.

Stress testing

Stress testing tries to break the system under test by overwhelming its resources or by taking resources away from it (in which case it is sometimes called negative testing). The main purpose behind this madness is to make sure that the system fails and recovers gracefully -- this quality is known as recoverability.

Where performance testing demands a controlled environment and repeatable measurements, stress testing joyfully induces chaos and unpredictability. To take again the example of a Web application, here are some ways in which stress can be applied to the system:
• double the baseline number for concurrent users/HTTP connections
• randomly shut down and restart ports on the network switches/routers that connect the servers (via SNMP commands for example)
• take the database offline, then restart it
• rebuild a RAID array while the system is running
• run processes that consume resources (CPU, memory, disk, network) on the Web and database servers
I'm sure devious testers can enhance this list with their favorite ways of breaking systems. However, stress testing does not break the system purely for the pleasure of breaking it, but instead it allows testers to observe how the system reacts to failure. Does it save its state or does it crash suddenly? Does it just hang and freeze or does it fail gracefully? On restart, is it able to recover from the last good state? Does it print out meaningful error messages to the user, or does it merely display incomprehensible hex codes? Is the security of the system compromised because of unexpected failures? And the list goes on.

Conclusion

I am aware that I only scratched the surface in terms of issues, tools and techniques that deserve to be mentioned in the context of performance, load and stress testing. I personally find the topic of performance testing and tuning particularly rich and interesting, and I intend to post more articles on this subject in the future.
posted by Grig Gheorghiu at 7:33 AM

［译文内容］
在面试测试人员的时候，这是一个很好的问题：你如何定义性能/负载/压力测试？在很多时候，人们都是将它们作为可互相替换的相同术语来使用，然而实际上他们之间的差异是比较大的。这个贴子是根据我自己的一些经验，针对这三个概念写的一个比较简单的评论，当然也同时参考了一些测试文献资料里的定义，比如说：
"Testing computer software" by Kaner et al
"Software testing techniques" by Loveland et al
"Testing applications on the Web" by Nguyen et al

Update July 7th, 2005
'.
从网站的访问日志中我可以看到这篇贴子经常会被人们在GOOGLE中搜索到，所以我在这里加上一个我写的一个后续贴子的地址连接'More on performance vs. load testing'.

性能测试

性能测试的目的不是去找bugs,而是排除系统的瓶颈，以及为以后的回归测试建立一个基准。而性能测试的操作，实际上就是一个非常小心受控的测量分析过程。在理想的情况下，被测软件在这个时候已经是足够稳定了，所以这个过程得以顺利的进行。

一组清晰已定义好的预期值是让一次有意义的性能测试的基本要素。如果连你自己都不知道系统性能有些什么是要测的，那么它对于你要测试的方法手段是没有指导意义的*。例如，给一个web应用做性能测试，你要知道至少两样东西：
 在不同并发用户数或者HTTP连接数情况下的负载预期值*
 可接受的响应时间

当你知道你的目标后，你就可以开始使用对系统持续增加负载的方法来观察系统的瓶颈所在。重新拿web应用系统来做例子，这些瓶颈可存在于多个层次，你可以使用多种工具来查明它们的所在：
 在应用层，开发人员可以通过profilers来发现低效率的代码，比如说较差的查找算法
 在数据库层，开发人员和数据库管理员（DBA）可以通过特定的数据库profilers及事件探查器*（query optimizers）
 在操作系统层，系统工程师可以使用一些工具如在Unix类的操作系统中的top,vmstat,iostat,在Windows系统中的PerfMon来监控CPU，内在，swap,磁盘I/O等硬件资源；专门的内核监控软件也可以在这一层面上被使用。
 在网络层上，网络工程师可以使用报文探测器（如tcpdump）,网络协议分析器（如ethereal）,还有其它的工具（如netstat,MRTG,ntop,mii-tool）

从测试的观点来看，上面所有描述的活动都是一种白盒的方法，它对系统从内到外及多角度进行审查及监控。测度数据*被取得及分析后，对系统的调整则成为理所当然的下一个步骤。

然而，（除了上面的方法外）测试人员在给被测系统运行负载试验*（这里为了不与我们所理解的负载测试-load testing的概念搞混，特译做负载试验）的时候，也采取了黑盒的方法。像对于WEB应用来讲，测试人员可以使用工具来模拟并发用户或者HTTP连接及测量响应时间。在我以前使用过的轻量级的负载测试开源工具有ab,siege,httperf。一个更重量级的工具是OpenSTA，但我没用过。我也还没有用过The Grinder这个工具，但它在我将要做的事情中排名靠前。

当负载试验*的结果显示出系统的性能来没有达到它的预期目标时，这就是要对应用和数据库的调整的时候了。同时你要确保让你的代码运行得尽可能高效，以及数据库在给定的操作系统和硬件配置的情况下最优化。测试驱动开发（TDD）的实践者会发现这种上下文结构框架是非常有用的*，如可以通过负载试验*及时间试验的函数性*来增强现存单元测试代码的Mike Clark的jUnitPerf*。当一个特定的函数或者方法被剖析过*和调试过后，开发人员就可以在jUnitPerf中，放入它的单元试验*来确保它可以达到负载及时间上的性能需求。Mike Clark称这为“持续性能测试”。我顺便也提一下我已经做了一个基于Python的jUnitPerf的初步研究，我称之为pyUnitPerf.

假若在调试过应用程序及数据库后，系统还是没有达到性能的预期目标，在这种情况下，还是有一些其它的调试的流程*可以针对前面讲过的那几个层次来使用的。下面就是一些在应用程序代码*之外仍可以提高WEB应用系统性能的例子：
 使用WEB缓存装制，如Squid提供的装置
 将高访问量的网页静态化，以避免这些高访问量对数据库进行大量的调用
 通过负载平衡的方法来水平缩放WEB服务器的结构*
 在水平缩放数据库群及将它们分为读写服务器和只读服务器后，还要对只读服务器群负载平衡。*
 通过增加更多的硬件资源（CPU，内存，磁盘等）纵向的缩放WEB及数据库服务器群
 增加网络的带宽

由于现在的WEB应用系统都是十分复杂的系统，性能调试有时要具有一些艺术性才行。在每次修改一个变量及重新测度的时候一定要非常小心，否则的话，在变化中将会有很多难于确定和重复的不确定因素*。

在一个规范的测试环境比如说一个测试实验试，它是不会常常的重现实际应用时的服务器配置环境。在这样的情况下，分段测试环境，也就是生产实际环境的一个子集就可以派上用场了。但同时系统的期望性能也需要相应的调低一点。

“运行负载试验*->测度性能->调试系统”这个循环一直要被重复执行到被测试系统达到了期望的性能标准了才可以停。在这个时候，测试人员就可以明了在正常条件下的系统运转怎么样，同时这些就可以做为以后在回归测试中，评价新版本的软件性能的一个标准了。

性能测试还有另一个目标就是建立一组被测系统的基准数据。在很多行业中都会有这种行业标准的基准数据，比如说TPC公布的。还有很多软硬件厂家都为了在TCP排名中靠前而对他们的机器进行精心调试。所以说你应当非常谨慎的说明在你进行测试的时候，并没有在种类繁多的软硬件产品中进行全部测试*。

负载测试
我们都已经在性能测试调试的过程中，见识过负载测试了。在那种环境中，它意味着通过自动化工具来持续对系统增加负载。但对于WEB应用来讲，负载则是并发用户或者HTTP连接的数量。
术语“负载测试”在测试文献资料中通常都被定义为给被测系统加上它所能操作的最大任务数的过程。负载测试有时也会被称为“容量测试”，或者“耐久性测试/持久性测试”*

容量测试的例子：
 通过编辑一个巨大的文件来测试文字处理软件
 通过发送一个巨大的作业来测试打印机
 通过成千上万的用户邮箱来测试邮件服务器
 有一种比较特别的容量测试是叫作“零容量测试”，它是给系统加上空任务来测试的。
耐久性测试/持久性测试的的例子：
 在一个循环中不停的运行客户端超过一个扩展时间段*。
负载测试的目的：
 找到一些在测试流程中前面的阶段所进行的粗略测试中没有被找出的bugs,例如，内存管理bugs,内存泄露，缓冲器溢出等等。
 保证应用程序达到性能测试中确定的性能基线。这个可以在运行回归试验时，通过加载特定的最大限度的负载来实现。

尽管性能测试和负载测试似乎很像，但他们的目的还是有差异的。一方面，性能测试使用负载测试的技术，工具，以及用不同的负载程度来测度和基准化系统。在另一方面来讲，负载测试是在一些已经定义好的负载程度上进行测试的，通常对系统加上最大负载之后，系统应该仍然可以提供全部功能。这里需要明确一点，负载测试并不是要对系统加载上过度的负载而使系统不能工作，而是要使系统像一个上满了油的机器嗡嗡叫*。

在负载测试的相关内容中，我想应该非常重要的是要有十分充足的数据来进行测试。从我的经验中得知，假若不用非常大的数据*去测的话，有很多严重的bug是不会的到的。比如说，LDAP/NIS/Active Directory数据库中成千上万的用户，邮件服务器中成千上万的邮箱，数据库中成G成G的表，文件系统中很深的文件或者目录的层次，等等。显然，测试人员就需要使用自动化工具来产生这些庞大的数据集，比较幸运的是任何优秀的脚本语言都可以胜任这些工作。

压力测试

压力测试是指通过对系统加载过度的资源或者例系统没有应该具有的令系统可以正常运作的资源，来使系统崩溃（在某些情况的时候，它又可以叫做负面测试）。进行这个疯狂行为的主要目的是为了保证系统出故障及可以适当的恢复，而这个恢复得怎么样的特性则是叫做可恢复性。

当性能测试需要的是一个可控制的环境和不断的测度的时候，压力测试则是令为欢喜的引起混乱及不可预测性（译者按：从这一点可以看出作者是一个很优秀的测试人员）。还是举WEB应用系统为例，下面是一些对系统可行的压力测试方法：
 两倍的已经基线的并发用户数或者HTTP连接数
 随机的关闭及重开连接到服务器上的网络上集线器/路由器的端口（例如，可以通过SNMP命令来实现）
 把数据库断线然后再重启
 当系统还在运行的时候，重建一个RAID阵列
 在WEB和数据库服务器上运行消耗资源（如CPU，内存，磁盘，网络）的进程

我可以肯定一些经常使用非常规方法来破坏系统的测试人员可以进一步充实这个列表的。只是压力测试并不是简单的为了一种破坏的快感而去破坏系统，实际上它是可以让测试工程师观察系统对出现故障时系统的反应。系统是不是保存了它出故障时的状态？是不是它就突然间崩溃掉了？它是否只是挂在那儿啥也不做了？它失效的时候是不是有一些反应*？在重启之后，它是否有能力可以恢复到前一个正常运行的状态？它是否会给用户显示出一些有用的错误信息，还是只是显示一些很难理解的十六进制代码？系统的安全性是否为因为一些不可预料的故障而会有所降低？这些问题可以一直问下去的。

结论
我明白我只是提到一些在性能/负载/压力测试中应该值得注意的一些比较基础的问题，工具及方法。我个人觉得性能测试及调试是一个很有趣的话题，我将会在以后贴更多的相关内容上来的。

[ 本帖最后由 AlexanderIII 于 2007-1-17 20:26 编辑 ]