Chapter 14 – Test Execution

 

patterns & practices Developer Center

Performance Testing Guidance for Web Applications

J.D. Meier, Carlos Farre, Prashant Bansode, Scott Barber, and Dennis Rea
Microsoft Corporation

September 2007

Objectives

  • Understand common principles and considerations of performance test execution.
  • Understand the common activities of performance test execution.

Overview

Performance test execution is the activity that occurs between developing test scripts and reporting and analyzing test results. Much of the performance testing–related training available today treats this activity as little more than starting a test and monitoring it to ensure that the test appears to be running as expected. In reality, this activity is significantly more complex than just clicking a button and monitoring machines. This chapter addresses these complexities based on numerous real-world project experiences.

How to Use this Chapter

Use this chapter to understand the key principles and considerations underlying performance test execution and the various activities that it entails. To get the most from this chapter:

  • Use the “Approach for Test Execution” section to get an overview of the approach for performance test execution and as quick reference guide for you and your team.
  • Use the various activity sections to understand the details of each activity involved in performance test execution.

Approach for Test Execution

The following activities are involved in performance test execution:

  • Validate the test environment
  • Validate tests
  • Run tests
  • Baseline and benchmark
  • Archive tests

The following sections discuss each of these activities in detail.

Validate the Test Environment

The goal is for the test environment to mirror your production environment as closely as possible. Typically, any differences between the test and production environments are noted and accounted for while designing tests. Before running your tests, it is important to validate that the test environment matches the configuration that you were expecting and/or designed your test for. If the test environment is even slightly different from the environment you designed your tests to be run against, there is a high probability that your tests might not work at all, or worse, that they will work but will provide misleading data.

The following activities frequently prove valuable when validating a test environment:

  • Ensure that the test environment is correctly configured for metrics collection.
  • Turn off any active virus-scanning on load-generating machines during testing, to minimize the likelihood of unintentionally skewing results data as a side-effect of resource consumption by the antivirus/anti-spyware software.
  • Consider simulating background activity, when necessary. For example, many servers run batch processing during predetermined time periods, while servicing users’ requests. Not accounting for such activities in those periods may result in overly optimistic performance results.
  • Run simple usage scenarios to validate the Web server layer first if possible, separately from other layers. Run your scripts without think times. Try to run a scenario that does not include database activity. Inability to utilize 100 percent of the Web server’s processor can indicate a network problem or that the load generator clients have reached their maximum output capacity.
  • Run simple usage scenarios that are limited to reading data to validate database scenarios. Run your script without think times. Use test data feeds to simulate randomness. For example, query for a set of products. Inability to utilize 100 percent of the Web server’s processor can indicate a network problem or that the load-generator clients have reached their maximum output capacity.
  • Validate the test environment by running more complex usage scenarios with updates and writes to the database, using a mix of test scripts that simulate business actions.
  • In Web farm environments, check to see if your load tests are implementing Internet Protocol (IP) switching. Not doing so may cause IP affinity, a situation where all of the requests from the load-generation client machine are routed to the same server rather than being balanced across all of the servers in the farm. IP affinity leads to inaccurate load test results because other servers participating in the load balancing will not be utilized.
  • Work with key performance indicators (KPIs) on all the servers to assess your test environment (processor, network, disk, and memory). Include all servers in the cluster to ensure correct evaluation of your environment.
  • Consider spending time creating data feeds for the test application. For example, database tables containing production data such as number of users, products, and orders shipped, so that you can create similar conditions to replicate problems in critical usage scenarios. Many scenarios involve running queries against tables containing several thousands of entries, to simulate lock timeouts or deadlocks.

Additional Considerations

Consider the following key points when troubleshooting performance-testing environments:

  • Look for problems in the load-generation clients from which load is simulated. Client machines often produce inaccurate performance-testing results due to insufficient processor or memory resources. Consider adding more client computers to compensate for fast transactions that may cause higher processor utilization; also consider using more memory when this becomes the bottleneck. Memory can be consumed when test data feeds are cached in load generators, or by more complex scripting in load tests.
  • Some network interface cards (NICs) when set to auto mode will fail to negotiate with switches in proper full-duplex mode. The result is that the NICs will operate in half-duplex negotiation, which causes inaccurate performance-testing results. A typical perimeter network with a Web server and database server in different layers will be deployed with the Web server having two NICs, one facing your clients and another using a different route to communicate with the database layer. However, be aware that having one NIC in the Web server facing both the clients and the database tier may cause network bottleneck congestion.
  • The database server in the production environment may be using separate hard drives for log files and data files associated with the database as a matter of policy. Replicate such deployment configurations to avoid inaccurate performance-testing results. Consider that if DNS is not properly configured, it might cause broadcast messages to be sent when opening database connections by using the database server name. Name-resolution issues may cause connections to open slowly.
  • Improper data feeds consumed by your scripts will frequently cause you to overlook problems with the environment. For example, low processor activity may be caused by artificial locking due to scripts querying the same record from the database. Consider creating test data feeds that simulate the correct business actions, accounting for variability of data sent from the post request. Load-generation tools may use a central repository such as a database or files in a directory structure to collect performance test data. Make sure that the data repository is located on a machine that will not cause traffic in the route used by your load-generation tools; for example, putting the data repository in the same virtual local-area network (VLAN) of the machine used to manage data collection.
  • Load-generation tools may require the use of special accounts between load-generator machines and the computers that collect performance data. Make sure that you set such configurations correctly. Verify that data collection is occurring in the test environment, taking into consideration that the traffic may be required to pass through a firewall.

Validate Tests

Poor load simulations can render all previous work useless. To understand the data collected from a test run, the load simulation must accurately reflect the test design. When the simulation does not reflect the test design, the results are prone to misinterpretation. Even if your tests accurately reflect the test design, there are still many ways that the test can yield invalid or misleading results. Although it may be tempting to simply trust your tests, it is almost always worth the time and effort to validate the accuracy of your tests before you need to depend on them to provide results intended to assist in making the “go-live” decision. It may be useful to think about test validation in terms of the following four categories:

  • Test design implementation.  To validate that you have implemented your test design accurately (using whatever method you have chosen), you will need to run the test and examine exactly what the test does.
  • Concurrency.  After you have validated that your test conforms to the test design when run with a single user, run the test with several users. Ensure that each user is seeded with unique data, and that users begin their activity within a few seconds of one another — not all at the same second, as this is likely to create an unrealistically stressful situation that would add complexity to validating the accuracy of your test design implementation. One method of validating that tests run as expected with multiple users is to use three test runs; one with 3 users, one with 5 users, and one with 11 users. These three tests have a tendency to expose many common issues with both the configuration of the test environment (such as a limited license being installed on an application component) and the test itself (such as parameterized data not varying as intended). 
  • Combinations of tests.  Having validated that a test runs as intended with a single user and with multiple users, the next logical step is to validate that the test runs accurately in combination with other tests. Generally, when testing performance, tests get mixed and matched to represent various combinations and distributions of users, activities, and scenarios. If you do not validate that your tests have been both designed and implemented to handle this degree of complexity prior to running critical test projects, you can end up wasting a lot of time debugging your tests or test scripts when you could have been collecting valuable performance information.
  • Test data validation.  Once you are satisfied that your tests are running properly, the last critical validation step is to validate your test data. Performance testing can utilize and/or consume large volumes of test data, thereby increasing the likelihood of errors in your dataset. In addition to the data used by your tests, it is important to validate that your tests share that data as intended, and that the application under test is seeded with the correct data to enable your tests.

Dynamic Data

The following are technical reasons for using dynamic data correctly in load test scripts:

  • Using the same data value causes artificial usage of caching because the system will retrieve data from copies in memory. This can happen throughout different layers and components of the system, including databases, file caches of the operating systems, hard drives, storage controllers, and buffer managers. Reusing data from the cache during performance testing might account for faster testing results than would occur in the real world.
  • Some business scenarios require a relatively small range of data selection. In such a case, even reusing the cache more frequently will simulate other performance-related problems, such as database deadlocks and slower response times due to timeouts caused by queries to the same items. This type of scenario is typical of marketing campaigns and seasonal sales events.
  • Some business scenarios require using unique data during load testing; for example, if the server returns session-specific identifiers during a session after login to the site with a specific set of credentials. Reusing the same login data would cause the server to return a bad session identifier error. Another frequent scenario is when the user enters a unique set of data, or the system fails to accept the selection; for example, registering new users that would require entering a unique user ID on the registration page.
  • In some business scenarios, you need to control the number of parameterized items; for example, a caching component that needs to be tested for its memory footprint to evaluate server capacity, with a varying number of products in the cache.
  • In some business scenarios, you need to reduce the script size or the number of scripts; for example, several instances of an application will live in one server, reproducing a scenario where an independent software vendor (ISV) will host them. In this scenario, the Uniform Resource Locators (URLs) need to be parameterized during load test execution for the same business scenarios.
  • Using dynamic test data in a load test tends to reproduce more complicated and time-sensitive bugs; for example, a deadlock encountered as a result of performing different actions using different user accounts.
  • Using dynamic test data in a load test allows you to use error values if they suit your test plan; for example, using an ID that is always a positive number when testing to simulate hacker behavior. It may be beneficial to use zero or negative values when testing to replicate application errors, such as scanning the database table when an invalid value is supplied.

Test Validation

The following are some commonly employed methods of test validation, which are frequently used in combination with one another:

  • Run the test first with a single user only. This makes initial validation much less complex.
  • Observe your test while it is running and pay close attention to any behavior you feel is unusual. Your instincts are usually right, or at least valuable.
  • Use the system manually during test execution so that you can compare your observations with the results data at a later time.
  • Make sure that the test results and collected metrics represent what you intended them to represent.
  • Check to see if any of the parent requests or dependent requests failed. 
  • Check the content of the returned pages, as load-generation tools sometimes report summary results that appear to “pass” even though the correct page or data was not returned.
  • Run a test that loops through all of your data to check for unexpected errors.
  • If appropriate, validate that you can reset test and/or application data following a test run.
  • At the conclusion of your test run, check the application database to ensure that it has been updated (or not) according to your test design. Consider that many transactions in which the Web server returns a success status with a “200” code might be failing internally; for example, errors due to a previously used user name in a new user registration scenario, or an order number that is already in use.
  • Consider cleaning the database entries between error trials to eliminate data that might be causing test failures; for example, order entries that you cannot reuse in subsequent test execution.
  • Run tests in a variety of combinations and sequences to ensure that one test does not corrupt data needed by another test in order to run properly.

Additional Considerations

Consider the following additional points when validating your tests:

  • Do not use performance results data from your validation test runs as part of your final report.
  • Report performance issues uncovered during your validation test runs.
  • Use appropriate load-generation tools to create a load that has the characteristics specified in your test design.
  • Ensure that the intended performance counters for identified metrics and resource utilization are being measured and recorded, and that they are not interfering with the accuracy of the simulation.
  • Run other tests during your performance test to ensure that the simulation is not impacting other parts of the system. These other tests may be either automated or manual.
  • Repeat your test, adjusting variables such as user names and think times to see if the test continues to behave as anticipated.
  • Remember to simulate ramp-up and cool-down periods appropriately.

Questions to Ask

  • What additional team members should be involved in evaluating the accuracy of this test?
  • Do the preliminary results make sense?
  • Is the test providing the data we expected?

Run Tests

Although the process and flow of running tests are extremely dependent on your tools, environment, and project context, there are some fairly universal tasks and considerations to keep in mind when running tests.

Once it has been determined that the application under test is in an appropriate state to have performance tests run against it, the testing generally begins with the highest-priority performance test that can reasonably be completed based on the current state of the project and application. After each test run, compile a brief summary of what happened during the test and add these comments to the test log for future reference. These comments may address machine failures, application exceptions and errors, network problems, or exhausted disk space or logs. After completing the final test run, ensure that you have saved all of the test results and performance logs before you dismantle the test environment.

Whenever possible, limit tasks to one to two days each to ensure that no time will be lost if the results from a particular test or battery of tests turn out to be inconclusive, or if the initial test design needs modification to produce the intended results. One of the most important tasks when running tests is to remember to modify the tests, test designs, and subsequent strategies as results analysis leads to new priorities.

A widely recommended guiding principle is: Run test tasks in one- to two-day batches. See the tasks through to completion, but be willing to take important detours along the way if an opportunity to add additional value presents itself.

Keys to Efficiently and Effectively Running Tests

In general, the keys to efficiently and effectively running tests include:

  • Revisit performance-testing priorities after no more than two days.
  • Remember to capture and use a performance baseline.
  • Plan to spend some time fixing application errors, or debugging the test.
  • Analyze results immediately so that you can modify your test plan accordingly.
  • Communicate test results frequently and openly across the team.
  • Record results and significant findings.
  • Record other data needed to repeat the test later.
  • At appropriate points during test execution, stress the application to its maximum capacity or user load, as this can provide extremely valuable information.
  • Remember to validate application tuning or optimizations.
  • Consider evaluating the effect of application failover and recovery.
  • Consider measuring the effects of different system configurations.

Additional Considerations

Consider the following additional points when running your tests:

  • Performance testing is frequently conducted on an isolated network segment to prevent disruption of other business operations. If this is not the case for your test project, ensure that you obtain permission to generate loads during certain hours on the available network.
  • Before running the real test, consider executing a quick “smoke test” to make sure that the test script and remote performance counters are working correctly.
  • If you choose to execute a smoke test, do not report the results as official or formal parts of your testing.
  • Reset the system (unless your scenario is to do otherwise) before running a formal test.
  • If at all possible, execute every test twice. If the results produced are not very similar, execute the test again. Try to determine what factors account for the difference.
  • No matter how far in advance a test is scheduled, give the team 30-minute and 5-minute warnings before launching the test (or starting the day’s testing). Inform the team whenever you are not going to be executing for more than one hour in succession.
  • Do not process data, write reports, or draw diagrams on your load-generating machine while generating a load because this can corrupt the data.
  • Do not throw away the first iteration because of script compilation or other reasons. Instead, measure this iteration separately so you will know what the first user after a system-wide reboot can expect.
  • Test execution is never really finished, but eventually you will reach a point of diminishing returns on a particular test. When you stop obtaining valuable information, change your test.
  • If neither you nor your development team can figure out the cause of an issue in twice as much time as it took the test to execute, it may be more efficient to eliminate one or more variables/potential causes and try again.
  • If your intent is to measure performance related to a particular load, it is important to allow time for the system to stabilize between increases in load to ensure the accuracy of measurements.
  • Make sure that the client computers (also known as load-generation client machines) that you use to generate load are not overly stressed. Utilization of resources such as processor and memory should remain low enough to ensure that the load-generation environment is not itself a bottleneck.
  • Analyze results immediately and modify your test plan accordingly.
  • Work closely with the team or team sub-set that is most relevant to the test.
  • Communicate test results frequently and openly across the team.
  • If you will be repeating the test, consider establishing a test data restore point before you begin testing.
  • In most cases, maintaining a test execution log that captures notes and observations for each run is invaluable.
  • Treat workload characterization as a moving target. Adjust new settings for think times and number of users to model the new total number of users for normal and peak loads.
  • Observe your test during execution and pay close attention to any behavior you feel is unusual. Your instincts are usually right, or at least valuable.
  • Ensure that performance counters relevant for identified metrics and resource utilization are being measured and are not interfering with the accuracy of the simulation.
  • Use the system manually during test execution so that you can compare your observations with the results data at a later time.
  • Remember to simulate ramp-up and cool-down periods appropriately.

Questions to ask

  • Have recent test results or project updates made this task more or less valuable compared to other tests we could be conducting right now?
  • What additional team members should be involved with this task?
  • Do the preliminary results make sense?

Baseline and Benchmark

When baselines and benchmarks are used, they are generally the first and last tests you will execute, respectively. Of all the tests that may be executed during the course of a project, it is most important that baselines and benchmarks be well understood and controlled, making the validations discussed above even more important.

Baselines

Creating a baseline is the process of running a set of tests to capture performance metric data for the purpose of evaluating the effectiveness of subsequent performance-improving changes to the system or application.

With respect to Web applications, you can use a baseline to determine whether performance is improving or declining and to find deviations across builds and versions. For example, you could measure load time, number of transactions processed per unit of time, number of Web pages served per unit of time, and resource utilization such as memory and processor usage. Some considerations about using baselines include:

  • A baseline can be created for a system, component, or application.
  • A baseline can be created at different layers: database, Web services, etc.
  • A baseline can be used as a standard for comparison to track future optimizations or regressions. When using a baseline for this purpose, it is important to validate that the baseline tests and results are well understood and repeatable.
  • Baselines can help product teams articulate variances that represent degradation or optimization during the course of the development life cycle by providing a known starting point for trend analysis. Baselines are most valuable if created using a set of reusable test assets; it is important that such tests are representative of workload characteristics that are both repeatable and provide an appropriately accurate simulation.
  • Baseline results can be articulated by using combinations of a broad set of key performance indicators such as response time, processor, memory, disk, and network.
  • Sharing baseline results across the team establishes a common foundation of information about performance characteristics to enable future communication about performance changes in an application or component.
  • A baseline is specific to an application and is most useful for comparing performance across different builds, versions, or releases.
  • Establishing a baseline before making configuration changes almost always saves time because it enables you to quickly determine what effect the changes had on the application’s performance.

Benchmarking

Benchmarking is the process of comparing your system performance against an industry standard that is endorsed by some other organization.

From the perspective of Web application development, benchmarking involves running a set of tests that comply with the specifications of an industry benchmark to capture the performance metrics for your application necessary to determine its benchmark score. You can then compare your application against other systems or applications that have also calculated their score for the same benchmark. You may choose to tune your application performance to achieve or surpass a certain benchmark score. Some considerations about benchmarking include:

  • A benchmark score is achieved by working within industry specifications or by porting an existing implementation to comply with those specifications.
  • Benchmarking generally requires identifying all of the necessary components that will run together, the market where the product exists, and the specific metrics to measure.
  • Benchmark scores can be published publicly and may result in comparisons being made by competitors. Performance metrics that may be included along with benchmark scores include response time, transactions processed per unit of time, Web pages accessed per unit of time, processor usage, memory usage, and search times.

Archive Tests

Some degree of change control or version control can be extremely valuable for managing scripts, scenarios, and/or data changes between each test execution, and for communicating these differences to the rest of the team. Some teams prefer to check their test scripts, results, and reports into the same version-control system as the build of the application to which they apply. Other teams simply save copies into dated folders on a periodic basis, or have their own version-control software dedicated to the performance team. It is up to you and your team to decide what method is going to work best for you, but in most cases archiving tests, test data, and test results saves much more time than it takes over the course of a performance-testing project.

Additional Considerations

Consider the following additional points when creating baselines and benchmarking:

  • You can use archived test scripts, data, and results to create the baseline for the next version of your product. Archiving this information together with the build of the software that was tested satisfies many auditability standards.
  • In most cases, performance test scripts are improved or modified with each new build. If you do not save a copy of the script and identify the build it was used against, you can end up doing a lot of extra work to get your scripts running again in the case of a build rollback.
  • With the overwhelming majority of load-generation tools, implementing the test is a minor software-development effort in itself. While this effort generally does not need to follow all of the team’s standards and procedures for software development, it is a good idea to adopt a sound and appropriately “weighted” development process for performance scripts that complements or parallels the process your development team employs.

Summary

Performance test execution involves activities such as validating test environments/scripts, running the test, and generating the test results. It can also include creating baselines and/or benchmarks of the performance characteristics.

It is important to validate the test environment to ensure that the environment truly represents the production environment.

Validate test scripts to check if correct metrics are being collected, and if the test script design is correctly simulating workload characteristics.

patterns & practices Developer Center