29 October 2008

Test Automation for Complex Systems

A while back the software development team I was working in was struggling with late stage severity 1 bugs. There were about 3M lines of C/C++ server code, 10 client apps and 300 developers. The code had evolved over 10-15 years, it was rich in functionality and much of it was timing critical. Therefore it came as no surprise that the code had some problems and that the problems were hard to fix.

First some questions had to be answered.

Q: What was the problem to be solved? 
A: To find and fix most severity 1 bugs in early stages of development.

Q: Why was this not being done now? 
A: Complexity. The combinatorics of the system were daunting. It had the benefit of a comprehensive middleware layer that cleanly separated clients from the server. However the middleware layer was large, approx 1,000 APIs all with multiple arguments, the server processed complex files specified by multiple complex languages and 100+ meta-data attributes with an average of 4 values each that could be applied to each file. The server was multi-threaded to handle multiple users and to pipeline processing of jobs, and the threads shared data and state so interactions were important. The system also had a large real-time device control component.

Q: What was there to work with?
A: Working code. The system had bugs but it worked. Many parts of the code were well-written. It was possible that the code could be bug fixed to an acceptable level of stability without a major re-write.

Q: What was the best starting point? 
A: The middleware layer. All the company's client apps communicated with the server through the middle-ware layer so it was theoretically possible to test any customer-visible scenario by scripting the middleware layer.

Those questions and answers appeared to narrow the scope of the problem as much as it was ever going to be narrowed by theoretical analysis. Therefore we put together a prototype test framework to see what was achievable in practice. The late stage severity 1 bugs were very costly so time was of the essence.

The Design Goals

  • Test as close as possible to code submission time.
  • Get started as early as possible. The goal was to fix bugs. The earlier we started, the more we could fix.
  • Test as much user-exposed functionality as possible as quickly as possible.
    • Test automatically.
    • Get as many test systems as possible. Either re-deploy existing systems or purchase new ones.
    • Cover functionality as efficiently as possible.
The First Implementation
  • Write test cases as programs that use the middleware layers.
  • Try to save development time by using an existing automated test framework. We quickly found that there were no frameworks that helped much with our 3 main requirements:
    • Integrate with our source code management (SCM) system and our build systems.
    • Install our code on our test machines
    • Invoke our middle-ware based test programs

  • Get started early by writing test programs that exercise well-known functionality
  • Create a simple test framework to learn about writing automated test frameworks. This framework would
    • Trigger from SCM check-ins.
    • Invoke the build system.
    • Install built code on test system.
    • Run tests.
    • Save results.
    • Display results.

  • Run reliably on 10 test machines running 24x7.
  • Run fully enqueued.
One developer wrote a prototype framework while other developers wrote tests or adapted existing tests to the framework. Within a few months we were running on 20 test machines with a duty cycle of around 50%. It was not perfect: we made heavy use of TCP/IP controlled power switches and other tricks and we had to re-start and re-image test machines regularly. But it worked and we had learned a lot.

Lessons Learned From the First Implementation
  • Getting a prototype out fast was invaluable.
  • Many low-hanging server bugs were found and fixed.
  • We had learned how to build the final automated test system.
  • The company's middleware layer was the key to the success of our system level automated testing. All our success was based on the wisdom and hard work of the team who designed and implemented this layer then made it work for all servers and all clients in the company's product line and rigorously maintained backwards compatibility over its history.
  • To be effective, the automated test system had to work reliably in a fully enqueued mode. That is, a code check-in had to trigger a build, installation, test, saving of results, capture of failed states, notification to developers and reservation of the failed test system, all without human intervention. Doing so gave 24x7=168 hours of testing per test machine per week. Waiting for human intervention could drop throughput by a factor of 10. As the number of test machines increased, the wait for each human intervention grew. As the tests became better at killing test machines, the number of human interventions increased.
  • Tests needed to be more effective. Even though our tests had found as many bugs as we had time to fix, it was still easy to find bugs in manual testing. Moreover the code being tested was still under development and it was clear that it was not stabilizing fast enough to meet target schedules.
Critical Next Steps
By this time work the final automated build and test was well underway. It was much more sophisticated than our prototype: it had a scalable design, was easy to use, robust and its code was maintainable over a long period. But we still had to address the issues of test machine reliability, full enqueuing and testing effectiveness.

The solutions to test machine reliability and full enqueuing turned out to have already been solved by VMware Lab Manager. We did not get to Lab Manager directly. We tried building something similar ourselves first before we found out that Lab Manager solved all the problems we had been working through plus some we had yet to encounter. The key benefits of Lab Manager were
  • Test machines were virtualized so they were as reliable as the code being tested. We no longer needed TCP/IP controlled switches and human intervention to deal with hung machines. Hung virtual test machines could be killed and recycled with a single remote function call.
  • The full test cycle could be enqueued. Virtual machines could be created, software installed on them, tests run, and failed virtual systems saved to disk for later examination. The duty cycle went up to close to 100%
  • The full test cycle was implemented extremely efficiently with VMware technology. The saved VM states were implemented as deltas from the common test images. Efficiency (= number of tests per physical PC per week) went up by over a factor of 10.
  • It scaled very well. Combined with the scalability of the new auto-build+test system, the number of (virtual) machines under test increased to over a hundred quickly then kept growing.
We found no similar existing solution to the testing efficiency problem. To restate the problem, there were many ways the product could be used and testing all these cases was taking too long. The number of possible ways the product could be used was the product of
  • Number of different input files. This was high but we did not know what the effective number of input files was.
  • Meta data applied to each file. 140 attributes with an average of 4 values/attribute = 4^140 = 10^84
  • 1000 APIs in middle-ware, which controlled approx 10 main computational threads which shared state data.
The field of Experimental Design explains how to test systems with several input variables efficiently. The key technique is to change several input variables on each test run. The challenges are knowing which variables to change on each run and interpreting the results when several variables are changed at once. An example will illustrate:
  • The above meta-data turned out to have a big effect on the outcome of tests. Given that a test run took at least 3 seconds, the only way to test all 140 meta-data attributes was to change most attributes on all test runs. After a set of test runs with, say, 100 different meta-data attributes set on each run, one test run fails. How do you find out which of the attributes caused the failure? (One answer is given at the end of this post).
The following is a rough outline of how we designed our software experiment.

Design of Software Experiment 
The number of input variables to test was far too high for traditional experimental designs so we used Optimal Designs. First we narrowed the input variable list to the ones that were important for the outcomes we cared about. While we were software engineers who understood how each individual function in the code behaved, we could not construct any useful model to describe how the entire system would behave. Therefore we did this empirically and used Dimensional Reduction  What we did was too complicated to describe in a blog post but a simplified version could be summarized as follows:
  • Collected as many input files as possible. We wrote a crawler to search all the company's test files and installed snooper's on all the company's test machines to create a superset of the company's test files.
  • Synthesized the D-Optimal set of meta-data attributes.
  • Manually selected the key middle-ware APIs.
  • Executed many test runs.
  • Found the Principal Components of the input variables that described the bulk of the variation in the outcomes we cared about.
For some of the input variables the dimensional reduction was high. In other cases little reduction was achieved. We ended up with a dramatically simplified but still complex set of tests, essentially one thousand test files, and one hundred meta-data attributes, a smallish number of middle-ware calls, and all the timing-dependent interactions between those. This turned out not to be so bad. Since the remaining input variables were uncorrelated to within the limits of our testing, we could test them all at once without them interfering significantly with each other. That meant we could do something similar to Fuzz testing (e.g.: Java Fuzz Testing,  Fuzzers - The ultimate list).

The meta-data testing illustrates how this worked. There were 100 attributes with 4 values each. Say the random number generator selected every value of an attribute with 99.99% probability within 10 test runs (that is, rand()%4 returned 0,1,2 and 3 within 10 test runs with 99.99% probability) and the values it selected for each attribute were uncorrelated. Then
  • in 10 test runs, all outcomes that depended on 1 attribute would have been tested
  • in 100 test runs, all outcomes that depended on 2 attributes would have been tested
  • in 1000 test runs, all outcomes that depended on 3 attributes would have been tested
This was good coverage, especially since real users tended not to set a lot of attributes at once and the code was modular enough that the attributes tended not to have a lot of inter-dependencies. Debugging failures took some time because each test run had 100 meta-data attributes set. The vast majority of bugs were caused by only one or two of those attributes but the culprit attributes still had to be found. The solution we used was a binary search, re-running the failed test with selected attributes disabled until we found the minimal set of attributes that triggered the failure.

TO COME:
Pipelining tests 

7 comments:

Srihari Palangala said...

Hi,
You touch upon a core value proposition of virtual lab automation solutions such as VMware Lab Manager and VMLogix LabManager - i.e., the ability for users to self-serve.

You may find the following post on a customer experience with VMLogix LabManager particularly interesting.

VMLogix LabManager provides powerful guest VM automation capabilities that allows users to truly automate provisioning as well as guest VM customizations (and multi-machine synchronizations).

- Srihari Palangala
http://blog.vmlogix.com

Ulf said...
This comment has been removed by a blog administrator.
Peter Williams said...

Shrihari

The key benefit VMware Lab Manager brought to our project was that it allowed us to fully enqueue the build+test+fix cycle: provision VMs, install software on them, run tests, check results saved failed VMs to disk for later examination. The build+test+fix duty cycle was close to 100%, INDEPENDENT OF THE NUMBER OF DEVELOPERS, COMPLEXITY OF THE CODE UNDER TEST AND THE NUMBER OF RELEASES PER YEAR.

This was a vast improvement for our development organization of 300 developers working on 3M lines of C/C++ code and making 50-100 product releases/year. Anyone who has worked in a development organization of this size knows that the amount of time spent on communicating between teams and the hand-offs of code between teams dominates schedules. High quality releases require thorough testing at each hand-off which in turn requires synchronizing many people: the developers submitting code, the build team building it, the testers testing it and reporting bugs, and the developers fixing the bugs. This high level of synchronization restricts the flow of work through the system, and the fraction of time spent on synchronization increases with the size of the development organization.

Lab Manager allowed us to fully enqueue the hand-offs between teams so the synchonization overhead almost disappeared. Code check-ins triggered the build+install+test+check process and after that the code change was either accepted or rejected with a reason and a failed VM for the developer to debug when he/she had time.

Lab Manager is also a great application in its own right, and we introduced it to our QA organization to use in that way. However this blog post is about how we first used it to implement a fully automated development hand-off pipeline. IMO this was more important than the QA semi-automated usage because it caught bugs before they got into the code branches that QA tested.

-Peter

Peter Williams said...

People are now patenting virtualized testing. http://www.patents.com/Test-Automation-Using-Virtual-Machines/US20080244525/en-US/ is an example from Microsoft.

Gary said...

ATS have also done virtualized testing but using Java and a seti@home approach. See http://automatedtestsystems.com.au/node/29

Big Bad Bob said...
This comment has been removed by a blog administrator.
David Smorth said...

http://www.youtube.com/watch?v=IyNPeTn8fpo In this Scrum overview, Ken Schwaber descibes how development organizations get hung up on legacy core code.