blogit ergo sum: virtualization

Showing posts with label virtualization. Show all posts

01 November 2008

Test Automation For Complex Systems Continued

In a previous post I discussed some innovative methods that had been developed to test a software product line comprising 3M lines of C/C++ server code, 10 client apps and 300 developers. In that post I discussed:

A middleware layer against which the tests were run
The benefits of getting started early on a prototype solution
A CruiseControl-like build+test framework that implemented continuous builds and optionally tested each code check-in
Using VMware Lab Manager to dramatically improve test reliability, test hardware usage efficiency and flow of tests through the automated build+test system
Using methods from Experimental Design to dramatically improve test efficiency

Some issues remained

Testing needed to become yet more efficient. At this time the entire server still needed to be tested comprehensively on any code check-in that could affect the entire system. The full system had not been broken down into sub-systems that could each be tested comprehensively with small subsets of the full system test suite.
We needed to improve detection of concurrent defects such as race conditions and deadlocks. This was (and is) a well-known problem in software development. Faults found in the field were often produced under stress and timing conditions that were difficult to reproduce in testing .
The test system needed to find bugs without continual re-writing or re-tuning of tests.

Concurrent Testing
As developers of the server code being tested, we knew the fastest way of flowing jobs through the system. We also had a good idea of where concurrent defects were likely to occur. We developed a single solution to these two problems by refactoring our test programs into a set of threads that exercised different parts of the server. In their standard modes the new multi-threaded test programs maximized throughput by keeping all stages of all server processing pipelines full. This by itself unmasked many concurrent bugs by exercising all major threads in the system while efficiently exploring many combinations of input parameters. We then added code to trigger all the user-switchable state transitions (abort, cancel, etc) in the server's processing pipelines and control loops to call these rapidly and with different timings. This in turn uncovered many more bugs.

Finding New Bugs without Changing Tests
Two well-known behaviors of software development organizations are

Defects found by tests tend to get fixed.
Code that is tested often tends to have fewer defects.

The inevitable result of this is that static tests will tend to find fewer bugs over time as the defects they find get fixed and the code they exercise gets debugged. The consensus among the testers in our organization was that they found at least 80% of their bugs through exploratory testing and less than 20% through running their standard test matrices. Their exploratory testing included such things as testing recently changed features and testing functionality they had observed to be fragile in the past.

Effectiveness starting at 20% then tapering off was not what we had in mind for our test system. We had partially addressed this initial design by:

Allowing tests to be based on server configuration. E.g. meta-data attributes were a key test parameter so the test programs had an option to read the all the meta-data keys and their allowed values from a server and generate test cases from them. This distributed test coverage evenly between previously and newly exposed code paths. While this was less directed toward testing new code paths than the QA department's exploratory strategy of targeting recently exposed changes, it was much more effective that continually re-testing the same well tested code paths as a purely static test would have done. It also had the nice behavior that two servers with identical code and configuration would always be tested the exact same way while any change to the configuration or middleware layer would result in a different test with good coverage of the changes.
Having a pure exploratory mode where a crawler found new test files and a time seeded pseudo random number generator created truly random test parameters.

The configuration based testing mode was widely used. It continued to find a lot of bugs as long as code or configuration changed.

The pure exploratory mode was seldom used other than for creating data for the statistical analysis module described in the previous post.

Conclusions
Some of the things we learned from this work were:

It was important to fit the solution to the problem. IMO the one thing that most characterized our approach was that we did not start with a solution in mind. We continually re-analyzed the problem and the efficacy of our partial solutions to it and ended up with solutions that few people had expected.
We created a design verification system rather than automating the quality assurance department's manual testing
We emphasized bug find+fix volume over bug prioritization
We automated at the middleware layer rather than with client app button pushers
Virtualization was a key component of our solution
Statistical analysis played a key role in the design of the solution
Automated tests continued to find many bugs even after code stopped changing. We investigated this and found that new code paths were being exposed to our tests by changes in configuration made after code freeze. As a result the company started treating configuration changes the same as code changes, requiring them to asymptote to zero before shipping.
It can be difficult to fit solutions to a large problem. Big problems often require big solutions and building big solutions can be expensive. For example, we discovered that CruiseControl brought us little value so we had to invest a lot of money in developing our own CruiseControl-like automated build+test system.
Integration is a major cost in big solutions. Few of the off-the-shelf tools we looked at worked well together. A large fraction the cost of this work went on integration. The major exception to this was Lab Manager and the VMware tools supporting it, which integrated well with all parts of our system.
The value of an effective IT department. These people understood integration of heterogeneous systems, rolling out hardware and software and meeting service level requirements in a way that we developers did not. The VMware products, which were designed for IT departments, had the same qualities.

29 October 2008

Test Automation for Complex Systems

A while back the software development team I was working in was struggling with late stage severity 1 bugs. There were about 3M lines of C/C++ server code, 10 client apps and 300 developers. The code had evolved over 10-15 years, it was rich in functionality and much of it was timing critical. Therefore it came as no surprise that the code had some problems and that the problems were hard to fix.

First some questions had to be answered.

Q: What was the problem to be solved?
A: To find and fix most severity 1 bugs in early stages of development.

Q: Why was this not being done now?
A: Complexity. The combinatorics of the system were daunting. It had the benefit of a comprehensive middleware layer that cleanly separated clients from the server. However the middleware layer was large, approx 1,000 APIs all with multiple arguments, the server processed complex files specified by multiple complex languages and 100+ meta-data attributes with an average of 4 values each that could be applied to each file. The server was multi-threaded to handle multiple users and to pipeline processing of jobs, and the threads shared data and state so interactions were important. The system also had a large real-time device control component.

Q: What was there to work with?
A: Working code. The system had bugs but it worked. Many parts of the code were well-written. It was possible that the code could be bug fixed to an acceptable level of stability without a major re-write.

Q: What was the best starting point?
A: The middleware layer. All the company's client apps communicated with the server through the middle-ware layer so it was theoretically possible to test any customer-visible scenario by scripting the middleware layer.

Those questions and answers appeared to narrow the scope of the problem as much as it was ever going to be narrowed by theoretical analysis. Therefore we put together a prototype test framework to see what was achievable in practice. The late stage severity 1 bugs were very costly so time was of the essence.

The Design Goals

Test as close as possible to code submission time.
Get started as early as possible. The goal was to fix bugs. The earlier we started, the more we could fix.
Test as much user-exposed functionality as possible as quickly as possible.

Test automatically.
Get as many test systems as possible. Either re-deploy existing systems or purchase new ones.
Cover functionality as efficiently as possible.

The First Implementation

Write test cases as programs that use the middleware layers.
Try to save development time by using an existing automated test framework. We quickly found that there were no frameworks that helped much with our 3 main requirements:

Integrate with our source code management (SCM) system and our build systems.
Install our code on our test machines

Invoke our middle-ware based test programs

Get started early by writing test programs that exercise well-known functionality
Create a simple test framework to learn about writing automated test frameworks. This framework would

Trigger from SCM check-ins.
Invoke the build system.
Install built code on test system.
Run tests.
Save results.
Display results.

Run reliably on 10 test machines running 24x7.
Run fully enqueued.

One developer wrote a prototype framework while other developers wrote tests or adapted existing tests to the framework. Within a few months we were running on 20 test machines with a duty cycle of around 50%. It was not perfect: we made heavy use of TCP/IP controlled power switches and other tricks and we had to re-start and re-image test machines regularly. But it worked and we had learned a lot.

Lessons Learned From the First Implementation

Getting a prototype out fast was invaluable.
Many low-hanging server bugs were found and fixed.
We had learned how to build the final automated test system.
The company's middleware layer was the key to the success of our system level automated testing. All our success was based on the wisdom and hard work of the team who designed and implemented this layer then made it work for all servers and all clients in the company's product line and rigorously maintained backwards compatibility over its history.
To be effective, the automated test system had to work reliably in a fully enqueued mode. That is, a code check-in had to trigger a build, installation, test, saving of results, capture of failed states, notification to developers and reservation of the failed test system, all without human intervention. Doing so gave 24x7=168 hours of testing per test machine per week. Waiting for human intervention could drop throughput by a factor of 10. As the number of test machines increased, the wait for each human intervention grew. As the tests became better at killing test machines, the number of human interventions increased.
Tests needed to be more effective. Even though our tests had found as many bugs as we had time to fix, it was still easy to find bugs in manual testing. Moreover the code being tested was still under development and it was clear that it was not stabilizing fast enough to meet target schedules.

Critical Next Steps
By this time work the final automated build and test was well underway. It was much more sophisticated than our prototype: it had a scalable design, was easy to use, robust and its code was maintainable over a long period. But we still had to address the issues of test machine reliability, full enqueuing and testing effectiveness.

The solutions to test machine reliability and full enqueuing turned out to have already been solved by VMware Lab Manager. We did not get to Lab Manager directly. We tried building something similar ourselves first before we found out that Lab Manager solved all the problems we had been working through plus some we had yet to encounter. The key benefits of Lab Manager were

Test machines were virtualized so they were as reliable as the code being tested. We no longer needed TCP/IP controlled switches and human intervention to deal with hung machines. Hung virtual test machines could be killed and recycled with a single remote function call.
The full test cycle could be enqueued. Virtual machines could be created, software installed on them, tests run, and failed virtual systems saved to disk for later examination. The duty cycle went up to close to 100%
The full test cycle was implemented extremely efficiently with VMware technology. The saved VM states were implemented as deltas from the common test images. Efficiency (= number of tests per physical PC per week) went up by over a factor of 10.
It scaled very well. Combined with the scalability of the new auto-build+test system, the number of (virtual) machines under test increased to over a hundred quickly then kept growing.

We found no similar existing solution to the testing efficiency problem. To restate the problem, there were many ways the product could be used and testing all these cases was taking too long. The number of possible ways the product could be used was the product of

Number of different input files. This was high but we did not know what the effective number of input files was.
Meta data applied to each file. 140 attributes with an average of 4 values/attribute = 4^140 = 10^84
1000 APIs in middle-ware, which controlled approx 10 main computational threads which shared state data.

The field of Experimental Design explains how to test systems with several input variables efficiently. The key technique is to change several input variables on each test run. The challenges are knowing which variables to change on each run and interpreting the results when several variables are changed at once. An example will illustrate:

The above meta-data turned out to have a big effect on the outcome of tests. Given that a test run took at least 3 seconds, the only way to test all 140 meta-data attributes was to change most attributes on all test runs. After a set of test runs with, say, 100 different meta-data attributes set on each run, one test run fails. How do you find out which of the attributes caused the failure? (One answer is given at the end of this post).

The following is a rough outline of how we designed our software experiment.

Design of Software Experiment
The number of input variables to test was far too high for traditional experimental designs so we used Optimal Designs. First we narrowed the input variable list to the ones that were important for the outcomes we cared about. While we were software engineers who understood how each individual function in the code behaved, we could not construct any useful model to describe how the entire system would behave. Therefore we did this empirically and used Dimensional Reduction What we did was too complicated to describe in a blog post but a simplified version could be summarized as follows:

Collected as many input files as possible. We wrote a crawler to search all the company's test files and installed snooper's on all the company's test machines to create a superset of the company's test files.
Synthesized the D-Optimal set of meta-data attributes.
Manually selected the key middle-ware APIs.
Executed many test runs.
Found the Principal Components of the input variables that described the bulk of the variation in the outcomes we cared about.

For some of the input variables the dimensional reduction was high. In other cases little reduction was achieved. We ended up with a dramatically simplified but still complex set of tests, essentially one thousand test files, and one hundred meta-data attributes, a smallish number of middle-ware calls, and all the timing-dependent interactions between those. This turned out not to be so bad. Since the remaining input variables were uncorrelated to within the limits of our testing, we could test them all at once without them interfering significantly with each other. That meant we could do something similar to Fuzz testing (e.g.: Java Fuzz Testing, Fuzzers - The ultimate list).

The meta-data testing illustrates how this worked. There were 100 attributes with 4 values each. Say the random number generator selected every value of an attribute with 99.99% probability within 10 test runs (that is, rand()%4 returned 0,1,2 and 3 within 10 test runs with 99.99% probability) and the values it selected for each attribute were uncorrelated. Then

in 10 test runs, all outcomes that depended on 1 attribute would have been tested
in 100 test runs, all outcomes that depended on 2 attributes would have been tested
in 1000 test runs, all outcomes that depended on 3 attributes would have been tested

This was good coverage, especially since real users tended not to set a lot of attributes at once and the code was modular enough that the attributes tended not to have a lot of inter-dependencies. Debugging failures took some time because each test run had 100 meta-data attributes set. The vast majority of bugs were caused by only one or two of those attributes but the culprit attributes still had to be found. The solution we used was a binary search, re-running the failed test with selected attributes disabled until we found the minimal set of attributes that triggered the failure.

TO COME:
Pipelining tests

05 March 2008

Open Virtualization Interfaces

I attended VMware CTO Steve Herrod's talk on VMware's roadmap today. Some of the more interesting information presented was already published on VMware's web site.

Open Virtualization Interfaces

The most interesting claim was that VMware will be guaranteeing 100% forwards compatibility with the open interfaces. While this is an obvious design goal for such interfaces, it is very difficult to achieve in practice. The VMware Management Interface is summarized below. It is very rich in useful functionality.
If VMware and/or other vendors supporting the open standards can deliver on this then the deployment of complex software packages becomes a lot easier. Packaging, deployment, upgrades, tracking and other software management chores make up a substantial fraction of the total cost of software. Third party vendors who can do this in a consistent way for all software packages will be able to achieve economies of scale that will allow them provide current levels of service at much lower costs than are currently possible.
Management Interface
Monitoring
Discovery and inventory of virtual machines and host operating systems
Associations between entities
Synchronous and asynchronous performance monitoring
Virtual Machine Lifecycle
Create/delete
Configure (assign virtual devices, setup networks, assign resources, etc.)
Clone (create a copy of an existing virtual machine or virtual machine template)
Migrate (move a virtual machine from one host to another)
Snapshot/revert (checkpoint of virtual machine/revert to checkpoint)
Power operations (on/off/suspend)
Multi-host Virtualization
Virtual infrastructure is the ideal foundation on which to build 'utility' or 'adaptive' solutions. Thus, models for virtual infrastructure must address questions such as how to represent a collection of physical servers acting as a single compute resource and how to understand resource allocation in the form of flexible and hierarchical resource pools.
Storage and Networking
It is important that management solutions not break into incompatible silos of storage, network, and server. To this end, emerging standards for managing virtualized servers need to leverage and interface with existing work on network and storage management.
Host Management
As new hardware devices are developed and introduced to virtual infrastructure, the relation of those devices to other components - both virtual and physical - must be clarified and the models for virtual infrastructure extended accordingly.
Microsoft's Approach to Virtualization
Microsoft's Virtualization Consolidation Announcement says Microsoft Application Virtualization, formerly known as SoftGrid Application Virtualization, is a more fine-grained virtualization solution that complements Windows Server 2008 Hyper-V. Instead of virtualizing an entire operating system, Microsoft Application Virtualization virtualizes only the applications. Microsoft Application Virtualization allows any application to run alongside any other—even applications that normally conflict, multiple versions of the same application, and many applications that previously could not run in parallel.
MS Open Virtual Hard Disk Spec
Further Reading

blogit ergo sum

01 November 2008

Test Automation For Complex Systems Continued

29 October 2008

Test Automation for Complex Systems

05 March 2008

Open Virtualization Interfaces

Translate

My Blog List

Total Pageviews

Followers

Your Details

Google Analytics

Feedjit Live Blog Stats

Cluster Map