A while back the software development team I was working in was struggling with late stage severity 1 bugs. There were about 3M lines of C/C++ server code, 10 client apps and 300 developers. The code had evolved over 10-15 years, it was rich in functionality and much of it was timing critical. Therefore it came as no surprise that the code had some problems and that the problems were hard to fix.
First some questions had to be answered.
Q: What was the problem to be solved?
A: To find and fix most severity 1 bugs in early stages of development.
Q: Why was this not being done now?
A: Complexity. The combinatorics of the system were daunting. It had the benefit of a comprehensive middleware layer that cleanly separated clients from the server. However the middleware layer was large, approx 1,000 APIs all with multiple arguments, the server processed complex files specified by multiple complex languages and 100+ meta-data attributes with an average of 4 values each that could be applied to each file. The server was multi-threaded to handle multiple users and to pipeline processing of jobs, and the threads shared data and state so interactions were important. The system also had a large real-time device control component.
Q: What was there to work with?
A: Working code. The system had bugs but it worked. Many parts of the code were well-written. It was possible that the code could be bug fixed to an acceptable level of stability without a major re-write.
Q: What was the best starting point?
A: The middleware layer. All the company's client apps communicated with the server through the middle-ware layer so it was theoretically possible to test any customer-visible scenario by scripting the middleware layer.
Those questions and answers appeared to narrow the scope of the problem as much as it was ever going to be narrowed by theoretical analysis. Therefore we put together a prototype test framework to see what was achievable in practice. The late stage severity 1 bugs were very costly so time was of the essence.
The Design Goals
- Test as close as possible to code submission time.
- Get started as early as possible. The goal was to fix bugs. The earlier we started, the more we could fix.
- Test as much user-exposed functionality as possible as quickly as possible.
- Test automatically.
- Get as many test systems as possible. Either re-deploy existing systems or purchase new ones.
- Cover functionality as efficiently as possible.
- Write test cases as programs that use the middleware layers.
- Try to save development time by using an existing automated test framework. We quickly found that there were no frameworks that helped much with our 3 main requirements:
- Integrate with our source code management (SCM) system and our build systems.
- Install our code on our test machines
- Invoke our middle-ware based test programs
- Get started early by writing test programs that exercise well-known functionality
- Create a simple test framework to learn about writing automated test frameworks. This framework would
- Trigger from SCM check-ins.
- Invoke the build system.
- Install built code on test system.
- Run tests.
- Save results.
- Display results.
- Run reliably on 10 test machines running 24x7.
- Run fully enqueued.
Lessons Learned From the First Implementation
- Getting a prototype out fast was invaluable.
- Many low-hanging server bugs were found and fixed.
- We had learned how to build the final automated test system.
- The company's middleware layer was the key to the success of our system level automated testing. All our success was based on the wisdom and hard work of the team who designed and implemented this layer then made it work for all servers and all clients in the company's product line and rigorously maintained backwards compatibility over its history.
- To be effective, the automated test system had to work reliably in a fully enqueued mode. That is, a code check-in had to trigger a build, installation, test, saving of results, capture of failed states, notification to developers and reservation of the failed test system, all without human intervention. Doing so gave 24x7=168 hours of testing per test machine per week. Waiting for human intervention could drop throughput by a factor of 10. As the number of test machines increased, the wait for each human intervention grew. As the tests became better at killing test machines, the number of human interventions increased.
- Tests needed to be more effective. Even though our tests had found as many bugs as we had time to fix, it was still easy to find bugs in manual testing. Moreover the code being tested was still under development and it was clear that it was not stabilizing fast enough to meet target schedules.
By this time work the final automated build and test was well underway. It was much more sophisticated than our prototype: it had a scalable design, was easy to use, robust and its code was maintainable over a long period. But we still had to address the issues of test machine reliability, full enqueuing and testing effectiveness.
The solutions to test machine reliability and full enqueuing turned out to have already been solved by VMware Lab Manager. We did not get to Lab Manager directly. We tried building something similar ourselves first before we found out that Lab Manager solved all the problems we had been working through plus some we had yet to encounter. The key benefits of Lab Manager were
- Test machines were virtualized so they were as reliable as the code being tested. We no longer needed TCP/IP controlled switches and human intervention to deal with hung machines. Hung virtual test machines could be killed and recycled with a single remote function call.
- The full test cycle could be enqueued. Virtual machines could be created, software installed on them, tests run, and failed virtual systems saved to disk for later examination. The duty cycle went up to close to 100%
- The full test cycle was implemented extremely efficiently with VMware technology. The saved VM states were implemented as deltas from the common test images. Efficiency (= number of tests per physical PC per week) went up by over a factor of 10.
- It scaled very well. Combined with the scalability of the new auto-build+test system, the number of (virtual) machines under test increased to over a hundred quickly then kept growing.
- Number of different input files. This was high but we did not know what the effective number of input files was.
- Meta data applied to each file. 140 attributes with an average of 4 values/attribute = 4^140 = 10^84
- 1000 APIs in middle-ware, which controlled approx 10 main computational threads which shared state data.
- The above meta-data turned out to have a big effect on the outcome of tests. Given that a test run took at least 3 seconds, the only way to test all 140 meta-data attributes was to change most attributes on all test runs. After a set of test runs with, say, 100 different meta-data attributes set on each run, one test run fails. How do you find out which of the attributes caused the failure? (One answer is given at the end of this post).
Design of Software Experiment
The number of input variables to test was far too high for traditional experimental designs so we used Optimal Designs. First we narrowed the input variable list to the ones that were important for the outcomes we cared about. While we were software engineers who understood how each individual function in the code behaved, we could not construct any useful model to describe how the entire system would behave. Therefore we did this empirically and used Dimensional Reduction What we did was too complicated to describe in a blog post but a simplified version could be summarized as follows:
- Collected as many input files as possible. We wrote a crawler to search all the company's test files and installed snooper's on all the company's test machines to create a superset of the company's test files.
- Synthesized the D-Optimal set of meta-data attributes.
- Manually selected the key middle-ware APIs.
- Executed many test runs.
- Found the Principal Components of the input variables that described the bulk of the variation in the outcomes we cared about.
The meta-data testing illustrates how this worked. There were 100 attributes with 4 values each. Say the random number generator selected every value of an attribute with 99.99% probability within 10 test runs (that is, rand()%4 returned 0,1,2 and 3 within 10 test runs with 99.99% probability) and the values it selected for each attribute were uncorrelated. Then
- in 10 test runs, all outcomes that depended on 1 attribute would have been tested
- in 100 test runs, all outcomes that depended on 2 attributes would have been tested
- in 1000 test runs, all outcomes that depended on 3 attributes would have been tested