16 July 2009

When Words Fail

A while back I worked at a company who made software+hardware products in a maturing market. The company found  it needed to deliver higher quality products with more features and was struggling to do so from an old codebase. It had become clear to the management team that late-stage serious defects were the major cause of schedule/quality issues but they had been able to fix this problem.

The codebase management team had a lot of ideas about what the causes were and how to fix them. They had discussed "technical debt", "silo-ing" and other causes. However in the end they settled on two key priorities: taking extreme care with code changes and sticking with established QA processes to minimise the number of introduced bugs.

Eventually the project was given to me to manage. One of the (many) things the development team had done well was to document each bug and cross-reference bug fixes against  the source code. I analysed about 100 recently fixed serious software bugs, looked up their fixes in the SCM and then looked up the date at which the code changes causing the bug were checked in. This showed that most of the bugs being found had been introduced months before they were discovered. It was clear that the late-stage defects were dominated by latent bugs being unmasked by changes, not by bugs introduced by changes.

Some changes to the development process were needed. The development group was responsible for creating code without introducing bugs and the QA group was responsible for finding the bugs the development team missed. However the QA process was unsuited to discovering latent bugs fast because it had a long cycle based on testing user scenarios. Therefore I got small teams of developers and QAs to work closely together to find, fix and verify bugs and I took some developers away other work to develop a system to find and fix (and eventually prevent the introduction of more) latent bugs. This work is described here. With these changes in place, code stability improved rapidly and late-stage serious bugs essentially ceased to be found.

That was a fairly straightforward technical solution to a fairly straightforward  technical problem. So why had the very capable management team who had known the underlying causes (technical debt and silo-ing) not been able to fix the problem for so long?

Change is known to be difficult in organisations and there is an industry built around dealing with this. However our immediate problem was not an inability to persuade people to change. In fact consultation and review had been distracting people from doing the experimentation required to find the underlying causes of the problem was and how to fix them. The more people talked about the problem the further they got from the solution (hence this post's title).

The situation reminded me of Uncle Bob Martin's Agile Smagile

As I said before, going meta is a good thing. However, going meta requires experimental evidence. Unfortunately the industry has latched on to the word "Agile" and has begun to use it as a prefix that means "good". This is very unfortunate, and discerning software professionals should be very wary of any new concept that bears the "agile" prefix. The concept has been taken meta, but there is no experimental evidence that demonstrates that "agile", by itself, is good.
The deeply ingrained practices in the organisation I worked in had grown out of  ideas that had worked well in the past. They had been good enough to cover a wide range of development scenarios for a long while and were clearly based on experimental evidence from past development. However somewhere along the way people had stopped experimenting and modifying the rules, and started just following the rules. This is what Uncle Bob called "going meta". The problem for our organisation was that the set of rules it had got to when it stopped experimenting were not universally true, they were only true for the type of the development they were doing when they stopped changing the rules.

The changes I made to detect and fix latent bugs (high-coverage automated system testing, static analysis with Klocwork and refactoring with unit tests) were adopted across the development organisation and became part of the standard development process, at least for the time I was there. That was good but I wondered if those practices would become a fixed part of the new development process because they had worked some time in the past. And I wondered whether they would prevent the company from addressing problems that arose in the future, just as the practises that had worked well in the past had come to do.

1 comment:

David said...

Here is a hot-off-the-press example of how talking about something for long enough can lead to a bad outcome.

This group of "Agile" experts have decided that fixing bugs as they are found may not be such a good idea!