blogit ergo sum: software development

Showing posts with label software development. Show all posts

04 January 2010

C++ Continues to Surprise

Someone was asking questions about const_cast<>() a few days ago. I was not quite sure how it would work because I try to use as little of the C++ language as possible and it possible to get by in C++ without const_cast<>(). To find out exactly how it worked I tried it out with a test case. The following code gave the same output on g++ on Vista and OS X.

int i = 3;

const int* ptr = &i;

*const_cast<int*>(ptr) = 11;

if (&i == ptr && i != *ptr) {

std::cout << "Cannot happen: &i=" << &i << " == ptr=" << ptr << " but i=" << i << " != *ptr=" << *ptr << std::endl;

}

The output in both cases was Cannot happen: &i=0x22fe6c == ptr=0x22fe6c but i=3 != *ptr=22

How can a single memory address hold two different values?

The disassembly was

push %ebp

mov %esp,%ebp

sub $0x18,%esp

int i = 3;

movl $0x3,0xfffffffc(%ebp)         (i in bp-4)

const int* ptr = &i;

lea 0xfffffffc(%ebp),%eax         (&i in eax)

mov %eax,0xfffffff8(%ebp)            (ptr in bp-8)

*const_cast<int*>(ptr) = 11;

mov 0xfffffff8(%ebp),%eax           (ptr in eax)

movl $0xb,(%eax) (*ptr set to 11)

if (&i == ptr && i != *ptr)

lea 0xfffffffc(%ebp),%eax

cmp 0xfffffff8(%ebp),%eax

jne 0x403214

mov 0xfffffff8(%ebp),%eax

mov (%eax),%eax

cmp 0xfffffffc(%ebp),%eax

je 0x403214

The disassembly matches the C++ code. i is stored at bp-4 and ptr is stored at bp-8 so the C++ code should work. The observed behaviour does not match the disassembly.

This cannot be right. I guess I found a bug in g++.

10 September 2009

Electronic Medical Records Bonanza?

Big Bucks in Health IT!, quoting from http://www.healthcareitnews.com/news/global-market-hospital-it-systems-pegged-35b-2015 , says

SAN JOSE, CA – The global hospital information systems market will climb past $35 billion by 2015, according to a new forecast by Global Industry Analysts. The United States represents the largest market in the world. The U.S. hospital information system market is experiencing an increase in acceptance of customized technology such as laboratory information systems and radiology information systems, the report notes. The market is also a promising ground for electronic medical record systems.

The Asia-Pacific region (excluding Japan) represents the fastest growing hospital information systems market, exhibiting a compounded annual growth rate of 11.5 percent over the next few years, according to analysts. Despite being a smaller market in terms of revenue, the Asia-Pacific promises excellent growth opportunities for hospital information systems, they said.

The global vendors profiled in the report include McKesson , Cerner , Allscripts-Misys Healthcare Solutions, Eclipsys, Computer Programs and Systems, Siemens Medical Solutions USA, QuadraMed, Medical Information Technology, Healthland, GE Healthcare, iSOFT Group, Agfa-Gevaert, Brunie-Software, IBA Health and Integrated Medical Systems.

The full release is here: Global Hospital Information Systems Market to Cross $35 Billion by 2015, According to New Report by Global Industry Analysts, Inc. Increasing awareness among medical service patrons on the benefits of using Information Technology in the healthcare sector, coupled with growing demand for affordable-yet quality healthcare services is forcing hospitals and other medical centers to adopt IT in their daily operations. Subsequently, Healthcare IT systems such as the Hospital Information Systems witnessed a great demand in the healthcare services sector. Adoption of HIS in hospitals is increasingly being encouraged and promoted by the Governments world over. http://www.prweb.com/releases/2009/02/prweb2021984.htm

16 July 2009

When Words Fail

A while back I worked at a company who made software+hardware products in a maturing market. The company found it needed to deliver higher quality products with more features and was struggling to do so from an old codebase. It had become clear to the management team that late-stage serious defects were the major cause of schedule/quality issues but they had been able to fix this problem.

The codebase management team had a lot of ideas about what the causes were and how to fix them. They had discussed "technical debt", "silo-ing" and other causes. However in the end they settled on two key priorities: taking extreme care with code changes and sticking with established QA processes to minimise the number of introduced bugs.

Eventually the project was given to me to manage. One of the (many) things the development team had done well was to document each bug and cross-reference bug fixes against the source code. I analysed about 100 recently fixed serious software bugs, looked up their fixes in the SCM and then looked up the date at which the code changes causing the bug were checked in. This showed that most of the bugs being found had been introduced months before they were discovered. It was clear that the late-stage defects were dominated by latent bugs being unmasked by changes, not by bugs introduced by changes.

Some changes to the development process were needed. The development group was responsible for creating code without introducing bugs and the QA group was responsible for finding the bugs the development team missed. However the QA process was unsuited to discovering latent bugs fast because it had a long cycle based on testing user scenarios. Therefore I got small teams of developers and QAs to work closely together to find, fix and verify bugs and I took some developers away other work to develop a system to find and fix (and eventually prevent the introduction of more) latent bugs. This work is described here. With these changes in place, code stability improved rapidly and late-stage serious bugs essentially ceased to be found.

That was a fairly straightforward technical solution to a fairly straightforward technical problem. So why had the very capable management team who had known the underlying causes (technical debt and silo-ing) not been able to fix the problem for so long?

Change is known to be difficult in organisations and there is an industry built around dealing with this. However our immediate problem was not an inability to persuade people to change. In fact consultation and review had been distracting people from doing the experimentation required to find the underlying causes of the problem was and how to fix them. The more people talked about the problem the further they got from the solution (hence this post's title).

The situation reminded me of Uncle Bob Martin's Agile Smagile

As I said before, going meta is a good thing. However, going meta requires experimental evidence. Unfortunately the industry has latched on to the word "Agile" and has begun to use it as a prefix that means "good". This is very unfortunate, and discerning software professionals should be very wary of any new concept that bears the "agile" prefix. The concept has been taken meta, but there is no experimental evidence that demonstrates that "agile", by itself, is good.

The deeply ingrained practices in the organisation I worked in had grown out of ideas that had worked well in the past. They had been good enough to cover a wide range of development scenarios for a long while and were clearly based on experimental evidence from past development. However somewhere along the way people had stopped experimenting and modifying the rules, and started just following the rules. This is what Uncle Bob called "going meta". The problem for our organisation was that the set of rules it had got to when it stopped experimenting were not universally true, they were only true for the type of the development they were doing when they stopped changing the rules.

The changes I made to detect and fix latent bugs (high-coverage automated system testing, static analysis with Klocwork and refactoring with unit tests) were adopted across the development organisation and became part of the standard development process, at least for the time I was there. That was good but I wondered if those practices would become a fixed part of the new development process because they had worked some time in the past. And I wondered whether they would prevent the company from addressing problems that arose in the future, just as the practises that had worked well in the past had come to do.

22 June 2009

Minimal Non-C++Programmer Bamboozling C++ Question

I recently read a stream of blog posts about why developers don't like C++ for general purpose programming. This post typifies much of the criticism of C++'s complexity. It includes an interview question about creating a C++ class that behaves like a class in a high level language such as Java. The author says that he uses this to weed out job applicants who haven't used C++ for real work.

It strikes me that tripping up developers with C++'s many oddnesses is much easier than that. Here is a simple question that I believe will confuse many non-C++ programmers:

What is the output of this program?

#include <string>
#include <iostream>
using namespace std;
class Parent {
public:
string func() { return "parent"; }
virtual string vfunc() { return "parent+virtual"; }
};
class Child : public Parent {
string func() { return "child"; }
virtual string vfunc() { return "child+virtual"; }
};
string test1(Parent parent) {
return parent.func() + " - " + parent.vfunc();
}
string test2(Parent& parent) {
return parent.func() + " - " + parent.vfunc();
}
int main() {
Child child;
cout << "test1: " << test1(child) << endl;
cout << "test2: " << test2(child) << endl;
}

I have seen C++ interviewers ask questions like this but only show test1 then ask what is on the stack when test1 is invoked.

26 April 2009

The Future of ICT

It is worth thinking about the future from time to time. It helps us craft investment strategies and career paths that match the major trends in the world. So what is the future of ICT?

My guesses are

Simplification.
Movement to the cloud.
Fixed/mobile convergence.
Integration of simple cloud services. (1+2)

Modern ICT systems are insanely complex while the most productive computer users I know all use simple tools.

Most of the things we do we with computers are much simpler than the popular packages are capable of. e.g. Editing some text does not require a full blown desktop publishing program like MS Word, yet MS Word is the most popular text editor in the world. Likewise keeping track of some customers and inventory does not require a gigantic package like SAP, yet SAP is the biggest selling ERP software package in the world.

The costs of learning these immensely complex packages are considerable in terms on time lost. There is probably a much higher cost in working as slaves to these packages which distracts from finding the best solutions to an enterprise's problems. A current trend in corporate ICT is to use "best of breed" packages with the minimum possible customization because the payback from customizing is much less than the cost. (BTW, this does not seem to be true for ERP). This means that enterprises that enterprises are paying the cost of not solving their ICT problems as well as they could. This cost has to be a significant fraction of their ICT budgets.

What is ERP anyway? (MS&T ERP Center, 01/29/2009)

This article explains why "best of breed" software packages sell well. It boils down to the promise of lower total cost of ownership (TCO) through using a single vendor for all services and a mega-brand that makes buyers feel safe.

SAP ERP systems effectively implemented can have huge cost benefits. Integration is the key in this process. "Generally, a company's level of data integration is highest when the company uses one vendor to supply all of its modules." An out-of-box software package has some level of integration but it depends on the expertise of the company to install the system and how the package allows the users to integrate the different modules.

Movement to the Cloud

Centralized computing is much more efficient than desktop-centric computing. TCO decreases dramatically with centralization because maintaining and upgrading software running in one physical location is far easier than on many people's personal computers.
Most client software, even computationally intensive software like high quality graphics, has very low duty cycles. It does nothing most of the time. When you buy expensive PC hardware to support it, you are paying to support peak usage, the few minutes per day when it does the tricky computations and you want the user interface to be responsive. The average computer resource usage of this software is very low, much less than 10%. Therefore running the software on a central server is much more the 10x more efficient.
Expensive infrastructure such as high-reliability disk storage does not need to be replicated through an organization. Virtualization, SaaS etc only became effective in the last few years so many software and hardware vendors built their (then efficient) businesses around powerful client PCs running software locally.

Fixed Mobile Convergence

When simple applications and cloud computing become dominant, the requirements for terminals become much less. Smart phones and 3G netbooks are already very capable and are becoming more so. They also use little power and are portable.

The next level of usability is to have one device for fixed and mobile work. That device should be able to work with WiFi and 3G networks and move seamlessly between them. The technology for this is maturing.

From Wikipedia

A clear trend is emerging in the form of fixed and mobile telephony convergence (FMC). The aim is to provide both services with a single phone, which could switch between networks ad hoc. Several industry standardisation activities have been completed in this area such as the Voice call continuity (VCC) specifications defined by the 3GPP. Typically, these services rely on Dual Mode Handsets, where the customers' mobile terminal can support both the wide-area (cellular) access and the local-area technology (for VoIP). However, an alternative approach achieves FMC over 3G mobile networks - eliminating the requirement for Dual Mode. This approach, broadly termed cellular FMC, is in trials by telecoms operators including BT.

An alternative approach to achieve similar benefits is that of femtocells .

Integration of simple cloud services
When cloud computing is widespread and simple cloud services are widely available, integration companies will be able to assemble tools to meet the needs of businesses. This should be a vast business since it competes with the mega-apps Microsoft, Oracle, SAP, Siebel etc and the mega-glue Tibco etc.

If the recent history of software development is a guide, nimble companies will start to build effective suites and grow rapidly to form a foundation for this industry, then they will be followed by specialist companies who will take care to make their software inter-operable. This will evolve into a software ecosystem and sales channels will emerge. With fixed mobile convergence in the mix, application stores may be used for sales, removing the need for sales and marketing teams in the startup companies that start this new business category.

At this time the setups of hyper-productive software users will be easy to replicate in the cloud. User applications will be available to users as simple serices on simple devices like 3G netbooks. Enterprise applications will run in the cloud with simple interfaces. Business outsourcing will be simple because the software will run in the cloud with well-defined APIs.

The Consequences

These changes will result in a dramatic increase in productivity that will boost economies world-wide.
Software will be simple so ICT staff will not be slaves to the machines of gigantic software packages.
This will free up ICT staff's time to add business value which will increase productivity even more.

Photo Credits
threesixtyfive | day 244 by Sybren A. Stüvel.
What is ERP anyway? (MS&T ERP Center, 01/29/2009) by MS&T Center for ERP.
Skype Crashing on iPhone Fix by theleetgeeks.
Modern Times by jampa.

30 March 2009

Software Project Convergence

People who manage software development projects become sensitive to code convergence, the changes that software goes through as its defects are found and removed to get it ready to ship. This takes time because

testing all the code takes time
fixing defects takes time
removing defects requires changing code which introduces more defects.

Software project managers need to estimate when code will converge into so they can predict ship dates. Some of the methods they use are

using the product to get a feel for the overall level of maturity
plotting number of open (found but not yet fixed) defects vs. date
plotting code churn vs. date

If a metric (or method) can be used to predict the ship date of a real software project then that metric should do a great job of predicting the ship date of a simple mathematical model of a software project. I am going to test the 3 methods above on a simple model of a software project. If these methods give reasonable predictions then I will test them against some more realistic models.

Software Project Model
This model starts when the code in the hypothetical project is functioning then it runs for one year of one week development cycles. In each week the development organization fixes as many bugs as it can and a separate QA organization tests the development organization's code from the previous week. The model's input parameters are

Initial bugs: The number of bugs in the code when the model starts.
Average number of lines of code (LOC) change to fix a bug.
Average number of bugs introduced per LOC changed.
Maximum number of bugs that the development organization can fix each week.
QA test cycle.
The fastest and slowest bug find rates in the QA test cycle.

The model outputs

Remaining bugs in code. This is the model's actual maturity. It cannot be observed directly so it needs to be estimated from the other output parameters.
Open bugs (bugs found but not yet fixed)
Code churn.
Fraction of bugs that were introduced in the previous week.

This is a very simple model and does not take into account many of the things that happen in real-world software development. In particular it does not take into account the lags between introducing a bug, finding it and fixing it, which often have a major impact on project convergence. The spreadsheet that implements the model is at the end of this post.

The following graphs show the model outcomes for 4 sets of input parameters.

initial bugs: 1000
bugs introduced per loc: 0.007
loc to fix a bug: 100
max bugs fixed per week: 120
max bugs found per week: 200
min bugs found per week: 50
QA test cycle(weeks): 8

initial bugs: 500
bugs introduced per loc: 0.009
loc to fix a bug: 100
max bugs fixed per week: 120
max bugs found per week: 200
min bugs found per week: 50
QA test cycle(weeks): 8

initial bugs: 500
bugs introduced per loc: 0.009
loc to fix a bug: 100
max bugs fixed per week: 200
max bugs found per week: 200
min bugs found per week: 200
QA test cycle(weeks): 8

initial bugs: 500
bugs introduced per loc: 0.009
loc to fix a bug: 100
max bugs fixed per week: 100
max bugs found per week: 200
min bugs found per week: 100
QA test cycle(weeks): 8

What the Metrics Say About the Models

Using the product to get a feel for the overall level of maturity. This was necessary in most cases because the long term trends were different to the short term trends. The gradients in the number of open (found but not yet fixed) bugs and code churn graphs did not predict code stabilization directly. The exception to this was when there was no pattern in the bug find rate. Bug find rates cannot be guaranteed to not have patterns, so using the product is necessary, as common sense would have suggested.
Plotting number of open defects vs. date. This graph trended in the same direction as the number of remaining underlying defects in code so it was a good metric. Some work is required to filter out the effects of the bug find rate.
Plotting code churn vs. date. This graph trended in the same direction as the number of remaining underlying defects in code so it was a good metric. Some work is required to filter out the effects of the bug find rate.

Conclusion
The metrics work for the simple model. Schedule predictions require that code churn and number of open defects need to be corrected for bug find rate, or patterns need to be removed from the bug find rate. One way to remove bug find patterns is to measure bugs with an automated test system where tests are selected by a random number generator. See test automation for complex systems.

The Model

18 March 2009

You Think Google Has Hard Interview Questions? Try EFI.

The job interview - "Oops!",
by Big Fat Rat.

Google is famous for its hard interview questions . When I lived in the USA I used to listen to Car Talk on NPR, a radio show that had its own hard weekly questions in its "puzzler" segment.. When I was browsing the Car Talk web site I saw their hardest "puzzler" .

A hundred prisoners are each locked in a room with three pirates, one of whom will walk the plank in the morning. Each prisoner has 10 bottles of wine, one of which has been poisoned; and each pirate has 12 coins, one of which is counterfeit and weighs either more or less than a genuine coin. In the room is a single switch, which the prisoner may either leave as it is, or flip. Before being led into the rooms, the prisoners are all made to wear either a red hat or a blue hat; they can see all the other prisoners' hats, but not their own. Meanwhile, a six-digit prime number of monkeys multiply until their digits reverse, then all have to get across a river using a canoe that can hold at most two monkeys at a time. But half the monkeys always lie and the other half always tell the truth. Given that the Nth prisoner knows that one of the monkeys doesn't know that a pirate doesn't know the product of two numbers between 1 and 100 without knowing that the N+1th prisoner has flipped the switch in his room or not after having determined which bottle of wine was poisoned and what colour his hat is, what is the solution to this puzzle?

EFI at Night, Foster City,
by cool_blue_dog.

The problem sounded difficult but familiar, and sure enough ....

it was written by a man named Alan B. at EFI, Inc

I had worked at EFI when I was in the USA and when I interviewed software engineering candidates I asked challenging interview questions, as was common practice there . We interviewers had had to answer difficult questions to get employed there in the first place so we had seen it from both sides.

Googleplex,
by Laughing Squid.

Alan B's question brings back some memories. I am glad I didn't get it at my interview.

A web search found my favorite EFI interview question on an interview questions web site .
How hard can Google interview questions be?

11 March 2009

Agile + Ecosystems = Tom Peters?

Tom Peters!,
by Pieter Baert.

Tom Peters' book In Search of Excellence listed eight themes that drove success in organizations:

A bias for action, active decision making - 'getting on with it'.
Close to the customer - learning from the people served by the business.
Autonomy and entrepreneurship - fostering innovation and nurturing 'champions'.
Productivity through people- treating rank and file employees as a source of quality.
Hands-on, value-driven - management philosophy that guides everyday practice - management showing its commitment.
Stick to the knitting - stay with the business that you know.
Simple form, lean staff - some of the best companies have minimal HQ staff.
Simultaneous loose-tight properties - autonomy in shop-floor activities plus centralized values.

Acrobats - Shanghai,
ed by Mel F.

The ZenZui Ecosystem,
by mtlin.

Two of the biggest trends in software development in 2009 are Agile and Ecosystems.

Peters' eight themes seem to cover a lot of what makes Agile and Ecosystems work.

A bias for action, active decision making - 'getting on with it'. Agile
Close to the customer - learning from the people served by the business. Agile + Ecosystems
Autonomy and entrepreneurship - fostering innovation and nurturing 'champions'.
Productivity through people- treating rank and file employees as a source of quality. Agile
Hands-on, value-driven - management philosophy that guides everyday practice - management showing its commitment. Agile
Stick to the knitting - stay with the business that you know. Ecosystems
Simple form, lean staff - some of the best companies have minimal HQ staff. Ecosystems.
Simultaneous loose-tight properties - autonomy in shop-floor activities plus centralized values. Agile + Ecosystems

This is not surprising since modern development practices, modern business structures and modern studies of business share many common roots. Still it seems that Peters' book is more useful than the average business book .

Having a small number of guidelines helps with getting started on actual work.

There is a good summary of In Search of Excellence in this Fast Company article . Here is an excerpt:

... Love thy people. Love thy customers. Keep it simple. Lean staff, simple organization. Get the bureaucrats out of the bloody way. Pay attention to the "real" people with dirty fingernails. That was the Oakland Raiders. They were the guys flying the Jolly Roger. They were the pirates, the underdogs. Al Davis, their renegade owner, always preached, "Just win, baby," and his avowed message was . . . "Commitment to excellence."

We got it right when we said that we were in search of excellence. Not competitive advantage. Not economic growth. Not market dominance or strategic differentiation. Not maximized shareholder value. Excellence. It's just as true today. Business isn't some disembodied bloodless enterprise. Profit is fine -- a sign that the customer honors the value of what we do. But "enterprise" (a lovely word) is about heart. About beauty. It's about art. About people throwing themselves on the line. It's about passion and the selfless pursuit of an ideal. ...

08 February 2009

Face Recognition for Android?

A lot of smart people end up at Google. Even Hartmut Neven is there. Does that mean that Android will have face recognition, general object recognition and gaze tracking from the phone's camera? It sounds straightforward: upload the picture in the Google computing cloud, analyze it and download the results. No need to run the image recognition locally on the phone the way Neven used to.

What is the face recognition API called? All I can find is the face detector API.

What Other People Are Saying
Did Google Pull a Neven with Enkin? speculates that Google are going to use image recognition as a way for mobile camera phones to interact with the world, in particular by machine-readable codes on printed pages. I surveyed machine readable codes for printed pages in a previous post .

This speculation is interesting because I always thought the killer app for mobiles would be image recognition + machine readable codes on printed pages + OCR + speech recognition + location awareness, along with an engine to deduce useful information from the data . Maybe it's a more obvious idea than I thought. Don't throw out your copy of Duda and Hart!

23 January 2009

Real-time Web Based Power Charting

From Jason Winters' Pico-Projects .

Very cool.

16 January 2009

Online Source Control Service

SVN repository is a great idea for very small development teams and individual developers.

$4/month for 500 MB.

See also CVSDude : This press release says

CVSDude(TM), Inc. -- a pioneer in Subversion(TM) hosting and source code control services -- today announced a wide-ranging series of strategic initiatives and financial milestones that further solidify its worldwide leadership position in these markets. Serving nearly 3,000 customers worldwide with tens of thousands of dedicated users -- 2/3 of these located in North America -- and all supported via http://cvsdude.com.

In addition, despite the worldwide recession, CVSDude also experienced its best revenue month ever in 6 years of business in January of 2009. This is a testament to the strength and robustness of the company's Software-as-a-Service (SaaS) business model, as customers seek to maximize the value of their worldwide programming teams and minimize their upfront capital costs in difficult times.

CVSDude remains a primarily founder and employee-owned company, and took an Angel round of investment in late 2008 to help finance the move to the US. To further grow the company and its customer base, CVSDude has made the strategic acquisition of a competitor, found at http://www.sharpforge.com , sharpforge.org, and projxpert.com. As of March 2009, the CVSDude group of companies will look after clients and users of these sites.

06 January 2009

Risk Management for Software Projects

Large software projects are risky because

They involve people.
They are can be complex.
They depend on people and companies outside your direct control, such as OS makers and 3rd party library vendors.
There is little flexibility to make big changes at end of big software projects. This was described elegantly in the Mythical Man Month.

Risk Management Strategies

Understand your project's final deliverable very well, including your customer's needs and quality requirements.
Manage for an appropriate level of risk, not zero risk, which is unachievable in any case. Learn which risks give the best returns and which lead to the worst problems.

Take the costs of managing risks into account; don't spend more time measuring, avoiding and mitigating risk than you get back from doing so.
The appropriate level of risk depends on the benefit of your project succeeding, the cost of it failing and the probability of it failing.

Base schedule estimates on reasonable expectation of things going wrong from time to time. An honest schedule allows you to eliminate non-essential features at the start of a project and improve its chance of success. It also wins you the respect of seasoned developers, sets a tone of realism, and provides the foundation for a successful project.

If in doubt then pad your estimated schedule based on the accuracy of your previous schedule estimates. e.g. If your last few projects took 1.5 as long as you expected them to take then multiply your current schedule estimate by 1.5. This is a way of admitting the limitations of your ability to estimate schedules. As you get better at estimating schedules you will be able to pad less. Meanwhile it is critical that you understand and compensate for your limitations. Padding is necessary.
Take a minimax approach and minimize your project's reasonable worst case schedule. The reasonable worst case depends on your project's circumstances but it will be something like the 90th percentile, 95th percentile or other high percentile in the distribution of possible schedules.

When your project gets started, keep track of your real schedule. Be good at discovering reality.

Work with today's real schedule and expectations. This is similar to the popular Agile methodology. Even if your company does not use Agile methodology, once development starts you need to deal with the current reality of your project. If the reality differs from the plan then you need to deal with the reality. If you can regain your original schedule that's great but you must deal with the reality while the original plan differs from the current reality.
Devise good metrics and use them. Use the metrics to compute a real schedule. E.g. #open bugs may be an accurate metric that reflects the true state of the project, but it probably doesn't predict ship date accurately since bug fixes (like any code change) introduce bugs. Code churn will almost certainly be a better predictor of ship date, but the code churn is meaningful only as long as the bug fixes are being driven to the final (shipping) bug fix number .
Remember that projects need to converge. Doing extra work at the end of the project invariably leads to slippage.
Don't shortchange any critical upstream development activities (Steve McConnell #20). If something is going to have to be done then do it at the most efficient time. Design is much more effective if it is done before coding. Testing is more effective if code is written with testing in mind. Etc. All incomplete critical activities are risks.

Minimize the number of risks in a project.

Try to avoid introducing multiple new technologies in a separate project.
If you cannot one or more major risks (e.g. new technology, new supplier, new market) in a project then avoid non-essential risks. e.g. If you have to introduce a new technology for the project but you can delay the new supplier until the next project then do so. If your company is large enough to support multiple simultaneous projects then spread the risks between projects.
People are part of the risk. If you have to take a risk then don't add to the risk with inexpert or untried developers or developers who are subject to external pressures. Likewise this is a bad time to use newly formed teams, teams whose members don't back each other up or teams going through major issues.
Infrastructure and organization are part of the risk. If you have to take a risk then support the risk-taking team with your organization's best infrastructure, including IT, HR, facilities, etc. Shield the risk-taking team from distractions such as re-organizations, moving to new premises, learning a new email system, heavy personnel review processes, non-critical training etc.

Break down major risks. Break down complex tasks into several smaller tasks so that one failure won't bring the whole large task down.

In a series of tasks with measurable completion milestones, failures to achieve the milestones will become apparent early while there is still time to recover.
Breaking a big task down into task a set of concurrent tasks can either increase risk by adding coordination risks or decrease risk if some of the sub-tasks can fail without causing the whole task to fail. If the coordination risk can be minimized and sub-task failures are tolerable then this is a good strategy.

Move risky items to the start of the project. If something goes wrong with the risky items this gives you time to address the problem while code and designs can be changed without high risk. Performing this step rigorously will distinguish well risk-managed projects from other projects.
Keep a top N (say N=10) risks list. Risks are everywhere and eternal vigilance is required. However not all people have this mindset. A top N risks list is an easy-to-grasp way of communicating risks to a group.
Follow good software development and management practices. Avoid the classic mistakes because they can be easily avoided by reading the list (which I wish I had read before I made most of them). Many standard software engineering principles minimize risk, so don't unlearn them when you work on your first commercial product. In particular don't stop using good development practices when your project is under pressure.
Expect the unexpected.

Never ever forget Murphy's Law. Expect to make mistakes.
Effective risk management requires sensitivity to risk. If you are not emotionally wired in this way then you will need to learn how to think this way .
Keep in mind that while the past at best provides an imprecise guide to the future (see the item on schedule padding), at worst it provides no indication of the future at all. This is exemplified by Taleb's Turkey: Imagine that you're a turkey. You've eaten well and lived in safety every day of your life. Everything in your experience tells you that tomorrow will be no different. Then Thanksgiving arrives.
As advised above, adapt to the currently reality, even it differs from your plan.

Summary of Risk-Based Schedule Prediction

Knowns. Plan these rigorously.
Known unknowns. Pad schedule for these. e.g. predicted schedule + 2 std devs
Unknown unknowns. Requires eternal vigilence and adapting to the current reality.

What Makes Risk Sensitive Managers Different

Risk management is not just part of software development project management like design, scheduling and presentation skills. A truly risk-sensitive approach to project management requires explicitly managing by risk. This means that if your scheduling and prioritization is not based on realistic risk estimates then your are not managing risk well. For example

In a project involving 100 engineers, a group of 3 engineers are working on a completely new piece of code that implements difficult algorithms for a critical deliverable and every other engineer is making incremental changes to the new code, then a risk sensitive manager would focus on this group of 3 engineers.
If a project introduced a new software technology, brought in a new hardware supplier and entered a new untested market then a risk sensitive manager would try to break it into 3 projects, each with only one risk.
MOST IMPORTANT EXAMPLE. A risk manager assumes that a) many things could go wrong will go wrong and b) there is limited scope to correct mistakes at the end of a project. Therefore a risk sensitive manager will trim features at the start of a project to give the project a reasonable chance of success.

Risk sensitive managers manage like this even if doing so is at odds with other management techniques.

In example 1, the risk sensitive manager would be focusing on the risky group of 3, even if it meant PRDs coming in late and the low-risk 97 engineers getting less than optimal help. The risk sensitive manager would work hard to move work from the group of 3 to other people in the 100, and negotiate with other clients of the group of 3 to lighten their load.
In example 2, the risk sensitive manager would lobby high and low through his/her company to avoid the dangerous confluence of risks.
Example 3 illustrates what separates risk sensitive managers from average software managers, the absence of wishful thinking in decision making. In #13 in the previous link Steve McConnell says Wishful thinking isn't just optimism. It's closing your eyes and hoping something works when you have no reasonable basis for thinking it will. Wishful thinking at the beginning of a project leads to big blowups at the end of a project. It undermines meaningful planning and may be at the root of more software problems than all other causes combined.

And good risk managers always attack uncertainty.

19 November 2008

The Tyranny of Complexity

The tyranny of complexity is a problem that has long been known to computer engineers. As long ago as the 1950s, the complexity of software systems was exceeding the limitations of humans to understand it. In the same decade psychologists were measuring human capability of dealing with complexity. George A. Miller's classic The Magical Number Seven, Plus or Minus Two: Some Limits on our Capacity for Processing Information famously showed that people could remember only 5 to 9 items of unstructured information. He gave the example of a "A man just beginning to learn radio-telegraphic code hears each dit and dah as a separate chunk. Soon he is able to organize these sounds into letters and then he can deal with the letters as chunks. Then the letters organize themselves as words, which are still larger chunks, and he begins to hear whole phrases." In other words, most of human effective short-term memory for low-information-content items was achieved mentally re-coding these items into a smaller number of high-information-content items, a process known as chunking.

This technique will sound familiar to software engineers who use abstraction and modular design to simplify the design of complex systems. Abstraction is not exactly the same thing as chunking but it addresses the same limits of human mental capacity. Software developers who work with high level languages like Java can work effectively without having to understanding the assembly code that Java translates to or the hardware actions that the assembly triggers. A high level language by itself does not bring the level of abstraction that developers require so they also use software libraries and frameworks .

However frameworks and libraries of ever increasing size don't completely solve the problem of complexity
they take ever increasing time to learn so

the IT industry is filled with specialists who have learned one framework or another, and
companies buy products and hope they work together

This means that it will either

take time to put together a group who can address a significant issue or construct a significant system, or
if an existing group is used, then it will the solution will have a lot in common with previous solutions.

In 2008 software development parlance, the companies using these large frameworks are not as agile as they could be. Developing a new process might take so long that the the process may no longer fit the business needs it was meant to solve when it is complete.

The software development industry has addressed this in various ways, some of which are

Emphasizing workmanship over ephemeral results: ".. managing by results is, in effect, exactly the same as ... while driving your automobile, keeping your eye on the rear view mirror"
Agile development breaks development down into short cycles with deliverables at each stage to allow efficient iteration to a working solution that meets the customer's needs.
Software patterns acknowledge that there are effective methods that transcend computer languages and frameworks and attempt to harness them.
The use of business analysts who define the problem to be solved and systems analysts who are responsible for researching, planning, coordinating and recommending software and system choices to meet an organization's business requirements.
Outsourcing to companies with specialist expertise. e.g. IBM , HP , Fujitsu
Virtualization, SaaS, cloud computing, etc
SOA or Service-Oriented Architecture, is a yet higher level of abstraction which treats applications as loosely coupled inter-operable services whose functionality is defined along business processes. This includes orchestration which is the automated arrangement and management of complex computer systems, middleware, and services.

See also SOA Sidekick Oriented Architecture for criticisms of SOA

IT Governance frameworks

All of these methods work to some degree. Experienced software developers and managers have used these techniques or techniques like these long before the names were coined. The fact that these methods have recently been formalized and named means they are becoming available to a much broader audience such as developers working on end-customer IT installations.

General Principles for Dealing with Complexity
People have been aware of the importance of pattern recognition in perception of the world since the time of Skinner's pigeons . In this recent article , Judith Polgar says "One of the biggest misconceptions about chess is it requires a lot of memorization. In reality, while some memorization is required, pattern recognition plays a crucial part in chess mastery." Most attempts at dealing with complexity that I am aware of come down to finding structure in the data being analyzed. The challenge is to come up with methods that find this structure quickly and reliably, like Skinner's pigeon and Polgar's chess players.

Effective developers fit the solution to the problem which requires

understanding the problem being solved
knowing how to solve similar problems
an understanding of the underlying principles and technologies
the ability to compare different methods

This sounds a lot like a seasoned developer would do and is the opposite of mechanically applying canned methods. It values understanding and flexibility over knowledge. However the best computer systems are complex so any practical approach has to combine sound technique with specific knowledge.

The following questions need to be answered

what does my organization need to do over the next N years?
how well can I predict that now?
what is the most effective way I can spend my IT budget based on the information I have?

An example will illustrate how this affects the technique vs. knowledge decisions.

If a major computer installation will take 3 years to complete for an organization whose needs for such a computer system change by 80% in 3 years then agility will be a key factor. However if the organization's needs do not change in that time or can be reliably predicted before the project starts then agility will probably not be critical.

Many studies including this one indicate that businesses are much less agile than they need to be. Here is an excerpt

Another survey by Thomke and Reinerstein [10] showed that only 5% of product developing firms have complete product specifications before starting a design, and on the average only 58% of specifications are available before the design process begins. Barclay and Benson [11] cite that as many as 80% of new products fail, whereas about 25% of new industrial products and about 30-35% of consumer products fail to meet expectations.

Intentional SOA for Real-World SOA Builders describes how agility can be difficult to achieve in heterogeneous environments of complex frameworks

Platform middleware provides a number of facilities for declarative programming and middleware. ... ESBs also have a very similar form of deployment descriptor which enables loose coupling and metadata management on a per-service basis. ..business agility is impacted, because there is no business user interface for managing this metadata. ...

In other words, a common low-level language (meta-data expressed in XML in this case) by itself is not enough to make two systems with different high-level semantics work together. It is revealing that the author thought he/she needed to spell this out. This implies that many or most businesses don't understand the need for common high-level semantics across systems that need to work together.

Products that Help Discover Structure in Information
These are all products that help people find patterns in their work. They are all complementary to the above well-known methods so I expect one of more of them to be widely adopted by IT organizations or framework builders.

Mind Manager is an intuitive visual tool that allows users to create structure from unstructured information
reQall generates reminder lists and has a cool iPhone interface
Jott is like reQall and has voice recognition
Evernote has some cool character recognition for converting scanned notes to data
Leximancer drills into textual data ... and extracts the main concepts, themes and causal relationships to provide the information needed to make critical decisions. Some examples:

Human Error in Maritime Operation says a large proportion of human error problems in the maritime domain can be grouped into a single category that can be labeled “loss of SA ”. ... Almost identical percentages were obtained between the manual coding and with the Leximancer analyses. ..Additionally, analysis of accident report forms can be successfully undertaken by using a tool such as the Leximancer.
Technical description and evaluation

Austhink Rationale
NICTA managing complexity

I will attempt to quantify the ideas discussed here in a future post .

01 November 2008

Test Automation For Complex Systems Continued

In a previous post I discussed some innovative methods that had been developed to test a software product line comprising 3M lines of C/C++ server code, 10 client apps and 300 developers. In that post I discussed:

A middleware layer against which the tests were run
The benefits of getting started early on a prototype solution
A CruiseControl-like build+test framework that implemented continuous builds and optionally tested each code check-in
Using VMware Lab Manager to dramatically improve test reliability, test hardware usage efficiency and flow of tests through the automated build+test system
Using methods from Experimental Design to dramatically improve test efficiency

Some issues remained

Testing needed to become yet more efficient. At this time the entire server still needed to be tested comprehensively on any code check-in that could affect the entire system. The full system had not been broken down into sub-systems that could each be tested comprehensively with small subsets of the full system test suite.
We needed to improve detection of concurrent defects such as race conditions and deadlocks. This was (and is) a well-known problem in software development. Faults found in the field were often produced under stress and timing conditions that were difficult to reproduce in testing .
The test system needed to find bugs without continual re-writing or re-tuning of tests.

Concurrent Testing
As developers of the server code being tested, we knew the fastest way of flowing jobs through the system. We also had a good idea of where concurrent defects were likely to occur. We developed a single solution to these two problems by refactoring our test programs into a set of threads that exercised different parts of the server. In their standard modes the new multi-threaded test programs maximized throughput by keeping all stages of all server processing pipelines full. This by itself unmasked many concurrent bugs by exercising all major threads in the system while efficiently exploring many combinations of input parameters. We then added code to trigger all the user-switchable state transitions (abort, cancel, etc) in the server's processing pipelines and control loops to call these rapidly and with different timings. This in turn uncovered many more bugs.

Finding New Bugs without Changing Tests
Two well-known behaviors of software development organizations are

Defects found by tests tend to get fixed.
Code that is tested often tends to have fewer defects.

The inevitable result of this is that static tests will tend to find fewer bugs over time as the defects they find get fixed and the code they exercise gets debugged. The consensus among the testers in our organization was that they found at least 80% of their bugs through exploratory testing and less than 20% through running their standard test matrices. Their exploratory testing included such things as testing recently changed features and testing functionality they had observed to be fragile in the past.

Effectiveness starting at 20% then tapering off was not what we had in mind for our test system. We had partially addressed this initial design by:

Allowing tests to be based on server configuration. E.g. meta-data attributes were a key test parameter so the test programs had an option to read the all the meta-data keys and their allowed values from a server and generate test cases from them. This distributed test coverage evenly between previously and newly exposed code paths. While this was less directed toward testing new code paths than the QA department's exploratory strategy of targeting recently exposed changes, it was much more effective that continually re-testing the same well tested code paths as a purely static test would have done. It also had the nice behavior that two servers with identical code and configuration would always be tested the exact same way while any change to the configuration or middleware layer would result in a different test with good coverage of the changes.
Having a pure exploratory mode where a crawler found new test files and a time seeded pseudo random number generator created truly random test parameters.

The configuration based testing mode was widely used. It continued to find a lot of bugs as long as code or configuration changed.

The pure exploratory mode was seldom used other than for creating data for the statistical analysis module described in the previous post.

Conclusions
Some of the things we learned from this work were:

It was important to fit the solution to the problem. IMO the one thing that most characterized our approach was that we did not start with a solution in mind. We continually re-analyzed the problem and the efficacy of our partial solutions to it and ended up with solutions that few people had expected.
We created a design verification system rather than automating the quality assurance department's manual testing
We emphasized bug find+fix volume over bug prioritization
We automated at the middleware layer rather than with client app button pushers
Virtualization was a key component of our solution
Statistical analysis played a key role in the design of the solution
Automated tests continued to find many bugs even after code stopped changing. We investigated this and found that new code paths were being exposed to our tests by changes in configuration made after code freeze. As a result the company started treating configuration changes the same as code changes, requiring them to asymptote to zero before shipping.
It can be difficult to fit solutions to a large problem. Big problems often require big solutions and building big solutions can be expensive. For example, we discovered that CruiseControl brought us little value so we had to invest a lot of money in developing our own CruiseControl-like automated build+test system.
Integration is a major cost in big solutions. Few of the off-the-shelf tools we looked at worked well together. A large fraction the cost of this work went on integration. The major exception to this was Lab Manager and the VMware tools supporting it, which integrated well with all parts of our system.
The value of an effective IT department. These people understood integration of heterogeneous systems, rolling out hardware and software and meeting service level requirements in a way that we developers did not. The VMware products, which were designed for IT departments, had the same qualities.

29 October 2008

Test Automation for Complex Systems

A while back the software development team I was working in was struggling with late stage severity 1 bugs. There were about 3M lines of C/C++ server code, 10 client apps and 300 developers. The code had evolved over 10-15 years, it was rich in functionality and much of it was timing critical. Therefore it came as no surprise that the code had some problems and that the problems were hard to fix.

First some questions had to be answered.

Q: What was the problem to be solved?
A: To find and fix most severity 1 bugs in early stages of development.

Q: Why was this not being done now?
A: Complexity. The combinatorics of the system were daunting. It had the benefit of a comprehensive middleware layer that cleanly separated clients from the server. However the middleware layer was large, approx 1,000 APIs all with multiple arguments, the server processed complex files specified by multiple complex languages and 100+ meta-data attributes with an average of 4 values each that could be applied to each file. The server was multi-threaded to handle multiple users and to pipeline processing of jobs, and the threads shared data and state so interactions were important. The system also had a large real-time device control component.

Q: What was there to work with?
A: Working code. The system had bugs but it worked. Many parts of the code were well-written. It was possible that the code could be bug fixed to an acceptable level of stability without a major re-write.

Q: What was the best starting point?
A: The middleware layer. All the company's client apps communicated with the server through the middle-ware layer so it was theoretically possible to test any customer-visible scenario by scripting the middleware layer.

Those questions and answers appeared to narrow the scope of the problem as much as it was ever going to be narrowed by theoretical analysis. Therefore we put together a prototype test framework to see what was achievable in practice. The late stage severity 1 bugs were very costly so time was of the essence.

The Design Goals

Test as close as possible to code submission time.
Get started as early as possible. The goal was to fix bugs. The earlier we started, the more we could fix.
Test as much user-exposed functionality as possible as quickly as possible.

Test automatically.
Get as many test systems as possible. Either re-deploy existing systems or purchase new ones.
Cover functionality as efficiently as possible.

The First Implementation

Write test cases as programs that use the middleware layers.
Try to save development time by using an existing automated test framework. We quickly found that there were no frameworks that helped much with our 3 main requirements:

Integrate with our source code management (SCM) system and our build systems.
Install our code on our test machines

Invoke our middle-ware based test programs

Get started early by writing test programs that exercise well-known functionality
Create a simple test framework to learn about writing automated test frameworks. This framework would

Trigger from SCM check-ins.
Invoke the build system.
Install built code on test system.
Run tests.
Save results.
Display results.

Run reliably on 10 test machines running 24x7.
Run fully enqueued.

One developer wrote a prototype framework while other developers wrote tests or adapted existing tests to the framework. Within a few months we were running on 20 test machines with a duty cycle of around 50%. It was not perfect: we made heavy use of TCP/IP controlled power switches and other tricks and we had to re-start and re-image test machines regularly. But it worked and we had learned a lot.

Lessons Learned From the First Implementation

Getting a prototype out fast was invaluable.
Many low-hanging server bugs were found and fixed.
We had learned how to build the final automated test system.
The company's middleware layer was the key to the success of our system level automated testing. All our success was based on the wisdom and hard work of the team who designed and implemented this layer then made it work for all servers and all clients in the company's product line and rigorously maintained backwards compatibility over its history.
To be effective, the automated test system had to work reliably in a fully enqueued mode. That is, a code check-in had to trigger a build, installation, test, saving of results, capture of failed states, notification to developers and reservation of the failed test system, all without human intervention. Doing so gave 24x7=168 hours of testing per test machine per week. Waiting for human intervention could drop throughput by a factor of 10. As the number of test machines increased, the wait for each human intervention grew. As the tests became better at killing test machines, the number of human interventions increased.
Tests needed to be more effective. Even though our tests had found as many bugs as we had time to fix, it was still easy to find bugs in manual testing. Moreover the code being tested was still under development and it was clear that it was not stabilizing fast enough to meet target schedules.

Critical Next Steps
By this time work the final automated build and test was well underway. It was much more sophisticated than our prototype: it had a scalable design, was easy to use, robust and its code was maintainable over a long period. But we still had to address the issues of test machine reliability, full enqueuing and testing effectiveness.

The solutions to test machine reliability and full enqueuing turned out to have already been solved by VMware Lab Manager. We did not get to Lab Manager directly. We tried building something similar ourselves first before we found out that Lab Manager solved all the problems we had been working through plus some we had yet to encounter. The key benefits of Lab Manager were

Test machines were virtualized so they were as reliable as the code being tested. We no longer needed TCP/IP controlled switches and human intervention to deal with hung machines. Hung virtual test machines could be killed and recycled with a single remote function call.
The full test cycle could be enqueued. Virtual machines could be created, software installed on them, tests run, and failed virtual systems saved to disk for later examination. The duty cycle went up to close to 100%
The full test cycle was implemented extremely efficiently with VMware technology. The saved VM states were implemented as deltas from the common test images. Efficiency (= number of tests per physical PC per week) went up by over a factor of 10.
It scaled very well. Combined with the scalability of the new auto-build+test system, the number of (virtual) machines under test increased to over a hundred quickly then kept growing.

We found no similar existing solution to the testing efficiency problem. To restate the problem, there were many ways the product could be used and testing all these cases was taking too long. The number of possible ways the product could be used was the product of

Number of different input files. This was high but we did not know what the effective number of input files was.
Meta data applied to each file. 140 attributes with an average of 4 values/attribute = 4^140 = 10^84
1000 APIs in middle-ware, which controlled approx 10 main computational threads which shared state data.

The field of Experimental Design explains how to test systems with several input variables efficiently. The key technique is to change several input variables on each test run. The challenges are knowing which variables to change on each run and interpreting the results when several variables are changed at once. An example will illustrate:

The above meta-data turned out to have a big effect on the outcome of tests. Given that a test run took at least 3 seconds, the only way to test all 140 meta-data attributes was to change most attributes on all test runs. After a set of test runs with, say, 100 different meta-data attributes set on each run, one test run fails. How do you find out which of the attributes caused the failure? (One answer is given at the end of this post).

The following is a rough outline of how we designed our software experiment.

Design of Software Experiment
The number of input variables to test was far too high for traditional experimental designs so we used Optimal Designs. First we narrowed the input variable list to the ones that were important for the outcomes we cared about. While we were software engineers who understood how each individual function in the code behaved, we could not construct any useful model to describe how the entire system would behave. Therefore we did this empirically and used Dimensional Reduction What we did was too complicated to describe in a blog post but a simplified version could be summarized as follows:

Collected as many input files as possible. We wrote a crawler to search all the company's test files and installed snooper's on all the company's test machines to create a superset of the company's test files.
Synthesized the D-Optimal set of meta-data attributes.
Manually selected the key middle-ware APIs.
Executed many test runs.
Found the Principal Components of the input variables that described the bulk of the variation in the outcomes we cared about.

For some of the input variables the dimensional reduction was high. In other cases little reduction was achieved. We ended up with a dramatically simplified but still complex set of tests, essentially one thousand test files, and one hundred meta-data attributes, a smallish number of middle-ware calls, and all the timing-dependent interactions between those. This turned out not to be so bad. Since the remaining input variables were uncorrelated to within the limits of our testing, we could test them all at once without them interfering significantly with each other. That meant we could do something similar to Fuzz testing (e.g.: Java Fuzz Testing, Fuzzers - The ultimate list).

The meta-data testing illustrates how this worked. There were 100 attributes with 4 values each. Say the random number generator selected every value of an attribute with 99.99% probability within 10 test runs (that is, rand()%4 returned 0,1,2 and 3 within 10 test runs with 99.99% probability) and the values it selected for each attribute were uncorrelated. Then

in 10 test runs, all outcomes that depended on 1 attribute would have been tested
in 100 test runs, all outcomes that depended on 2 attributes would have been tested
in 1000 test runs, all outcomes that depended on 3 attributes would have been tested

This was good coverage, especially since real users tended not to set a lot of attributes at once and the code was modular enough that the attributes tended not to have a lot of inter-dependencies. Debugging failures took some time because each test run had 100 meta-data attributes set. The vast majority of bugs were caused by only one or two of those attributes but the culprit attributes still had to be found. The solution we used was a binary search, re-running the failed test with selected attributes disabled until we found the minimal set of attributes that triggered the failure.

TO COME:
Pipelining tests

04 January 2010

10 September 2009

16 July 2009

22 June 2009

26 April 2009

30 March 2009

18 March 2009

11 March 2009

08 February 2009

23 January 2009

16 January 2009

06 January 2009

Summary of Risk-Based Schedule Prediction

What Makes Risk Sensitive Managers Different

19 November 2008

01 November 2008

29 October 2008

Translate

My Blog List

Total Pageviews

Followers

Your Details

Google Analytics

Feedjit Live Blog Stats

Cluster Map