blogit ergo sum

27 December 2013

Digital Signal Processing MOOC

I just finished Paolo Prandoni's and Martin Vetterli's Digital Signal Processing course on Coursera.

I did not spend much time on this course unlike the MOOCs I had studied before. I have worked on various types of signal processing in my career and I as I progressed through this course I found most of the material was familiar to me.

Digital Signal Processing is a diverse collection of subjects defined is Wikipedia as being the mathematical manipulation of an information signal to modify or improve it in some way, including e the following laundry list: audio and speech signal processing, sonar and radar signal processing, sensor array processing, spectral estimation, statistical signal processing, digital image processing, signal processing for communications, control of systems, biomedical signal processing, seismic data processing, etc!

The course syllabus was

Discrete time signals
Vectors spaces
Fourier Analysis
Linear Filters
Interpolation and Sampling
Stochastic Signal Processing and Quantization
Image Processing
Digital Communication Systems

The lectures were based on a very thorough online book that was so easy to read, I did not watch many of the lectures. The lectures seemed good for an introduction to the field.

Assessment was by quiz. I would have preferred programming assignments

My quiz results suggested that I understood all the material in the course expect design of ADSL.

26 August 2013

Discrete Optimization Course Finished

I recently finished Pascal Van Hentenryck's Discrete Optimization online course.

The course differed from other tertiary courses and MOOCs I had taken in the past in several ways.

It started with the video on the right instead of the dry introductions I was used to in math and computing courses.
The format of the course was to solve several instances of one NP-hard problem per week.
Little guidance were given. Students were left to choose from the methods give in lectures.
There were no quizzes!

Discrete Optimization as the name suggests is about optimizing things that take on discrete (whole number) and usually positive values. That is much of the real world. e.g. items in a knapsack, cities on a route, vehicles in a delivery schedule, so it is widely applicable. e.g. The introductory video told us that Australia spends 10-20% of GDP on logistics, and logistics is made up of discrete things like numbers of packages and vehicles, so discrete optimization is useful.

The Course

The basic ideas of the course were to:

Learn the main techniques of discrete optimization.

Basics: e.g. randomization, greedy search, dynamic programming, branch and bound.
Constraint programming
Local search
Mixed integer programming

Solve a series of problems using any of the techniques learned in combination or alternative methods.

Knapsack
Graph Coloring
Traveling Salesman
Warehouse Location
Vehicle Routing
Puzzle Challenge (optional)

In the words of the Course Syllabus

The course has an open format. At the start of the course all of the assignments and lectures are available and each student is free to design their own plan of study and proceed at their own pace. The assessments in the course consist of five programming assignments and one extra credit assignment. In the programming assignments, students experience the challenges of real world optimization problems such as selecting the most profitable locations of retail stores (warehouse location) and the design of package delivery routes (vehicle routing).

This sounded practical on one hand, after all I usually want to learn how to solve real world problems in computer courses. On the other hand it does not make it clear how to start.

Lacking a plan, I watched all the lectures then started solving the problems. This was probably about the worst way to do the course. Later on I noticed there was as study guide that gave a good way of doing the course. Oh well.

I watched about 40 lectures that contained some deep insights and lots of practical advice about discrete optimization.

Introduction, branch and bound, dynamic programming.
Constraint programming
Local search
Linear programming and mixed integer programming.
Advanced topics, column generation, disjunctive global constraints, limited discrepancy search and large neighborhood search.

Then I started on the problem sets.

The problem sets were in increasing order of difficulty and each set contained instances that were in increasing size and therefore difficulty. The size was important as all the problems were NP-hard which means roughly that they have no known exact solutions that run faster than a^n where a > 1.0 and n is the size of the problem (e.g. number of cities in the travelling salesman problem).

Solving the Problems

My problem solving went like this.

Knapsack

Used branch and bound. Spent a lot of time getting tight bounds using a technique that was something like a genetic algorithm and developed a special data structure for it (a fixed size sorted deque). Spent way too much time on this.

Graph Coloring

Used local search. Spent time developing efficient moves and more time developing metaheuristics. Learnt why my original metaheuristic that I took so long to develop is not in wide use. Spent too much time on this.

Traveling Salesman

Used local search. Spent time developing a visualization and moves. Spent most of my time tuning well-known metaheuristics and experimenting with search strategies. Learnt why my original search strategies are not in wide use.

Warehouse Location

Used a MIP solution. Spent most of my time getting scip working and formulating the MIP problems.

Vehicle Routing

Had run out of time when I started this. Sorted vehicle trips by angle with warehouse, jiggled them back to feasibility and ran my TSP solution on each vehicle trip.

Puzzle Challenge

Had hours left when I started this so I used whatever method came in to my head first.

e.g. For the N queens problem, I could remember MIP, local search and constraint programming solutions. Constraint propagation seemed the most natural way to do it so I wrote a simple python script for a constraint based solution. This script was too slow to get a high score (valid solution for high N) so I made a few improvements.

Randomized the order of searching rows for new queens as searching every column in increasing row order is very slow. I didn't have time to figure out what a good order was but random seemed unlikely to be bad, and it wasn't/
Coded the inner loop propagate() in Numba which is the quickest way to make python code run at C speeds.

With those improvements my script could solve for about 900 queens but no more because it was recursive and had reached a python call depth limit. I didn't have time to convert it from a recursive to iterative so I stopped there.

Conclusion

This seemed like a good way to learn. I went about it an inefficient way but I learned first hand why the best methods don't work like the ones I invented myself. I also got to see

what was important for each method. e.g. tight bounds for branch and bound, efficient moves for local search
what scales to large N. e.g. not my TSP local search
how much time is involved learning the packages that can be used for discrete optimization, e.g. scip

You can read other students' comments on the course here.

12 August 2012

Twitter Bot

I recently wrote a Twitter bot to offer sympathy to people who have suffered paper cuts. You can see it here.

The Twitter bot link in the last paragraph has all the code. There was nothing particularly complex in it. The hardest part was classifying tweets as being from people who suffered paper cuts or not. This was done with a very simple naive Bayes classifier.

The code from the previous link was trained on 3,225 tweets to give the following 10-fold cross-validation results. The columns are the numbers of tweets containing "paper cut" that the classifier predicts are from people with paper cuts. The rows are the number of tweets that a human (me) says are from people with paper cuts.

===== = ===== = =====
      | False |  True
----- + ----- + -----
False |  1978 |    59
----- + ----- + -----
 True |   308 |   880
===== = ===== = =====
Total = 3225

===== = ===== = =====
      | False |  True
----- + ----- + -----
False |   61% |    2%
----- + ----- + -----
 True |   10% |   27%
===== = ===== = =====
Precision = 0.937, Recall = 0.741, F1 = 0.827

This precision value of 0.937 means that 0.063 or about 1/16 of the tweets that the Twitter bot classifies as being about paper cuts and replies to are not about paper cuts..

There is some additional filtering and to reduce the number of inappropriate replies. Filtering increases the precision to 0.95 so the final false positive rate is 1/20. The false positives after filtering are here. Examining them gives a feel for the cases where the simple Bayesian n-gram classifier breaks down.

The most influential n-grams are given here where you can also find a link to the full list of n-grams.

You can get some idea of the how well the classification works in the real world from the Twitter responses to it.

UPDATE 17 September 2012. Twitter warned the OwwPapercut account for sending automated replies which seemed odd given @StealthMountain. OwwwPapercut has run in non-replying mode since then.

03 June 2012

NLP Class Completed

The Stanford NLP class finished about a week ago and I got my certificate of accomplishment over the week-end.

NLP Class "Attrition"

Professors Dan Jurafsky and Chris Manning also sent out some course statistics that I saved here and graphed on the right using the Python code in the previous link.

From about the third week of the course onwards I felt like I did not have time to complete watching the lectures and doing the problem sets and programming exercises. Therefore I was wondering what it was like for other people like me with jobs and families.

The graph on the right shows the number of students completing problem sets and programming exercises each week of the course. I have called this "attrition" which cannot be completely right as the number of students completing programming exercises increased from week 6 to week 7.

I have not studied university course attrition rates but I suspect they would differ from NLP-class because of incentives like costs and grade penalties that would lead to students leaving at the start of the course. The somewhat smooth and mostly monotonic curves in the above graph seem more natural for a free class like NLP-class where the main benefits are learning.

So why were people leaving the course all the time? I can only answer for me. In order to learn the skills I wanted, I felt I had to complete the assignments and the assignments were long and tended to get longer through the course. Programming assignment 6 took me the longest and cut into some family time. That seems to be reflected in the graph.

25 March 2012

Stanford NLP Class continued

In one of the early classes one of the lecturers showed some Unix tools for performing low-level tasks such tokenizing and word counting. You can see these in the course notes.

I prefer not to use such tools because I already use more tools than I want to and it is possible to do all the things taught in lectures with one of the tools I use a lot, Python.

Here is some Python code that does what I remember the lecturer demonstrating.

Tokenizing on non-alphabetic boundaries
import re _RE_TOKEN = re.compile('[^a-zA-Z]+') def tokenize(text): return [x for x in _RE_TOKEN.split(text) if x]
Finding unique words
The lecturers call unique words word types.
tokens = tokenize(text) types = set(tokens)
Finding unique words, ignoring case variants
normalized_types = set(tokenize(text.lower()))
Finding word types that differ only in case
# variants[x] = set of all case variants of x in types variants = {} for w in types: variants[w.lower()] = variants.get(w.lower(), set([])) | set([w]) # types that occur in multiple type variants multi_case = dict([(k,v) for (k,v) in variants.items() if len(v) > 1]) # number of types that are case variants of other types num_case_only = sum([len(v) - 1 for v in multi_case.values()]) # this had better be true assert(num_case_only == len(types) - len(normalized_types))

That's it. Basic test manipulations done without having to leave one of the small numbers of programming environments I spend a lot of time in.

10 March 2012

Stanford NLP Class

These are my notes on Stanford's online NLP class.

The first few lectures said that a lot of the hard work in NLP, notably in tokenizers, is done with regular expressions. This was not entirely surprising as a good fraction of the string processing I have done in my professional career has been done with regular expressions.

Programming exercises can be done in Python or Java. I chose Python as I have found it well suited to simple string manipulation programs in the past.

The first programming exercise is to extract phone numbers and email addresses from web pages. A training set of Stanford computer science faculty home pages was supplied along with some starter code to show the required formatting. The starter code helpfully computed lists of true positives, false positives and false negatives.

My experience with problems like these is to

get the test samples to pass, by

loosening matches and adding more detection to detect all the addresses and phone numbers
tightening matches to avoid false positives

while taking care to make decisions that are likely to generalize well to as yet unseen samples

2 takes some judgement as it is not clear what will generalize well. e.g In the samples " DOT " was used to mask "." in email addresses. It seems wises to match on all cases of " DOT " but then I found that " DOM " was used for the "." alias by one faculty member . The question was then whether to generalize from " DOT " and " DOM " to " DO<any character> " or treat " DOM " as a one-off as it had been observed only once.

15 January 2011

My Web Browsing Turned into a Newspaper

My life is an open book.
What the machine learning people I follow on Twitter are reading.
And a custom paper.

18 October 2010

The Effect of Test Set Selection on Classification Accuracy

I was looking at some prediction results for the UCI Michalski and Chilausky soybean data set and wondered how they depended on test set selection. Some had classification accuracy as high as 93.1% accuracy on a 25% training set and 97.1% on 290 training and 340 test instances.

A few weeks ago I had been asked to find the best classifier for the soybean data set based on prediction accuracy on a test set of 20% of the data. The remaining 80% could be used for training. That gave 306!/(245!x61!) = 1.3 x 10^65 possible splits of the 306 data points into training and test sets. Could some of these splits lead to better results than others for the classifiers I was about to use?

The WEKA data mining package was used for classification. WEKA has many classifiers that can be run on a data set and their performance to be compared.

WEKA also has a programming interface so I used it to write some Jython toolsto explore the performance of a range of classifiers.

One of these tools was run on the soybean data to find the training/test splits with best and worst classification accuracy. The results were

Classifier	Best Accuracy	Worst Accuracy
Naive Bayes	100%	70.5%
Bayes Net	100%	75.4%
J48 (C4.5)	95%	69%
JRip (RIPPER)	98.4%	70.5%
KStar	96.7%	65.6%
Random Forest	95%	62.3%
SMO (support vector machine)	96.7%	82%
MLP (neural network)	100%	77%

Fig1. Best and worst accuracies for selected WEKA classifiers run on different training/test splits

That was quite a range of test set accuracies for different training/test splits. My simple genetic algorithm may not have found the extremes of the distributions so the actual range may have been higher.

When I ran the test set selection scripta second time (Fig 2) it also found a 100% SMO accuracy. The second test was set up to find a single training/test set split that gave best results for all classifiers at once. It also had a slightly different pre-processing. The 4 duplicate instances were removed and the troublesome single 2-4-5-t sample was left in. Therefore I expected it to give worse results than the pre-processing used for the results in Fig 1.

Classifier	Correct (out of 60)	Percent Correct
Naive Bayes	57	95 %
Bayes Net	59	98.3 %
J48	58	96.7 %
JRip	60	100 %
KStar	60	100 %
Random Forest	59	98.3 %
SMO	60	100 %
MLP	60	100 %

Fig2. Best accuracies for selected WEKA classifiers all run on the same training/test split

Both the above results were for the default settings of each of the WEKA classifiers. The WEKA classifiers all have parameters that can be tuned and it is possible to select subsets of attributes so they can give better and much worse results than the defaults. However the default parameters are usually close to the best so they may be good indicators of the best possible accuracies.

It appears that the training/test split of a data set can change classification accuracy by more than 30%. This was observed on a well-known and widely used classification data set.

07 September 2010

Watching Percipo

13 August 2010

Wrote a Blog Post

I have not posted here recently but I wrote blog post for PaperCut last week.

20 March 2010

Blogger supports logical symbols

e.g. ¬(A ∨ B) ⇒ ¬A ∧ ¬B

All symbols: ¬, ∧, ∨, ⇒, ⇔

04 January 2010

C++ Continues to Surprise

Someone was asking questions about const_cast<>() a few days ago. I was not quite sure how it would work because I try to use as little of the C++ language as possible and it possible to get by in C++ without const_cast<>(). To find out exactly how it worked I tried it out with a test case. The following code gave the same output on g++ on Vista and OS X.

int i = 3;

const int* ptr = &i;

*const_cast<int*>(ptr) = 11;

if (&i == ptr && i != *ptr) {

std::cout << "Cannot happen: &i=" << &i << " == ptr=" << ptr << " but i=" << i << " != *ptr=" << *ptr << std::endl;

}

The output in both cases was Cannot happen: &i=0x22fe6c == ptr=0x22fe6c but i=3 != *ptr=22

How can a single memory address hold two different values?

The disassembly was

push %ebp

mov %esp,%ebp

sub $0x18,%esp

int i = 3;

movl $0x3,0xfffffffc(%ebp)         (i in bp-4)

const int* ptr = &i;

lea 0xfffffffc(%ebp),%eax         (&i in eax)

mov %eax,0xfffffff8(%ebp)            (ptr in bp-8)

*const_cast<int*>(ptr) = 11;

mov 0xfffffff8(%ebp),%eax           (ptr in eax)

movl $0xb,(%eax) (*ptr set to 11)

if (&i == ptr && i != *ptr)

lea 0xfffffffc(%ebp),%eax

cmp 0xfffffff8(%ebp),%eax

jne 0x403214

mov 0xfffffff8(%ebp),%eax

mov (%eax),%eax

cmp 0xfffffffc(%ebp),%eax

je 0x403214

The disassembly matches the C++ code. i is stored at bp-4 and ptr is stored at bp-8 so the C++ code should work. The observed behaviour does not match the disassembly.

This cannot be right. I guess I found a bug in g++.

27 October 2009

My Time in Sweden

I lived in Sweden from 1988 to 1991. Here is a map which showed where I lived in Stortorget in Gamla Stan in Stockholm.

View Larger Map

I lived in the red building in the two left photos below which are taken from Stortorget. The photo on the right is of the same building taken from Kåkbrinken, the alley to the left the red building.


The photo on the left below is the main street in Gamla Stan and the photo on the centr is of the Grand Hotel as seen from the shore of Gamla Stan and the photo on the right is Karloniska Hospital where I worked.


After I left Stockholm I moved to Umeå which is shown on the left below. When I lived there I used to visit Vaasa in Finland shown on the right.

When I lived in Sweden I took vacations in Norway including Lofoten on the left and Tromsø in the centre of the row of photos below. I also took the Hurtigruten

Photo Credits

13 October 2009

Machine Learning While I Work

I am setting up Postfix so I have spare time as I try things out. This post is about the things I am reading or watching in the background.

Taskforce on Context-Aware Computing
I went to a lecture called Open Mobile Miner (OMM): A System for Real Time Mobile Data Analysis. There is a video here, a description of OMM here and lecture slides here (pdf).

Shonali Krishnaswamy's group are making software that does some analysis of data on a smart phone before uploading it, thereby reducing the phone's power consumption by reducing communications. Their examples include ECG output, traffic congestion metrics and taxi location data. The data in their examples is scalar and sampled at 0.5 Hz or less so it is hard to see why a simple store-and-forward scheme would not achieve much the same thing. I guess I need to read their publications more deeply.

Statistical Learning as the Ultimate Agile Development Tool by Peter Norvig is an overview of modern practical machine learning. The summary is focus on the data, not the code.

Learning Theory by Mark Reid was an introduction to some theoretical aspects of machine learning presented in a summer school in Canberra in January 2009.

Now some videos of how machine learning can be applied to models of the face.

Changes of facial features on the of dominance, trustworthiness and competence dimensions in a computer model developed by Oosterhof & Todorov (2008).

Now it is time to start watching a video on distributed computing

Swarm: Distributed Computation in the Cloud from Ian Clarke on Vimeo.

18 September 2009

My First Upside Down Post

˙ʇǝsdıɥɔ pǝsɐq ɯɹɐ s’ǝןɐɔsǝǝɹɟ 'uoƃɐɹpdɐus ɯɯoɔןɐnb 'ɯoʇɐ ןǝʇuı ˙sɹɐʍ ɹossǝɔoɹd ˙sʞooqʇǝu puɐ sǝuoɥd ʇɹɐɯs ɟo ǝɔuǝƃɹǝʌuoɔ ǝןqıssod ˙ǝƃuɐɥɔ ǝʌıɹp ןןıʍ ǝɔuǝɹǝɟɟıp ǝɔıɹd 000'1$ ˙sʌ 052$ ǝɥʇ ˙pǝʇıns-ןןǝʍ os ʇou sǝop pɹoʍ ʇɟosoɹɔıɯ ǝןıɥʍ pnoןɔ ǝɥʇ ɯoɹɟ ןןǝʍ sʞɹoʍ ʎpɐǝɹןɐ ǝɹɐʍʇɟos ɹǝɥʇo puɐ uozɐɯɐ 'ɯoɔ˙ǝɔɹoɟsǝןɐs 'ǝןƃooƃ ˙ǝɹɐʍʇɟos ʇɟosoɹɔıɯ ǝɥʇ doʇdɐן ǝɔɐןdǝɹ ʎɐɯ pnoןɔ + ʞooqʇǝu os ƃuıʇndɯoɔ pnoןɔ oʇ pǝʇıns-ןןǝʍ ǝɹɐ sʞooqʇǝu ˙ʇɟosoɹɔıɯ puɐ ןǝʇuı oʇ ʇsoɔ ʇɐǝɹƃ ʇɐ sǝןɐs doʇdɐן 0001$sn ǝzıןɐqıuuɐɔ ʎɐɯ sʞooqʇǝu 052$sn ɯɹǝʇ ɹǝƃuoן ǝɥʇ uı ʇnq ǝıd ɔd ǝɥʇ ƃuıʍoɹƃ ǝɹɐ sʞooqʇǝu ʎןʇuǝɹɹnɔ ¿uıɐɥɔ ǝnןɐʌ doʇdɐן puɐ ɔd ǝɥʇ oʇ op sʞooqʇǝu ןןıʍ ʇɐɥʍ ˙sǝןɐs ƃuıʇsıxǝ ɟo uoıʇɐzıןɐqıuuɐɔ pnoןɔ ǝɥʇ oʇ ʇuǝɯǝʌoɯ ɟo sǝɔuǝnbǝsuoɔ ǝɯos sɹoʇıʇǝdɯoɔ ɹıǝɥʇ ɹǝʌo ǝƃɐʇuɐʌpɐ ƃuıɔıɹd ɐ ǝʌǝıɥɔɐ ןןıʍ sıɥʇ ǝpıʌoɹd uɐɔ oɥʍ sɹǝıɹɹɐɔ uoıʇɐɔıunɯɯoɔǝןǝʇ ǝɥʇ ˙suoıʇɔǝuuoɔ ʞɹoʍʇǝu ǝןqɐʇɹod puɐ ǝןqɐıןǝɹ 'ʇsɐɟ sǝɹınbǝɹ ƃuıʇndɯoɔ pnoןɔ ʎɥdɹnɯ ˙ɹɯ sʎɐs „'ɹǝʇʇǝq ʇnq ƃuıɥʇou uǝǝq s,ʇı 'dɯnɥ ƃuıuɹɐǝן ǝɥʇ ɹǝʌo ʇoƃ ǝʍ ǝɔuo ˙sǝƃɐʇuɐʌpɐsıp ןɐǝɹ ʎuɐ ɥʇıʍ dn ƃuıɯoɔ pǝssǝɹd-pɹɐɥ ǝq p,ı ˙ǝɹoɯ ʎuɐ suoıʇɐןןɐʇsuı ǝɹɐʍpɹɐɥ ןɐɔısʎɥd ɹoɟ ƃuıʇıɐʍ sʎɐp puɐ sɹnoɥ puǝds ʇ,uop ǝʍ„ ˙sɹǝʇndɯoɔ uʍo sʇı ɟo ǝuou ɥʇıʍ ʎuɐdɯoɔ ʇuǝɯdoןǝʌǝp qǝʍ ɐ ǝɯoɔǝq sɐɥ ʇı - ǝɹɐʍpɹɐɥ ןɐɔısʎɥd ɹoɟ ʇno ƃuıʞɹoɟ pǝddoʇs sɐɥ ʇı suɐǝɯ ɥɔıɥʍ 'ǝɔıʌɹǝs ןɐnʇɹıʌ ǝɥʇ ɹoɟ ɹnoɥ ɹǝd sʇuǝɔ 08 oʇ 01 sʎɐd ǝƃuɐɹo ʎɔınɾ ˙ǝɯoɔǝq sɐɥ ƃuıʇndɯoɔ pnoןɔ ɯɐǝɹʇsuıɐɯ ʍoɥ sǝʇɐɹʇsnןןı (9002 qǝɟ 02 ǝƃɐ ǝɥʇ) ɯɐǝɹʇsuıɐɯ ǝɥʇ spuǝɔsɐ ƃuıʇndɯoɔ pnoןɔ ”˙ƃuıʍoɹƃ ןןıʇs sı ןɐɹǝuǝƃ uı ƃ3“ ˙sʇdǝɔuoɔ pɹɐʍɹoɟ ʇɐ ןɐdıɔuıɹd 'ssnɐɹʇs ןןıʍ pıɐs ”'sʇods ʇɥƃıɹq ǝɹɐ ǝsǝɥʇ“˙˙˙ ˙pǝʌɹǝsqo ǝʌɐɥ sʇsʎןɐuɐ 'ǝɔuɐɥɔ ƃuıʇɥƃıɟ ɐ ǝʌɐɥ sɯǝpoɯ ƃ3 puɐ sdıɥɔ ɥʇooʇǝnןq puɐ ıɟ-ıʍ 'sdƃ ǝʞɐɯ oɥʍ sɹopuǝʌ puɐ — ɥʇʍoɹƃ ʇɐɥʇ ɟo ɥʇƃuǝɹʇs ǝɥʇ uo ʎɹɐʌ sʇsʎןɐuɐ — ɹɐǝʎ sıɥʇ ʍoɹƃ oʇ pǝʇɔǝɾoɹd ǝɹɐ sǝןɐs ǝuoɥdʇɹɐɯs ˙pɐǝɥɐ ɹɐǝʎ ǝɥʇ uı sʇods ʇɥƃıɹq ƃuıʞǝǝs puɐ sɥʇƃuǝɹʇs ǝɹoɔ oʇ ƃuıʞooן ǝɹɐ sɹopuǝʌ dıɥɔ ʇsoɯ 'ǝןıɥʍuɐǝɯ˙˙˙ ˙sʇǝʞɹɐɯ ʍǝu uǝdo oʇ ʎɐןd ǝuoɥd ʇɹɐɯs ɐ ƃuıɹɐdǝɹd ǝq oʇ pǝɹoɯnɹ sı ˙ɔuı ןןǝp ɹǝʞɐɯ ɔd ˙ǝuoɥdı s’˙ɔuı ǝןddɐ ʎq pǝʌɹǝs ʇǝʞɹɐɯ ǝɥʇ ɟo ǝɯos qɐɹƃ oʇ ƃuıdoɥ 'ʍoןs sǝʇɐɹ sǝןɐs ɔd sɐ spıɯ pǝʇǝƃɹɐʇ sɐɥ — ɹǝʞɐɯ dıɥɔ ʇsǝƃɹɐן s’pןɹoʍ ǝɥʇ — ˙dɹoɔ ןǝʇuı sɐ 'uoıʇıʇǝdɯoɔ ǝʌɐɥ ןןıʍ ɯɯoɔןɐnb˙˙˙spןǝɥpuɐɥ puɐ sdoʇdɐן uǝǝʍʇǝq dɐƃ ǝɔıɹd ǝɥʇ ǝƃpıɹq sʞooqʇǝu sɐ ǝƃɹns oʇ pǝʇɔǝɾoɹd ʎɹoƃǝʇɐɔ ɹǝʇʇɐן ǝɥʇ ɥʇıʍ 'sǝɔıʌǝp ʇǝuɹǝʇuı ǝןıqoɯ puɐ sʞooqʇǝu 'sʞooqǝʇou ɹoɟ ʇǝsdıɥɔ uoƃɐɹpdɐus sʇı uo sısɐɥdɯǝ pǝɔɐןd sɐɥ ɯɯoɔןɐnb 'ǝןıɥʍuɐǝɯ sdıɥɔ :ʇsɐɔǝɹoɟ ssǝןǝɹıʍ 9002 ssǝןǝɹıʍ ɹɔɹ sǝuıɥɔɐɯ dx sʍopuıʍ ɹo nʇunqn ʎןןɐnsn ǝɹɐ ʇɐɥʍ ƃuoɯɐ ǝɥɔıu ɐ puıɟ ןןıʍ ɯɹoɟʇɐןd ǝןıqoɯ ǝɔɹnos-uǝdo s’ǝןƃooƃ ʇɐɥʇ ƃuıʇʇǝq sı - ʎɐpoʇ ʇǝʞɹɐɯ ǝɥʇ uo sʞooqʇǝu ɟo ʎʇıɹoɾɐɯ ʇsɐʌ ǝɥʇ uı punoɟ ɹossǝɔoɹd 072u ɯoʇɐ ןǝʇuı ǝɥʇ ɹoɟ ǝןqısuodsǝɹ - ɹǝʞɐɯdıɥɔ ǝɥʇ ʇɐɥʇ sʇsǝƃƃns oɥʍ '”ǝɔɹnos ǝןqɐıןǝɹ“ s’ʇɐǝqǝɹnʇuǝʌ oʇ ƃuıpɹoɔɔɐ s’ʇɐɥʇ ˙ǝɹɐʍpɹɐɥ ʇǝsdıɥɔ ǝןqɐʇıns ɥʇıʍ sɹǝɹnʇɔɐɟnuɐɯ ʇɹoddns oʇ ƃuıɹɐdǝɹd sı puɐ '0102 ʇnoɥƃnoɹɥʇ puɐ 9002 ǝʇɐן uı sʞooqʇǝu pǝsɐq-pıoɹpuɐ ɟo ʎɹɹnןɟ ɐ ƃuıʇɔǝdxǝ sı ןǝʇuı sʞooqʇǝu pǝsɐq-pıoɹpuɐ ɟo ǝsıɹ ɹoɟ ƃuıʎpɐǝɹ ןǝʇuı ʞooqʇǝu pǝsɐq-pıoɹpuɐ ɟo ǝsıɹ ɹoɟ ƃuıʎpɐǝɹ ןǝʇuı ʎoʎ %6˙12– puɐ bob %7˙12– pǝuıןɔǝp sʇuǝɯdıɥs ʇıun ɹossǝɔoɹd ɔd ǝpıʍpןɹoʍ 'ɯoʇɐ ʇnoɥʇıʍ ˙ǝuıןɔǝp ɔıʇɐɯɐɹp pıoʌɐ ʇǝʞɹɐɯ ǝɥʇ dןǝɥ oʇ ɥƃnouǝ ʇou ʇnq ǝɔuɐɯɹoɟɹǝd ʇǝʞɹɐɯ ןןɐɹǝʌo ǝɥʇ uı ǝɔuǝɹǝɟɟıp ǝןqɐʇou ɐ ǝʞɐɯ oʇ pǝnuıʇuoɔ (,,sʞooqʇǝu,, sןןɐɔ ןǝʇuı ɥɔıɥʍ) sɔd ʞooqǝʇou-ıuıɯ ɹoɟ ɹossǝɔoɹd ɯoʇɐ s,ןǝʇuı ˙˙˙ ؛(ʎoʎ) ɹɐǝʎ ɹǝʌo ɹɐǝʎ %4˙11– puɐ (bob) ɹǝʇɹɐnb ɹǝʌo ɹǝʇɹɐnb %0˙71– pǝuıןɔǝp sʇuǝɯdıɥs ʇıun ɹossǝɔoɹd ɔd ǝpıʍpןɹoʍ '80b4 uı 80b4 uı sʞooqʇǝu oʇ sdoʇdɐן sʍopuıʍ ɯoɹɟ ǝʌoɯ sıɥʇ ʇɹoddns oʇ sǝıɹoʇs ǝɯos ǝɹɐ ǝɹǝɥ ˙ʎʇıןıqoɯ puɐ ʎɹʇsǝɔuɐ ǝuoɥd ɹıǝɥʇ ɯoɹɟ sǝɯoɔ ʇɐɥʇ uoıʇdɯnsuoɔ ɹǝʍod ʍoן ɟo sǝƃɐʇuɐʌpɐ ןɐuoıʇıppɐ ǝɥʇ ǝʌɐɥ ʎǝɥʇ ˙sƃuıɹǝɟɟo pnoןɔ ɹǝɥʇo puɐ uoıʇɐzıןɐnʇɹıʌ 'sɐɐs 'sǝɔıʌɹǝs qǝʍ ɹoɟ sʇuǝıןɔ ǝʇɐnbǝpɐ ǝʞɐɯ ǝuoɥd ʇɹɐɯs ǝןqɐdɐɔ ʎɹǝʌ puɐ sʞooqʇǝu ƃ3 ˙sǝıƃǝʇɐɹʇs ƃuıʇndɯoɔ doʇʞsǝp pǝnsɹnd ʇou ǝʌɐɥ puɐ sǝɔıʌɹǝs ɟo sǝdʎʇ ǝsǝɥʇ uo ʎןǝɹıʇuǝ sǝssǝuısnq ɹıǝɥʇ pǝsɐq ǝʌɐɥ oɥʍ ɯoɔ˙ǝɔɹoɟsǝןɐs puɐ ǝɹɐʍɯʌ 'ǝןƃooƃ sɐ ɥɔns sǝıuɐdɯoɔ ʎq uǝʌıɹp uǝǝq sɐɥ ʇuǝɯdoןǝʌǝp ɹıǝɥʇ ˙sɹɐǝʎ 01 ʇsɐן ǝɥʇ uı ǝʌıʇɔǝɟɟǝ ʎןɥƃıɥ ǝɯoɔǝq ǝʌɐɥ ǝsǝɥʇ ɟo ʇsoɯ ˙ suoıʇɐɔıןddɐ qǝʍ puɐ sɐɐs 'uoıʇɐzıןɐnʇɹıʌ ƃuıpnןɔuı sǝɔıʌɹǝs pnoןɔ ɟo sǝdʎʇ ʎuɐɯ ǝɹɐ ǝɹǝɥʇ ˙uoıʇɐzıuɐƃɹo uɐ ɥƃnoɹɥʇ pǝʇɐɔıןdǝɹ ǝq oʇ pǝǝu ʇou sǝop ǝƃɐɹoʇs ʞsıp ʎʇıןıqɐıןǝɹ-ɥƃıɥ sɐ ɥɔns ǝɹnʇɔnɹʇsɐɹɟuı ǝʌısuǝdxǝ ʇɐɥʇ ʇıɟǝuǝq ןɐuoıʇıppɐ ǝɥʇ sı ǝɹǝɥʇ ˙ʇuǝıɔıɟɟǝ ǝɹoɯ ɥɔnɯ sı ɹǝʌɹǝs ןɐɹʇuǝɔ ɐ uo ǝɹɐʍʇɟos ǝɥʇ ƃuıuunɹ ǝɹoɟǝɹǝɥʇ ˙%01 uɐɥʇ ssǝן ɥɔnɯ 'ʍoן ʎɹǝʌ sı ǝɹɐʍʇɟos sıɥʇ ɟo ǝƃɐsn ǝɔɹnosǝɹ ɹǝʇndɯoɔ ǝƃɐɹǝʌɐ ǝɥʇ ˙ǝʌısuodsǝɹ ǝq oʇ ǝɔɐɟɹǝʇuı ɹǝsn ǝɥʇ ʇuɐʍ noʎ puɐ suoıʇɐʇndɯoɔ ǝsuǝʇuı ǝɥʇ sǝop ʇı uǝɥʍ ʎɐp ɹǝd sǝʇnuıɯ ʍǝɟ ǝɥʇ 'ǝƃɐsn ʞɐǝd ʇɹoddns oʇ ƃuıʎɐd ǝɹɐ noʎ 'ʇı ʇɹoddns oʇ ǝɹɐʍpɹɐɥ ɔd ǝʌısuǝdxǝ ʎnq noʎ uǝɥʍ ˙ǝɯıʇ ǝɥʇ ɟo ʇsoɯ ƃuıɥʇou sǝop ʇı ˙sǝןɔʎɔ ʎʇnp ʍoן ʎɹǝʌ sɐɥ 'sɔıɥdɐɹƃ ʎʇıןɐnb ɥƃıɥ ǝʞıן ǝɹɐʍʇɟos ǝʌısuǝʇuı ʎןןɐuoıʇɐʇndɯoɔ uǝʌǝ 'ǝɹɐʍʇɟos ʇuǝıןɔ ʇsoɯ ˙sɹǝʇndɯoɔ ןɐuosɹǝd s,ǝןdoǝd ʎuɐɯ uo ʇı ƃuıop uɐɥʇ ɹǝısɐǝ sı uoıʇɐɔoן ןɐɔısʎɥd ǝuo uı ƃuıuunɹ ǝɹɐʍʇɟos ƃuıpɐɹƃdn puɐ ƃuıuıɐʇuıɐɯ ǝsnɐɔǝq uoıʇɐzıןɐɹʇuǝɔ ɥʇıʍ ʎןןɐɔıʇɐɯɐɹp sǝsɐǝɹɔǝp (oɔʇ) dıɥsɹǝuʍo ɟo ʇsoɔ ןɐʇoʇ ˙sʇuǝıןɔ ɹǝןןɐɯs ǝʌɐɥ puɐ ƃuıʇndɯoɔ ǝzıןɐɹʇuǝɔ-ǝɹ oʇ sı ʇı ǝsıɹdɹǝʇuǝ uı puǝɹʇ ɹoɾɐɯ ʇuǝɹɹnɔ ɐ

14 September 2009

What I need from a 3G Netbook

I could use one know for

working on the train
working at cafes while waiting for the kids
working in the country while visiting friends and family

My PC and laptop seem like overkill for researching on the web, emailing, writing reports and building a few models in a spreadsheet. They also use a lot of electricity and take up space.

To be an effective replacement for the PC and laptop, a netbook would need to have

Be reasonably priced. $200 would be nice.
Have a reasonably priced connection plan.
Have access to cheap or free software

Web browser
Word processor
Drawing tools
Spreadsheet

Reliable connection with good coverage.
Good battery life. 6 hours would be nice

Basic set of applications

Gmail with tasks
Google calender
Google Docs
GIT for source code management
YUML
Ubuntu or Windows XP with cygwin
Gnu tools
VNC or WRD
ssh

That would get me started. It would be nice to have Eclipse and a local word processor but running these over a remote shell would be more than adequate. I worked that way with all my heavy tools on VMware instances for years and it worked well. The VMware instance were hosted in a data center and backed up regularly

10 September 2009

Electronic Medical Records Bonanza?

Big Bucks in Health IT!, quoting from http://www.healthcareitnews.com/news/global-market-hospital-it-systems-pegged-35b-2015 , says

SAN JOSE, CA – The global hospital information systems market will climb past $35 billion by 2015, according to a new forecast by Global Industry Analysts. The United States represents the largest market in the world. The U.S. hospital information system market is experiencing an increase in acceptance of customized technology such as laboratory information systems and radiology information systems, the report notes. The market is also a promising ground for electronic medical record systems.

The Asia-Pacific region (excluding Japan) represents the fastest growing hospital information systems market, exhibiting a compounded annual growth rate of 11.5 percent over the next few years, according to analysts. Despite being a smaller market in terms of revenue, the Asia-Pacific promises excellent growth opportunities for hospital information systems, they said.

The global vendors profiled in the report include McKesson , Cerner , Allscripts-Misys Healthcare Solutions, Eclipsys, Computer Programs and Systems, Siemens Medical Solutions USA, QuadraMed, Medical Information Technology, Healthland, GE Healthcare, iSOFT Group, Agfa-Gevaert, Brunie-Software, IBA Health and Integrated Medical Systems.

The full release is here: Global Hospital Information Systems Market to Cross $35 Billion by 2015, According to New Report by Global Industry Analysts, Inc. Increasing awareness among medical service patrons on the benefits of using Information Technology in the healthcare sector, coupled with growing demand for affordable-yet quality healthcare services is forcing hospitals and other medical centers to adopt IT in their daily operations. Subsequently, Healthcare IT systems such as the Hospital Information Systems witnessed a great demand in the healthcare services sector. Adoption of HIS in hospitals is increasingly being encouraged and promoted by the Governments world over. http://www.prweb.com/releases/2009/02/prweb2021984.htm

09 September 2009

bit.ly custom URLs

Where does http://bit.ly/peterwilliams direct to?

Is it the same web page as http://linkd.in/PeterWilliams ?

16 August 2009

Open Goverment Made Simple

There has been a lot of talk about Open Government recently, including this from Peter Williams

"Australian governments should adopt international standards of open publishing as far as possible. Material released for public information by Australian governments should be released under a creative commons licence." or in simple terms "make public data open and free".

That is useful, clear and straightforward.

So how to do it?

My experience in organising company data over the last 10 years is that the teams I have worked in have tried many content management systems (CMSs) and none of them were satisfying to use (though some were interesting to implement.) Inevitably the document taxonomies that made sense to the site administrators did not work for most of the users and the users soon gave up trying to find things through the CMS.

Then one day someone in the company I was working at purchased a Google Search Appliance (GSA) and indexed most of our intranet with it. After that everybody could find all the documents they knew existed on the intranet and discovered useful ones they did not know existed.

To be fair, things were not quite that simple. Most companies need reliable storage, decent version tracking, access control and many other things that CMSs provide. However people need to be able to find documents much more than they need these other things. Very few people need version tracked, access controlled documents that they cannot find in the first place.

So why don't goverments just make their data visible to internet search engines and store it somewhere secure with some simple versioning system now, and then do the fancy stuff later? Why are they are investing in CMSs like Sharepoint?

The reason we did not do this in the companies I worked in was that many of the features in the CMSs we used were useful and the people who implemented the systems decided they needed all these features. Finding documents was just one of several check-boxes on their requirements documents. They were acting as implementers and experts, not users. The systems they ended up with made perfect sense to everyone except the users.

The interesting thing about this was the implementers were users in most cases. They were aware of the limitations of CMSs but they had to follow either the direction of their users who had had not used CMSs enough and to understand how badly they would work in practice or the direction of their managers and key stakeholders who had heard that CMSs were good. The person who got the GSA was an IT guy who just went out and tried it without surveying users or bringing in CMS vendors to talk to his key stakeholders.

For a different perspective on Open Government, read some Tim O'Reilly.

16 July 2009

When Words Fail

A while back I worked at a company who made software+hardware products in a maturing market. The company found it needed to deliver higher quality products with more features and was struggling to do so from an old codebase. It had become clear to the management team that late-stage serious defects were the major cause of schedule/quality issues but they had been able to fix this problem.

The codebase management team had a lot of ideas about what the causes were and how to fix them. They had discussed "technical debt", "silo-ing" and other causes. However in the end they settled on two key priorities: taking extreme care with code changes and sticking with established QA processes to minimise the number of introduced bugs.

Eventually the project was given to me to manage. One of the (many) things the development team had done well was to document each bug and cross-reference bug fixes against the source code. I analysed about 100 recently fixed serious software bugs, looked up their fixes in the SCM and then looked up the date at which the code changes causing the bug were checked in. This showed that most of the bugs being found had been introduced months before they were discovered. It was clear that the late-stage defects were dominated by latent bugs being unmasked by changes, not by bugs introduced by changes.

Some changes to the development process were needed. The development group was responsible for creating code without introducing bugs and the QA group was responsible for finding the bugs the development team missed. However the QA process was unsuited to discovering latent bugs fast because it had a long cycle based on testing user scenarios. Therefore I got small teams of developers and QAs to work closely together to find, fix and verify bugs and I took some developers away other work to develop a system to find and fix (and eventually prevent the introduction of more) latent bugs. This work is described here. With these changes in place, code stability improved rapidly and late-stage serious bugs essentially ceased to be found.

That was a fairly straightforward technical solution to a fairly straightforward technical problem. So why had the very capable management team who had known the underlying causes (technical debt and silo-ing) not been able to fix the problem for so long?

Change is known to be difficult in organisations and there is an industry built around dealing with this. However our immediate problem was not an inability to persuade people to change. In fact consultation and review had been distracting people from doing the experimentation required to find the underlying causes of the problem was and how to fix them. The more people talked about the problem the further they got from the solution (hence this post's title).

The situation reminded me of Uncle Bob Martin's Agile Smagile

As I said before, going meta is a good thing. However, going meta requires experimental evidence. Unfortunately the industry has latched on to the word "Agile" and has begun to use it as a prefix that means "good". This is very unfortunate, and discerning software professionals should be very wary of any new concept that bears the "agile" prefix. The concept has been taken meta, but there is no experimental evidence that demonstrates that "agile", by itself, is good.

The deeply ingrained practices in the organisation I worked in had grown out of ideas that had worked well in the past. They had been good enough to cover a wide range of development scenarios for a long while and were clearly based on experimental evidence from past development. However somewhere along the way people had stopped experimenting and modifying the rules, and started just following the rules. This is what Uncle Bob called "going meta". The problem for our organisation was that the set of rules it had got to when it stopped experimenting were not universally true, they were only true for the type of the development they were doing when they stopped changing the rules.

The changes I made to detect and fix latent bugs (high-coverage automated system testing, static analysis with Klocwork and refactoring with unit tests) were adopted across the development organisation and became part of the standard development process, at least for the time I was there. That was good but I wondered if those practices would become a fixed part of the new development process because they had worked some time in the past. And I wondered whether they would prevent the company from addressing problems that arose in the future, just as the practises that had worked well in the past had come to do.