I was looking at some prediction results for the UCI Michalski and Chilausky soybean data set and wondered how they depended on test set selection. Some had classification accuracy as high as 93.1% accuracy on a 25% training set and 97.1% on 290 training and 340 test instances.
A few weeks ago I had been asked to find the best classifier for the soybean data set based on prediction accuracy on a test set of 20% of the data. The remaining 80% could be used for training. That gave 306!/(245!x61!) = 1.3 x 10^65 possible splits of the 306 data points into training and test sets. Could some of these splits lead to better results than others for the classifiers I was about to use?
The WEKA data mining package was used for classification. WEKA has many classifiers that can be run on a data set and their performance to be compared.
WEKA also has a programming interface so I used it to write some Jython toolsto explore the performance of a range of classifiers.
One of these tools was run on the soybean data to find the training/test splits with best and worst classification accuracy. The results were
|Classifier||Best Accuracy||Worst Accuracy|
|SMO (support vector machine)||96.7%||82%|
|MLP (neural network)||100%||77%|
That was quite a range of test set accuracies for different training/test splits. My simple genetic algorithm may not have found the extremes of the distributions so the actual range may have been higher.
When I ran the test set selection scripta second time (Fig 2) it also found a 100% SMO accuracy. The second test was set up to find a single training/test set split that gave best results for all classifiers at once. It also had a slightly different pre-processing. The 4 duplicate instances were removed and the troublesome single 2-4-5-t sample was left in. Therefore I expected it to give worse results than the pre-processing used for the results in Fig 1.
|Classifier||Correct (out of 60)||Percent Correct|
|Naive Bayes||57||95 %|
|Bayes Net||59||98.3 %|
|Random Forest||59||98.3 %|
Both the above results were for the default settings of each of the WEKA classifiers. The WEKA classifiers all have parameters that can be tuned and it is possible to select subsets of attributes so they can give better and much worse results than the defaults. However the default parameters are usually close to the best so they may be good indicators of the best possible accuracies.
It appears that the training/test split of a data set can change classification accuracy by more than 30%. This was observed on a well-known and widely used classification data set.