I recently wrote a Twitter bot to offer sympathy to people who have suffered paper cuts. You can see it here.
The Twitter bot link in the last paragraph has all the code. There was nothing particularly complex in it. The hardest part was classifying tweets as being from people who suffered paper cuts or not. This was done with a very simple naive Bayes classifier.
The code from the previous link was trained on 3,225 tweets to give the following 10-fold cross-validation results. The columns are the numbers of tweets containing "paper cut" that the classifier predicts are from people with paper cuts. The rows are the number of tweets that a human (me) says are from people with paper cuts.
This precision value of 0.937 means that 0.063 or about 1/16 of the tweets that the Twitter bot classifies as being about paper cuts and replies to are not about paper cuts..
===== = ===== = ===== | False | True ----- + ----- + ----- False | 1978 | 59 ----- + ----- + ----- True | 308 | 880 ===== = ===== = ===== Total = 3225 ===== = ===== = ===== | False | True ----- + ----- + ----- False | 61% | 2% ----- + ----- + ----- True | 10% | 27% ===== = ===== = ===== Precision = 0.937, Recall = 0.741, F1 = 0.827
There is some additional filtering and to reduce the number of inappropriate replies. Filtering increases the precision to 0.95 so the final false positive rate is 1/20. The false positives after filtering are here. Examining them gives a feel for the cases where the simple Bayesian n-gram classifier breaks down.
The most influential n-grams are given here where you can also find a link to the full list of n-grams.
You can get some idea of the how well the classification works in the real world from the Twitter responses to it.
UPDATE 17 September 2012. Twitter warned the OwwPapercut account for sending automated replies which seemed odd given @StealthMountain. OwwwPapercut has run in non-replying mode since then.