These are my notes on Stanford's online NLP class.
The first few lectures said that a lot of the hard work in NLP, notably in tokenizers, is done with regular expressions. This was not entirely surprising as a good fraction of the string processing I have done in my professional career has been done with regular expressions.
Programming exercises can be done in Python or Java. I chose Python as I have found it well suited to simple string manipulation programs in the past.
The first programming exercise is to extract phone numbers and email addresses from web pages. A training set of Stanford computer science faculty home pages was supplied along with some starter code to show the required formatting. The starter code helpfully computed lists of true positives, false positives and false negatives.
My experience with problems like these is to
- get the test samples to pass, by
- loosening matches and adding more detection to detect all the addresses and phone numbers
- tightening matches to avoid false positives
- while taking care to make decisions that are likely to generalize well to as yet unseen samples