blogit ergo sum: Stanford NLP Class continued

25 March 2012

Stanford NLP Class continued

In one of the early classes one of the lecturers showed some Unix tools for performing low-level tasks such tokenizing and word counting. You can see these in the course notes.

I prefer not to use such tools because I already use more tools than I want to and it is possible to do all the things taught in lectures with one of the tools I use a lot, Python.

Here is some Python code that does what I remember the lecturer demonstrating.

Tokenizing on non-alphabetic boundaries
import re _RE_TOKEN = re.compile('[^a-zA-Z]+') def tokenize(text): return [x for x in _RE_TOKEN.split(text) if x]
Finding unique words
The lecturers call unique words word types.
tokens = tokenize(text) types = set(tokens)
Finding unique words, ignoring case variants
normalized_types = set(tokenize(text.lower()))
Finding word types that differ only in case
# variants[x] = set of all case variants of x in types variants = {} for w in types: variants[w.lower()] = variants.get(w.lower(), set([])) | set([w]) # types that occur in multiple type variants multi_case = dict([(k,v) for (k,v) in variants.items() if len(v) > 1]) # number of types that are case variants of other types num_case_only = sum([len(v) - 1 for v in multi_case.values()]) # this had better be true assert(num_case_only == len(types) - len(normalized_types))

That's it. Basic test manipulations done without having to leave one of the small numbers of programming environments I spend a lot of time in.