25 March 2012

Stanford NLP Class continued

In one of the early classes one of the lecturers showed some Unix tools for performing low-level tasks such tokenizing and word counting. You can see these in the course notes.

I prefer not to use such tools because I already use more tools than I want to and it is possible to do all the things taught in lectures with one of the tools I use a lot, Python.

Here is some Python code that does what I remember the lecturer demonstrating.

Tokenizing on non-alphabetic boundaries
import re
_RE_TOKEN = re.compile('[^a-zA-Z]+')
def tokenize(text):
  return [x for x in _RE_TOKEN.split(text) if x]

Finding unique words
The lecturers call unique words word types.
tokens = tokenize(text)
types = set(tokens)

Finding unique words, ignoring case variants
normalized_types = set(tokenize(text.lower()))

Finding word types that differ only in case
# variants[x] = set of all case variants of x in types
variants = {}
for w in types:
  variants[w.lower()] = variants.get(w.lower(), set([])) | set([w])

# types that occur in multiple type variants
multi_case = dict([(k,v) for (k,v) in variants.items() if len(v) > 1])

# number of types that are case variants of other types
num_case_only = sum([len(v) - 1 for v in multi_case.values()])

# this had better be true
assert(num_case_only == len(types) - len(normalized_types))

That's it. Basic test manipulations done without having to leave one of the small numbers of programming environments I spend a lot of time in.

No comments: