25 March 2012

Stanford NLP Class continued

In one of the early classes one of the lecturers showed some Unix tools for performing low-level tasks such tokenizing and word counting. You can see these in the course notes.

I prefer not to use such tools because I already use more tools than I want to and it is possible to do all the things taught in lectures with one of the tools I use a lot, Python.

Here is some Python code that does what I remember the lecturer demonstrating.

Tokenizing on non-alphabetic boundaries
import re
_RE_TOKEN = re.compile('[^a-zA-Z]+')
def tokenize(text):
  return [x for x in _RE_TOKEN.split(text) if x]

Finding unique words
The lecturers call unique words word types.
tokens = tokenize(text)
types = set(tokens)

Finding unique words, ignoring case variants
normalized_types = set(tokenize(text.lower()))

Finding word types that differ only in case
# variants[x] = set of all case variants of x in types
variants = {}
for w in types:
  variants[w.lower()] = variants.get(w.lower(), set([])) | set([w])

# types that occur in multiple type variants
multi_case = dict([(k,v) for (k,v) in variants.items() if len(v) > 1])

# number of types that are case variants of other types
num_case_only = sum([len(v) - 1 for v in multi_case.values()])

# this had better be true
assert(num_case_only == len(types) - len(normalized_types))


That's it. Basic test manipulations done without having to leave one of the small numbers of programming environments I spend a lot of time in.

2 comments:

Nishabablu said...

Really i like this blog and i got lot of information's from your blog.And thanks for sharing!!!!
Interior Designers in Chennai
Interior Decorators in Chennai
Architectural Firms in Chennai

Amelia said...

This is a great post. I like this topic.This site has lots of advantage.I found many interesting things from this site. It helps me in many ways.Thanks for posting this again.

Best Architects in Chennai
Turnkey Interior Contractors in Chennai
Interior Contractors in Chennai
Architecture Firms in Chennai
Best Interior Designers in Chennai