RSS

A Brief Introduction to Part-of-Speech Tagging

22 Aug

A field of computer science that has captured my attention lately is computational linguistics — the inexact science of how to get a computer to understand what you mean.  This could be something as futuristic as Matthew Broderick’s battle with the WOPR, or with something more practical, like Siri.  Whether it be text entered by a human into a keyboard or something more akin to understanding the very unstructured format of human speech, understanding the meaning behind parsed words is incredibly complex — and to someone like me — fascinating!

My particular interest as of late is parsing — which from a linguistic perspective, means the breaking down of a string of characters into words, their meanings, and stringing them together in a parse tree, where the meanings of individual words as well as the relationships between words is composed into a logical construct that allows higher order functions, such as a personal assistant.  Having taken several foreign language classes before, then sitting on the other side of the table as an ESL teacher, I can appreciate the enormous ambiguity and complexity of any language, and much more so English among Germanic languages, as to creating an automated process to parse input into meaningful logical representations.  Just being able to discern the meaning of individual words given the multitude of meanings that can be ascribed to any one sequence of characters is quite a challenge.

Parsing Models

Consider this:  My security beat wore me out tonight.

In this sentence, what is the function of the word beat?  Beat functions as either a noun or a verb, but in this context, it is a noun.  There are two general schools of thought around assigning a tag as to what part of speech (POS) each word in a sentence functions as — iterative rules-based methods and stochastic methods.  In rules-based methods, like Eric Brill’s POS tagger, a priority-based set of rules that set forth language-specific axioms, such as “when a word appears to be a preposition, it is actually a noun if the preceding word is while”.  A complex set of these meticulously constructed conditions is used to refine a more course dictionary-style assignment of POS tags.

Stochastic methods, however, are more “fuzzy” methods of building advanced statistical models of how words should be tagged not based on a procedural and manual analysis of edge cases and their mitigations, but using training models over pre-tagged corpra, in a manner hearkening to the training sets applied to neural networks.  These trained models are then used as a baseline for assigning tags to incoming text, but no notable option for correction of any specific error or edge case other than retraining the entire model is available for refinement.  One such very interesting concept is treating the tagging of parts of speech as Hidden Markov Models, which is a probabilistic model that strives to explain how a process with a defined pattern that is not known other than sparse characteristics of the model and the inputs and the outputs through the process.

This continues to be a good candidate for doctorial theses in computer science disciplines.. papers that have caused me to lose too much sleep as of late.

Parsing Syntax

Even describing parts of speech can be as mundane as your elementary school grammar book, or as rich as the C7 tagset, which provides 146 unique ways to describe a word’s potential function.  While exceptionally expressive and specific, I have become rather fond of the Penn Treebank II tagset, which defines 45 tags that seem to provide enough semantic context for the key elements of local pronoun resolution and larger-scale object-entity context mapping.  Finding an extensively tagged Penn Treebank corpus proves difficult, however, as it is copyright by the University of Pennsylvania, distributed through a public-private partnership for several thousand dollars, and the tagged corpus is almost exclusively a narrow variety of topics and sentence structures — Wall Street Journal articles.  Obtaining this is critical to use as a reference check for writing a new Penn Treebank II part-of-speech tagger, and it prevents the construction of a more comprehensive Penn-tagged wordlist, which would be a boon for any tagger implementation.  However, the folks at the NLTK has provided a 10% free sample under Fair Use that has provided somewhat useful for both checking outputs in a limited fashion, but also for generating some more useful relative statistics about relationships between parts of speech within a sentence.

To produce some rudimentary probabilistic models to guide ambiguous POS-mappings for individual words, I wrote a five-minute proof of concept that scanned the NLTK-provided excerpt of the WSJ Penn Treebranch corpus to produce probabilities of what the next word’s part of speech would be given the previous word’s tag. The full results are available in this gist.

Future Musings

My immediate interest, whenever I get some free time on a weekend (which is pretty rare these days due to the exceptional pace of progress at our start-up), is pronoun resolution, which is the object of this generation’s Turing Test — the Winograd Schemas.  An example of such a challenge is to get a machine to answer this kind of question — Joe’s uncle can still beat him at tennis, even though he is 30 years older. Who is older? This kind of question is easy for a human to answer, but very, very hard for a machine to infer because (a) it can’t cheat to Google a suitable answer, which some of the less impressive Turing Test contestant programs now do, and (b) it requires not only the ability to successfully parse a sentence into its respective parts of speech, phrases, and clauses, but it requires the ability for a computer to resolve the meaning of a pronoun.  That’s an insanely tough feat!  Imagine this:

“Annabelle is a mean-spirited person.  She shot my dog out of spite.”

A program could infer “my dog” is a dog belonging to the person providing the text.  This has obvious applications in the real world if you can do this, and it has been done before.  But, imagine the leap in context that is exponentially harder to overcome when resolving “She”.  This requires not only an intra-sentence relationship of noun phrases, possessive pronouns, direct objects, and adverbial clauses, but it also requires the ability to carry context forward from one sentence to the next, building a going “mental map” of people, places, things — and building a profile of them as more information or context is provided.  And, if you think that’s not hard enough to define .. imagine the two additional words appended on to this sentence:

, she said.

That would to a human indicate dialog, which requires a wholly separate frame of Inception-style reference between contextual frames.  The parser is reading text about things which is actually being conveyed by other things — both sets of frames have their own unique, but not necessarily separate, domains and attributes.  I’m a very long-way off from ever getting this diversion in my “free time” anywhere close to functioning as advertised… but, then again, that’s what exercises on a weekend are for — not doing, but learning. 🙂

Advertisements
 
2 Comments

Posted by on August 22, 2013 in Programming

 

2 responses to “A Brief Introduction to Part-of-Speech Tagging

  1. Cole Brand (@jcolebrand)

    April 7, 2014 at 8:25 pm

    What about constructing a sort of context version of a piece table? Context is fluid (especially in English, which is a far harder semantic language than something like German) so you shouldn’t try to have more than a few blocks of context at any one point. Sometimes we purposefully destroy context with our words and yet a seasoned English speaker can find which pieces belong together. I already don’t agree language parsing can be done with a simple tree model (as depicted in the “The giant rocket rose” example) and that we probably need to use multiple accountancy models that all have to tie together.

    There’s probably even something that needs to be said for deliberate spacing by an author and how we can spot mechanical spacing as being foreign, but that’s all OCR and not really the mechanics of how to match up words to intent and meaning.

     
    • Sean McElroy

      April 8, 2014 at 11:36 am

      This is essentially what a Brill tagger does. It tags words with their most probable part of speech, but then it recursively refines that based on localized context, with rule sets that typically track back or forwards up to four words away. Brill-like approaches don’t work very well in English though since while there is some subject verb agreement for pluralism, subjects aren’t inflected with verbs for gender (like in Spanish) or for the use case (like in Latin), which would make parsing sentences with multiple complex clauses much more reliable if that were the case. For fun one week, I wrote a tree-based, Brill-like tagger to explore those problems first-hand and got a very functional model… so I thought. I challenged a friend to come up with something it couldn’t handle, and his first attempt broke it badly. His sentence was, “One does not simply walk into Mordor”. It’s very difficult to procedurally recognize the only noun in the sentence isn’t the subject of it, particularly because English lacks hints that could make tagging more algorithmic.

      Probabilistic training models generated from statistical modeling, is definitely the way to go, but context is still key there. After using the Stanford NLP as well as NLTK, there’s quite a difference in POS tagging output of literary work after using a training set like the Penn Treebank WSJ archive. We really need a shifting context probability model, where there is context for the context, not necessarily at the individual speaker level, but there is clearly a statistical difference in word use frequencies between vastly different use cases for bodies of work.

       

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: