A field of computer science that has captured my attention lately is computational linguistics — the inexact science of how to get a computer to understand what you mean. This could be something as futuristic as Matthew Broderick’s battle with the WOPR, or with something more practical, like Siri. Whether it be text entered by a human into a keyboard or something more akin to understanding the very unstructured format of human speech, understanding the meaning behind parsed words is incredibly complex — and to someone like me — fascinating!
My particular interest as of late is parsing — which from a linguistic perspective, means the breaking down of a string of characters into words, their meanings, and stringing them together in a parse tree, where the meanings of individual words as well as the relationships between words is composed into a logical construct that allows higher order functions, such as a personal assistant. Having taken several foreign language classes before, then sitting on the other side of the table as an ESL teacher, I can appreciate the enormous ambiguity and complexity of any language, and much more so English among Germanic languages, as to creating an automated process to parse input into meaningful logical representations. Just being able to discern the meaning of individual words given the multitude of meanings that can be ascribed to any one sequence of characters is quite a challenge.
Consider this: My security beat wore me out tonight.
In this sentence, what is the function of the word beat? Beat functions as either a noun or a verb, but in this context, it is a noun. There are two general schools of thought around assigning a tag as to what part of speech (POS) each word in a sentence functions as — iterative rules-based methods and stochastic methods. In rules-based methods, like Eric Brill’s POS tagger, a priority-based set of rules that set forth language-specific axioms, such as “when a word appears to be a preposition, it is actually a noun if the preceding word is while”. A complex set of these meticulously constructed conditions is used to refine a more course dictionary-style assignment of POS tags.
Stochastic methods, however, are more “fuzzy” methods of building advanced statistical models of how words should be tagged not based on a procedural and manual analysis of edge cases and their mitigations, but using training models over pre-tagged corpra, in a manner hearkening to the training sets applied to neural networks. These trained models are then used as a baseline for assigning tags to incoming text, but no notable option for correction of any specific error or edge case other than retraining the entire model is available for refinement. One such very interesting concept is treating the tagging of parts of speech as Hidden Markov Models, which is a probabilistic model that strives to explain how a process with a defined pattern that is not known other than sparse characteristics of the model and the inputs and the outputs through the process.
This continues to be a good candidate for doctorial theses in computer science disciplines.. papers that have caused me to lose too much sleep as of late.
Even describing parts of speech can be as mundane as your elementary school grammar book, or as rich as the C7 tagset, which provides 146 unique ways to describe a word’s potential function. While exceptionally expressive and specific, I have become rather fond of the Penn Treebank II tagset, which defines 45 tags that seem to provide enough semantic context for the key elements of local pronoun resolution and larger-scale object-entity context mapping. Finding an extensively tagged Penn Treebank corpus proves difficult, however, as it is copyright by the University of Pennsylvania, distributed through a public-private partnership for several thousand dollars, and the tagged corpus is almost exclusively a narrow variety of topics and sentence structures — Wall Street Journal articles. Obtaining this is critical to use as a reference check for writing a new Penn Treebank II part-of-speech tagger, and it prevents the construction of a more comprehensive Penn-tagged wordlist, which would be a boon for any tagger implementation. However, the folks at the NLTK has provided a 10% free sample under Fair Use that has provided somewhat useful for both checking outputs in a limited fashion, but also for generating some more useful relative statistics about relationships between parts of speech within a sentence.
To produce some rudimentary probabilistic models to guide ambiguous POS-mappings for individual words, I wrote a five-minute proof of concept that scanned the NLTK-provided excerpt of the WSJ Penn Treebranch corpus to produce probabilities of what the next word’s part of speech would be given the previous word’s tag. The full results are available in this gist.
My immediate interest, whenever I get some free time on a weekend (which is pretty rare these days due to the exceptional pace of progress at our start-up), is pronoun resolution, which is the object of this generation’s Turing Test — the Winograd Schemas. An example of such a challenge is to get a machine to answer this kind of question — Joe’s uncle can still beat him at tennis, even though he is 30 years older. Who is older? This kind of question is easy for a human to answer, but very, very hard for a machine to infer because (a) it can’t cheat to Google a suitable answer, which some of the less impressive Turing Test contestant programs now do, and (b) it requires not only the ability to successfully parse a sentence into its respective parts of speech, phrases, and clauses, but it requires the ability for a computer to resolve the meaning of a pronoun. That’s an insanely tough feat! Imagine this:
“Annabelle is a mean-spirited person. She shot my dog out of spite.”
A program could infer “my dog” is a dog belonging to the person providing the text. This has obvious applications in the real world if you can do this, and it has been done before. But, imagine the leap in context that is exponentially harder to overcome when resolving “She”. This requires not only an intra-sentence relationship of noun phrases, possessive pronouns, direct objects, and adverbial clauses, but it also requires the ability to carry context forward from one sentence to the next, building a going “mental map” of people, places, things — and building a profile of them as more information or context is provided. And, if you think that’s not hard enough to define .. imagine the two additional words appended on to this sentence:
, she said.
That would to a human indicate dialog, which requires a wholly separate frame of Inception-style reference between contextual frames. The parser is reading text about things which is actually being conveyed by other things — both sets of frames have their own unique, but not necessarily separate, domains and attributes. I’m a very long-way off from ever getting this diversion in my “free time” anywhere close to functioning as advertised… but, then again, that’s what exercises on a weekend are for — not doing, but learning. 🙂