Next: Tree Database
Up: System Description
Previous: System Description
Since we are working with lexicalized TAGs, each word in the sentence
selects at least one tree. The advantage of a lexicalized formalism
like LTAGs is that rather than parsing with all the trees in the
grammar, we can parse with only the trees selected by the words in the
input sentence.
In the XTAG system, the selection of trees by the words is done in
several steps. Each step attempts to reduce ambiguity, i.e. reduce the
number of trees selected by the words in the sentence.
- Morphological Analysis and POS Tagging
- The input sentence is
first submitted to the Morphological Analyzer and the Tagger. The morphological analyzer ([#!karp92!#]) consists of a
disk-based database (a compiled version of the derivational rules)
which is used to map an inflected word into its stem, part of speech
and feature equations corresponding to inflectional information.
These features are inserted at the anchor node of the tree
eventually selected by the stem. The POS Tagger can be disabled in
which case only information from the morphological analyzer is used.
The morphology data was originally extracted from the Collins
English Dictionary ([#!ced79!#]) and Oxford Advanced Learner's
Dictionary ([#!oald74!#]) available through ACL-DCI
([#!liberman89!#]), and then cleaned up and augmented by hand
([#!karp92!#]).
- POS Blender
- The output from the morphological analyzer and the
POS tagger go into the POS Blender which uses the output of
the POS tagger as a filter on the output of the morphological
analyzer. Any words that are not found in the morphological database
are assigned the POS given by the tagger.
- Syntactic Database
- The syntactic database contains the mapping
between particular stem(s) and the tree templates or tree-families
stored in the Tree Database (see Table 3.1). The
syntactic database also contains a list of feature equations that
capture lexical idiosyncrasies. The output of the POS Blender is
used to search the Syntactic Database to produce a set of
lexicalized trees with the feature equations associated with the
word(s) in the syntactic database unified with the feature equations
associated with the trees. Note that the features in the syntactic
database can be assigned to any node in the tree and not just to the
anchor node. The syntactic database entries were originally
extracted from the Oxford Advanced Learner's Dictionary
([#!oald74!#]) and Oxford Dictionary for Contemporary Idiomatic
English ([#!cie75!#]) available through ACL-DCI
([#!liberman89!#]), and then modified and augmented by hand
([#!EgediMartin94!#]). There are more than 31,000 syntactic
database entries.3.1
Selected entries from this database are shown in
Table 3.2.
- Default Assignment
- For words that are not found in the
syntactic database, default trees and tree-families are assigned
based on their POS tag.
- Filters
- Some of the lexicalized trees chosen in previous stages
can be eliminated in order to reduce ambiguity. Two methods are
currently used: structural filters which eliminate trees which have
impossible spans over the input sentence and a statistical filter
based on unigram probabilities of non-lexicalized trees (from a hand
corrected set of approximately 6000 parsed sentences). These methods
speed the runtime by approximately 87%.
- Supertagging
- Before parsing, one can avail of an optional step
of supertagging the sentence. This step uses statistical
disambiguation to assign a unique elementary tree (or supertag) to each word in the sentence. These assignments can
then be hand-corrected. These supertags are used as a filter on the
tree assignments made so far. More information on supertagging can
be found in ([#!srini97diss!#,#!srini97iwpt!#]).
Next: Tree Database
Up: System Description
Previous: System Description
XTAG Project
http://www.cis.upenn.edu/~xtag