Next: Appositives, parentheticals and vocatives
Up: Other Constructions
Previous: Future Work
Punctuation Marks
Many parsers require that punctuation be stripped out of the
input. Since punctuation is often optional, this sometimes has no
effect. However, there are a number of constructions which must
obligatorily contain punctuation and adding analyses of these to the
grammar without the punctuation would lead to severe
overgeneration. An especially common example is noun
appositives. Without access to punctuation, one would have to allow
every combinatorial possibility of NPs in noun sequences, which is
clearly undesirable (especially since there is already unavoidable
noun-noun compounding ambiguity). Aside from coverage issues, it is
also preferable to take input ``as is'' and do as little editing as
possible. With the addition of punctuation to the XTAG grammar, we
need only do/assume the conversion of certain sequences of punctuation
into the ``British'' order (this is discussed in more detail below in
Section 24.2).
The XTAG POS tagger currently tags every punctuation mark as
itself. These tags are all converted to the POS tag Punct before
parsing. This allows us to treat the punctuation marks as a single POS
class. They then have features which distinguish amongst them.
Wherever possible we have the punctuation marks as anchors, to
facilitate early filtering.
The full set of punctuation marks is separated into three classes:
balanced, separating and terminal. The balanced punctuation marks are
quotes and parentheses, separating are commas, dashes, semi-colons and
colons, and terminal are periods, exclamation points and question
marks. Thus, the <punct> feature is complex (like the <agr> feature), yielding feature equations like <Punct bal = paren> or <Punct term = excl>. Separating and terminal
punctuation marks do not occur adjacent to other members of the same
class, but may occasionally occur adjacent to members of the other
class, e.g. a question mark on a clause which is separated by a dash
from a second clause. Balanced punctuation marks are sometimes adjacent
to one another, e.g. quotes immediately inside of parentheses. The <punct> feature allows us to control these local
interactions.
We also need to control non-local interaction of punctuation
marks. Two cases of this are so-called quote alternation, wherein
embedded quotation marks must alternate between single and double, and
the impossibility of embedding an item containing a colon inside of
another item containing a colon. Thus, we have a fourth value for <punct>, <contains colon/dquote/etc. +/->, which
indicates whether or not a constituent contains a particular
punctuation mark. This feature is percolated through all auxiliary
trees. Things which may not embed are: colons under colons,
semi-colons, dashes or commas; semi-colons under semi-colon or commas.
Although it is rare, parentheses may appear inside of parentheses, say
with a bibliographic reference inside a parenthesized sentence.
Next: Appositives, parentheticals and vocatives
Up: Other Constructions
Previous: Future Work
XTAG Project
http://www.cis.upenn.edu/~xtag