next up previous contents
Next: Appositives, parentheticals and vocatives Up: Other Constructions Previous: Future Work

   
Punctuation Marks

Many parsers require that punctuation be stripped out of the input. Since punctuation is often optional, this sometimes has no effect. However, there are a number of constructions which must obligatorily contain punctuation and adding analyses of these to the grammar without the punctuation would lead to severe overgeneration. An especially common example is noun appositives. Without access to punctuation, one would have to allow every combinatorial possibility of NPs in noun sequences, which is clearly undesirable (especially since there is already unavoidable noun-noun compounding ambiguity). Aside from coverage issues, it is also preferable to take input ``as is'' and do as little editing as possible. With the addition of punctuation to the XTAG grammar, we need only do/assume the conversion of certain sequences of punctuation into the ``British'' order (this is discussed in more detail below in Section 24.2). The XTAG POS tagger currently tags every punctuation mark as itself. These tags are all converted to the POS tag Punct before parsing. This allows us to treat the punctuation marks as a single POS class. They then have features which distinguish amongst them. Wherever possible we have the punctuation marks as anchors, to facilitate early filtering. The full set of punctuation marks is separated into three classes: balanced, separating and terminal. The balanced punctuation marks are quotes and parentheses, separating are commas, dashes, semi-colons and colons, and terminal are periods, exclamation points and question marks. Thus, the <punct> feature is complex (like the <agr> feature), yielding feature equations like <Punct bal = paren> or <Punct term = excl>. Separating and terminal punctuation marks do not occur adjacent to other members of the same class, but may occasionally occur adjacent to members of the other class, e.g. a question mark on a clause which is separated by a dash from a second clause. Balanced punctuation marks are sometimes adjacent to one another, e.g. quotes immediately inside of parentheses. The <punct> feature allows us to control these local interactions. We also need to control non-local interaction of punctuation marks. Two cases of this are so-called quote alternation, wherein embedded quotation marks must alternate between single and double, and the impossibility of embedding an item containing a colon inside of another item containing a colon. Thus, we have a fourth value for <punct>, <contains colon/dquote/etc. +/->, which indicates whether or not a constituent contains a particular punctuation mark. This feature is percolated through all auxiliary trees. Things which may not embed are: colons under colons, semi-colons, dashes or commas; semi-colons under semi-colon or commas. Although it is rare, parentheses may appear inside of parentheses, say with a bibliographic reference inside a parenthesized sentence.

 
next up previous contents
Next: Appositives, parentheticals and vocatives Up: Other Constructions Previous: Future Work
XTAG Project
http://www.cis.upenn.edu/~xtag