Next: Comparison with Alvey
Up: Evaluation and Results
Previous: Chunking and Dependencies in
The evaluation in this section was done with the earlier 1995 release
of the grammar. This section describes an experiment to measure the
crossing bracket accuracy of the XTAG-parsed IBM-manual sentences. In
this experiment, XTAG parses of 1100 IBM-manual sentences have been
ranked using certain heuristics. The ranked parses have been
compared31.3 against the bracketing given in the
Lancaster Treebank of IBM-manual sentences31.4.
Table G.5 shows the results of XTAG obtained in this
experiment, which used the highest ranked parse for each system. It
also shows the results of the latest IBM statistical grammar
([#!jelineketal94!#]) on the same genre of sentences. Only the
highest-ranked parse of both systems was used for this evaluation.
Crossing Brackets is the percentage of sentences with no pairs of
brackets crossing the Treebank bracketing (i.e. ( ( a b ) c ) has a
crossing bracket measure of one if compared to ( a ( b c ) ) ). Recall
is the ratio of the number of constituents in the XTAG parse to the
number of constituents in the corresponding Treebank sentence.
Precision is the ratio of the number of correct constituents to the
total number of constituents in the XTAG parse.
System |
# of |
Crossing Bracket |
Recall |
Precision |
|
sentences |
Accuracy |
|
|
XTAG |
1100 |
81.29% |
82.34% |
55.37% |
IBM Statistical |
1100 |
86.20% |
86.00% |
85.00% |
grammar |
|
|
|
|
- {Performance of XTAG on IBM-manual sentences
As can be seen from Table G.5, the precision figure for
the XTAG system is considerably lower than that for IBM. For the
purposes of comparative evaluation against other systems, we had to
use the same crossing-brackets metric though we believe that the
crossing-brackets measure is inadequate for evaluating a grammar like
XTAG. There are two reasons for the inadequacy. First, the parse
generated by XTAG is much richer in its representation of the internal
structure of certain phrases than those present in manually created
treebanks (e.g. IBM: [N your personal computer], XTAG: [NP
[G your] [N [N personal] [N computer]]]). This is
reflected in the number of constituents per sentence, shown in the
last column of Table G.6.31.5
System |
Sent. |
# of |
Av. # of |
Av. # of |
|
Length |
sent |
words/sent |
Constituents/sent |
XTAG |
1-10 |
654 |
7.45 |
22.03 |
|
1-15 |
978 |
9.13 |
30.56 |
IBM Stat. |
1-10 |
447 |
7.50 |
4.60 |
Grammar |
1-15 |
883 |
10.30 |
6.40 |
- {Constituents in XTAG parse and IBM parse
A second reason for considering the crossing bracket measure
inadequate for evaluating XTAG is that the primary structure in XTAG
is the derivation tree from which the bracketed tree is derived. Two
identical bracketings for a sentence can have completely different
derivation trees (e.g. kick the bucket as an idiom vs. a
compositional use). A more direct measure of the performance of XTAG
would evaluate the derivation structure, which captures the
dependencies between words.
Next: Comparison with Alvey
Up: Evaluation and Results
Previous: Chunking and Dependencies in
XTAG Project
http://www.cis.upenn.edu/~xtag