Korean forms one of the major languages in
multilingual NLP research at the University
of Pennsylvania. This site introduces three main projects on Korean
NLP currently being conducted at Penn: Korean XTAG, Korean Treebank, and
Korean/English machine translation. These projects are partially funded by
the Army Research Lab via a subcontract from CoGenTex, Inc., and by NSF
Grant SBR 8920230.
Korean XTAG
Korean XTAG is an on-going project to develop a wide-coverage grammar
for Korean using Feature-Based Lexicalized Tree Adjoining Grammar (LTAG)
formalism. For grammar development system, it uses the XTAG system which
we have customized for Korean TAG development. The XTAG system was originally
developed for English TAG and it consists of a parser, an X-windows grammar
development interface and a POS tagger. We have modified the
original XTAG system and incorporated a Korean morphological analyzer to
handle rich inflectional morphology in Korean and facilitate lexicon
development and parsing. More on the Korean XTAG system description can be
found in our TAG+5 workshop paper.
LTAG is based on the Tree Adjoining Grammar (TAG) formalism developed by
Joshi, Levy and Takahashi (1975). The TAG formalism in general, and
lexicalized TAG in particular, are well-suited for linguistic applications.
An LTAG consists of a finite set of elementary trees anchoring lexical
items and composition operations of substitution and adjunction. The
elementary trees represent extended projections of lexical items and
encapsulate syntactic/semantic arguments of the lexical anchor. In the
last decade, the LTAG approach has been applied to various NLP tasks such
as parsing, machine translation, information retrieval, generation and
summarization applications. More on the introduction to LTAG and the
current status of our Korean LTAG grammar is documented in our technical report.
[BACK UP]
Korean Treebank
A Treebank is a corpus annotated with morphological and syntactic
information. Each word in the corpus is annotated with morpho-syntactic
tags and each sentence is bracketed to represent its structural analysis.
This kind of corpus has served as an extremely valuable resource for
computational linguistics applications, and has also proved useful in
theoretical linguistics research.
Annotation Format
For syntactic bracketing, we use a phrase structure annotation. Similar
phrase structure annotation schemes were also used by the Penn English Treebank, the
Penn Middle English
Treebank and the Penn
Chinese Treebank. This annotation is preferable to a pure dependency
annotation because with a phrase structure annotation we can encode richer
structural information than with dependency annotation, as illustrated
below:
- Phrase structure annotation has phrasal level node labels such as VP
and NP, whereas dependency annotation does not have any node labels.
- Phrase structure annotation can explicitly represent empty arguments,
but dependency annotation cannot.
- Phrase structure annotation can distinguish between
complementation and adjunction, but dependency annotation cannot.
- Phrase structure annotation can make use of traces for displaced
constituents, whereas dependency annotation cannot.
Corpus
The corpus for the Korean Treebank project consists of texts from military
language training manuals. These texts contain information about various
aspects of the military, such as troop movement, intelligence gathering,
and equipment supplies, among others. The texts in the manuals were
originally in printed form, and in order to use them for our Treebank, we
converted the manuals into a machine-readable form. This corpus contains
54,366 words and 5078 sentences.
Guidelines and a Sample File
Applications
- The linguistic information in the Korean Treebank will provide a standard
framework in which to train and evaluate tools such as POS tagger and
stochastic parsers.
- The Treebank will also be used to extract lexicalized grammars,
e.g. a Korean Tree Adjoining Grammar, which can be used for other
applications, such as natural language generation. There are already tools
developed at Penn that train parsers and extract Tree Adjoining Grammars
from a phrase-structure based Treebank (Xia 1999), which will be equally
applicable to the Korean Treebank.
- Having an on-line corpus of parsed texts will be extremely useful for
research in corpus linguistics and will lead to many interesting theoretical
results.
[BACK UP]
Korean Morphological Analysis and
Tagging
... about Korean morphological analysis and tagging ...
[BACK UP]
Korean Syntactic Parsing
... about Korean syntactic parsing ...
[BACK UP]
Korean/English Machine Translation
This is a joint project with CoGenTex and Systran.
Basic Elements of our Approach
Given that Korean and English are very different from each other in
structure and morphology, many challenging problems arise, demanding
sophisticated linguistic analysis. The basic elements of our approach
include:
- Following the model described in
Palmer, Rambow and Nasr (1998) for
English/French translation, our system has a plug-and-play architecture
that is composed of state-of-the-art off-the-shelf components in parsing
(and morphological analysis) and generation. These components communicate
with each other via a common predicate-argument structure representation.
- Our system is a hybrid system that profits from a stochastic parser
that was independently trained on domain-general corpora and a hand-crafted
linguistic knowledge base in the form of a predicate-argument lexicon and
linguistically sophisticated transfer rules. The linguistic knowledge base
plays an important role in handling structural divergences and recovering
dropped arguments.
- For defining transfer rules, we use the `lexico-structural transfer'
framework, which is based on a lexicalized predicate-argument structure.
In this framework, the transfer lexicon does not simply relate words (or
context-free rewrite rules) from one language to words (or context-free
rewrite rules) from another language. Instead, lexemes and their relevant
syntactic structures (essentially, their syntactic projection along with
syntactic/semantic features) are mapped. This framework was applied
previously in English/French and English/Arabic MT
(Nasr et. al. 1997;
Palmer, Rambow and Nasr 1998).
Corpus
The corpus for this project is a set of Korean/English parallel texts that
consist of battle scenario message traffic and military language training
manuals which contain information on typical military events such as troop
movement, intelligence gathering, and equipment supplies, among others.
Each half has roughly 50,000 word tokens, and 5000 sentences.
Presentations
[BACK UP]
Papers
- Chung-hye Han, Na-Rae Han and Eon-Suk Ko
Development and Evaluation of a Korean Treebank and its Application to NLP,
Proceedings of the 3rd International Conference on Language Resources and
Evaluation (LREC-2002)
- Chung-hye Han, Na-Rae Han, Eon-Suk Ko, Heejong Yi and Martha Palmer
Penn Korean Treebank: Development and Evaluation,
Proceedings of the 16th Pacific Asia Conference on Language, Information and
Computation. The Korean Society for Language and Information. (2002)
- Chung-hye Han, Na-Rae han, Eon-Suk Ko
Bracketing Guidelines for Penn Korean TreeBank,
Technical Report, IRCS-01-10 (2001)
- Chung-hye Han, Na-Rae Han
Part of Speech Tagging Guidelines for Penn Korean Treebank,
Technical Report, IRCS-01-09 (2001)
- Chung-hye Han, Juntae Yoon, Nari Kim and Martha Palmer
A Feature-Based Lexicalized Tree Adjoining Grammar for Korean,
Technical Report, IRCS-00-04 (2000)
- Chung-hye Han, Benoit Lavoie, Martha Palmer, Owen Rambow, Richard
Kittredge, Tanya Korelsky, Nari Kim and Myunghee Kim
Handling Structural Divergences and Recovering
Dropped Arguments in a Korean/English Machine Translation System
Proceedings of the Association for Machine Translation in the
Americas '2000.
Published in Lecture Notes in AI series of Springer-Verlag,
© Springer-Verlag (2000).
- Juntae Yoon, Chung-hye Han, Nari Kim and Mee-sook Kim
Customizing the XTAG system for efficient grammar development
for Korean
Proceedings of the Fifth International Workshop on Tree Adjoining
Grammars and Related Formalisms, TAG+ 5 (2000).
- Chung-hye Han and Owen Rambow
The Sino-Korean light verb construction and lexical argument structure
Proceedings of the Fifth International Workshop on Tree Adjoining
Grammars and Related Formalisms, TAG+ 5 (2000)
- Martha Palmer,Dania Egedi,Chunghye Han, Fei Xia, and Joseph Rosenzweig.
Constraining Lexical Selection Across Languages Using TAGs.
Tree Adjoining Grammars: Formal, Computational and Linguistic
Aspects (TAG+ 3 Workshop Proceedings)
Eds. Anne Abeille and Owen Rambow, CSLI,
Stanford (2000).
- Chung-hye Han, Fei Xia, Martha Palmer, Joseph Rosenzweig.
Capturing Language Specific Constraints on Lexical Selection with Feature-Based Lexicalized Tree-Adjoining Grammars
Proceedings of International Conference on Chinese Computing '96 (ICCC '96).
[BACK UP]
People (click to see pictures
)
Faculty
Graduate Students
Staff
Visitors
- Seunghun Lee (Rutgers University)
- Sung-Dong Kim (Hansung University, Korea)
- Sinwon Yoon (Paris 7 University, France)
Thanks to
- Mee-sook Kim (participated from Nov. 1999 to July 2000)
- Nari Kim (participated from Mar. 1998 to Dec. 1999, now at Konan Technology, Inc.)
- Juntae Yoon (participated from Mar. 1999 to Mar. 2000, now at Daum Communications)
- Jong-Cheol Park (participated at the very early phase of the project,
now at KAIST)
- Heejong Yi (participated in 1998)
- Eon-Suk Ko (participated from Spring 1998 to Spring 2000)
- Seungyun Yang (participated from Spring 1999 to Spring 2000)
- Myuncheol Kim (participated from Spring 1999 to Spring 2000)
- Chung-hye Han
(participated from Spring 1998 to August 2001)
- Chulwoo Park (participated from Spring 1999 to February 2002)
[BACK UP]
Some Links to NLP in Korea
[BACK UP]
This web page is maintained by
Chung-hye Han
Last changed: $Date: 2004/08/18 20:31:03 $