My research is concerned with how best to interconnect, query, and
update heterogeneous data-producing components in a networked world. Not
only have recent decades resulted in a plethora of new data, but they have
also resulted in a plethora of different data representations and data
versions. I seek to build fast and robust algorithms and systems for
helping to inter-relate, find, update, and synchronize ("reconcile") data
in this world.
CopyCat
In collaboration with USC Information Sciences
Institute (led by Craig Knoblock) and Fetch
Technologies (led by Steve Minton), considers the problem of how to make it easy for users to
author, use, and debug mappings for one-time integration tasks. The system
presents a spreadsheet-like workspace, into which the user may paste columns
and rows of data from source applications. The system attempts
to learn what data is being extracted and what queries are being
asked, and it makes auto-complete suggestions that generalize the
user's work. The user provides feedback (either explicitly or by
pasting more data) and the system refines its suggestions accordingly.
Provenance information is used to explain and debug results, and it is also a
foundation for the learning process. See
here for an overview paper.
CopyCat was funded in part by a DARPA IPTO seedling in the area of "best
effort data integration," and is also funded in part by DARPA DSO funding through the CSSG program.
Secondary Projects
Querying over distributed, heterogeneous data
Traditional data integration allows many different structured (or
semi-structured) data sources to be mapped to a single umbrella mediated
schema, which can be queried by users. The data integration or
mediator system masks all of the variations in schemas and interfaces, and
presents a uniform interface. The huge challenge in data integration is
gaining consensus about what the mediated schema should be -- with
secondary challenges in extending, maintaining, and modifying the schema as
needs change. Worse, the mediated schema, as the product of global
standardization, may be very different from the way certain users want to
think about their schema.
In the Piazza peer data management system, we have
proposed to make data integration more flexible and decentralized by
eliminating the need for a single central schema: instead, participants or
peers can each provide their own schema, and different peers will be
interrelated via schema mappings. Queries over any schema will be answered
using the transitive closure and merge of all mappings in the system. We
are currently developing techniques for building a corresponding system
implementation in a peer-to-peer fashion, to take advantage of replication
for reliability and performance.
The Tukwila query engine is a component of Piazza responsible for
providing high-performance query answering. Within Tukwila, we are focused
on the topic of adaptive query processing as a means of providing
query answers with good performance. Adaptive query processing allows the
query engine to "discover" properties of the data as it is executing a
query, and to exploit those characteristics to produce a more efficient
query plan.
Our work on adaptive query processing focuses on the following problems:
(1) extending our query processing techniques to increasingly complex types
of queries, (2) investigating whether adaptive techniques provide
significant benefits in more traditional database applications, (3)
extending to a distributed and peer-to-peer environment, and (4)
understanding the principles and effectiveness of adaptive query processing
techniques.
Collaborators: Nick Taylor, Sudipto Guha, Mohammad Daud. Former
collaborators: Alon Halevy, Daniel Weld, Dan Suciu, Igor Tatarinov,
University of Washington; Aneesh Kapur, Mike Wittie, Ivan Terziev.
In the data integration research community, there is only a limited
understanding of the needs of real integration applications. I propose
to build a suite of mappings, data sources, and workloads for benchmarking
and evaluating data integration techniques.
One of the first domains of interest is bioinformatics, which has a rich
set of complex data types, as well as a variety of publicly available data
sources.