| |
If you are interested in doing an independent study that may lead to
something longer term, please send me mail. If you are interested in doing a a
senior project with me, here are some examples of the types of short-term
projects I'd be interested in. Obviously you are encouraged to propose your
own ideas, as well.
Past Senior Projects
- An XML-based web infrastructure for the Penn Database Group that makes
use of data integration tools developed within our group, XSL tools, and
various other standard web components. This project will teach skills
relating to XML, XQuery, XSL, and various open source and locally developed
tools. This would be a great way to gain experience and skills relevant to
both research and post-academic careers, to learn new technologies, and to
be able to showcase the results. BSE senior project by Eric McGlinchey.
- The Universal Message Center -- a "Google for personal and public messages"
that includes security. Develop an application for indexing, sharing, and
searching e-mail, newsgroup, scheduling, and other information (e.g., RSS
feeds), in a way that preserves both privacy and security. This project will
teach information retrieval, XML, and database techniques. It will involve
building a significant system in Java. BSE senior project by Nina Quek and
Mike Daglian.
- Meta-search as a way of combatting the hacks people use to increase
their page rankings. BSE senior project by Nate Hashem.
- A scalable infrastructure for Penn Course Review. BSE senior
project by Howie Vegter and Steve MacCrory.
- Building a better tech support help web site. BSE senior project by Dan
Margolis.
- ILMUNC, a database-backed web site for
the Ivy League Model United Nations conference. BAS capstone project by Amit
Vazirani and Raghav Bajaj.
Open Projects
- (MS/BSE level) The Universal Message Center is a peer-to-peer indexing and
search system for public and personal data: web pages, XML documents, blog
posts, newsgroups, email, etc. The UMC (developed as a Senior Project by Mike
Daglian and Nina Quek) allows for keyword search, as well as keyword-within-tag
search, over data, and it supports a P2P architecture with certain privacy
restrictions. There are a number of further development steps that can make
for interesting projects:
- Views or Collections. Can we come up with a way of "saving"
and naming collections of data based on keyword queries, where we can
selectively add related documents, and remove "bad" ones? Can we define this
in a hierarchical way, as with Yahoo categories?
- Ranking. Currently we rely on a combination of Information
Retrieval-style ranking and several heuristics. Can concepts like Google's
PageRank and user feedback be applied here?
- Adding structure to unstructured data. Can we define a series of
"templates" to progressively add XML tags to plain-text data, so we can take
emails and web pages and query them for semantic information? Can we account
for the fact that such templates are only correct with a certain
probability?
- Sharing. Given a distributed engine with public, semi-private, and
private data, how should we define models of sharing between users?
- (MS/BSE level) Peer-to-peer synchronization. We are building a
peer-to-peer system, Orchestra, that (among several tasks) allows nodes to
"check out" data, modify it, and synchronize with one another. There are many
aspects of this system that need to be investigated.
- (MS/BSE level) XML path matching for query processing. Our Tukwila system,
used for integrating or querying XML data, can read and operate on data that is
still being read across the network (a form of "streaming" or "pipelining").
Currently, the implementation of this capability is limited in the information
it collects about the XML (it does not collect information like the location of
the data in the document), and it has not been optimized. We would like to
extend this to be more general so more complex operations can be performed, and
we would like to add further optimized behaviors.
- (MS level) Experimenting with state-of-the-art adaptive query processing
techniques in a local database. This project will involve extending the
Tukwila adaptive data integration system, which queries data across a network,
so it has local storage. That will be accomplished by coupling the existing
Tukwila codebase with a data storage system, BerkeleyDB. This project requires
skills in reading and writing C++ code.
- (MS level) Implementing new datatypes and the XQuery function library for
an XML query processor. This project involves adding the XQuery standard
function library and some new data types to the Tukwila codebase. It would
teach the internals of a database query processor, the fundamentals of XQuery,
and software engineering skills. This project requires C++ skills and
familiarity with XML and the XQuery language.
- (MS level) Developing an XQuery rewrite-based optimizer. Query processing
is a challenging and interesting area of work, with some overlap with compiler
techniques but a number of unique characteristics. XQuery is a highly
delarative language, so it is possible to specify a query in many forms (some
with many nested expressions, some with many expressions that can be
simplified, and some that depend on previously defined queries, aka views).
This project would involve understanding how XQueries relate and simplifying
the queries so they are more compact and more efficient.
|