The Data that Archiving Fails to Capture
Peter Buneman
University of Pennsylvania
peter@cis.upenn.eduTo someone interested in the preservation and archiving of data, this is good news. The information is being kept alive and there is no need to bring any new technology to bear on the problem of ensuring the longevity of data. Indeed, one can argue that historically, duplication has always been a better guarantee of data preservation than archival media or institutions (consider the loss caused by the destruction of the libraries of Alexandria, Cotton and Louvain!). Therefore, the right way to preserve data is the natural way -- to facilitate the copying and construction of new databases. One could argue further that keeping data alive by this process is natural in that it is context sensitive. The form of the data adjusts to the context in which it is used. One can think of countless examples in which the original raw data is much less useful than its modern representation: music, literature etc. The original rendering is often unintelligible, and a work survives because it is constantly adjusted (re-interpreted, translated etc.) to suit the context.
So why should the issue of derived databases be of interest to a conference on data archiving when it appears that these information sources are being naturally archived? The problem is that duplication of databases per se does not archive all the relevant data The copying of databases typically involves extracting a subset of the data from some data source, manually cleaning it, and transforming it into a form suitable for some other data source. However, the process by which a piece of data arrives in a database, its provenance, is frequently lost. A user of a derived database may have no idea of how the data got there; worse still, the maintainers of the database may not keep this information. Knowing the provenance of a data element is crucial to ones assessment of its reliability.
The following diagram shows the interrelationships between a very small subset of the biological databases whose primary concern is genetics. The arrows between databases describe how these databases are derived from each other. It is important to remember that this extraction involves the selection and transformation of certain data elements (a database query) and it involves extensive manual "cleaning" the data. Some of the databases are general purpose, for example Genbank is a general purpose sequence database, while others are specific to a such as EpoDB (a database of genes connected with red blood cells) are specific to a research project. A genetics researcher will use the appropriate database as the most reliable source. Swissprot, for example, is regarded as the most reliable source of protein sequence data because it is heavily curated. In this figure a * indicates databases that are curated, SUB indicates that the database has some form of automatic submission process and LIT indicates that the curators of the database may go to external sources (the literature) to augment or correct data.
Each of these databases is constantly evolving as new experimental evidence is obtained. This explains the cycles in the diagram. Data may appear first in one database, be corrected as it moves into another, and that correction is moved back into the original database. In general, the individual database curators do an excellent job of keeping old versions of their databases; the databases are sometimes available in more than one format; and it is likely that XML versions of most of these databases will shortly be available.
On the face of it, biological databases are being naturally and effectively archived. Yet despite the effort that is expended in this domain on information preservation, we are loosing crucial information!
What we are loosing is the linkage between the databases. How one database depends on another is a complex process involving query languages, data mining techniques, data cleaning and various forms of data translation. Taking a "data-oriented" view of the problem, when you see some data element in one of these databases, you may have no idea how it got there. Almost certainly it was extracted from some other database, which in turn extracted it from another database, and at each step some correction or transformation may have been applied. Also, the relevant data may have been available in two or more databases, and some judgment was exercised in which source to use. The provenance of the data -- the process by which the data moved through the complex system of databases -- is often lost. When it is maintained, it is kept in uninterpreted comment fields and is typically partial. This information is crucial to anyone trying to assess the reliability of the databases, not least to the people (the curators) who are maintaining the databases. The tools for recording data provenance are, at best, minimal. A recent NSF database workshop a discussion group invented the term self aware data to describe data that carried wits own history. Perhaps "self-describing" would be better but this term has been coined, somewhat inaccurately, by people working on data formats and semistructured data. What "self aware" means is that whenever you extract data from a database you will not only get the "face value" data -- the data you wanted, but some latent metadata -- metadata that describes the history of the data: where it came from, how it was transformed, who corrected it, etc. Moreover, when you pass it on to someone else, this latent data will (perhaps automatically) be augmented with the further details of that transaction. This has the following consequences for database construction.
Sheer size is not the only problem. Databases have structure and, in conventional database systems, that structure is predefined and restricts what one can put into the database. Adding arbitrary annotations, which is one of the desiderata of our latent metadata, is difficult, if not impossible. Even if one could expand the database schema with fields that account for the "core" annotations, one still has the problem of the unanticipated annotations and the problem of annotating the annotations.
Fortunately there is recent work in both the database and document communities that may offer some hope of solving the problem of implementing such annotations. This is work on semistructured data, which converges with XML in that the database work offers methods for storing and querying large XML documents. Semistructured data models allow us to accommodate unanticipated structure, and there is now considerable interest in techniques for the efficient storage and retrieval of mixtures and structured and semistructured data. This offers at least the beginning of a solution to the problem of what data model and storage mechanisms might be appropriate for storing annotations.
The size issue still needs to be addressed. Again, some existing ideas in databases may help us. In an image database we would expect that the latent data associated with most pixels in an image would be the same. Therefore one should be able to use the same annotation for most of the pixels. Only the "deviant" pixels -- for example those that have been corrected -- need special treatment. By and large the latent data for each pixel will be inherited from the latent data in the image. Again, in genetic databases we have some idea of how much exceptional annotation is needed. One typically sees a small number -- two or three at a rough guess -- annotations on a sequence of several hundred base pairs. Thus, while the overhead for transmitting a single base pair may be 1000%, the overhead on larger units of information may be relatively low.
The use of semistructured data as the substrate for data annotation is
only a partial solution to the problem. There are many more issues
concerning data models, languages and storage techniques that are
involved in building an environment in which the recording of data
provenance is a simple and natural process. Especially important is
the development of tools for helping annotators/curators to record
and repeat the corrections they make.
Acknowledgements and References
Many of the ideas in this note are the result of discussions with my colleagues in the database group at the University of Pennsylvania and with John Ockerbloom. I am also grateful for discussions with David Maier and Paul Kantor at a recent meeting, the 1999 NSF Information and Data Management Workshop at which we coined the term "self-aware data". Some of these issues also came up at NSF Invitational Workshop on Distributed Information, Computation, and Process Management for Scientific and Engineering Environments . My extremely limited knowledge of data preservation issues was taken from the position papers of the NSF Workshop on Data Archival and Information Preservation