Peter Buneman -- position paper

Functional Database Languages and the Functional Data Model

A position paper for the FDM workshop

Peter Buneman
University of Pennsylvania

The functional data model is now almost twenty years old. The original idea was to view the database as a collection of extensionally defined functions and to use a functional language for querying the database. The idea of the functional data model and the idea of functional database languages arose simultaneously and reinforced each other. However, these are separate ideas. One can use functional languages with non-functional data models, and one can use non-functional languages with a functional data model. In this position paper I want to assert that functional languages have had some influence on the development of practical database languages; in fact they have had most success as query languages for ``non-database'' data that is difficult to represent in the functional data model! On the other hand, the functional data model has had little, if any, impact. I want to argue that the functional data model gives us both too little and too much and to suggest some criteria that are important for the next generation of data models, which will, I hope, be influenced by the functional model.

This paper is written in almost complete ignorance of recent developments in the functional data model. I apologise in advance for this, and I would be happy to learn of work that refutes some of the more negative claims in this paper. I would also like to express my indebtedness to the database group at Penn, especially to Susan Davidson and to Val Tannen, for providing constructive responses to my frequent ``flames'' about data models. Many of the ideas in this paper are theirs, though the responsibility for representing or mis-representing them is mine.

1. Functional Database Query Languages

Before starting this discussion we should resolve a confusion surrounding the definition of functional languages. The term is used to sometimes to denote languages in which functions are first-class values, and sometimes used also used to described applicative languages -- languages in which there is no notion of state. SML, for example, has the first property but lacks the second. When used for databases, the issue is more complicated. Database query languages are designed around some simple primitives with a well-understood equational theory that is used for optimisation, and it is not clear what role general higher-order functions play in this, nor is it obvious that one needs a general fix-point operation, which is normally associated with functional languages. As for whether the notion of state is important, obviously it is in databases. However database updates are normally specified as a transformation from one state of a database to the next. This transformation may be specified applicatively, so there may not be a contradiction between database updates and the use of applicative languages.

Apart from the highly desirable property of being compositional (a property that is unfortunately not possessed by the most widely used database query languages,) why are functional query languages appropriate as database query languages?

1.1 Type Systems

Much has been learned from the polymorphic type systems of functional languages. Especially useful has been the relatively recent work on record polymorphism. A database query typically only uses a small portion of the database. Record polymorphism allows one to describe -- and possibly infer -- what part of the database is needed to support a given query. This is especially important if one wants some protection against database restructuring. It is also important for providing some verification for the correctness of a query and for providing some optimisation.

1.2 Functional Algebras

First-order logic has traditionally been the basis of database query languages. The relational algebra has proved itself as the right medium in which to express and optimise first-order logic. The addition of a fixpoint operation (datalog) has provided increased expressive power. However first-order logic, by its nature, is based upon ``flat'' (1NF) relations which, in a programming language, we take to be sets of tuples constructed from atomic types. Extending first-order logic to work on ``non-flat'' structures, or to generalise it to other collection types such as multisets or lists has proven problematic.

An approach based upon simple functional algebras, which generalises properly to free combinations of collection types, records and variants, has provided an alternative to first-order logic The inspiration for this approach was provided by a basic construct in category theory -- the monad, which is not so surprising when one considers that the purpose of category theory is to generalise mathematical structures, and that sets, multisets and lists are such structures.

There are a number of satisfactory outcomes to this approach. First, it has provided a tractable language for nested relations -- something that did not readily fall out of relational algebra. Second, the equational theory of monads generalises many of the well-known optimisations of relational algebra. Third, there are a number of ``conservativity'' results that show that when the appropriate monad algebras are restricted to flat relations, they coincide in expressive power with well-known ``logic-based'' languages: relational algebra, conjunctive queries, and datalog.

1.3 The use of functional query languages

Functional query languages have proved enormously useful for querying ``non-database'' data -- the kinds of data that exist in data exchange formats, scientific data formats, and other data interchange protocols. The design of recent query languages has recognised this: OQL, for example, is described as a functional language, though this description is based on the simple fact that OQL is compositional.

Perhaps the most important observation is that functional query languages have proved themselves most useful for those data models and formats (relational, nested relational, etc.) that one would not think of as functional models!

2. The Functional Data Model

The functional data model is the ``lowest common denominator'' of data models. Apart from the complications introduced by collection types other than sets, almost any other model (relational, E-R, OO) can be regarded as a model based upon sets and functions. Let us briefly examine two case studies to see what they tell us about the functional data model.

2.1 The Entity-Relationship model

At its core, the E-R model is a class of graphs with three node types: attributes (A), entities (E) and relationships (R). Edges in this graph may run from an E node to an A node and from an R node to an E or A node; edges may not connect two A nodes, two E nodes or two R nodes. These restrictions ensure that the graph is acyclic. Because one often wants to have one entity as an ``attribute'' of another, and because one cannot have edges connecting E nodes, one is led to clumsy complications of the model that introduce constructs such weak entities. Something similar is done for relationships.

An E-R formalism would almost certainly describe an instance of an E-R schema by associating a set with each node and a function with each arrow (there might be some disagreement about whether functions can be multi-valued.) There is no doubt that the E-R model is a kind of functional model. A functional modeler would simplify the problem of weak entities by collapsing the A, E and R nodes to a common type and allowing arbitrary edges, thereby eliminating the need for weak entities and the like. However, this collapsing of nodes may have destroyed useful information:

The A,E,R stratification is intuitive. Having different node shapes is a real aid in trying to understand large and complex schemas.
The set that instantiates an R node is (isomorphic to) a subset of the product of the sets that instantiate the A and the E nodes to which it is connected. In other words, R nodes have keys and do not have independent ``identity''. Keys are important, and these are described in the E-R model by edge annotations. (I do not know if E nodes are supposed to have ``identity''. Unlike R nodes, E nodes do not usually have decorated edged, and one is led to assume that E nodes have ``identity'' rather than keys; but I understand opinions differ on this topic.)

A ``bare'' functional model fails to represent these properties, and if we were to add an appropriate annotation to the functional model to correct for these failings, how different would it look from an E-R model?

2.2 The Object-Oriented Data Model

In contrast to the E-R model, the OO data model is based upon some (generic?) object-oriented language and provides us with a rich type system. We are at liberty to build recursive types as well as exploiting the collection types that are invariably added to an OODBMS. With methods, we also have the ability to embed procedures in our data. Isn't this too general to be the basis for a data model? With this generality, it is not surprising that there are several ways of representing the same data:

class Student { Name : string; Age : int; Courses : set( Course);}
class Course { Cname : string; Teacher : string; }
class Student { Name : string; Age : int; }
class Course { Cname : string; Teacher : string; } Students : set( Student); }
class Student { Name : string; Age : int; }
class Course { Cname : string; Teacher : string; }
class Enroll { Students : set( Student); Courses : set( Course); }

Under certain assumptions about containment in extents, these three definitions are equivalent. In addition, some OODBMSs and the ODMG proposals add, through ``inverse relationships'' a fourth possibility, which is a combination of (1) and (2) above.

Again, provided multi-valued functions are allowed, we can represent each of the three possibilities above in the functional data model, and this model will then inherit from the object-oriented model the same potential for multiple representations. It is arguable, of course, that multiple representations are a good thing. My opinion is that multiple views are a good thing, but that a canonical underlying representation,such as that provided by the E-R model, is desirable for any kind of data integration and translation.

2.3 The Future of the Functional Data Model

From the previous examples I have indicated that the functional data model has inherited (or is in danger of inheriting) some of the bad aspects of the OO data model, and has failed to inherit some of the good aspects of E-R modeling. What is its future? Before answering this question it is worth remarking that data models, as well as the semantic networks and knowledge representation systems of Artificial Intelligence, all appear to have a similar life cycle: they start with some relatively simple graphical representation of structure, but these graphs become increasingly complicated as people attempt to add ``semantics'' to the model. New shapes of nodes and new kinds of edges are introduced with the likely consequence that several equivalent representations of the same data are possible, and while the semantics will be enriched, it will also be more confusing. I do not know what attempts have been made to enrich the functional data model beyond the addition of an inheritance arrow.

Let me now try to enumerate the desiderata for any data model in the hope that some extension or variant of the functional model will ultimately satisfy them.

It must have a clear mathematical semantics that should embrace all details of the model. This is essential if we are going to translate between the model and some other model. Even though the E-R model has a close correspondence with the relational model, the lack of attention to the details of the model can make relational representations difficult to construct.
One should be able to derive a conventional type system from the model. There is reasonable agreement on the types that one would like to be able to represent in a database.
A consequence of the previous point is that there should be a well defined set of ``language bindings'' for the model. Preferably there should be a query language that works directly from the model.
Data models are more than type systems; they also represent constraints on the data. These constraints should be explicit in the semantics and should, if possible, be easy to reason about.
It should be ``implementable'' in the sense that it it should not require grossly inefficient data representations or programming constructs.

Having been rather negative about data models in general, I want to end by stressing their importance. The plethora of data models and knowledge representation systems indicates a clear need for ``conceptual models''. In many cases an E-R diagram only has transient use in the process of database design; nevertheless, it is seen as useful. Most database work in the future will be ``data massaging'' -- moving between representations of data. Only with a good data model that expresses a simple constraint system and an adequate type system will this be possible.

Peter Buneman
June 1997