Functional Database Languages and the Functional Data Model
A position paper for the FDM workshop
Peter Buneman
University of Pennsylvania
The functional data model is now almost twenty years old. The
original idea was to view the database as a collection of
extensionally defined functions and to use a functional language for
querying the database. The idea of the functional data model and the
idea of functional database languages arose simultaneously and
reinforced each other. However, these are separate ideas. One can
use functional languages with non-functional data models, and one can
use non-functional languages with a functional data model. In this
position paper I want to assert that functional languages have had
some influence on the development of practical database languages; in fact
they have had most success as query languages for ``non-database''
data that is difficult to represent in the functional data model! On
the other hand, the functional data model has had little, if any,
impact. I want to argue that the functional data model gives us both
too little and too much and to suggest some criteria that are
important for the next generation of data models, which will, I hope,
be influenced by the functional model.
This paper is written in almost complete ignorance of recent
developments in the functional data model. I apologise in advance for
this, and I would be happy to learn of work that refutes some of the
more negative claims in this paper. I would also like to express my
indebtedness to the database group at Penn, especially to Susan
Davidson and to Val Tannen, for providing constructive responses to my
frequent ``flames'' about data models. Many of the ideas in this paper
are theirs, though the responsibility for representing
or mis-representing them is mine.
1. Functional Database Query Languages
Before starting this discussion we should resolve a
confusion surrounding the definition of functional languages. The
term is used to sometimes to denote languages in which functions are
first-class values, and sometimes used also used to described applicative
languages -- languages in which there is no notion of state. SML, for
example, has the first property but lacks the second. When used for
databases, the issue is more complicated. Database query languages
are designed around some simple primitives with a well-understood
equational theory that is used for optimisation, and it is not clear
what role general higher-order functions play in this, nor is it
obvious that one needs a general fix-point operation, which is
normally associated with functional languages. As for whether the
notion of state is important, obviously it is in databases. However
database updates are normally specified as a transformation from one
state of a database to the next. This transformation may be specified
applicatively, so there may not be a contradiction between database
updates and the use of applicative languages.
Apart from the highly desirable property of being compositional (a
property that is unfortunately not possessed by the most widely used
database query languages,) why are functional query languages
appropriate as database query languages?
1.1 Type Systems
Much has been learned from the polymorphic type systems of functional
languages. Especially useful has been the relatively recent work on
record polymorphism. A database query typically only uses a small
portion of the database. Record polymorphism allows one to describe
-- and possibly infer -- what part of the database is needed to
support a given query. This is especially important if one wants
some protection against database restructuring. It is also important
for providing some verification for the correctness of a query and for
providing some optimisation.
1.2 Functional Algebras
First-order logic has traditionally been the basis of database query
languages. The relational algebra has proved itself as the right
medium in which to express and optimise first-order logic. The
addition of a fixpoint operation (datalog) has provided increased
expressive power. However first-order logic, by its nature, is based
upon ``flat'' (1NF) relations which, in a programming language, we
take to be sets of tuples constructed from atomic types. Extending
first-order logic to work on ``non-flat'' structures, or to
generalise it to other collection types such as multisets or lists has
proven problematic.
An approach based upon simple functional algebras, which generalises
properly to free combinations of collection types, records and
variants, has provided an alternative to first-order logic The
inspiration for this approach was provided by a basic construct in
category theory -- the monad, which is not so surprising when
one considers that the purpose of category theory is to generalise
mathematical structures, and that sets, multisets and lists are such
structures.
There are a number of satisfactory outcomes to this approach. First,
it has provided a tractable language for nested relations -- something
that did not readily fall out of relational algebra. Second, the
equational theory of monads generalises many of the well-known
optimisations of relational algebra. Third, there are a number of
``conservativity'' results that show that when the appropriate monad
algebras are restricted to flat relations, they coincide in expressive
power with well-known ``logic-based'' languages: relational algebra,
conjunctive queries, and datalog.
1.3 The use of functional query languages
Functional query languages have proved enormously useful for querying
``non-database'' data -- the kinds of data that exist in data exchange
formats, scientific data formats, and other data interchange
protocols. The design of recent query languages has recognised this:
OQL, for example, is described as a functional language, though this
description is based on the simple fact that OQL is compositional.
Perhaps the most important observation is that
functional query languages have proved themselves most useful for
those data models and formats (relational, nested relational, etc.)
that one would not think of as functional models!
2. The Functional Data Model
The functional data model is the ``lowest common denominator'' of data
models. Apart from the complications introduced by collection types
other than sets, almost any other model (relational, E-R, OO) can be
regarded as a model based upon sets and functions. Let us briefly
examine two case studies to see what they tell us about the functional
data model.
2.1 The Entity-Relationship model
At its core, the E-R model is a class of graphs with three node types:
attributes (A), entities (E) and relationships (R). Edges in
this graph may run from an E node to an A node and from an R
node to an E or A node; edges
may not connect two A nodes, two E nodes or two R nodes. These
restrictions ensure that the graph is acyclic. Because one often
wants to have one entity as an ``attribute'' of another, and because
one cannot have edges connecting E nodes, one is led to
clumsy
complications of the model that introduce constructs such weak entities. Something
similar is done for relationships.
An E-R formalism would almost certainly describe an instance of an E-R
schema by associating a set with each node and a function with each
arrow (there might be some disagreement about whether functions
can be multi-valued.) There is no doubt that the E-R model is a
kind of functional model.
A functional modeler would simplify the problem of weak entities by collapsing the A,
E and R nodes to a common type and allowing arbitrary edges,
thereby eliminating the need for weak entities and the like. However,
this collapsing of nodes may have destroyed useful information:
- The A,E,R stratification is intuitive. Having different
node shapes is a real
aid in trying to understand large and complex schemas.
- The set that instantiates an R node is (isomorphic to)
a subset of the product of the sets that instantiate the A and
the E nodes to which it is connected. In other words, R
nodes have keys and do not have independent ``identity''.
Keys are important, and these are described in the E-R model by edge
annotations. (I do not know if E nodes are supposed to have
``identity''. Unlike R nodes, E nodes do not usually
have decorated edged, and one is led to assume that E nodes
have ``identity'' rather than keys; but I understand opinions differ on this
topic.)
A ``bare'' functional model fails to represent these properties,
and if we were to add an appropriate annotation to the
functional model to correct for these failings, how different would it
look from an E-R model?
2.2 The Object-Oriented Data Model
In contrast to the E-R model, the OO data model is based upon
some (generic?) object-oriented language and provides us with a rich
type system. We are at liberty to build recursive types as well as
exploiting the collection types that are invariably added to an
OODBMS. With methods, we also have the ability to embed procedures in
our data. Isn't this too general to be the basis for a data model?
With this generality, it is not surprising that there are several ways
of representing the same data:
-
class Student { Name : string;
Age : int;
Courses : set( Course);}
class Course { Cname : string;
Teacher : string; }
-
class Student { Name : string;
Age : int; }
class Course { Cname : string;
Teacher : string; }
Students : set( Student); }
-
class Student { Name : string;
Age : int; }
class Course { Cname : string;
Teacher : string; }
class Enroll { Students : set( Student);
Courses : set(
Course); }
Under certain assumptions about containment in extents, these three
definitions are equivalent. In addition, some OODBMSs and the ODMG proposals add,
through ``inverse relationships'' a
fourth possibility, which is a combination of (1) and (2) above.
Again, provided multi-valued functions are allowed,
we can represent each of the three possibilities above in the
functional data model, and this model will then
inherit from the object-oriented model the same potential for
multiple representations. It is arguable, of course, that multiple
representations are a good thing. My opinion is that multiple views are a good thing, but that a canonical underlying
representation,such as that provided by the E-R model, is desirable for any kind of data integration and
translation.
2.3 The Future of the Functional Data Model
From the previous examples I have indicated that the functional data
model has inherited (or is in danger of inheriting) some of the bad
aspects of the OO data model, and has failed to inherit some of the
good aspects of E-R modeling. What is its future? Before answering
this question it is worth remarking that
data models, as well as the semantic networks and knowledge
representation systems of Artificial Intelligence, all appear to have a
similar life cycle: they start with some relatively simple graphical
representation of structure, but these graphs become increasingly
complicated as people attempt to add ``semantics'' to the model. New
shapes of nodes and new kinds of edges are introduced with the likely
consequence that several equivalent representations of the same data
are possible, and while the semantics will be enriched, it will also
be more confusing. I do not know what attempts have been made to
enrich the functional data model beyond the addition of an inheritance
arrow.
Let me now try to enumerate the desiderata for any data model in
the hope that some extension or variant of the functional model will
ultimately satisfy them.
- It must have a clear mathematical semantics that should embrace
all details of the model. This is essential if we are going to
translate between the model and some other model. Even though the E-R
model has a close correspondence with the relational model, the lack
of attention to the details of the model can make relational
representations difficult to construct.
- One should be able to derive a conventional type system from the
model. There is reasonable agreement on the types that one would like to be able to
represent in a database.
- A consequence of the previous point is that there should be a
well defined set of ``language bindings'' for the model. Preferably
there should be a query language that works directly from the model.
- Data models are more than type systems; they also represent
constraints on the data. These constraints should be explicit in the
semantics and should, if possible, be easy to reason about.
- It should be ``implementable'' in the sense that it it should
not require grossly inefficient data representations or programming
constructs.
Having been rather negative about data models in general, I want to
end by stressing their importance. The plethora of data models and knowledge
representation systems indicates a clear need for ``conceptual
models''. In many cases an E-R diagram only has transient use in the
process of database design; nevertheless, it is seen as useful.
Most database work in the future will be ``data massaging'' -- moving
between representations of data. Only with a good data model that
expresses a simple constraint system and an adequate type system will
this be possible.
Peter Buneman
June 1997