Distributed Data
(CIS 700/1, Spring 2002)

Summaries of Papers

1 Filesystems and Transactions

Main readings:

Hennessy and Patterson. Computer Architecture: A Quantitative Approach. Pages 485--493 (section on disk storage).
M. K. McKusick, W. N. Joy, S. J. Leffler, and R. S. Fabry. A Fast File System for UNIX. ACM Transactions on Computer Systems, Vol. 2, No. 3, August 1984, pp. 181--197.

Summary: The Berkeley FFS is a reimplementation of the original Bell Labs Unix filesystem, retaining the original API but providing much higher throughput. The main sources of speedup are:
- The old FS used 1024-byte blocks; the new one uses (typically) 8K blocks so that data is transferred to/from the disk in larger chunks. To avoid excessive space wastage for small files, blocks can be subdivided into sector-sized (e.g. 512-byte) fragments, which can be allocated to separate files (subject to some alignment constraints).
- The old FS kept all inode information in a single area of (each partition of) the disk. The FFS structures the disk into a collection of cylinder groups, each with its own space for storing inodes (up to 2048 per cylinder group).
- The old FS used a free list for keeping track of allocatable disk blocks. The FFS uses a bitmap for each cylinder group, allowing allocation of contiguous (actually, rotationally optimal) blocks to large files.
  
  Inodes for files in the same directory are allocated (if possible) in the same cylinder group. However, different directories are placed in different cylinder groups, to try to ensure that the amount of free space in the cylinder groups remains approximately balanced, allowing good allocation decisions for future writes.
  
  The new allocation policies are effective only when the disk is run less that completely full (e.g. below 90into account, as well as the extra space required for cylinder group bitmaps, etc., and offsetting it by the benefits of sub-block-level allocation of small files, the space usage of FFS is similar to that of the old Unix FS.
The FFS can use up to half the maximum disk transfer rate (i.e., the max rate at which data can be streamed onto or off of the disk by the raw hardware); the old FS, by comparison, could only get 2-3% utilization.

The paper also discusses some functional enhancements to the original FS, including quotas, advisory locks, symlinks, and an atomic rename primitive.

Some questions for discussion:
1. This paper was published in 1984; since that time, many performance parameters of disks have changed enormously. How many of the calculations remain valid (mutatis mutandi), and which require serious rethinking? Have any of the conclusions drawn from these calculations become misleading or downright bogus?
2. Disk transfer performance is regarded here as a distinct, isolable value. But in many systems, the same hardware resources (disk, memory, and processor) are shared between a variety of tasks (in particular, file I/O and virtual memory). How does the story need to change when we take these other parts of the picture into account?
3. Everybody talks about the ``Unix Filesystem Semantics''; what is it, exactly? (For example, is the expectation that writes to directory nodes will not be reordered part of the semantics of the filesystem? What about writes to blocks within files? What are 'reasonable expectations' about the permanence of writes after crashes?)
M. Rosenblum and J. K. Ousterhout. The Design and Implementation of a Log-Structured File System. ACM Transactions on Computer Systems, Vol. 10, No. 1, February 1992, pp. 26--52.

Summary: Traditional Unix filesystems like FFS can use large disk caches to increase the efficiency of reads, but still perform writes synchronously. Especially for metadata, this can involve several seeks per block of data transferred; this limits effective disk utilization to a small percentage of the theoretical maximum. As memory becomes cheaper and disk caches get larger, disk traffic will be dominated by writes; thus, it is imperative to speed them up somehow.

A log-structured filesystem writes (almost) all information to the disk---including both data blocks and indexing data structures---in large, sequential, asynchronous bulk transfers (e.g., 1Mb at a time). Write performance can therefore approach the raw maximum of the disk. By caching indexing structures in memory (e.g., the inode map), read performance can be kept similar to traditional Unix filesystems. A side benefit is that the state of the world can be reconstructed very rapidly (by reading just the tail of the log) following a crash. This paper describes the design and implementation of the Sprite LFS, a component of the Sprite network operating system.

The main design issue in implementing a log-structured filesystem is deciding on an algorithm for cleaning (i.e., compacting) segments of the disk to obtain fresh space that can be re-used for new writes. The paper considers (and measures) a number of algorithms; the one eventually chosen is a form of generational collection, where segments are chosen to be cleaned based on a combination of their percentage of free-able space (which can easily be determined by examining in-memory structures) and the age of their most recently modified block.

The paper is cogently written and contains a lot of nice observations, measurements, and comparisons. Recommended.

Some questions for discussion:
1. Segment cleaning clearly has some similarity with garbage collection, and in particular the hot/cold segment cleaner can be viewed as a kind of generational collector. But they are not exactly the same: for the cleaning problem, we know where the live data is, and the problem is just deciding which bits of it to tidy at any given moment; in garbage collection, the location of the live data must be discovered as part of the process. Even so, could segment cleaning benefit from insights of the GC community?
2. At the top of p. 12, the authors note that ``a newer version of SunOS groups writes [McVoy and Kleiman] and should therefore have performance equivalent to Sprite LFS.'' So what, then, are the remaining benefits of the log-structured approach? Just fast crash recovery?
S. R. Kleiman. Vnodes: An Architecture for Multiple File System Types in Sun UNIX. In Proceedings of the 1986 Usenix Conference, June 1986, pp. 238--247.
D. K. Gifford, P. Jouvelot, M. A. Sheldon, and J. W. O'Toole, Jr. Semantic File Systems. In Proceedings of the 13th ACM Symposium on Operating System Principles, October 1991.
P. A. Bernstein, V. Hadzilacos, and N. Goodman. Concurrency Control and Recovery in Database Systems. Pages 25--56 (serializability and 2PL).
N. S. Barghouti and G. E. Kaiser. Sections 5 and 6 from Concurrency Control in Advanced Database Applications. ACM Computing Surveys, 23(9), 269--317, 1991.

Supplemental readings:

K. Thompson. Unix Implementation. Bell System Technical Journal, 56(6), 1978.
L. McVoy and S. Kleiman. Extent-like Performance from a Unix File System. In Proceedings of the 1991 Winter Usenix Conference, January 1991, pp. 33--44.

Summary: Sun's Unix File System (UFS) was closely based on the Berkeley FFS and shares its performance characteristics. This paper describes performance enhancements aimed at jobs performing large sequential I/O. The main idea is to cluster both reads and writes: when the higher-level FS routines notice that a file is being read or written sequentially, they start reading ahead in large blocks or delaying writes until large blocks have accumulated. The new UFS can use about twice as much of the maximum disk bandwidth as the original in the best case.

Some important caveats:
- Changing the order of writes is dangerous (this is one reason why only sequential writes are clustered in the new UFS): many programs, especially those manipulating directories, depend for consistency on the fact that writes reach the disk in order.
- Clustering writes ``degrades'' the unix file semantics, making it difficult to know when a particular write has definitely reached stable storage.
- The file system and VM in SunOS are tightly integrated. This is good because it allows all of memory to be used for I/O buffering when appropriate, but it means that, unless care is taken, a job doing a huge write can lock down all pages in memory (by getting them into the kernel disk write queue); this is prevented in UFS by making the writing process do free-behind of the pages that it has been using.
The writing in this paper is not overly clear. There are probably better places to get information about ``post-FFS performance tuning.''
M. Seltzer, K. Bostic, M. K. McKusick, and Carl Staelin. An Implementation of A Log-Based File System for Unix. In Proceedings of the 1993 Winter Usenix Conference, January 1993.
J. N. Mattews, D. Roselli, A. M. Costello, R. Y. Wang, and T. E. Anderson. Improving the Performance of Log-Structured File Systems with Adaptive Methods. In Proceedings of the 16th ACM Symposium on Operating System Principles, October 1997.
R. Y. Wang, T. E. Anderson, and D. A. Patterson. Virtual Log Based File Systems for a Programmable Disk. In Proceedings of the 3rd Symposium on Operating Systems Design and Implementation, February 1999.
D. S. H. Rosenthal. Evolving the Vnode Interface. Proceedings of the Summer USENIX Conference, pages 107--117, June 1990.
J. S. Heidemann and G. J. Popek. File System Development with Stackable Layers. ACM Transaction on Computer Systems, 12(1): 58--89, Febuary 1994.
B. Welch. A Comparison of Three Distributed File System Architectures: Vnode, Sprite, and Plan 9. Computing Systems, 7(2):175--99, Spring 1994.
G. Weikum and G. Vossen. Transactional Information Systems. Academic Press, 2002. Chapter 3 (pages 61--123).

Additional Resources:

Ext2fs home page (including full sources -- only 6K lines! -- and design papers)
ReiserFS home page
XFS home page
FiST home page
D. S. H. Rosenthal. Evolving the Vnode Interface. Proceedings of the Summer USENIX Conference, pages 107--117, June 1990.
Burra Gopal, Integrating Content-Based Access Mechanisms with Hierarchical File Systems. OSDI'99.
Rob Pike, Dave Presotto, Ken Thompson, Howard Trickey, Phil Winterbottom. The Use of Name Spaces in Plan 9. Operating Systems Review (reprinted from Proceedings of the 5th ACM SIGOPS European Workshop), 1992.
Epsilon serializability

2 Pessimismistic Replication

Main readings:

S. B. Davidson and H. Garcia-Molina and D. Skeen. Consistency in Partitioned Networks, ACM Computing Surveys, 17(3), 1985.
M. Satyanarayanan, ``Distributed File Systems.'' In Distributed Systems, Sape Mullender, ed. Addison Wesley, 1993.
Russel Sandberg, David Goldberg, Steve Kleiman, Dan Walsh and Bob Lyon, "Design and Implementation of the Sun Network Filesystem," in Proc. Summer 1985 USENIX Conf., pp. 119-130.
B. Liskov, S. Ghemawat, R. Gruber, P. Johnson, L. Shrira and M. Williams, "Replication in the Harp File System," Proceedings of the Thirteenth Symposium on Operating Systems Principles, ACM, October 1991, pp. 198-212.
M. N. Nelson, B. B. Welch and J. K. Ousterhout, Caching in the Sprite Network File System, ACM Transactions on Computer Systems, Vol. 6, No. 1, February 1988, pp. 134-154.

Supplemental readings:

M. Satyanarayanan, A Survey of Distributed File Systems. Technical Report, Department of Computer Science, Carnegie-Mellon University, Number CMU-CS-89-116, February 1989. Also appeared in Annu. Rev. Comput. Sci., volume 4, 1990, pages 73--104.

Additional resources:

A. Muthitacharoen, B. Chen, David Mazières, "A Low-Bandwidth Network File System", Proceedings of the 18th ACM Symposium on Operating Systems Principles (SOSP'01), October 21-24, 2001.

Summary: Users rarely consider running network file systems over slow or wide-area networks, as the performance would be unac-ceptable and the bandwidth consumption too high. Nonethe-less, efficient remote file access would often be desirable over such networks particularly when high latency makes remote login sessions unresponsive. Rather than run interac-tive programs such as editors remotely, users could run the programs locally and manipulate remote files through the file system. To do so, however, would require a network file sys-tem that consumes less bandwidth than most current file sys-tems.

This paper presents LBFS, a network file system designed for low-bandwidth networks. LBFS exploits similarities be-tween files or versions of the same file to save bandwidth. It avoids sending data over the network when the same data can already be found in the server s file system or the client s cache. Using this technique in conjunction with conventional compression and caching, LBFS consumes over an order of magnitude less bandwidth than traditional network file sys-tems on common workloads.

To [avoid most] transfers, LBFS relies on the collision-resistant properties of the SHA-1 [6] hash function. The probability of two inputs to SHA-1 producing the same out-put is far lower than the probability of hardware bit errors.

In order to use chunks from multiple files on the recipient, LBFS takes a different approach from that of rsync. It con-siders only non-overlapping chunks of files and avoids sen-sitivity to shifting file offsets by setting chunk boundaries based on file contents, rather than on position within a file. Insertions and deletions therefore only affect the surround-ing chunks. Similar techniques have been used successfully in the past to segment files for the purpose of detecting unau-thorized copying [3]. To divide a file into chunks, LBFS examines every (over-lapping) 48-byte region of the file and with probability 2^-13 over each region's contents considers it to be the end of a data chunk. LBFS selects these boundary regions called break-points using Rabin fingerprints [19]. A Rabin fingerprint is the polynomial representation of the data modulo a pre-determined irreducible polynomial. We chose fingerprints because they are efficient to compute on a sliding window in a file. When the low-order 13 bits of a region s finger-print equal a chosen value, the region constitutes a break-point. Assuming random data, the expected chunk size is 2¹³ = 8192 = 8 KBytes (plus the size of the 48-byte breakpoint window).

3 Optimistic Replication Schemes

Main readings:

J. J. Kistler and M. Satyanarayanan, Disconnected Operation in the Coda File System, ACM Transactions on Computer Systems, Vol. 10, No. 1, February 1992, pp. 3-25.
Bruce Walker, Gerald Popek, Robert English, Charles Kline, Greg Thiel, The LOCUS Distributed Operating System, 9th Symposium on Operating Systems Principles (SOSP), Bretton Woods, New Hampshire, November 1983, pp. 49-70.
Alan J. Demers, Daniel H. Greene, Carl Hauser, Wes Irish, John Larson. Epidemic algorithms for replicated database maintenance. Proceedings of the Sixth Annual ACM Symposium on Principles of distributed computing (PODC). August 10 - 12, 1987, Vancouver Canada. Pages 1-12.
Fischer, Michael J., and Michael, Alan. Sacrificing serializability to attain high availability of data in an unreliable network. Proceedings of the ACM Symposium on Principles of Database Systems, March 29-31, 1982, Los Angeles, California, pp. 70-75.
R. G. Guy, G. J. Popek, and T. W. Page, Jr., Consistency Algorithms for Optimistic Replication. Proceedings of the First International Conference on Network Protocols. IEEE, October 1993.
Maurice Herlihy, Apologizing vs. asking permission: optimistic concurrency control for abstract data types. TODS 15(1), 96--124, 1990.

Supplemental readings:

E. T. Mueller, J. D. Moure, and G. J. Popek, A Nested Transaction Mechanism for LOCUS, 9th Symposium on Operating Systems Principles (SOSP), Bretton Woods, New Hampshire, November 1983, pp. 71-89.
R. G. Guy, J. S. Heidemann, W.-K. Mak, T. W. Page Jr., G. J. Popek, D. Rothmeier, Implementation of the Ficus Replicated File System, USENIX Summer 1990, 63-72.
T. W. Page Jr., R. G. Guy, J. S. Heidemann, D. Ratner, P. L. Reiher, A. Goel, G. H. Kuenning, G. J. Popek, Perspectives on Optimistically Replicated, Peer-to-Peer Filing. Software - Practice and Experience 28(2): 155-180 (1998).

Additional Resources:

A nice Overview of epidemic architectures and related ideas

4 Authentication, privacy, and resource control

Main readings:

Mariposa: A Wide-Area Distributed Database System, Michael Stonebraker, Paul M. Aoki, Avi Pfeffer, Adam Sah, Jeff Sidell, Carl Staelin, and Andrew Yu. VLDB Journal 5, 1 (Jan. 1996), p. 48-63.
I. Clarke, O. Sandberg, B. Wiley, and T.W. Hong, Freenet: A Distributed Anonymous Information Storage and Retrieval System, in Workshop on Design Issues in Anonymity and Unobservability, Berkeley, California. Springer: New York (2001). (See also the FreeNet homepage.)
J. Ioannidis, S. Ioannidis, A. Keromytis, V. Prevelakis, ``Fileteller: Paying and getting paid for file storage.

Summary: FILETELLER is a credential-based network file storage system with provisions for paying for file storage and getting paid when others access files. Users get access to arbitrary amounts of storage anywhere in the network, and use a micropayments system to pay for both the initial creation of the file and any subsequent accesses. Wide-scale information sharing requires that a number of issues be addressed; these include distributed access, access control, payment, accounting, and delegation (so that information owners may allow others to access their stored content). In this paper we demonstrate how all these issues are addressed using a micropayment architecture based on a trust-management system. Utilizing the same mechanism for both for access control and payment results in an elegant and scalable architecture.

Some questions:
- As described here, Fileteller provides storage for individual files, with no special provisions for handling directory structures. What extra issues arise (e.g., with access permissions)?
- A key advantage of Fileteller is the capability to delegate access permissions from one user to another. But some details of how this works were not clear from the example---seeing this in a bit more depth would be interesting.
S. Miltchev, V. Prevelakis, S. Ioannidis, J. Smith, A. Keromytis, ``Secure and Flexible Global File Sharing.

Supplemental readings:

Mark Miller and Eric Drexler, Markets and computation: Agoric open systems. (Sections 1-3 are enough to get a general picture.)

5 Synchronization Technologies

Main readings:

Sundar Balasubramaniam and Benjamin C. Pierce, What is a file synchronizer. Fourth Annual ACM/IEEE International Conference on Mobile Computing and Networking (MobiCom '98).
W. Keith Edwards, Elizabeth D. Mynatt, Karin Petersen, Mike J. Spreitzer, Douglas B. Terry, and Marvin Theimer. Designing and Implementing Asynchronous Collaborative Applications with Bayou. ACM Symposium on User Interface Software and Technology, 1997.
Anne-Marie Kermarrec, Anthony Rowstron, Marc Shapiro, and Peter Druschel. The IceCube approach to the reconciliation of divergent replicas, PODC, 2001.

Summary: We describe a novel approach to log-based reconciliation called IceCube. It is general and is parameterised by appli- cation and object semantics. IceCube considers more ex- ible orderings and is designed to ease the burden of recon- ciliation on the application programmers. IceCube captures the static and dynamic reconciliation constraints between all pairs of actions, proposes schedules that satisfy the static constraints, and validates them against the dynamic con- straints.

Preliminary experience indicates that strong static constraints successfully contain the potential combinatorial explosion of the simulation stage. With weaker static constraints, the system still finds good solutions in a reasonable time. ....

IceCube is log-based: the input to the reconciler is a common initial state and logs of actions that were performed on each replica. The reconciler merges the logs in some fashion, and replays the operations in the merged log against the initial state, yielding a reconciled, common final state. Logs provide the reconciler with the history of user actions; thus it can infer information about the users' intents and is therefore potentially more powerful.

In previous systems [4, 5, 10] the logs are merged accord- ing to some predetermined order, such as temporal order. These systems cannot exploit cases where a reordering of operations would avoid a conflict. IceCube attempts to find an ordering that minimizes conflicts, while observing object and application semantics and user intent. However, naively exploring all possible orderings suffers from a combinatorial explosion. Effectively pruning the space of orderings is one of the key design issues in IceCube.

Some questions:
1. Why is it ``obvious that state-based reconciliation will not work'' with the example in Section 2? As long as we're taking application semantics into account in determining possible orderings of actions, why not go a bit further and use application semantics to suggest ways of repairing conflicts? (Cf. Bayou.)
2. Why have both preconditions and postconditions on operations? They seem redundant, from the point of view of the reconciler.
3. Actions on independent target objects are always considered safe to commute. But isn't it easy to think of examples where this would not be true? (E.g., note that the example at the end of section 2 depends on an ordering constraint between writing a file and deleting the file's parent directory!)
4. In 3.1, why would we expect that the independence relation I might be reflexive or transitive? (I.e., why is it worth remarking that it is not?)
5. What does it mean to accept a cutset (3.2)? (Ah: this is answered later on. IceCube proposes a number of cutsets to the application, which responds with a subset that it considers ``acceptable.'')
Jonathan P. Munson and Prasun Dewan. A flexible object merging framework. Proceedings of the conference on Computer supported cooperative work, Chapel Hill, North Carolina, 1994.

Supplemental readings:

Norman Ramsey and Elod Csirmaz, An Algebraic Approach to File Synchronization. Foundations of Software Engineering, 175--185, September 2001.
Trevor Jim, Benjamin Pierce, and Jerome Vouillon, How to Build a File Synchronizer.
Douglas B. Terry, Marvin Theimer, Karin Petersen, Alan J. Demers, Mike Spreitzer, and Carl Hauser. Managing Update Conflicts in Bayou, a Weakly Connected Replicated Storage System. SOSP 1995, pp. 172-183.

Summary:
D. B. Terry, A. J. Demers, K. Petersen, M. J. Spreitzer, M. M. Theimer, and B. B. Welch. Session Guarantees for Weakly Consistent Replicated Data. Proceedings International Conference on Parallel and Distributed Information Systems (PDIS), Austin, Texas, September 1994, pages 140-149.
Shirish H. Phatak and B.R. Badrinath, Conflict Resolution and Reconciliation in Disconnected Databases, Proc. of Mobility in Databases and Distributed Systems (MDDS), Florence, Italy, Sep.'99.
Peter Reiher, John Heidemann, David Ratner, Greg Skinner, and Gerald Popek. Resolving File Conflicts in the Ficus File System. Usenix summer conference, 1994.
Puneet Kumar and M. Satyanarayanan. Flexible and Safe Resolution of File Conflicts. Usenix winter conference, 1995.

Additional Resources:

The Bayou project home page

6 Algorithmic Underpinnings

Main readings:

Eugene W. Meyers. An o(nd) difference algorithm and its variations. Algorithmica, 1(2):251 266, 1986.
Tancred Lindholm. A 3-way Merging Algorithm for Synchronizing Ordered Trees: The 3DM merging and differencing tool for XML. Masters thesis, Helsinki University of Technology, 2001.

Summary: Abstract: Keeping data synchronized across a variety of devices and environments is becoming more and more important as the number of computing devices per user increase. Of particular interest are situations when the sets of data that need to be synchronized have structure, but not the exact same stucture. In the thesis we approach these situations trough a number of use cases, which are set in a future, more ubiquitous, computing environment. The use cases are subsequently analyzed, and in combination with the characteristics of a ubiquitous computing environment used to derive requirements for a synchronization tool. Although the focus is on future computing environments, we nd that a synchronization tool ful lling the requirements is well suited for present-day synchronization tasks as well. We nd that the requirements call for a tool capable of performing a 3-way merge of general ordered trees without any additional tree metadata, such as edit histories or unique node identi ers, that describe how the trees participating in the merge are related. The main research problem of the thesis is to design such an algorithm, given that no suitable algorithms exist. The design of the algorithm is preceded by stating a de nition of desired merging be- haviour derived from the use cases as well as a relatively large number of small hand-written merging examples. Further on, the design task is divided into the design of a tree matching algorithm, and an algorithm for merging matched trees. The matching algorithm also gives us the ability to easily nd the di erence between ordered trees. The designed merging, matching and di erencing algorithms for ordered trees are im- plemented as a merging and di erencing tool for XML, and their functionality is veri ed against the use cases as well as the merging examples. The evaluation of the algorithms shows promising results regarding the applicability of the tool to real-world situations.

Comments:
- The use cases here are unusually... useful. Rather than one or two suggestive examples, Lindholm (with colleagues) has actually developed five or six realistic cases in considerable detail, as well as cataloguing 40 or so more smaller examples.
- Very useful review of the existing literature on tree differencing and merging algorithms.

Supplemental readings:

Hunt, J.W., and McIlroy, M.D. An Algorithm for Differential File Comparison, Bell Laboratories, N.J., Computing Science Technical Report No.41, 1975.
Miller, W., and Myers, E.W. A file comparison program. Software-Practice and Experience 15, 11 (1985), 1025-1040.
D. S. Hirschberg. Algorithms for the longest common subsequence problem. Journal of the ACM, 24:664-675, 1977.
Neuwirth, C. M., Chandhok, R., Kaufer, D. S., Erion, P., Morris, J., and Miller, D. (1992). Flexible diff-ing in a collaborative writing system. In Proceedings of the Fourth Conference on Computer-Supported Cooperative Work (CSCW '92) (pp. 147-154). Baltimore, MD: Association for Computing Machinery. Reprinted in R. Rada (Ed.) (1996). Groupware and authoring (pp. 189-204).
Sudarshan S. Chawathe, Comparing Hierarchical Data in External Memory. VLDB, 1999.

Summary: Generalizes the least common subsequence difference model to tree structures.
Yuan Wang, David J. DeWitt, and Jin-Yi Cai. X-Diff: A Fast Change Detection Algorithm for XML Documents.

Summary: Over the next several years XML is likely to replace HTML as the standard web publishing language and data transportation format. Since online information changes frequently, being able to quickly detect changes in XML documents is important to Internet query systems, search engines, and continuous query systems. Previous work in change detection on XML or other hierarchically structured documents used an ordered tree model, in which left-to-right order among siblings is important and it affects the change result. In this paper, we argue that an unordered model (only ancestor relationships are significant) is more suitable for most database applications. Using an unordered model, change detection is substantially harder than using ordered model, but the change result that it generates is more accurate. We propose X-Diff, a fast algorithm that integrates key XML structure characteristics with standard tree-to-tree correction techniques. We also analyze the algorithm and study its performance.

Additional Resources:

The source code for diff3 is part of the GNU diffutils package.
The 3DM site.