1 Filesystems and Transactions
Main readings:
-
Hennessy and Patterson. Computer Architecture: A
Quantitative Approach. Pages 485--493 (section on disk storage).
- M. K. McKusick, W. N. Joy, S. J. Leffler, and
R. S. Fabry.
A Fast File System for UNIX. ACM Transactions on
Computer Systems, Vol. 2, No. 3, August 1984, pp. 181--197.
Summary:
The Berkeley FFS is a reimplementation of the original Bell Labs Unix
filesystem, retaining the original API but providing much higher
throughput. The main sources of speedup are:
-
The old FS used 1024-byte blocks; the new one uses (typically) 8K
blocks so that data is transferred to/from the disk in larger chunks. To
avoid excessive space wastage for small files, blocks can be subdivided
into sector-sized (e.g. 512-byte) fragments, which can be allocated
to separate files (subject to some alignment constraints).
- The old FS kept all inode information in a single area of (each
partition of) the disk. The FFS structures the disk into a collection of
cylinder groups, each with its own space for storing inodes (up to
2048 per cylinder group).
- The old FS used a free list for keeping track of allocatable
disk blocks. The FFS uses a bitmap for each cylinder group, allowing
allocation of contiguous (actually, rotationally optimal) blocks to
large files.
Inodes for files in the same directory are allocated (if possible) in
the same cylinder group. However, different directories are placed in
different cylinder groups, to try to ensure that the amount of free space
in the cylinder groups remains approximately balanced, allowing good
allocation decisions for future writes.
The new allocation policies are effective only when the disk is run less
that completely full (e.g. below 90into account, as well as the extra space required for cylinder group
bitmaps, etc., and offsetting it by the benefits of sub-block-level
allocation of small files, the space usage of FFS is similar to that of
the old Unix FS.
The FFS can use up to half the maximum disk transfer rate (i.e., the max
rate at which data can be streamed onto or off of the disk by the raw
hardware); the old FS, by comparison, could only get 2-3% utilization.
The paper also discusses some functional enhancements to the original FS,
including quotas, advisory locks, symlinks, and an atomic rename
primitive.
Some questions for discussion:
-
This paper was published in 1984; since that time, many performance
parameters of disks have changed enormously. How many of the
calculations remain valid (mutatis mutandi), and which require serious
rethinking? Have any of the conclusions drawn from these calculations
become misleading or downright bogus?
- Disk transfer performance is regarded here as a distinct, isolable
value. But in many systems, the same hardware resources (disk, memory,
and processor) are shared between a variety of tasks (in particular, file
I/O and virtual memory). How does the story need to change when we take
these other parts of the picture into account?
- Everybody talks about the ``Unix Filesystem Semantics''; what is
it, exactly? (For example, is the expectation that writes to directory
nodes will not be reordered part of the semantics of the filesystem?
What about writes to blocks within files? What are 'reasonable
expectations' about the permanence of writes after crashes?)
- M. Rosenblum and J. K. Ousterhout.
The Design and Implementation of a
Log-Structured File System. ACM Transactions on Computer Systems,
Vol. 10, No. 1, February 1992, pp. 26--52.
Summary: Traditional Unix filesystems like FFS can use large disk caches
to increase the efficiency of reads, but still perform writes
synchronously. Especially for metadata, this can involve several seeks
per block of data transferred; this limits effective disk utilization
to a small percentage of the theoretical maximum. As memory becomes
cheaper and disk caches get larger, disk traffic will be dominated by
writes; thus, it is imperative to speed them up somehow.
A log-structured filesystem writes (almost) all information to the
disk---including both data blocks and indexing data structures---in
large, sequential, asynchronous bulk transfers (e.g., 1Mb at a time).
Write performance can therefore approach the raw maximum of the disk.
By caching indexing structures in memory (e.g., the inode map), read
performance can be kept similar to traditional Unix filesystems. A
side benefit is that the state of the world can be reconstructed very
rapidly (by reading just the tail of the log) following a crash. This
paper describes the design and implementation of the Sprite LFS,
a component of the Sprite network operating system.
The main design issue in implementing a log-structured filesystem is
deciding on an algorithm for cleaning (i.e., compacting) segments
of the disk to obtain fresh space that can be re-used for new writes.
The paper considers (and measures) a number of algorithms; the one
eventually chosen is a form of generational collection, where segments
are chosen to be cleaned based on a combination of their percentage of
free-able space (which can easily be determined by examining in-memory
structures) and the age of their most recently modified block.
The paper is cogently written and contains a lot of nice observations,
measurements, and comparisons. Recommended.
Some questions for discussion:
-
Segment cleaning clearly has some similarity with garbage
collection, and in particular the hot/cold segment cleaner can be
viewed as a kind of generational collector. But they are not exactly
the same: for the cleaning problem, we know where the live data
is, and the problem is just deciding which bits of it to tidy at any
given moment; in garbage collection, the location of the live data must
be discovered as part of the process. Even so, could segment cleaning
benefit from insights of the GC community?
- At the top of p. 12, the authors note that ``a newer version of
SunOS groups writes [McVoy and Kleiman] and should therefore have
performance equivalent to Sprite LFS.'' So what, then, are the
remaining benefits of the log-structured approach? Just fast crash
recovery?
- S. R. Kleiman. Vnodes: An Architecture for Multiple File System Types in Sun UNIX. In Proceedings of the 1986 Usenix Conference, June 1986, pp. 238--247.
- D. K. Gifford, P. Jouvelot, M. A. Sheldon, and J. W. O'Toole, Jr.
Semantic File Systems. In Proceedings of
the 13th ACM Symposium on Operating System Principles, October 1991.
- P. A. Bernstein, V. Hadzilacos, and N. Goodman.
Concurrency Control and Recovery in Database Systems. Pages 25--56 (serializability and 2PL).
- N. S. Barghouti and G. E. Kaiser. Sections 5 and 6 from
Concurrency Control in Advanced Database Applications. ACM Computing Surveys, 23(9), 269--317, 1991.
Supplemental readings:
-
K. Thompson. Unix Implementation.
Bell System Technical Journal, 56(6), 1978.
- L. McVoy and S. Kleiman.
Extent-like Performance from a Unix File System. In Proceedings of the
1991 Winter Usenix Conference, January 1991, pp. 33--44.
Summary: Sun's Unix File System (UFS) was closely based on the
Berkeley FFS and shares its performance characteristics. This paper
describes performance enhancements aimed at jobs performing large
sequential I/O. The main idea is to cluster both reads and
writes: when the higher-level FS routines notice that a file is being
read or written sequentially, they start reading ahead in large
blocks or delaying writes until large blocks have accumulated. The
new UFS can use about twice as much of the maximum disk bandwidth as
the original in the best case.
Some important caveats:
-
Changing the order of writes is dangerous (this is one reason why
only sequential writes are clustered in the new UFS): many programs,
especially those manipulating directories, depend for consistency on
the fact that writes reach the disk in order.
- Clustering writes ``degrades'' the unix file semantics, making it
difficult to know when a particular write has definitely reached stable
storage.
- The file system and VM in SunOS are tightly integrated. This is
good because it allows all of memory to be used for I/O buffering when
appropriate, but it means that, unless care is taken, a job doing a
huge write can lock down all pages in memory (by getting them into the
kernel disk write queue); this is prevented in UFS by making the
writing process do free-behind of the pages that it has been
using.
The writing in this paper is not overly clear. There are probably
better places to get information about ``post-FFS performance tuning.''
- M. Seltzer, K. Bostic, M. K. McKusick, and Carl Staelin.
An Implementation of A Log-Based File System for Unix.
In Proceedings of the 1993 Winter Usenix Conference, January 1993.
- J. N. Mattews, D. Roselli, A. M. Costello, R. Y. Wang, and
T. E. Anderson. Improving the
Performance of Log-Structured File Systems with Adaptive Methods. In
Proceedings of the 16th ACM Symposium on Operating System Principles, October
1997.
- R. Y. Wang, T. E. Anderson, and D. A. Patterson.
Virtual Log Based File Systems for a
Programmable Disk. In Proceedings of the 3rd Symposium on Operating Systems
Design and Implementation, February 1999.
- D. S. H. Rosenthal. Evolving the Vnode Interface. Proceedings of the Summer USENIX Conference, pages 107--117, June 1990.
- J. S. Heidemann and G. J. Popek.
File System Development with Stackable
Layers. ACM Transaction on Computer Systems, 12(1): 58--89, Febuary 1994.
- B. Welch. A Comparison of Three Distributed
File System Architectures: Vnode, Sprite, and Plan 9. Computing Systems,
7(2):175--99, Spring 1994.
- G. Weikum and G. Vossen. Transactional Information Systems. Academic Press, 2002. Chapter 3 (pages 61--123).
Additional Resources:
2 Pessimismistic Replication
Main readings:
-
S. B. Davidson and H. Garcia-Molina and D. Skeen.
Consistency in Partitioned
Networks, ACM Computing Surveys, 17(3), 1985.
- M. Satyanarayanan, ``Distributed File Systems.'' In Distributed Systems, Sape Mullender, ed. Addison Wesley, 1993.
- Russel Sandberg, David Goldberg, Steve Kleiman, Dan
Walsh and Bob
Lyon, "Design and Implementation of the Sun Network Filesystem," in Proc.
Summer 1985 USENIX Conf., pp. 119-130.
- B. Liskov, S. Ghemawat, R. Gruber, P. Johnson, L. Shrira and M.
Williams, "Replication in the Harp
File System," Proceedings of the Thirteenth Symposium on Operating
Systems Principles, ACM, October 1991, pp. 198-212.
- M. N. Nelson, B. B. Welch and J. K. Ousterhout,
Caching in the Sprite Network File
System, ACM Transactions on Computer Systems, Vol. 6, No. 1, February
1988, pp. 134-154.
Supplemental readings:
-
M. Satyanarayanan, A Survey of
Distributed File Systems. Technical Report, Department of Computer
Science, Carnegie-Mellon University, Number CMU-CS-89-116, February 1989.
Also appeared in Annu. Rev. Comput. Sci., volume 4, 1990, pages 73--104.
Additional resources:
-
A. Muthitacharoen, B. Chen, David Mazières,
"A Low-Bandwidth Network File System", Proceedings
of the 18th ACM Symposium on Operating Systems Principles (SOSP'01),
October 21-24, 2001.
Summary: Users rarely consider running network file systems over slow or
wide-area networks, as the performance would be unac-ceptable and the
bandwidth consumption too high. Nonethe-less, efficient remote file
access would often be desirable over such networks particularly when
high latency makes remote login sessions unresponsive. Rather than run
interac-tive programs such as editors remotely, users could run the
programs locally and manipulate remote files through the file system.
To do so, however, would require a network file sys-tem that consumes
less bandwidth than most current file sys-tems.
This paper presents LBFS, a network file system designed for
low-bandwidth networks. LBFS exploits similarities be-tween files or
versions of the same file to save bandwidth. It avoids sending data
over the network when the same data can already be found in the server
s file system or the client s cache. Using this technique in
conjunction with conventional compression and caching, LBFS consumes
over an order of magnitude less bandwidth than traditional network file
sys-tems on common workloads.
To [avoid most] transfers, LBFS relies on the collision-resistant
properties of the SHA-1 [6] hash function. The probability of two
inputs to SHA-1 producing the same out-put is far lower than the
probability of hardware bit errors.
In order to use chunks from multiple files on the recipient, LBFS takes
a different approach from that of rsync. It con-siders only
non-overlapping chunks of files and avoids sen-sitivity to shifting
file offsets by setting chunk boundaries based on file contents, rather
than on position within a file. Insertions and deletions therefore only
affect the surround-ing chunks. Similar techniques have been used
successfully in the past to segment files for the purpose of detecting
unau-thorized copying [3]. To divide a file into chunks, LBFS examines
every (over-lapping) 48-byte region of the file and with probability
2-13 over each region's contents considers it to be the end of a
data chunk. LBFS selects these boundary regions called break-points
using Rabin fingerprints [19]. A Rabin fingerprint is the polynomial
representation of the data modulo a pre-determined irreducible
polynomial. We chose fingerprints because they are efficient to compute
on a sliding window in a file. When the low-order 13 bits of a region s
finger-print equal a chosen value, the region constitutes a
break-point. Assuming random data, the expected chunk size is 213 =
8192 = 8 KBytes (plus the size of the 48-byte breakpoint window).
3 Optimistic Replication Schemes
Main readings:
-
J. J. Kistler and M. Satyanarayanan,
Disconnected Operation in the Coda
File System, ACM Transactions on Computer Systems, Vol. 10, No. 1,
February 1992, pp. 3-25.
- Bruce Walker, Gerald Popek, Robert English, Charles
Kline, Greg Thiel, The LOCUS Distributed
Operating System, 9th Symposium on Operating Systems Principles
(SOSP), Bretton Woods, New Hampshire, November 1983, pp. 49-70.
- Alan J. Demers, Daniel H. Greene, Carl Hauser, Wes
Irish, John Larson. Epidemic algorithms
for replicated database maintenance. Proceedings of the Sixth Annual
ACM Symposium on Principles of distributed computing (PODC). August 10 -
12, 1987, Vancouver Canada. Pages 1-12.
- Fischer, Michael J., and Michael, Alan.
Sacrificing serializability to
attain high availability of data in an unreliable network.
Proceedings of the ACM Symposium on Principles of Database Systems, March
29-31, 1982, Los Angeles, California, pp. 70-75.
- R. G. Guy, G. J. Popek, and T. W. Page, Jr.,
Consistency Algorithms for Optimistic Replication. Proceedings of the First International Conference on Network Protocols.
IEEE, October 1993.
- Maurice Herlihy,
Apologizing vs. asking
permission: optimistic concurrency control for abstract data types.
TODS 15(1), 96--124, 1990.
Supplemental readings:
-
E. T. Mueller, J. D. Moure, and G. J. Popek,
A Nested Transaction Mechanism
for LOCUS, 9th Symposium on Operating Systems Principles (SOSP),
Bretton Woods, New Hampshire, November 1983, pp. 71-89.
- R. G. Guy, J. S. Heidemann, W.-K. Mak, T. W. Page Jr.,
G. J. Popek, D. Rothmeier, Implementation
of the Ficus Replicated File System, USENIX Summer 1990, 63-72.
- T. W. Page Jr., R. G. Guy, J. S. Heidemann, D. Ratner,
P. L. Reiher, A. Goel, G. H. Kuenning, G. J. Popek,
Perspectives on Optimistically Replicated,
Peer-to-Peer Filing. Software - Practice and Experience 28(2): 155-180
(1998).
Additional Resources:
4 Authentication, privacy, and resource control
Main readings:
-
Mariposa: A Wide-Area
Distributed Database System, Michael Stonebraker, Paul M. Aoki, Avi
Pfeffer, Adam Sah, Jeff Sidell, Carl Staelin, and Andrew Yu. VLDB
Journal 5, 1 (Jan. 1996), p. 48-63.
- I. Clarke, O. Sandberg, B. Wiley, and T.W. Hong,
Freenet:
A Distributed Anonymous Information Storage and Retrieval System, in
Workshop on Design Issues in Anonymity and Unobservability, Berkeley,
California. Springer: New York (2001).
(See also the
FreeNet homepage.)
- J. Ioannidis, S. Ioannidis, A. Keromytis, V. Prevelakis,
``Fileteller: Paying and getting paid for file
storage.
Summary:
FILETELLER is a credential-based network file storage system with
provisions for paying for file storage and getting paid when others
access files. Users get access to arbitrary amounts of storage anywhere
in the network, and use a micropayments system to pay for both the
initial creation of the file and any subsequent accesses. Wide-scale
information sharing requires that a number of issues be addressed; these
include distributed access, access control, payment, accounting, and
delegation (so that information owners may allow others to access their
stored content). In this paper we demonstrate how all these issues are
addressed using a micropayment architecture based on a trust-management
system. Utilizing the same mechanism for both for access control and
payment results in an elegant and scalable architecture.
Some questions:
-
As described here, Fileteller provides storage for individual
files, with no special provisions for handling directory structures.
What extra issues arise (e.g., with access permissions)?
- A key advantage of Fileteller is the capability to delegate access
permissions from one user to another. But some details of how this works
were not clear from the example---seeing this in a bit more depth would
be interesting.
- S. Miltchev, V. Prevelakis, S. Ioannidis, J. Smith, A. Keromytis,
``Secure and Flexible Global File Sharing.
Supplemental readings:
5 Synchronization Technologies
Main readings:
-
Sundar Balasubramaniam and Benjamin C. Pierce,
What is a
file synchronizer. Fourth Annual ACM/IEEE International Conference on
Mobile Computing and Networking (MobiCom '98).
- W. Keith Edwards, Elizabeth D. Mynatt, Karin Petersen, Mike J.
Spreitzer, Douglas B. Terry, and Marvin Theimer.
Designing and Implementing
Asynchronous Collaborative Applications with Bayou. ACM Symposium on
User Interface Software and Technology, 1997.
- Anne-Marie Kermarrec, Anthony Rowstron, Marc
Shapiro, and Peter Druschel. The
IceCube approach to the reconciliation of divergent replicas, PODC,
2001.
Summary:
We describe a novel approach to log-based reconciliation called
IceCube. It is general and is parameterised by appli- cation and object
semantics. IceCube considers more ex- ible orderings and is designed to
ease the burden of recon- ciliation on the application
programmers. IceCube captures the static and dynamic reconciliation
constraints between all pairs of actions, proposes schedules that satisfy
the static constraints, and validates them against the dynamic con-
straints.
Preliminary experience indicates that strong static constraints
successfully contain the potential combinatorial explosion of the
simulation stage. With weaker static constraints, the system still finds
good solutions in a reasonable time. ....
IceCube is log-based: the input to the reconciler is a common initial
state and logs of actions that were performed on each replica. The
reconciler merges the logs in some fashion, and replays the operations in
the merged log against the initial state, yielding a reconciled, common
final state. Logs provide the reconciler with the history of user
actions; thus it can infer information about the users' intents and is
therefore potentially more powerful.
In previous systems [4, 5, 10] the logs are merged accord- ing to some
predetermined order, such as temporal order. These systems cannot exploit
cases where a reordering of operations would avoid a conflict. IceCube
attempts to find an ordering that minimizes conflicts, while observing
object and application semantics and user intent. However, naively
exploring all possible orderings suffers from a combinatorial explosion.
Effectively pruning the space of orderings is one of the key design
issues in IceCube.
Some questions:
-
Why is it ``obvious that state-based reconciliation will not work''
with the example in Section 2? As long as we're taking application
semantics into account in determining possible orderings of actions, why
not go a bit further and use application semantics to suggest ways of
repairing conflicts? (Cf. Bayou.)
- Why have both preconditions and postconditions on operations? They
seem redundant, from the point of view of the reconciler.
- Actions on independent target objects are always considered safe to
commute. But isn't it easy to think of examples where this would not be
true? (E.g., note that the example at the end of section 2 depends on an
ordering constraint between writing a file and deleting the file's parent
directory!)
- In 3.1, why would we expect that the independence relation I might
be reflexive or transitive? (I.e., why is it worth remarking that it is
not?)
- What does it mean to accept a cutset (3.2)? (Ah: this is answered
later on. IceCube proposes a number of cutsets to the application, which
responds with a subset that it considers ``acceptable.'')
- Jonathan P. Munson and Prasun Dewan.
A flexible object merging framework.
Proceedings of the conference on Computer supported cooperative work,
Chapel Hill, North Carolina, 1994.
Supplemental readings:
-
Norman Ramsey and Elod Csirmaz,
An Algebraic Approach to File
Synchronization. Foundations of Software Engineering, 175--185,
September 2001.
- Trevor Jim, Benjamin Pierce, and Jerome Vouillon,
How to Build a File Synchronizer.
- Douglas B. Terry, Marvin Theimer, Karin Petersen, Alan J. Demers,
Mike Spreitzer, and Carl Hauser.
Managing Update Conflicts in
Bayou, a Weakly Connected Replicated Storage System. SOSP 1995, pp.
172-183.
Summary:
-
D. B. Terry, A. J. Demers, K. Petersen, M. J. Spreitzer, M. M.
Theimer, and B. B. Welch. Session
Guarantees for Weakly Consistent Replicated Data. Proceedings
International Conference on Parallel and Distributed Information Systems
(PDIS), Austin, Texas, September 1994, pages 140-149.
- Shirish H. Phatak and B.R. Badrinath,
Conflict Resolution and Reconciliation in
Disconnected Databases, Proc. of Mobility in Databases and Distributed
Systems (MDDS), Florence, Italy, Sep.'99.
- Peter Reiher, John Heidemann, David Ratner, Greg Skinner, and
Gerald Popek.
Resolving File
Conflicts in the Ficus File System. Usenix summer conference, 1994.
- Puneet Kumar and M. Satyanarayanan.
Flexible and Safe Resolution of File
Conflicts. Usenix winter conference, 1995.
Additional Resources:
6 Algorithmic Underpinnings
Main readings:
-
Eugene W. Meyers. An o(nd)
difference algorithm and its variations. Algorithmica, 1(2):251 266,
1986.
- Tancred Lindholm. A 3-way Merging
Algorithm for Synchronizing Ordered Trees: The 3DM merging and
differencing tool for XML. Masters thesis, Helsinki University of
Technology, 2001.
Summary:
Abstract: Keeping data synchronized across a variety of devices and
environments is
becoming more and more important as the number of computing devices per
user increase. Of particular interest are situations when the sets of
data that need to be synchronized have structure, but not the exact same
stucture. In the thesis we approach these situations trough a number of
use cases, which are set in a future, more ubiquitous, computing
environment. The use cases are subsequently analyzed, and in combination
with the characteristics of a ubiquitous computing environment used to
derive requirements for a synchronization tool. Although the focus is on
future computing environments, we nd that a synchronization tool ful
lling the requirements is well suited for present-day synchronization
tasks as well. We nd that the requirements call for a tool capable of
performing a 3-way merge of general ordered trees without any additional
tree metadata, such as edit histories or unique node identi ers, that
describe how the trees participating in the merge are related. The main
research problem of the thesis is to design such an algorithm, given that
no suitable algorithms exist. The design of the algorithm is preceded by
stating a de nition of desired merging be- haviour derived from the use
cases as well as a relatively large number of small hand-written merging
examples. Further on, the design task is divided into the design of a
tree matching algorithm, and an algorithm for merging matched trees. The
matching algorithm also gives us the ability to easily nd the di erence
between ordered trees. The designed merging, matching and di erencing
algorithms for ordered trees are im- plemented as a merging and di
erencing tool for XML, and their functionality is veri ed against the use
cases as well as the merging examples. The evaluation of the algorithms
shows promising results regarding the applicability of the tool to
real-world situations.
Comments:
-
The use cases here are unusually... useful. Rather than one or two
suggestive examples, Lindholm (with colleagues) has actually developed
five or six realistic cases in considerable detail, as well as
cataloguing 40 or so more smaller examples.
- Very useful review of the existing literature on tree differencing
and merging algorithms.
Supplemental readings:
-
Hunt, J.W., and McIlroy, M.D. An Algorithm for Differential File
Comparison, Bell Laboratories, N.J., Computing Science Technical Report
No.41, 1975.
- Miller, W., and Myers, E.W. A file comparison
program. Software-Practice and Experience 15, 11 (1985), 1025-1040.
- D. S. Hirschberg. Algorithms for the longest common subsequence
problem. Journal of the ACM, 24:664-675, 1977.
- Neuwirth, C. M., Chandhok, R., Kaufer, D. S., Erion, P., Morris,
J., and Miller, D. (1992). Flexible
diff-ing in a collaborative writing system. In Proceedings of the
Fourth Conference on Computer-Supported Cooperative Work (CSCW '92) (pp.
147-154). Baltimore, MD: Association for Computing Machinery. Reprinted
in R. Rada (Ed.) (1996). Groupware and authoring (pp. 189-204).
- Sudarshan S. Chawathe,
Comparing Hierarchical Data in
External Memory. VLDB, 1999.
Summary: Generalizes the least common
subsequence difference model to tree structures.
- Yuan Wang, David J. DeWitt, and Jin-Yi Cai.
X-Diff: A Fast Change Detection Algorithm for
XML Documents.
Summary: Over the next several years XML is
likely to replace HTML as the standard web publishing language and data
transportation format. Since online information changes frequently,
being able to quickly detect changes in XML documents is important to
Internet query systems, search engines, and continuous query
systems. Previous work in change detection on XML or other
hierarchically structured documents used an ordered tree model, in
which left-to-right order among siblings is important and it affects
the change result. In this paper, we argue that an unordered model
(only ancestor relationships are significant) is more suitable for most
database applications. Using an unordered model, change detection is
substantially harder than using ordered model, but the change result
that it generates is more accurate. We propose X-Diff, a fast algorithm
that integrates key XML structure characteristics with standard
tree-to-tree correction techniques. We also analyze the algorithm and
study its performance.
Additional Resources: