NAME

babarchive - Manage babarchives, checksumed directory trees that can be validated.

SYNOPSIS

babarchive_check_all, babarchive_prep_one, babarchive.cron

DESCRIPTION

Babarchive is a system to manage babarchives, checksumed directory trees that can be validated.

It is designed to preserve digital archives of static content that are intended to last for decades or more. Its goal is to detect random corruption in data at rest, or in errors introduced during data copies.

PARTS OF BABARCHIVE

A babarchive is a directory tree with a .shasum.fsdb and ls-lR in the root, as well as .shasum files in each subdirectory.

Babarchives are managed with two tools:

Babarchive_prep_one(1) creates (prepares) a new babarchive. Babarchive_check_all(1) tracks all archives on the local system and incrementally re-validates them.

In addition babarchive_check_one(1) validates a specific archive.

The script babarchive.cron(1) runs babarchive_check_all (1) and is intended to be run daily from cron (perhaps with anacron).

ARCHIVE PHILOSOPHY

The overall idea: when saving data, you don’t get what you deserve, you get what you verify.

Babarchive provides end-to-end checks on the validity of a directory tree.

We assume:

  1. Data will live on many platforms over its life.
  2. Data should stay on-line
  3. We will check it regularly to make sure it hasn’t changed
  4. You have separate backups to recover from errors. (Babarchive only detects problems.)
  5. You will run babarchive_check_all regularly on your file server.

It detects the following problems or events:

  1. Bit-flips inside files
  2. Removal of files
  3. Removal of babarchives to a system (not necessarily a problem)
  4. Addition of babarchives to a system (probably not a problem)

It saves two copies of checksums so partial corruption can partially recovered (by hand), and standard tools can be used to assist in partial recovery or verification.

Babarchive has been in use since 1998 for archives of media files and academic datasets. It has detected two silent bit-flips on disks over that time period. As of 2016 it currently protects more than 100 TB of data at USC/ISI.

ALTERNATIVES

There are of course many alternatives. Many are good. Our thoughts on some of them.

Finally, if you care about reading your data decades from now, we strongly encourage you to think about the data formats you choose.

COPYRIGHT

Copyright (C) 2001-2016 by John Heidemann. License GPLv2 (only).

This program is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law.

SEE ALSO

babarchive (8), babarchive_check_all (1), babarchive_check_one (1), babarchive_prep_one (1), shasum (1).

J. H. Saltzer, D. P. Reed, and D. D. Clark. End-to-end arguments in system design. ACM Transactions on Computer Systems, Vol. 2 (No. 4), pp. 277-288, November, 1984. [http://dx.doi.org/10.1145/357401.357402]