This is a copy of a blog post on record preservation that I believe you will find interesting from FamilySearch:
The LDS Church has been a pioneer for many decades in preserving
important family history records, keeping them safe from the dangers of
both man and nature. It took many years to build the Granite Mountain
Records Vault, where microfilm records are safely kept today. But what
about all this digital information that Family Search is generating to
assist researchers on the Internet –how does that get preserved from
generation to generation? As you might imagine, digital content is a bit
more complex and fragile than microfilm to preserve long term. Digital
preservation is a lot more than just tape backup. Let’s explore some of
the nuances and complexities of long-term, digital preservation.
Volume of data: the digital pipeline in Family
Search is generating somewhere in the range of 15 terabytes of images,
or one million to three million pages digitally every business day of
the year. The software to handle this volume did not exist when we
started digital preservation. We push many of our vendors to come up
with new technology to meet our needs as we stretch their capabilities
and often break their products. We are also writing our own magnetic
tape storage software because no products exist on the market that can
handle preservation storage volume of this magnitude.
Data validation: on an annual basis, the
preservation system has to be automated to check all the bits on every
tape and make sure that there is no corruption. We store checksum values
at multiple levels so the software can read the checksum, read the
data, and compare the calculations to ensure integrity. It is very
resource intensive to deploy tape drives for writing new data, while
also using drives for the annual validation of every tape. It takes
complex scheduling to balance the work between the two and assure that
we don’t go too long without touching each tape for validation.
Media refresh: as the tape media ages, the system
needs the ability to make copies on to new media, before unrecoverable
errors begin to appear. There is no way to tell exactly when tapes will
begin to fail, so the software has to keep a database of errors for
every tape and every tape drive and look for trends that indicate a
coming problem before they actually occur. If we rotate media too often,
however, the system becomes too costly to maintain.
File format migration: do you have any WordPerfect
4.2 files lying around? How about a Lotus 1-2-3 spreadsheet or even
something more obscure, where the software vendor is long gone, along
with your installation disks? As years pass, the risk of not being able
to accurately read a data file increases. Our preservation system has to
account for this and be able to convert files from one format or
version to a newer format. If files are not migrated in a timely
fashion, massive amounts of data can become inaccessible or difficult to
render accurately. Some file formats may be viable for a decade or
more, while others could become obsolete within just a few short years
or less. Is a PDF a viable rendering of an Excel spreadsheet? What about
the underlying formulas, fonts, supporting data, and links to data
sources? There is a significant risk of losing content whenever a file
format is converted to a new format.
Metadata and descriptive data: so you have a file
from 5 years ago…who created it? What software version is required to
read it? Where was the image originally digitized? Who is the owner of
the original? Are there any restrictions on the use of the file in the
future? Is this copy the highest resolution version we own, or is there a
better image somewhere? What is the subject matter of the file? Are
there people in the photograph? Is there important genealogical data
contained in the image? The list of important questions goes on and on.
Keeping track of the many types of metadata, indexes, and associated
descriptive data is critical for our preservation system.
Documentation: a preservation system serves both
currently living persons as well as future generations. We often pull
images from preservation to avoid having to rescan originals or
microfilm in the digital pipeline. A professional genealogist may need
to see our highest resolution copy of an image to get clarity around
handwriting. A future generation may have to open up our protected
vaults and try to recover as much information as they can from our tape
libraries and try to rebuild the family history information we have
attempted to preserve. Documentation is a critical component of digital
preservation. It is imperative that we document our data models, file
formats, technology standards, software code, hardware specifications,
and many, many other aspects of the digital preservation system. A
future archeologist will not be able to simply put a magnifying glass up
to microfilm to view our digital artifacts.
There are many additional complexities associated with operating a
trusted digital repository. Hopefully, this article gives you some
insights into some of them and helps you appreciate the efforts
FamilySearch is taking to ensure that future generations are handed a
pristine copy of their family records. We have not yet solved all of the
challenges associated with building our preservation system –a task
that will take many more years and possibly decades to prove out. We
take our work very seriously and have a dedicated team of professionals
looking after the world’s records. With contributions from many, we hope
to enable future generations to learn of their heritage and make the
same precious bond with their ancestors as we have.
No comments:
Post a Comment