Wednesday, July 4, 2012

Records preservation

This is a copy of a blog post on record preservation that I believe you will find interesting from FamilySearch:

The LDS Church has been a pioneer for many decades in preserving important family history records, keeping them safe from the dangers of both man and nature. It took many years to build the Granite Mountain Records Vault, where microfilm records are safely kept today. But what about all this digital information that Family Search is generating to assist researchers on the Internet –how does that get preserved from generation to generation? As you might imagine, digital content is a bit more complex and fragile than microfilm to preserve long term. Digital preservation is a lot more than just tape backup. Let’s explore some of the nuances and complexities of long-term, digital preservation.

Volume of data: the digital pipeline in Family Search is generating somewhere in the range of 15 terabytes of images, or one million to three million pages digitally every business day of the year. The software to handle this volume did not exist when we started digital preservation. We push many of our vendors to come up with new technology to meet our needs as we stretch their capabilities and often break their products. We are also writing our own magnetic tape storage software because no products exist on the market that can handle preservation storage volume of this magnitude.

Data validation: on an annual basis, the preservation system has to be automated to check all the bits on every tape and make sure that there is no corruption. We store checksum values at multiple levels so the software can read the checksum, read the data, and compare the calculations to ensure integrity. It is very resource intensive to deploy tape drives for writing new data, while also using drives for the annual validation of every tape. It takes complex scheduling to balance the work between the two and assure that we don’t go too long without touching each tape for validation.

Media refresh: as the tape media ages, the system needs the ability to make copies on to new media, before unrecoverable errors begin to appear. There is no way to tell exactly when tapes will begin to fail, so the software has to keep a database of errors for every tape and every tape drive and look for trends that indicate a coming problem before they actually occur. If we rotate media too often, however, the system becomes too costly to maintain.

File format migration: do you have any WordPerfect 4.2 files lying around? How about a Lotus 1-2-3 spreadsheet or even something more obscure, where the software vendor is long gone, along with your installation disks? As years pass, the risk of not being able to accurately read a data file increases. Our preservation system has to account for this and be able to convert files from one format or version to a newer format. If files are not migrated in a timely fashion, massive amounts of data can become inaccessible or difficult to render accurately. Some file formats may be viable for a decade or more, while others could become obsolete within just a few short years or less. Is a PDF a viable rendering of an Excel spreadsheet? What about the underlying formulas, fonts, supporting data, and links to data sources? There is a significant risk of losing content whenever a file format is converted to a new format.

Metadata and descriptive data: so you have a file from 5 years ago…who created it? What software version is required to read it? Where was the image originally digitized? Who is the owner of the original? Are there any restrictions on the use of the file in the future? Is this copy the highest resolution version we own, or is there a better image somewhere? What is the subject matter of the file? Are there people in the photograph? Is there important genealogical data contained in the image? The list of important questions goes on and on. Keeping track of the many types of metadata, indexes, and associated descriptive data is critical for our preservation system.

Documentation: a preservation system serves both currently living persons as well as future generations. We often pull images from preservation to avoid having to rescan originals or microfilm in the digital pipeline. A professional genealogist may need to see our highest resolution copy of an image to get clarity around handwriting. A future generation may have to open up our protected vaults and try to recover as much information as they can from our tape libraries and try to rebuild the family history information we have attempted to preserve. Documentation is a critical component of digital preservation. It is imperative that we document our data models, file formats, technology standards, software code, hardware specifications, and many, many other aspects of the digital preservation system. A future archeologist will not be able to simply put a magnifying glass up to microfilm to view our digital artifacts.

There are many additional complexities associated with operating a trusted digital repository. Hopefully, this article gives you some insights into some of them and helps you appreciate the efforts FamilySearch is taking to ensure that future generations are handed a pristine copy of their family records. We have not yet solved all of the challenges associated with building our preservation system –a task that will take many more years and possibly decades to prove out. We take our work very seriously and have a dedicated team of professionals looking after the world’s records. With contributions from many, we hope to enable future generations to learn of their heritage and make the same precious bond with their ancestors as we have.

No comments:

Post a Comment