Thursday, September 2, 2010

Safeguarding cloud ephemera Part I: the big picture

Artist, poet, and longtime friend Leah Korican commented on a recent post with this suggestion:
"Here's something I wondered about that you might write about...the longevity of these blog posts and other internet publishing. In other words is it important that they are preserved? Do you print them out and save them? What is their lifespan? Will they still be around in 10 years or 50? I have printed email and saved it occasionally but wonder if all the digital stuff will vanish."

Leah's not the only one wondering, and I haven't written about technology for a while. So today I'll start with thoughts about the much bigger and more intractable problem of preserving stuff published on the internet, in a general sense. Next week I'll offer some advice about preserving blog posts.

The Big Problem

Stuff on the internet can go away for multiple reasons. It could go away, for example, because the physical hardware that stores the data becomes corrupted and cannot be restored. It could go away because the company that stores the data goes out of business, taking data it hosts down with the ship. It could go away because the Internet as we know it goes away.

The physical hardware problem happened with a bang about a year ago, in October 2009. T-Mobile customers who used their Sidekick phones to store things like personal contact information and calendar entries "in the cloud" (on a remote server) received a communication that began,
"Dear valued T-Mobile Sidekick customers..."

The remote data was stored -- can't make this stuff up -- by a Microsoft subsidiary called "Danger." Wheeee!

As reported by on 10 October, the press release informed customers that,
"Regrettably, based on Microsoft/Danger’s latest recovery assessment of their systems, we must now inform you that personal information stored on your device – such as contacts, calendar entries, to-do lists or photos – that is no longer on your Sidekick almost certainly has been lost as a result of a server failure at Microsoft/Danger. That said, our teams continue to work around-the-clock in hopes of discovering some way to recover this information."

The good news, sort of, is that some days later those bleary-eyed teams managed a partial save. Again, via Mashable:
"We are pleased to report that we have recovered most, if not all, customer data for those Sidekick customers whose data was affected by the recent outage. We plan to begin restoring users’ personal data as soon as possible..."

A similar tale involved the loss of some 45% of user data by backup (yes!) service, The Linkup, in 2008. The Linkup subsequently went out of business. Wouldn't you?

And that third, apocalyptic option? About the Internet as we know it going away? Well ... when was the last time you tried to play a Betamax video tape? Anybody out there keeping important data on eight-inch floppy disks formatted for use with computers running CP/M? Are you still holding unused Instamatic film cartridges in a box stored in your attic?

Technologies die.

Some Big Solutions

There are efforts underway to archive the internet.

One of the best known is, brainchild of Brewster Kahle, and described by Stewart Brand of The Long Now Foundation, as
"the beginning of a cure - the beginning of complete, detailed, accessible, searchable memory for society, and not just scholars this time, but everyone."
The U.S. Library of Congress has a project -- called the National Digital Library Program -- that is
"assembling a digital library of reproductions of primary source materials to support the study of the history and culture of the United States"
The LoC's Digital Preservation project has a mission
"to develop a national strategy to collect, preserve and make available significant digital content, especially information that is created in digital form only, for current and future generations."

In other parts of the world, similar efforts are underway. In Europe, for example, Europeana is set to launch later this year "with links to over 10 million digital objects."

But. Let's get some perspective. In 2005, Google CEO Eric Schmidt cited a study guesstimating that the world's data can be quantified as about 5 million terabytes (a terabyte is 1,000 gigabytes; a gigabyte is 1,000 megabytes, a megabyte is 1,000,000 bytes -- and in ASCII encoding it takes one byte to represent a single letter or digit in computer storage). Schmidt estimated in the same talk that about 170 terabytes were indexable and searchable on-line. If he had his numbers right, that's .... wait for it .... about 0.004% of extant data at the time. A 2005 study estimated that the public, indexable web (the part Google can conceivably index) is 11.5 billion pages, and that large scale search engines cover no more than 40-70% of those pages.

My vote? Nobody is going to archive the whole sprawling, morphing internet, ever. And if they do? It'll only be a fraction of human culture.

But that's not exactly what Leah asked. Leah "wonder[ed] if all the digital stuff will vanish." I believe that most of it will.

All of it? Well, sure. Eventually.

But sticking to the less absolute, consider Babylon, Athens, Alexandria, Rome. Does anybody really think that the wealth of cultural material preserved from ancient Greece represents more than a small fraction of what that great civilization produced?

How best to preserve human knowledge? In Rock, Paper, Digital Preservation I suggested that humans have the longest demonstrated success with cave painting and clay tablets. A colleague with more experience working with cuneiform scholars than I have pointed out that there are a lot of clay tablets that have been found and cataloged, but that nobody knows how to read.

Human knowledge goes away, and there's little reason to believe that the internet changes longstanding rules. Yes, it's a heck of a lot easier and more economical to store digitized words and images and sound than it was thirty or fifty years ago. But how much more are we producing? There's lots of anecdotal estimates, but I'm not sure anybody knows.

That same colleague -- the one who pointed out that much cuneiform remains inscrutable to we of the 21st century -- wisecracked the other day that twenty percent of knowledge production these days is tweets about Lindsay Lohan. Okay, he was making that number up. Still ... maybe the decay of some data is a good thing?

Worth considering.

(This post is the first in a two-part series. The second, Safeguarding cloud ephemera Part II: keeping your blog alive, appears as a post of 9 September 2010.)

Related posts on One Finger Typing:
Breaking technology: Google's Blogger outage
Moving one's life to the cloud
Safeguarding cloud ephemera Part II: keeping your blog alive

1 comment:

  1. Ironically the comment I wrote using Firefox disappeared again! I love this answer. I want one of those cuneiform tablets as a memento mori. I'm also reminded of Yeats,

    His long lamp-chimney shaped like the stem
    Of a slender palm, stood but a day;
    All things fall and are built again,
    And those that build them again are gay.