Presentation: The Internet Archive and Social Historians: Challenge and Potential Amidst the WebARChive Files

Here is the rough text of what I’ll be presenting at the CHA. I tend to ad-lib a bit, but this should give you a sense of what I’m presenting on. It’s a twenty minute timeslot to a general audience. As I noted earlier, there is a full-text paper that drives the presentation. If you want it, drop me a line.

CHA 2013.001

Hi everybody and thank you for coming to my talk, “The Internet Archive and Social Historians: Challenge and Potential Amidst the WebARChive files.” I want to begin with something that I think we’ll all find familiar.
CHA 2013.002
Archival boxes come in different shapes and sizes, but are a familiar sight to historians. Generations of thought and design have gone into these boxes: they are specifically designed to protect documents over a long time period, reduce acidity, and they can withstand considerable physical wear and tear to avoid having to replace them. No fewer than six International Standards Organizations (ISO) specifications go into the creation and maintenance of physical archives. Historians played a large role in the establishment of the archival profession, a voice that has been supplanted in recent years by the rise of library and information schools.

CHA 2013.003

Historians need to understand the implications of the arrival of new archival boxes: web archives. These necessitate a rethinking of how we approach our professional standards and training, with particular implications for historians studying topics involving the 1980s onwards. In this presentation, I have two main objectives: to firstly introduce historians to the main issues of web archives, with an eye to incorporating them into our professional and pedagogical practices. Secondly, I argue that we need to look at various workflows that can help us open our own next generation archival boxes. There is a literature on this stuff, but it’s aimed as an internal conversation. I want historians to make sure we’re at the table and part of the scholarly conversation.

CHA 2013.004

Historians are already able to reap the fruits of forward thinking digital preservationists and archivists. To see why, we need to go back to the mid-1990s. Then, we began to be worried about our digital heritage: digital records were more numerous, images and text were beginning to be lumped together into one file, physical storage was evolving as floppies gave ways to CD-ROMS, and they all had differing standards of longevity and accessibility.

CHA 2013.005

Enter Brewster Kahle, an Internet enterpreneur and visionary. He had a simple goal: to download and preserve the entire publicly accessible World Wide Web. He founded the Internet Archive in 1996. They programmed web crawlers, automated software programs, that would go out and download content. It was a potentially infinite process, depending on how the Web developed. The crawler visited a site, downloaded it, and then followed each link on the page. At each page it then visited, it would download the site, follow those links, and so forth. As the World Wide Web continually grows and changes, a crawler could potentially archive forever.

CHA 2013.006

In October 2001, the WaybackMachine launched – the first website demonstrated being a Clinton-era press release about airline security, and a new era of historical research was upon us.

CHA 2013.007

But what does this mean for historians? Web archives present considerable opportunity and challenge to historians due to their sheer size, the particular technical challenges posed by the dynamic and interconnected nature of the World Wide Web, and the ethical dimensions that may arise.

CHA 2013.008

Kryder’s Law has made this possible. Building off the more popularly-understood Moore’s Law of ever-increasing transistor numbers of microchips, which foresaw a doubling of density every two years, Kryder’s law holds that storage density will double approximately every eleven months. In 2011 alone, we created 1.8 zettabytes of information, or 1.8 trillion gigabytes.

CHA 2013.009

While much of this data is being produced in the form of private sector databases, automated logs, and proprietary and walled-off parts of the web such as Facebook, if even a fraction of this data store is made available to historians in the future it will represent an unparalleled resource. Consider just two examples: YouTube sees 72 hours of video uploaded every single minute, and Twitter sees roughly 200 million tweets per day. This is an archive that we can never fully grasp, as it continues to infinitely grow. It is different every single day.

Size comes with problems, though, because it can be a bit illusionary. Websites are archived at differing intervals, depending on the site’s significance and traffic. Consider the following example that illustrates frequency of scrapes by the Internet Archive.

CHA 2013.010

There are two major spikes. The first follows the events of 9/11. Demonstrating foresight, the Internet Archive preserved a considerable digital footprint of the events of the day: we can now trace how a web visitor might have experienced some of the events and the tumultuous weeks that followed. As the web was then becoming the primary delivery mechanism for news, this is now an unparalleled way to explore how stories developed, how misinformation arose and was subsequently corrected, and overall provides insight into how people experienced 9/11.

CHA 2013.011

But it is critical to note, through this example, that we are not preserving a complete, unfettered record of the Internet’s past. There is a lot of loss. This will pose methodological problems for future historians, just as contemporary academics struggle with accessing early television broadcasts. News media is increasingly consumed online, breaking news is experienced there, and stories change throughout the day. Historians will still have to largely rely on microfilmed or digitized archived print versions of newspapers. As historians, we should be pressing for better institutional archiving. Moreover, we will have to rethink how we approach histories of the period when we cannot approach news media as it was consumed.

 CHA 2013.012

It is worth pausing to briefly consider some of the unique technical challenges at play. Consider the website, a popular Canadian historical group blog. It is a simple webpage, approximately four years old. If you were to download the webpage to host on your own computer, you would download almost two gigabytes of information spread across 18,793 files. Imagine the difficulties in archiving this material: from storage, to preservation, and ultimately to making it usable for historians.

CHA 2013.013

To use an archival metaphor, there is no simple sheet entitled “”. It is instead constructed from pieces all over the World Wide Web, a complex interconnected, ever-living and changing document. If one simply archives a single page, if it relies upon external content or images they have to be archived as well or a long-term, accurate representation has not been preserved.

CHA 2013.014

Finally, web archives bring with them new ethical dimensions that have not yet been fully explored. It is largely uncharted territory. With traditional archives, donor agreements ideally lay out restrictions, if any, on the use of material. Oral historians have a whole host of literature and regulations to draw on.

CHA 2013.015

How to approach websites? We do not have a similar body of practice. Roy Rosenzweig laid out an interesting example of the transitory nature of the World Wide Web in a 2003 American Historical Review article. He uses the “Bert is Evil” website, an example of an early Internet meme which saw the Sesame Street character Bert posed with nefarious individuals such as Adolf Hitler and Osama Bin Laden, as a key example of the issues facing historians. After 9/11 and the beginning of the global War on Terror, a print shop manager in Dhaka, Bangladesh used images of Osama Bin Laden from a website.

CHA 2013.016

Bert was tucked away in the corner. After he showed up in American broadcasts connected to Bin Laden, legal threats followed, and Bert is Evil’s owner decided to pull the site down on 11 October 2001. As Rosenzweig explains, this is scary: “If Ignacio had published his satire in a book or magazine, it would sit on thousands of library shelves rather than having a more fugitive existence as magnetic impulses on a web server.” People publish things in newspapers all the time, that they might regret.

CHA 2013.018

It also raises ethical questions. Despite Ignacio’s statement above, imploring his fans and mirrors (people who copied the site to provide it) to stop sharing Bert is Evil, the Internet Archive preserved the site. Do we, as historians, have an ethical obligation to those who upload websites to respect their wishes? Or does it constitute published material, in which case the author has few rights on the fair use or fair dealing rights of a researcher? This is a relatively uncharted historical area.

The Internet Archive grappled with these questions in its early inception. One of their initial proposals was to handle web information “like census data — aggregate information is made public, but specific information about individuals is kept confidential,” but by 2001 all information was made available. While you can opt out of the Internet Archive (both currently and retroactively) by modifying the robots.txt file on a website’s server, by default websites are included.

CHA 2013.019

There are no easy answers. Should Internet comments or discussions be fair game, akin to published material or letters to the editor? Are submissions to online discussion boards to be accessed and viewed unfettered? What about a high school student website? Analog equivalents to these digital examples are not always straightforward: letters to government, or submissions, are occasionally censored when it comes to archival access. There are two key dimensions: the scale of study, whereas aggregate information (textual analysis of hundreds or thousands of blogs or tweets, for example) is almost certainly permissable whereas we may have to think about risk analyses for others.

I just want to make sure that historians get in on the ground floor here.

CHA 2013.020

So, some brief technical notes in the short amount of time we have left here.

The Internet Archive’s WaybackMachine is and most likely will continue to be the most common way for historians to access web archives. As the WaybackMachine does not offer full-text functionality, the user needs to know the right Uniform Resource Locator (URL), such as This can be tricky as historical URLs are not always readily apparent today.

Luckily, we have a few options as historians. First, as I have done in several other research projects, we can combine traditional archival research with the WaybackMachine. Find a URL in an archival document or historical newspaper, plug it into the WaybackMachine, and the content is found. Second, before the advent of dynamic search engines like Google, users relied on web directories. These have largely been preserved. A user can go to as of 1997, and use the directory of listed websites to find relevant sources.

Using the system can require more complicated workflows, and in the next few paragraphs I want to provide an example of how it can be used. Early Internet forums are invaluable resources, representing large amounts of non-commercialized public speech. In 1998, the Canadian regulatory body for telecommunications, encompassing radio, telephone, and television, was considering whether New Media would fall under its purview. It accordingly held a series of public consultations across the country. The CRTC needed people to know where the website was, and thus placed advertisements in newspapers containing links to the websites.

With the forum pages found, we can use a workflow that helps harness the best of born-digital resources within the WaybackMachine without the attendant drawbacks. Consider the website below:

CHA 2013.021

Drawing on free, open-source tools such as the Programming Historian 2, I was able to download every single discussion page and save them as plain text. Automated downloading is increasingly an important part of the historians’ toolkit in the web age, as it facilitates this sort of work. In short, an automated script in this case will go to each forum page, gather the links to each individual post, and download them. It is important to incorporate pauses and limits on how quickly you download information, to conserve Internet Archive resources.

I then had several hundred files, each consisting of an individual discussion file. I was interested in exploring the degree to which the CRTC hearings contributed to a Canadian moral panic around cyberporn and abuse of children online. Plain text files provide options for researchers: they can be read with a writing program, akin to traditional archival research; they can be scanned by search programs, from built-in operating system versions to specialized information management systems like DevonThink; or using specialized humanities textual analysis tools. Electing to take the last course of action, I combined all of the individual files into one large plain-text file and loaded it into the web-based site.

CHA 2013.022

But let’s return to the first example I used: the original archival box.

CHA 2013.023

Digging into the WARC files themselves offers advantages beyond just using the publically accessible WaybackMachine. First, they can be created by anybody. Returning to our example, if we wanted to create a comprehensive web archive of that entire website, we could use the Programming Historian 2 lesson to install wget and execute one command: wget –mirror –warc-file=”ah”.

CHA 2013.024

Yet we do not have to create our own. At the Internet Archive, there are many such WARC files available, such as a just-in-time grab of the now-defunct Montreal Mirror community paper (with content between 1997 and 2010). Making this work all the more necessary, the Internet Archive released an entire web crawl of the 2011 World Wide Web in WARC format in 2012 to celebrate the attainment of the ten petabyte mark. Eighty terabytes are available for researchers: a copy of everything they could find. This will be a tremendous future resource for historians, and we need to begin planning for its utility now.

Let me walk through one program I have created. It takes a WARC file, and with one command generates an index, a finding aid, a searchable index, and let’s us see what it contains. <Here I may ad-lib a bit>

One command executes the script
One command executes the script
And then it proceeds to generate a full-text index (using WARC Tools), topic models content, generates Keyword in Context (both dynamic and static), and simple visualizations like sparklines and word clouds.
And then it proceeds to generate a full-text index (using WARC Tools), topic models content, generates Keyword in Context (both dynamic and static), and simple visualizations like sparklines and word clouds.
And provides reports in PDF format, one per WARC file. For a historian, this is handy when there aren't any finding aids.
And provides reports in PDF format, one per WARC file. For a historian, this is handy when there aren’t any finding aids.
CHA 2013.028
It’s all up on my Github. Click through on image to access.

But since I may have seemed utopian, let me put on my activist/doomsday hat.

The dangers of digital loss, expressed in the early 1990s, are still with us.

Geocities, a popular web service that allowed users with little technical expertise to create their own websites, had opened in 1994. Over the subsequent fifteen years, over 38 million webpages were created there: an astounding and irreplaceable collection of early Internet history, and most likely an until-then unparalleled collection of individual, non-commercialized expression. Yahoo!, the Internet search giant, acquired Geocities in 1999 (it had then been the third-largest website in the world).

On 26 October 2009, Yahoo! “succeeded in destroying the most amount of history in the shortest amount of time, certainly on purpose, in known memory. Millions of files, user accounts, all gone.” (Archives Team) Its closure was quick and sudden: announced in April, the site shuttered in October.

Historians need to play a leading role in arguing for the preservation of our past, irrespective of whether it is in digital or analog format.

There is work to be done. New web archives will necessitate a rethinking of the historian’s craft, and it is my belief that we need to move sooner rather than later on this front. The 1980s and 1990s, if past practice holds true, will become the target of historical inquiry soon. There are challenges on the road ahead, but opportunities too. We should look forward with optimism.

2 thoughts on “Presentation: The Internet Archive and Social Historians: Challenge and Potential Amidst the WebARChive Files

Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s