Last week, I had the pleasure of co-hosting our “Archives Unleashed 2.0 Hackathon” at the Library of Congress, along with Matthew Weber (Rutgers), Jimmy Lin (Waterloo), Nathalie Casemajor (Université du Québec en Outaouais), and Nicholas Worby (Toronto). While a lot of our time was taken up by facilitating the smooth running of the event – providing virtual machines, ensuring people had great test datasets, making sure that people knew when fresh coffee arrived – we also had time to participate and hack within some of the teams.
Why did this datathon matter?
I was asked to give a short presentation about the datathon to the Saving the Web Symposium, organized by Dame Wendy Hall and the Kluge Center immediately following our hackathon.
I presented several slides and gave our overarching rationale for what we did.
I began by making note that the Archives Unleashed team had been hacking on topics germane to Saving the Web since the previous Monday evening, when we gathered at George Washington University, and then on Tuesday and Wednesday within the magisterial grounds of the Library of Congress.
As I addressed a room full of computer scientists, librarians, social scientists, and other eminent people, I wanted to address the lion in the room..
And that is why a historian was helping to organize a datathon around web archives? Datathons are the province of computer scientists and technologists, aren’t they, not humanists?
Well, I had to be there because in short we have a very real and serious problem facing our collective cultural heritage.
And that can be summed up in one slide: the existence of content like these GeoCities.com pages. GeoCities.com, founded in 1994 until its closure by Yahoo! in 2009, was a place where anybody could go and add content to the Web. You just had to go there, enter your e-mail address, and receive your free megabyte (!) of space. While this expanded to later be two megabytes, and even ten megabytes, the core concept remained the same.
Everybody users could write their thoughts on any variety of topic:
- Their love of Buffy the Vampire Slayer;
- Their lamentations around the Toronto Maple Leafs;
- Their explorations into their genealogical family tree;
- And for children, their passions around Winnie the Pooh or other favourite cartoon characters.
And people took to it with aplomb. I like this visualization because it shows some of the early milestones: the first 10,000 users in October 1995, the first 100,000 in August 1996, the first 1,000,000 by October 1997. In total, by 2009, some seven million users would create sites on GeoCities.com.
Today, when we go in to try to calculate just how much content was there, we get numbers like 186 million distinct URLs.
This matters to me as a historian because traditionally we have had very little information from everybody people. It’s an unfair comparison, but I thrive on unfairness, so I like to point out that if you wanted to study everyday people in the 18th or 19th centuries, you were largely necessarily limited to the moments where people came into contact with the state: when they were born, married, or died.
The tiny glimmers of everyday life comes from things like the Old Bailey, the central criminal court of London, where tidbits of social relations were recorded in trial transcripts.
So the Old Bailey Online can rightfully describe themselves as having “the largest body of texts detailing the lives of non-elite people ever published, containing 197,745 criminal trials held at London’s central criminal court.”
For 239 years, 197,745 documents was the gold standard.
For 15 years of GeoCities, we have seven million users, and over 186 million documents (however you define documents).
As the late great American historian Roy Rosenzweig pointed out in a 2003 article, there is a shift from scarcity to abundance happening.
I like to put it as:
The problem with the Old Bailey is that you wish you had more information.
The problem with GeoCities is that you wish you had less.
This is why historians need to be involved.
Our animating statement for the hackathon was thus: “Web archives are great, but access and usage are a considerable problem.” We know they matter. But where do we even start?
It isn’t going to be historians working all on their lonesome that are going to be able to do this, of course. But nor is it going to be computer scientists, or librarians, or communications scholars.
We all had to work together. So our team reflected the multiple necessary perspectives.
We were going to be together to really accomplish these four points:
- Building a community that could cohere around Web archives, finding scholarly voices (similar to what I wrote about last week);
- Articulate a common vision of web archiving development and tool development, to give the field some additional shape;
- To avoid the black boxes of search engines we don’t understand – if you run a search on billions of documents, you need to know why the search engine has given you the first ten or twenty results, or else it’ll really be writing your histories for you;
- And most importantly, equipping us as a collective – here I really mean society more generally – to work with born-digital cultural resources.
This means that we need to bring different perspectives together. Hackers, or those who can work with data and code, to make our vision a reality and generate new accessible open-source code that can let us rise to this reality. But also yackers – or humanists – who have technical chops to have meaningful conversations, but who can bring their professional wisdom developed over years of theoretical and historical engagement.
We were very happy with what we were able to bring together, as a case study of how we used warcbase reveals…
Stay tuned for Part Two tomorrow morning (EST)!