DH 2014 Slides and Talk: “Clustering Search to Navigate A Case Study of the Canadian World Wide Web as a Historical Resource”

I gave a short paper at Digital Humanities 2014 last week. Held on the joint campus of the l’Université de Lausanne and the École polytechnique fédérale de Lausanne, it was a very good time. I might put a quick post up about some of my other observations, notably my visit to CERN, later on.

Here are the slides and text of my short talk. It’s a variation of the paper I also gave at the International Internet Preservation Consortium annual meeting in May. It concludes my conferencing for this summer – I have a workshop on computer vision and history during the Fall semester, and then the American Historical Association’s annual meeting in January. But until then: writing, research, and not flying in a plane.

Note: You should be able to reconstruct much of this approach from various blog posts on this site. See, for example: my WARC-Tools section, my clustering section, my NER posts, regular expression tinkering, and beyond (all my web archiving posts). Happy to chat, too. A downside of blogging so much is that many of will have seen much of this before.

Clustering Search to Navigate A Case Study of the Canadian World Wide Web as a Historical Resource

DH Abstract available here


Hi everybody. My name is Ian Milligan and I’m an assistant professor of Canadian and digital history at the University of Waterloo. In today’s short talk, I want to provide you with the approach I’ve been using to navigate large amounts of web information that uses open-source tools, provides an ad-hoc finding aid to me, and helps me make sense of the data deluge that historians are soon going to have to grapple with.

Using a case study of data that the Internet Archive released in October 2012, which is an entire scrape of the World Wide Web, I try to figure out what historians need to know to deal with this material.


Herein lies the crux of why I think this matters. We can’t do a large-scale social, political, or economic history (to name only a few subgenres) of periods after the mid-to-late 1990s without at least considering Web data. That doesn’t mean we’re all doing histories of the Web, just that I think the Web is going to be one of those essential sources that backdrops historical inquiry during this period. If you’re doing military history, you’ll want the forums of everyday soldiers; political history, you want e-mails, campaign websites, twitter campaigns; social, you want to know what people cared, blogged, felt, etc. about.

So there’s the opportunity that underlies exploring web archives. But there’s a problem…


The problem is that we just have too much information, or TMI. Take a thought experiment here, the Internet Archive – this using an outdated figure – dwarfing the Library of Congress. I know there are tons of things wrong with comparing apples and oranges here, but bear with me – it’s the sort of wake up call that I think helps shock historians.

So opportunity is balanced against challenge.


So this is the environment I’m operating in. As a case study, I’m using the March – December 2011 Wide Web Scrape. The Internet Archive released it in October 2012 to celebrate the 10PB milestone, and they’re working on releasing another one soon. In total, it’s 85,000 files spread across 80TB.
Out of this, then, I’ve built a case study.

DH2014.005The reason for using this method instead of the standard Wayback Machine is two fold. One, the Internet Archive does not have a full-text search function – it would both be ungainly given the amount of information, and it would also open them up to legal liability. Second, and just as importantly, the Wayback Machine is fantastic if you know the URL of what you’re looking for and have only a small dataset, but it is less good for simple, exploratory research.

DH2014.006As I said, I couldn’t work with the whole 80TB. So using the CDX files, which are sort of like a finding aid, I built a sample – I extracted 622,365 URLs out of the overall 8,512,275, which meant I was working with a case study of 7.31%. I actually did the same for a bunch of other top-level domains for comparative purposes, but let’s keep our focus on .ca for now.


The trick was how to actually turn a WARC file into data that I could work with. I ended up using a now-depreciated version of WARC Tools, which is an open-source toolkit for working with these large archival behemoths. It’s command line based and largely relies upon Python.

By this point you might be wondering what a WARC file is. It’s an ISO-standardized file format that takes everything that makes up a web site – like a WordPress site might be 35,000 files, including dependencies and external calls – and bundles it all together into one big file. They can be made of any size, although the Internet Archive generally releases them in 1GB tranches.


So here we see what I did. Basically, it’s a workflow that starts with the WARC file, which I isolated using the CDX files, and then begins by taking the WARC file and creates an index of the URLs within it – again, useful because I did further filtering to make separate top-level domain archives. I did this using three main elements of WARC-Tools: the filter, which gets rid of most non-textual elements; warchtmlindex, which creates a giant index as well as a way to navigate the hierarchy; and most importantly, filesdump which runs each website through Lynx and generates the output. Lynx is a text-based web browser from the earliest days of the web, still used in accessibility context today.

The Lynx step is worth pausing on because it involved source transformation. Unless an image has the ‘alt-text’ field filled out, we lose it. This means that we’re losing context: losing animated GIFs, images, backgrounds, which might make on a close reading apparent that something was being used in jest, sarcasm, etc. We need to keep this in mind throughout.

However, historians have long been flexible in approaching source materials: we make decisions all the time about what box to open, what to take notes about or not, what to transcribe, etc. It’s not a game-breaker, but we need to keep it in mind.
Finally, as I’ll discuss, I put it all into the Apache Solr NoSQL search engine to extract meaning from.


This is where clustering comes in. It’s all in Solr, but simple keyword searches don’t help. So I then make it speak to a program called Carrot2, which clusters search results from Google, Bing, and yes, custom databases set up in Solr.


We’ve all encountered clustering before, like in Google News or on desktops with DevonThink.

For digital historians, clustering represents a useful way to navigate a large amount of information. If, in a web collection, say we had websites about cats, dogs, and pigs, the clustering algorithm would read them, find that the cat-related websites had similar characteristics, and subsequently cluster them together. Conversely, it might discover as well that some of the websites – irrespective of the animal being studied – shared characteristics such as some being written by children, or others by adults. Through play and tinkering, a historian can tweak clustering algorithms to meet her or his requirements.


In my other work, I am a historian of youth cultures; how could this methodology help somebody with my research interests? Here, a query for ‘children’.

At a glance then, the visualization presents itself as a finding aid: we learn what these WARC files tell us about ‘Children,’ and whether this is worth investigating further. In this case study, we see that we have files relating to children’s health (ranging from Health Canada to Tylenol), research into children at various universities, health and wellness services, as well as related topics such as Youth Services, Family Services, and even mental health. Thus, we have both an ad-hoc finding aid equivalent as well as a way to move beyond distant and close reading levels.


Clusters often contain more than one file, and, much like a finding aid and archival box, the relationship between clusters can help shed light on the overall structure of a document collection. This positions itself about half way between the idea of a finding aid and exploratory data visualization. Consider the following image generated via the Aduna visualization built into Carrot2:

In the above visualization, the labels represent clusters. If a document spans two or more clusters, it is represented by a dot connected to both labels, which represent clusters. For example, “Christian Education” appears in the middle left of the chart. There is one document to the left of it (partially covered by the label), a document that belongs only to that label. Yet there is one to the right of it that is connected with “Early Learning,” representing a website that falls into both categories.

From this, we can learn quite a bit about the files that we can find in the Wide Web Scrape as well as suggest which might be most fruitful for exploration. In this chart, at the bottom we see the websites relating to children’s health, which connect to breastfeeding, which connect to timeframes, which actually then connect to employment (which often contains quite a bit of data and time information). We then also see that connect to early childhood workers, which in turn connects to early learning more generally. The structure of the web archive relating to children reveals itself. At a quick look, instead of going through the sheer amount of text, we are beginning to see what information we can learn from this web archive.


These are all plain text, but of course because we have the URL free and we know they’re from the Internet Archive, they’re in the Wayback Machine – I use a simple OS X Automator script to basically prefix the WBM URL to any archived ones that I find across my system.

So you can click on any of those nodes, and be brought back to the original page. It’s a pretty low-overhead implementation of a really useful research tool (to this historian, anyways).


So here again is my workflow. I think the big outcome of this is that we’re working with plain text from WARC files, which you can do some fun stuff with. However, so far, there is one problem..


The problem is that we need to know in this approach what we are looking for.


So I’ve complemented these finding aids with named-entity extraction, especially for locations, as well as regex searches for things like postal codes. All of this, of course, is possible because of our text extraction which is probably the biggest contribution this research makes.


As well as experimenting with images, looking at facial recognition, what we can learn from tens of thousands of images, and colour hues, change, etc.


But many of those are subjects for another talk. So far, clustering has been the best approach. It creates ad-hoc finding aids, lets me actually move between distant and close reading, and quite frankly is easy and uses open-source software.

The next stage is to apply these methods on a longitudinal basis, looking at changes in clusters over time. I want to thank you very much for your time and look forward to any questions that you might have.


Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s