From Dataverse to Gephi: Network Analysis on our Data, A Step-by-Step Walkthrough

Screen Shot 2015-12-10 at 4.20.20 PM
Do you want to make this link graph yourself from our data? Read on.

As part of our commitment to open data – in keeping with the spirit and principles of our funding agency, as well as our core scholarly principles – our research team is beginning to release derivative data. The first dataset that we are releasing is the Canadian Political Parties and Interest Groups link graph data, available in our Scholars Portal Dataverse space.

The first file is all-links-cpp-link.graphml, a file consisting of all of the links between websites found in the our collection. It was generated using warcbase’s function that extracts links and writes them to a graph network file, documented here. The exact script used can be found here.

However, releasing data is only useful if we show people how they can use it. So! Here you go.

Video Walkthrough

This video walkthrough is best viewed in conjunction with the step-by-step walkthrough below.

Step-by-Step Walkthrough

Once you’ve downloaded the file, open up Gephi.

On the opening screen, you want to select “Open a Graph File…” and select the all-links-cpp-link.graphml file that you downloaded from our Dataverse page. Continue reading “From Dataverse to Gephi: Network Analysis on our Data, A Step-by-Step Walkthrough”

Herrenhausen Big Data Podcast: Coding History on GeoCities

Last post (three day conference deserves three posts, right?) for my trip to Hannover, Germany for the “Big Data in a Transdisciplinary Perspective” conference. I had the opportunity to sit down with Daniel Meßner, who hosts a podcast called Coding History. I really enjoyed our conversation, and wanted to link to it here.

You can listen to the podcast here. Hopefully I am cogent enough!

It grew out of my lightning talk and poster, also available on my blog.

My thanks again to the VolkswagenStiftung for the generous travel grant to make my attendance possible. It was a wonderful conference.

Herrenhausen Big Data Lightning Talk: Finding Community in the Ruins of GeoCities

I was fortunate to receive a travel grant to present my research in a short, three-minute slot plus poster at the Herrenhäuser Konferenz: Big Data in a Transdisciplinary Perspective in Hanover, Germany. Here’s what I’ll be saying (pretty strictly) in my slot this afternoon. Some of it is designed to respond to the time format (if you scroll down you will see that there is an actual bell). 

If you want to see the poster, please click here.


Big Data is coming to history. The advent of web archived material from 1996 onwards presents a challenge. In my work, I explore what tools, methods, and approaches historians need to adopt to study web archives.

GeoCities lets us test this. It will be one of the largest records of the lives of non-elite people ever. The Old Bailey Online can rightfully describe their 197,000 trials as the “largest body of texts detailing the lives of non-elite people ever published” between 1674 and 1913. But GeoCities, drawing on the material we have between 1996 and 2009, has over thirty-eight million pages.

These are the records of everyday people who published on the Web, reaching audiences far bigger than previously imaginable. Continue reading “Herrenhausen Big Data Lightning Talk: Finding Community in the Ruins of GeoCities”

Adapting the Web Archive Analysis Workshop to Longitudinal Gephi:

The Problem

This script helps us get longitudinal link analysis working in Gephi
This script helps us get longitudinal link analysis working in Gephi

When playing with WAT files, we’ve run into the issue of getting Gephi to play accurately with the shifting node IDs generated by the Web Archive Analysis Workshop. It’s certainly possible – this post demonstrated the outcome when we could get it working – but it’s extremely persnickety. That’s because node IDs change for each time period you’re running the analysis: i.e. could be node ID 187 in 2006, but then remapped to node ID 117 in 2009. It’s best if we just turn it all into textual data for graphing purposes.

Let me explain. If you’re trying to generate a graph of the host-ids-to-host-ids, you get two files. They come out as hadoop output part-m-00000, etc. files, but I’ll rename them to match the commands used to generate them. Let’s say you have:

[note: this functionality was already in the GitHub repository, but not part of the workshop. There is lots of great stuff in there. As Vinay notes below, there’s a script that does this – found here.]

Continue reading “Adapting the Web Archive Analysis Workshop to Longitudinal Gephi:”

Exploding ARC Files with Warcbase

224,977 URLs with 254,702 hyperlinks between them (from a random sample of 2008 CommonCrawl data).
224,977 URLs with 254,702 hyperlinks between them (from a random sample of 2008 CommonCrawl data).

With the help of RA extraordinaire Jeremy Wiebe, I’ve been playing with Jimmy Lin’s warcbase tool. It’s:

… an open-source platform for managing web archives built on Hadoop and HBase. The platform provides a flexible data model for storing and managing raw content as well as metadata and extracted knowledge. Tight integration with Hadoop provides powerful tools for analytics and data processing.

Jeremy’s been working on documenting it for OS X, which I’m hoping to share soon, but it’s a neat suite of tools to play with web archived files. While right now everything works with the older, now-depreciated ARC file format (they note the irony given the name), Jeremy has been able to add support for WARC ingestion as well.

What can we do with it? Continue reading “Exploding ARC Files with Warcbase”

Short Interview on CBC’s The Current on Historians and Big Data

IMG_20141001_095116I did a short interview on CBC Radio’s The Current with Anna Maria Tremonti, which aired this morning. I was responding to some of the utopian arguments made by Christian Rudder’s book Dataclysm, noting that while the historical record is going to be enriched by digital sources, we’ve got to consider issues of access, preservation, and funding. I was nervous, but I think I got my main points across pretty decently.

The talk is available here: “Historians want Canada to give them access to Big Data.”

It was a fantastic experience, and it really did get me thinking about how it would be rewarding to build the capital to get website legal deposit in Canada — or at the very least, to get the preservation of digital resources a little bit more on the table. Maybe I’ll try my hand at some popular writing.

Using ImagePlot to Explore Web Archived Images

A low-resolution version of the EnchantedForest visualization. Read on for higher-resolution downloads.
A low-resolution version of the EnchantedForest visualization. Read on for higher-resolution downloads.

ImagePlot, developed by Lev Manovich’s Software Studies Initiative, promises to help you “explore patterns in large image collections.” It doesn’t disappoint. In this short post, I want to demonstrate what we can learn by visualizing the 243,520 images of all formats that make up the child-focused EnchantedForest neighbourhood of the GeoCities web archive.

Setting it Up

Loading web archived images into ImagePlot (macros which work with the open-source program ImageJ) requires an extra step, which works for both Wide Web Scrape as well as GeoCities data. Images need to be 24-bit RGB to work. My experience was that weird file formats broke the macros (i.e. an ico file, or other junk that you do get in a web archive), so I used ImageMagick to convert the files. Continue reading “Using ImagePlot to Explore Web Archived Images”

Colour Analysis of Web Archives

These popular colours from the children-focused area of GeoCities would look great on a paint swath, right?
These popular colours from the children-focused area of GeoCities would look great on a paint swath, right?
I have been tinkering with images in web archives – an idea that I had was to pull out the colours and see how frequent they were in a given web archive. With some longitudinal data, we could see how colour evolves. In any attempt, by running enough analysis on various test corpuses and web archives, we could begin to get a sense of what colours characterize what kind of image collections: another method to ‘distantly’ read images.

This is easier said than done, of course! I did a few things. First, I began by getting average RGB values from each image using ImageMagick as discussed in my previous post: this was potentially interesting, but I had difficulty really extracting meaningful information from it (if you average enough RGB values you begin to be comparing different shades of grey). This is the least computationally intensive route. I might return to this, by binning the various colour values and then tallying them.

The second was to turn to an interesting blog post from 2009, “Flag Analysis with Mathematica.” In the post, Jon McLoone took all the flags of the world, calculated the most frequent colours, and used it to play with fun questions like ‘if I ever set up my own country, I know what are fashionable choices for flag colors.’ It’s actually functionality that’s been incorporated to a limited degree in Mathematica 10, but the blog post has some additional functionality that’s not in the new fancy integrated code.

As I’ll note below, I’m not thrilled with what I’ve come up with. But we can’t always just blog about the incredible discoveries we make, right?  Continue reading “Colour Analysis of Web Archives”