WAT Files to Gephi Graphs

I’ve been playing with a large collection of WAT files. WAT files, specified here, are a collection of web archived metadata records. They’re a lot easier to move around and deal with than the full WARC files: i.e. they’re on the scale that you can put a few on a thumb drive, or years upon years of collections on a standard portable hard drive, rather than the storage arrays you need to deal with oodles of WARCs.

There was a bit of a learning curve for me, however, so I wanted to share the steps that I took to take a WAT file and generate a link visualization.

I’m not really going to dwell on results, because I literally generated these this morning. I plan to get longitudinal data between 2005 and 2014. I’m not convinced the networks themselves are going to be key, but the reports I can generate in Mathematica and Gephi should be useful.

And in any case, maybe this’ll help you.

Step One: Using Web Archive Analysis Workshop to Extract Data from WAT Files

Much of this followed the examples set out in the Web Archive Analysis Workshop, specifically:

Extracting links

$PIG_HOME/bin/pig -x local -p I_WATS_DIR=$DATA_DIR/derived-data/wats/*.wat.gz -p O_LINKS_DATA_DIR=$DATA_DIR/derived-data/canon-links-from-wats.gz pig/wats/extract-surt-canon-links-with-anchor-from-warc-wats.pig

Creating a Host Graph

$PIG_HOME/bin/pig -x local -p I_LINKS_DATA_DIR=$DATA_DIR/derived-data/canon-links-from-* -p O_HOST_GRAPH_DIR=$DATA_DIR/derived-data/graph/host.graph/ pig/graph/convert-link-data-to-host-graph-with-counts.pig

And then creating link graph data:

$PIG_HOME/bin/pig -x local -p I_LINKS_DATA_NO_TS_DIR=$DATA_DIR/derived-data/graph/host.graph/ -p O_ID_MAP_DIR=$DATA_DIR/derived-data/graph/host-id.map/ -p O_ID_SORTEDINT_GRAPH_NO_TS_DIR=$DATA_DIR/derived-data/graph/host-id-sortedint.graph/ pig/graph/convert-link-data-no-ts-to-sorted-id-data-no-ts.pig

Once we have this data, here’s what I did.

Step Two: Taking Data into Mathematica (you could adapt) to generate source/target information

In Mathematica, I ran the following code:

data = Import[
graph.tsv", "Data"];

results = Reap[
      line = data[[x]];
      Thread[ToExpression[line[[1]]] -> ToExpression[line[[2]]]]];
    , {x, Range[1, Length[data], 1]}]

graphdata = results[[2]][[1]];

This code basically imports our host-id-sorted-int data, and then breaks it apart so that it’s a long list of HOST –> DESTINATION format. Once it’s in Mathematica, we can use its own graphing/networking suite, or we can throw it into Gephi.

The results are akin to:
Screen Shot 2015-01-09 at 5.55.27 PM

Step Three: Wrangle into Gephi

I prefer using Gephi, so I export it using:

 Flatten[graphdata], "Pajek"];

I open the file into Gephi. The problem is that the labels aren’t there by default (as seen above, we’re dealing with numerical representations). To get labels, I do the following:

  1. export the nodes and the edge spreadsheets using the export function in the data library;
  2. I open up the following file: ../derived-data/graph/id.map/part-m-00000 which has a list of what each number represents;
  3. I paste the list of URLs into the ‘label’ list of the node spreadsheet.
  4. I then re-import them back into the program using the Data Laboratory, and begin to visualize it. It’ll look like this for NODES:

Screen Shot 2015-01-09 at 6.03.57 PM

And this for EDGES:

Screen Shot 2015-01-09 at 6.04.30 PM

In Gephi, I do the following. I’m not an expert, by any means, but turn to our trusty pre-release copy of the Historian’s Macroscope for help. In the lower right, under ‘layout,’ I select ‘Force Atlas 2’ and let it run. It takes a while, so I sometimes raise the scaling up to 200 just to get things going.

Screen Shot 2015-01-09 at 6.03.30 PM

We then need to look for the data we want to find. In this case, I want to find websites that have the most inbound links – the most inbound central. I set up a ranking in the upper left like so:

Screen Shot 2015-01-09 at 6.05.57 PM

And then decide to filter using the filters in the right panel, like so:

Screen Shot 2015-01-09 at 6.05.09 PM

With a filter, I ‘select’ it, and then click on the visualization pane at the bottom of the window (there’s a little triangle that will bring it up), and select “Hide node/edge labels if not filtered” on the label tab.

Finally, in the ‘preview’ tab, I select ‘show labels,’ make it size 12, proportional to the links, and begin to export. Here’s some results – Canadian labour movement WETs in 2009 [download full PDF here]:

Screen Shot 2015-01-09 at 6.08.40 PM Here we see in December 2009 that the central in-bound links were to thinks like: CanadianLabour.ca, the Canadian Labour Congress, the Canadian Union of Public Employees, PolicyAlternatives.ca, as well as Facebook, YouTube, the Flash viewer, and Adobe Acrobat readers.

By September 2014, the story is different [download full PDF here]:

Screen Shot 2015-01-09 at 6.11.01 PM


Screen Shot 2015-01-09 at 6.13.15 PM

We instead see other new hubs: YouTube, Facebook, Twitter, LinkedIn, etc. but also hubs like parliamentary webpages, David Suzuki (environmental concern?). But alas, hunger called, workshops needed to be prepped. Will look at this more later next week.


One thought on “WAT Files to Gephi Graphs

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s