From Dataverse to Gephi: Network Analysis on our Data, A Step-by-Step Walkthrough

Screen Shot 2015-12-10 at 4.20.20 PM
Do you want to make this link graph yourself from our data? Read on.

As part of our commitment to open data – in keeping with the spirit and principles of our funding agency, as well as our core scholarly principles – our research team is beginning to release derivative data. The first dataset that we are releasing is the Canadian Political Parties and Interest Groups link graph data, available in our Scholars Portal Dataverse space.

The first file is all-links-cpp-link.graphml, a file consisting of all of the links between websites found in the our collection. It was generated using warcbase’s function that extracts links and writes them to a graph network file, documented here. The exact script used can be found here.

However, releasing data is only useful if we show people how they can use it. So! Here you go.

Video Walkthrough

This video walkthrough is best viewed in conjunction with the step-by-step walkthrough below.

Step-by-Step Walkthrough

Once you’ve downloaded the file, open up Gephi.

On the opening screen, you want to select “Open a Graph File…” and select the all-links-cpp-link.graphml file that you downloaded from our Dataverse page.

You then want to click ‘ok’ on the next page. Create a ‘new graph.’

Screen Shot 2015-12-10 at 5.27.54 PM
Do you want to make this link graph yourself from our data? Read on.

You should now see what I (nerdily) call a borg cube. That’s good, because it means that the data is in there. We need to make it usable, however.

Click on the “Data Laboratory” tab at the top.

Click on “Nodes” above. When it is shaded behind it, that means that it is selected.

Click on “Copy Data to another Column,” select ID, and then select “label” on the drop down menu that appears in the following box.

Now click on “Edges” above.

Click on “Merge Columns,” select “timeint — STRING,” and move it into the right-hand column by clicking the arrow. Then under “Merge Strategy” select “Create Time Interval.” Click OK.

You now have to tell it how to read the data. Click on “Parse Dates” and enter “yyyymmdd.” That corresponds to how the data is laid out in the time intervals column! Press OK.

Screen Shot 2015-12-10 at 5.30.45 PM
Our data is now ready to visualize!

Now click on “overview” above, and let’s get our diagram going.

First, we need to filter it down. Right now there are too many nodes for most of our computers to make sense of. In the upper right, you’ll see a “filter tab.” Let’s make it so that we get rid of nodes (or domains) that have less than four inbound links.

To do so, under “Topology” find “In Degree” range and drag it down to the filter list below (Queries). Like so:

Screen Shot 2015-12-10 at 5.33.18 PM

In the slider below, move the lower bound to 4. Then click ‘Apply.’

Screen Shot 2015-12-10 at 5.34.46 PM

Then, we need to lay it out. Select the layout tab at left, and fill it out as follows.

Screen Shot 2015-12-10 at 5.36.07 PM

Note: these are just random example values, that will produce a basic graph quickly. I often spend half an hour or even more tweaking these settings. In this case, I ran it for a few seconds and then hit stop. You will have to tweak these depending on the speed of your computer. If you screw up, just run a “Random Layout” and bring it back to the starting point.

Click on the following button to zoom your map out to the extant of the graph:

Screen Shot 2015-12-10 at 5.37.20 PM

Now we’ll want to do some quick tweaks to make the nodes relative to the size of the number of links inbound. We can do that by setting up some “Rankings.” In the upper left, click “Rankings.” Select “Nodes.” Then let’s adjust the size of the node. In Gephi, size is noted by a diamond. Do the following:

Screen Shot 2015-12-10 at 5.38.32 PM

Depending on how big your graph has shaken out, you will have to adjust the min and max size accordingly. Luckily, it’s pretty quick: just change the values, click ‘apply,’ and see if you like it.

Now let’s do the same with labels. First, turn them on by clicking this button in the bottom of your window.

Screen Shot 2015-12-10 at 5.39.58 PM

Now you’ll see either no text, or if you’re zoomed in enough, you’ll see too much!

Screen Shot 2015-12-10 at 5.40.30 PM

Let’s do the same for the labels as we did for size. Go back to your “Ranking” and do the following.

Screen Shot 2015-12-10 at 5.41.37 PM

Note that this is similar to above, but you’re now adjusting the “label size” – a letter with a diamond in the upper left corner.

Finally, let’s move our labels around like so in our “Layout” tab again:

Screen Shot 2015-12-10 at 5.41.45 PM

Your graph should look like:

Screen Shot 2015-12-10 at 5.41.55 PM

Now you can do some final steps. You can turn it to a black background with this button:

Screen Shot 2015-12-10 at 5.43.47 PM

And to run some rudimentary community detection algorithms, click on “Statistics” in the upper right hand column, click the “Run” button next to modularity, and click through the next few windows by clicking OK.

In the upper left, click on “Partition” (it should be next to “Ranking”). Click the refresh button next to the drop-down menu, and then click “Modularity.” Click “Apply.”

Screen Shot 2015-12-10 at 5.44.55 PM

Voila! Here you are:

Screen Shot 2015-12-10 at 5.46.10 PM

This is necessarily limited – I didn’t want to go into the specifics of making a usable Gephi visualization in this post. If you want more, check out our network analysis chapter in Exploring Big Historical DataTo see what’s possible, I wrote a blog post and video a few months ago that we’ve used for some substantive research outcomes.

2 thoughts on “From Dataverse to Gephi: Network Analysis on our Data, A Step-by-Step Walkthrough

  1. Reblogged this on Web Archives for Historians and commented:

    We thought that this post from December 2015 was still relevant today. In short, it shows how you can take web archive network files generated by our research team and analyze them yourselves using the open-source Gephi package.

    Even more excitingly, there’s many more Gephi files available today for your analysis. To find them, visit our network data page here: https://dataverse.scholarsportal.info/dataset.xhtml?persistentId=hdl:10864/12040. It grows on a regular basis!

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s