As part of our commitment to open data – in keeping with the spirit and principles of our funding agency, as well as our core scholarly principles – our research team is beginning to release derivative data. The first dataset that we are releasing is the Canadian Political Parties and Interest Groups link graph data, available in our Scholars Portal Dataverse space.
The first file is
all-links-cpp-link.graphml, a file consisting of all of the links between websites found in the our collection. It was generated using warcbase’s function that extracts links and writes them to a graph network file, documented here. The exact script used can be found here.
However, releasing data is only useful if we show people how they can use it. So! Here you go.
This video walkthrough is best viewed in conjunction with the step-by-step walkthrough below.
Once you’ve downloaded the file, open up Gephi.
On the opening screen, you want to select “Open a Graph File…” and select the
all-links-cpp-link.graphml file that you downloaded from our Dataverse page.
You then want to click ‘ok’ on the next page. Create a ‘new graph.’
You should now see what I (nerdily) call a borg cube. That’s good, because it means that the data is in there. We need to make it usable, however.
Click on the “Data Laboratory” tab at the top.
Click on “Nodes” above. When it is shaded behind it, that means that it is selected.
Click on “Copy Data to another Column,” select ID, and then select “label” on the drop down menu that appears in the following box.
Now click on “Edges” above.
Click on “Merge Columns,” select “timeint — STRING,” and move it into the right-hand column by clicking the arrow. Then under “Merge Strategy” select “Create Time Interval.” Click OK.
You now have to tell it how to read the data. Click on “Parse Dates” and enter “yyyymmdd.” That corresponds to how the data is laid out in the time intervals column! Press OK.
Now click on “overview” above, and let’s get our diagram going.
First, we need to filter it down. Right now there are too many nodes for most of our computers to make sense of. In the upper right, you’ll see a “filter tab.” Let’s make it so that we get rid of nodes (or domains) that have less than four inbound links.
To do so, under “Topology” find “In Degree” range and drag it down to the filter list below (Queries). Like so:
In the slider below, move the lower bound to 4. Then click ‘Apply.’
Then, we need to lay it out. Select the layout tab at left, and fill it out as follows.
Note: these are just random example values, that will produce a basic graph quickly. I often spend half an hour or even more tweaking these settings. In this case, I ran it for a few seconds and then hit stop. You will have to tweak these depending on the speed of your computer. If you screw up, just run a “Random Layout” and bring it back to the starting point.
Click on the following button to zoom your map out to the extant of the graph:
Now we’ll want to do some quick tweaks to make the nodes relative to the size of the number of links inbound. We can do that by setting up some “Rankings.” In the upper left, click “Rankings.” Select “Nodes.” Then let’s adjust the size of the node. In Gephi, size is noted by a diamond. Do the following:
Depending on how big your graph has shaken out, you will have to adjust the min and max size accordingly. Luckily, it’s pretty quick: just change the values, click ‘apply,’ and see if you like it.
Now let’s do the same with labels. First, turn them on by clicking this button in the bottom of your window.
Now you’ll see either no text, or if you’re zoomed in enough, you’ll see too much!
Let’s do the same for the labels as we did for size. Go back to your “Ranking” and do the following.
Note that this is similar to above, but you’re now adjusting the “label size” – a letter with a diamond in the upper left corner.
Finally, let’s move our labels around like so in our “Layout” tab again:
Your graph should look like:
Now you can do some final steps. You can turn it to a black background with this button:
And to run some rudimentary community detection algorithms, click on “Statistics” in the upper right hand column, click the “Run” button next to modularity, and click through the next few windows by clicking OK.
In the upper left, click on “Partition” (it should be next to “Ranking”). Click the refresh button next to the drop-down menu, and then click “Modularity.” Click “Apply.”
Voila! Here you are:
This is necessarily limited – I didn’t want to go into the specifics of making a usable Gephi visualization in this post. If you want more, check out our network analysis chapter in Exploring Big Historical Data. To see what’s possible, I wrote a blog post and video a few months ago that we’ve used for some substantive research outcomes.
2 thoughts on “From Dataverse to Gephi: Network Analysis on our Data, A Step-by-Step Walkthrough”
Reblogged this on Web Archives for Historians and commented:
We thought that this post from December 2015 was still relevant today. In short, it shows how you can take web archive network files generated by our research team and analyze them yourselves using the open-source Gephi package.
Even more excitingly, there’s many more Gephi files available today for your analysis. To find them, visit our network data page here: https://dataverse.scholarsportal.info/dataset.xhtml?persistentId=hdl:10864/12040. It grows on a regular basis!