Ethics and the Archived Web Presentation: “The Ethics of Studying GeoCities”

I had the great pleasure to be a speaker at the Ethics and Archiving the Web conference at the New Museum in New York City. My own contribution to the conference was a piece on the “Ethics of Studying GeoCities.”

The livestream of the whole conference is available here.

Hi everybody and thanks so much for coming to my talk today. What I want to do is discuss the “ethics of studying GeoCities,” which to me gets at both the potential but also the risks of doing a lot of this web archival research. Continue reading “Ethics and the Archived Web Presentation: “The Ethics of Studying GeoCities””

New Article: “Ten Simple Rules for Collaborative Lesson Development”

Screen Shot 2018-03-11 at 2.16.01 PMI’m part of a great team that’s just published a new article: “Ten Simple Rules for Collaborative Lesson Development.” It’s part of the “Ten Simple Rules” series at PLOS Computational Biology.

The first paragraph of our introduction sets the stage:

Lessons take significant effort to build and even more to maintain. Most academics do this work on their own, but leveraging a community approach can make educational resource development more sustainable, robust, and responsive. Treating lessons as a community resource to be updated, adapted, and improved incrementally can free up valuable time while increasing quality.

If you’re curious, read on! The article can be found here. You can find a nicely-laid out PDF here as well.

Web Archive Analysis Workshop

Screen Shot 2018-02-22 at 9.33.56 AM
You can follow along through the links in this presentation

I was recently out at Simon Fraser University with Nick Ruest, where we ran a “Twitter and Web Analysis at Scale” workshop. We had a great and hardy band of students (including librarians, graduate students, and faculty) who braved the uncharacteristic snow atop Burnaby Mountain to learn about all things web archives and social media. My sincerest thanks again to SFU for being such amazing hosts, and for their fantastic “Data Love-In” programming.


My role in the workshop primarily focused on how to use web archives: I introduced students to the Wayback Machine (from doing searches in it to learning about temporal violations and provenance),, and of course, the Archives Unleashed Toolkit. We ended up taking data from and running analysis with it in AUT which worked for the most part. The workshop then concluded with work in Gephi.

As part of this, I made an interactive presentation: feel free to explore it, click on the many, many hyperlinks that are part of it, and you can learn a bit about web archives. I hope I get the opportunity to run this workshop a few more times: it’s always nice to have some dividends from the amount of work putting these things together can be. Continue reading “Web Archive Analysis Workshop”

We’ve Been Busy! The Archives Unleashed Project in 2017

Screen Shot 2018-01-18 at 12.55.26 PMHappy 2018!

Last year (2017) was a busy year on many fronts (from my own personal parental leave to launching our Archives Unleashed project!). Our project manager, Samantha Fritz, has a great write-up on the project’s activities to date. Please check it out!

We have many more exciting things planned, from our Toronto datathon in April 2018 and the Archives Unleashed Cloud, so please stay tuned and subscribe to our Medium blog or e-mail list if you’re curious!

The Death of Storify, Difficult Alternatives, and the Need to Steward our Data Responsibly

Screen Shot 2017-12-12 at 8.09.29 PMStorify is dead. The service, which let you take social media content like Twitter and Facebook posts and aggregate them together into stories, announced that they’ll be shutting down and deleting all content as of March 16th, 2018. It’s not as bad as some platform shutdowns – there is notice and at least you can export your own content (one story at a time) – but it’s still a reminder of how vulnerable user-generated content can be online.

This hits all users hard. Within academia, Storify seems to be the go to to document controversy, or more commonly, conferences (say, the proceedings of an online conference or the hard work that went into documenting a presidential address). And why not? It’s an intuitive platform, far better than grabbing screenshots, and the other standard method – embedding Tweets in say a blog post – is equally vulnerable to an external service, that of Twitter itself, changing its access, model, or failing altogether.

So what should we do? Continue reading “The Death of Storify, Difficult Alternatives, and the Need to Steward our Data Responsibly”

The Archives Unleashed Project: Warcbase is dead, long live the Toolkit

(x-posted from our Archives Unleashed Medium blog)

by Ian Milligan, Jimmy Lin, and Nick Ruest

We were delighted to be able to announce a few months ago that our project team at the University of Waterloo and York University were awarded a grant from the Andrew W. Mellon Foundation to make petabytes of historical internet content accessible to scholars and others interested in researching the recent past. Continue reading “The Archives Unleashed Project: Warcbase is dead, long live the Toolkit”

New Article: “Warcbase: Scalable Analytics Infrastructure for Exploring Web Archives”

Screen Shot 2017-08-09 at 10.52.58 AM
Our new article!

We have a new article out! Jimmy Lin, Jeremy Wiebe, Alice Zhou, and myself have a new piece in the ACM Journal on Computing and Cultural Heritage. This piece, a collaboration between two computer scientists (Jimmy and Alice) and two historians (Jeremy and myself), both introduces Warcbase as well as our Filter-Analyze-Aggregate-Visualize (FAAV) cycle for working with large-scale web archives.

You can check out the article here in the ACM Digital Library.

Topic Shifts Between Two US Presidential Administrations

By Ziquan Wang, Borui Lin, Ian Milligan, and Jimmy Lin

While Americans are busy enjoying their Fourth of July, us Canadians are digging into data… and indeed, we wanted to share some research recently presented at the Web Archives and Digital Libraries workshop.

Shortly after Donald Trump’s inauguration as President of the United States, eagle eyed observers noted a crucial difference in his webpage as compared to his predecessor, President Obama. Whereas Obama’s information page had listed the three branches of the US government: executive, judicial, and legislative, Trump’s page listed only two.

Examples like this made our research team at the University of Waterloo wonder: could we systematically begin to track the changes in discourses, priorities, topics, and beyond between two US Presidential elections, and more so, could we do so on a budget? As I’ve argued elsewhere, web archives are of crucial importance for historians seeking to understand any period after 1996. Yet the scale requires us to turn to digital methods. We cannot go page by page through websites, but rather we need tools to extract the information that we need. Could we “distantly read” websites to notice shifts like observers did in the early days of the Trump administration?

Luckily for us, students had just finished taking Jimmy Lin’s (awesome) Big Data Infrastructure course and wanted to exercise their skills.  The amazing Ziquan Wang and Borui Lin joined us and set out to explore shifts between two American presidential administrations.

But first, we needed the data… Continue reading “Topic Shifts Between Two US Presidential Administrations”

Grant news: Multidisciplinary project will help historians unlock billions of archived web pages

Some exciting news! Nick Ruest, Jimmy Lin, and myself will be leading a three-year project into web archiving analysis and community building. From the story:

The University of Waterloo and York University have been awarded a grant from the Andrew W. Mellon Foundation to make petabytes of historical internet content accessible to scholars and others interested in researching the recent past.

The grant, valued at $610,625, supports Archives Unleashed, a project that will develop web archive search and data analysis tools to enable scholars and librarians to access, share, and investigate recent history since the early days of the World Wide Web. It is additionally supported by generous in-kind and financial contributions from Start Smart Labs, Compute Canada, York University Libraries and the University of Waterloo’s Faculty of Arts.   

You can read more at the full story here.

Warcbase Install Guide for OS X and Linux

Screen Shot 2017-04-20 at 11.08.25 AMWarcbase is great, so we often get lots of questions about how to install it – never straightforward when using a piece of software with multiple dependencies. While we’re hoping to make some significant changes to the process of installing and using Warcbase in the near future, in the short to medium term I wanted to make a slightly simpler guide.

If you’re interested in using Warcbase to analyze your web archives, you might find this PDF helpful. I will have some tutorials available soon to run some scripts on your own data – and generate a standard set of derivatives.

Here are some walkthroughs for a workshop I’m running, in nice-easy-to-use paste-able HTML: Warcbase Installation and Penn State Warcbase Workshop.

Download the PDF here (4MB download).