Quickly Deploying Warcbase with Amazon Web Services

All the fun of warcbase.. in the cloud!
All the fun of warcbase.. in the cloud!

Part of the problem with warcbase is that you need a decently powerful laptop or desktop to run meaningful analysis on small-to-medium-sized collections (although it does run on a Raspberry Pi). While our team has access to a cluster, that’s not the case for everybody. What if you had a few WARC files and wanted to run some analysis on it with warcbase, without fully going through our documented yet sometimes challenging installation process? And if you wanted to deploy it on a cloud provider such as AWS? Maybe you have collections in an Amazon S3 bucket?

Note that this is all part of our exploration of ways that could eventually bring warcbase to a wider audience. It’s still command line based, unfortunately, and requires knowledge of the AWS stack. But for technically-inclined people or developers with web archival collections, we’re moving closer to helping you work with your collections.

Accordingly: enter the Warcbase Workshop repository!

For IIPC 2016 in Reykjavik, Nick Ruest and I ran a warcbase workshop. Nick set up a Vagrant workflow to help build the VM on people’s remote machines, which was incredibly useful. The downside was that it required lots of data to download locally, which is especially flaky if folks forget to do so until they’re sitting in a conference room with hundreds of friends. It also required that they had full-fledged laptops.

What if they could just spin it up on Amazon Web Services? For our example data, even a m3.medium would suffice, which in us-west-2 is $0.067/hour. I was also inspired by the DocNow team’s workflow, which allows for quick spin up of an instance on AWS (which I’ll hopefully use in Denver this January for an AHA workshop I’m running).

How do you get this set up?

The documentation is ever-evolving and can be found in our repository, but after installing the dependencies (Vagrant, the Vagrant-aws plugin, and git), you can edit the VagrantFile with your Amazon credentials. The following command will then by default spin up an m3.medium:

vagrant up --provider aws

You may have to manually edit the security group in your EC2 dashboard, to allow inbound traffic on port 22 (SSH) and 9000 (custom TCP rule, if you want to use the Spark notebook). After the lengthy install process, vagrant ssh will connect you to a machine with warcbase running.

You can then use the shell or notebook to your hearts abandon, and attach EBS volumes so you can work with web archives that you might have lying around. I think it’s a potentially useful addition to web archives!

You made it this far, so here's my research assistant Auden playing with web archives.
You made it this far, so here’s my research assistant Auden playing with web archives.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s