Generating List of Domain-Specific WARC Files to Download

In my last post, I took all of the CDX files for the 80TB Wide Crawl and began to count for how often the .ca domain appeared in each file. Once I had all that data, the next step was to sort it (code) so I could pick out the most relevant ones.

Each entry in this database was:

{FILE NUMBER, CDX, FREQUENCY}

Sorting by the latter generated a list of crawls that had the most .ca data. There was a discrepancy with the data provided by the Internet Archive, in that I identified 8,885,682 .ca domains and the Internet Archive identified 8,512,275 .ca domains. After double checking for duplicates that may have crept in there, I’m not finding any. However, those numbers are close enough for comfort at this stage.

It was an interesting window on how the .ca domains were distributed throughout the Wide Crawl. The mean repository had 1012.5 .ca sites, and the median was 732. There isn’t a magic bullet to just get the “.ca Internet” out of these files, of course, but we can find some case study files that we can hope have the largest amount of Canadian content.

Beginning with a rough ten gigabytes of a case study, here’s what I’ll be using:

{{4456, WIDE-20110714062831-crawl416/WIDE-20110714062831-crawl416.cdx.gz,73821},
{4458, WIDE-20110714072823-crawl416/WIDE-20110714072823-crawl416.cdx.gz,58581},
{4459, WIDE-20110714080057-crawl416/WIDE-20110714080057-crawl416.cdx.gz,45163},
{4453, WIDE-20110714054527-crawl416/WIDE-20110714054527-crawl416.cdx.gz,39513},
{4464, WIDE-20110714092713-crawl416/WIDE-20110714092713-crawl416.cdx.gz,35986},
{4466, WIDE-20110714100706-crawl416/WIDE-20110714100706-crawl416.cdx.gz,33982},
{4461, WIDE-20110714084349-crawl416/WIDE-20110714084349-crawl416.cdx.gz,33892},
{4467, WIDE-20110714104120-crawl416/WIDE-20110714104120-crawl416.cdx.gz,29423},
{4470, WIDE-20110714112924-crawl416/WIDE-20110714112924-crawl416.cdx.gz,27637},
{4473, WIDE-20110714120922-crawl416/WIDE-20110714120922-crawl416.cdx.gz,19223}}

These have a lot of the .ca Internet in them, representing almost 5% of it. That’s a decent beginning, and as my project gets off the ground with more storage and resources, we can probably push that up to at least 10%.

The next step is to set up a download this afternoon that’ll bring those files all down. I’ll then have my case study for my historical web archive experiment.

And as a neat aside, I decided to open up file #4456 to see what I’d find inside – and it was a bunch of files from York University, where I did my PhD. Neat coincidence.

Screen Shot 2013-06-20 at 9.24.26 AM

2 thoughts on “Generating List of Domain-Specific WARC Files to Download

Leave a comment