Generating List of Domain-Specific WARC Files to Download

In my last post, I took all of the CDX files for the 80TB Wide Crawl and began to count for how often the .ca domain appeared in each file. Once I had all that data, the next step was to sort it (code) so I could pick out the most relevant ones.

Each entry in this database was:

{FILE NUMBER, CDX, FREQUENCY}

Sorting by the latter generated a list of crawls that had the most .ca data. There was a discrepancy with the data provided by the Internet Archive, in that I identified 8,885,682 .ca domains and the Internet Archive identified 8,512,275 .ca domains. After double checking for duplicates that may have crept in there, I’m not finding any. However, those numbers are close enough for comfort at this stage.

It was an interesting window on how the .ca domains were distributed throughout the Wide Crawl. The mean repository had 1012.5 .ca sites, and the median was 732. There isn’t a magic bullet to just get the “.ca Internet” out of these files, of course, but we can find some case study files that we can hope have the largest amount of Canadian content. Continue reading “Generating List of Domain-Specific WARC Files to Download”