Using Images to Gain Insight into Web Archives?

Do you like animated GIFs? Curious about what this? Then read on.. :)
Do you like animated GIFs? Curious about what this? Then read on.. 🙂

Full confession: I rely too heavily on text. It was a common proviso of the talks I gave this summer on my work, which focused on the workflow that I used for taking WARC files, implementing full-text search, and extracting meaning from it all. What could we do if we decided to extract images from WARC files (or other forms of web archives) and began to distantly read them? I think we could learn a few things.

A montage of images from the GeoCities EnchantedForest neighbourhood. Click for a higher-resolution version.
A montage of images from the GeoCities EnchantedForest neighbourhood. Click for a higher-resolution version.

This continues from my work discussed in my “Image File Extensions in the Wide Web Scrape” post which provides the basics of what I did to begin to play with images. It also touches on work I’ve done with creating montages of images in both GeoCities and the Wide Web Scrape.

While creating montages was fun, it didn’t necessarily scale: up to a certain level, you find yourself clicking and searching around. I like creating them, think they make wonderful images and are useful on several levels, but it’s hardly harnessing the power of the computer. So I’ve been increasingly playing with various image analysis tools to distantly read.

Extracting Images and Analyzing Them (Generating Some Data)

I’ve been focusing on GeoCities in the last few days. Basically, I take the GeoCities archive and extract images (using my array of extensions from the Wide Web Scrape) and put them into a numbered directory. The ‘neighborhoods’ variable holds a list of the neighbourhoods, which are directories, and then I basically find the filenames, assign it an ever-increasing number, and then copy it into a neighbourhood-specific directory.

for hood in $neighborhoods; do
echo "beginning $hood"
	find -E . -iregex ".+\/${hood}\/.+(\.gif|\.jpg|\.tif|\.jpeg|\.tiff|\.png|\.jp2|\.j2k|\.bmp|\.pict|\.wmf|\.emf|\.ico|\.xbm)" | while read filename; do
    	cp "$filename" /volumes/lacie-1/geocities-images/"$hood"/"$newname"

I then have big directories of files. The first few attempts were to ingest them all into Mathematica, which was a super quick way to run out of memory even with 64GB. To give you a sense of the sizes at work here, a relatively small neighbourhood like the ‘Enchanted Forest’ for kids has 243,520 images; a mid-sized one like ‘Athens’ for philosophy and big thinking had 579,805; and the largest, the family-focused ‘Heartland’ section, has 1,468,762!

I then turned to ImageMagick, which has some cool image analysis tools. Drawing on a StackOverflow question by Thomas Padilla (Digital Humanities Librarian at MSU), I ran the following command in each directory (a snippet of a broader script that does this for every neighbourhood). The fullsize proviso on a few of the outputs differentiates it from some other work I did on ‘thumbnail’ versions to compare results/performance.

identify -verbose "*" > ./analysis/fullsize-image-output-enchantedforest.log; for f in *.*; do echo "$f => `identify -verbose "$f" | grep mean | tail -n1 | cut -d':' -f2 | xargs`"; done > ./analysis/fullsize-image-output-RGB-Means-enchantedforest.log; for f in *.jpg; do echo "$f => `identify -verbose "$f" | grep mean | head -n3 | cut -d':' -f2 | xargs | awk '{print "R = "$1" "$2", G = "$3" "$4", B = "$5" "$6}'`"; done > ./analysis/fullsize-image-output-RGB-values-enchantedforest.log; for f in *.*; do echo "$f => `identify -verbose "$f" | grep signature | xargs`"; done > /volumes/lacie-1/geocities-images/enchantedforest/analysis/enchantedforest-image-signatures.log

Or, if I don’t care about images, just:

identify -verbose "*" > ./analysis/fullsize-image-output-heartland.log; for f in *.*; do echo "$f => `identify -verbose "$f" | grep signature | xargs`"; done > /volumes/lacie-1/geocities-images/heartland/analysis/heartland-image-signatures.log

It’s messy, but basically, at the end of it I have four output files:
1. fullsize-image-output-enchantedforest.log: a verbose output for each file (more on this);
2. fullsize-image-output-RGB-Means-enchantedforest.log: the overall RGB mean for each file;
3. fullsize-image-output-RGB-values-enchantedforest.log” the RGB value means for each file;
4. enchantedforest-image-signatures.log: the ImageMagick hash strings for each image.

We can do some fun stuff with this, that I think might be handy as finding aids or distant reading accessories for lots of images in web archives.

So What Can We do with This?

I should put an under construction JPG, as that’s what I’m going to be figuring out this week hopefully. 🙂

First, we can do quick greps to find basic details of ImageMagick verbose settings (can all be found here). We can find out percentages of colour vs. grayscale, resolutions, file format type, pixels, colorspaces, hues, RGB means, etc. I often dump these using a grep command and then read them in Mathematica. So we can see in the Athens neighbourhood, we have 962,402 RGB pictures versus 103,083 Grayscale images, and that white is overwhelmingly the most common background colour.

Some faces from the FashionAvenue neighbourhood of GeoCities.
Some faces from the FashionAvenue neighbourhood of GeoCities.

Second, we have all the images, which we can do other things. I’ve been using FaceDetect to see what percentage of files have faces in them, and feeding them into Mathematica as well. I’ll be focusing on this this week as I’ve got some more robust Mathematica code to work with (if you care, some images break the Import function, but I can implement a delay to skip them and that seems to work well).

Finally, I’ve been using Mathematica‘s rapid prototyping function to play with recurring images. A research question that I have been engaging with is how consistent these neighbourhoods were: did they borrow from each other? Did they have similar topics? Were they engaged with each other? In short, was there an engaged virtual community?

This is where those ImageMagick signatures come in. I wrote a quick program that counted the most frequent hashes, and then created a Mathematica list with the following information:


So for example, the most popular image – an animated GIF – from the Athens neighbourhood:

Moving the slider or changing the value lets us explore the data deeper.
Moving the slider or changing the value lets us explore the data deeper.

From this, we see that this banner advertisement for GeoCities itself was frequent (appearing 465 times – not all pages, but quite a few) and was distributed throughout the entire neighbourhoods. If they were tightly clustered, we might learn that it was a site that repeated an image over and over and over again (it’s GeoCities, so you never know, right?). It’s a good test for something that’s working.

If we pop over to EnchantedForest, we find it as well. But we begin to see images like:

T-I-double"guh"-errrrr, that spells Tigger!
T-I-double”guh”-errrrr, that spells Tigger!

This was far more common for EnchantedForest: prominent widely distributed images include neighbourhood-specific awards, SafeSurf logos, and clip art of cartoon characters, etc. that was shared around. Less so for Athens.

It’s a fun way to begin exploring images. I think the distribution chart helps out, and it’s a new way to distantly read them.

Next Steps

This week I’ll be working on faces, and I should also – drawing on some suggestions from McGill’s Daniel Simeone – get some more fuzziness from images. Right now, if you took an image and reszied it slightly, it wouldn’t show up in the duplicates list. Most users who created basic GeoCities pages would have used standard templates and lifted images, but more advanced ones may have done some basic edits (tinkering in MS Paint, for example). So this “Detect Duplicate Images with Python” script looks great.

And, of course, the semester looms so I should probably start thinking about teaching too…

2 thoughts on “Using Images to Gain Insight into Web Archives?

Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s