Working with DPLA Rights Statments

The initial download and unzipping of the DPLA corpus took 2.5 hours over my home network. Unzipped you're working with one gigantic 40+ GB text file of every record in the DPLA with with every one of its associated metadata fields. One issue is, how are you going to view the file? Most any standard text editor is going to have major issues. Even the old standby vim couldn't deal with it. If you just need to view the file the Unix less utility will do the trick, but I never did solve the problem of editing the file directly in an editor. I had hopes that Open Refine might be able to deal with the data, but it too needs to hold the entire file in memory.

The solution I eventually went with was to loop through the file a small chunk at a time, pulling out the rights statements and dumping them into a database table that ended up with north of 3.2 million records in it. I then ran the following query from the MySQL command line: SELECT count(*) as totals, license from license group by license ORDER BY totals DESC INTO OUTFILE '/tmp/licenses.txt';. It took about two and half minutes for the query to run, but it leaves you with a nice file of about 26,000 records grouped by license counts. The text file needed a bit of massaging in Excel to get the columns just right, but this leaves you with a file that can be manipulated in Excel or a standard text editor.

From there I ran the license file through a series of rather naive regular expressions to group them by their copyright type: No Known Copyright, In Copyright, Creative Commons and Unknown; saving the results as JSON. I erred on the side of being too expansive. If licenses had the same but had different copryight years they counted as separate licenses.

Treemaps require that the data be in a very specific format. D3.nest() can be used to get you most of the way there. I highly recommend Mr Nester by Shan Carter for playing around with getting your data into the correct format. From there it was just a matter of playing around with how many records you want to show. 26,000 is far too many for anyone to take in even if your browser could handle it. I settled on 575, through trial and error, as its seemed to strike a nice balance between inclusiveness and browser responsiveness.

Thanks for taking a look, and happy browsing.