I had an idea that a playful way to investigate the distribution of human knowledge might be to look at the volume of books in each of the categories of the Dewey Decimal Classification system. Why not compare the proportion of books about Science (500s) with Religion (200s)? Is Literature (800s) bigger than Arts & Recreation (700s)? ... and so on. I soon realised this was silly.
But first, the results. This zoomable treemap shows the relative proportion of books in the British Library according to their Dewey Decimal number. Click to zoom in! Click the header to zoom out!
The treemap is a lot of fun to play with, but not very useful if you want to know which specific digit categories are the largest. You aren't going to zoom in and out of every division.
Here's an alternative view - a radial chart, onto which I've labelled all digit categories that make up more than 1% of books.
Details and discussion
Let's talk briefly about what we can learn from this chart and why my initial hypothesis about sizing up human knowledge was silly.
1. Library bias
The British Library laudably makes their catalogue available as a data dump. The 'basic' (eg. minimal) dataset is 27gb.
You need only play with the treemap for a moment to discover that the British Library collection has a very high proportion of English Literature (>11%). It's inconceivable that English Literature could make up 11% of all human knowledge, so the British Library has a bias. That's because the British Library is a legal deposit library, meaning that it receives copies of all books published in the UK.
I expect that any library would have a 'bias' of some sort, but the legal deposit function which a great many of the world's largest libraries maintain will significantly exacerbate that bias. No doubt the Library of Congress is similarly biased towards American books. The British Library was the only library I could find which dumped its catalogue online, and so this plot reflects its bias.
One more note about the British Library: a large proportion of the books in their open catalogue do not have Dewey Decimal numbers (and are therefore omitted from plots). I'm not sure why, but it's possibly because the DDC is increasingly anachronistic in the age of computational information classification (ok, let's just call it "the age of Google"). I came across a few libraries that have completely abandoned the DDC.
Which brings us to:
2. Shortcomings of the DDC
The Dewey Decimal Classification first appeared in 1876 when Melvil Dewey published his A Classification and Subject Index for Cataloguing and Arranging the Books and Pamphlets of a Library. In this (and subsequent volumes), he mapped human knowledge to the famous "10 main classes, 100 divisions, and 1,000 sections." You know: the 500s are for Mathematics & Science containing 530s for Physics containing 531 for Mechanics. Dewey left it to librarians to create decimal subcategories for finer discrimination (eg. 531 Mechanics might contain 531.3 Kinetics, which might contain 531.34 Rotation). Which means that libraries will differ if you look at the full DDC numbers of the books you find in them. In the plots above, you'll see that we've truncated the DDCs of books to the integer component.
Then there's the copyright status of the scheme (and let's be clear: by scheme we are referring to the text labels attached to each number). In the same year that it was published, Dewey was granted copyright for his scheme. Although that original has now long passed out of copyright, it continues to be updated and modernised (eg. Computer Science & Programming has been added to the 000's). So although Dewey's original labels are in the public domain, the current DDC is not. The current DDC is owned and copyrighted by the Online Computer Library Center, Inc. (OCLC). (Interestingly, they make a big deal about the expanded 200 Religion Class).
DDC23 is the latest scheme and now has over 38k official divisions. The OCLC make the integer-division labels available in excel format here. I am obliged to note that: All copyright rights in the Dewey Decimal Classification system are owned by OCLC. Dewey, Dewey Decimal Classification, DDC and WebDewey are registered trademarks of OCLC.
Mr. Melvil Dui
If you didn't know, he was a lifelong advocate of spelling reform. Hence why he changed his name from Melville Dewey to Melvil Dui.
The two plots above were created with D3.js. Once again, thanks to Mike Bostock, both for D3.js and for the his generous library of examples which are so helpful. In this case, see especially Zoomable Treemaps and Sunburst Partition.
For any students of D3, the modifications which were most time-consuming for me were:
- Getting the label text to wrap by putting them in
foreignObjects attached to the rectangles, and transitioning them from
display:block. (This had a downside: I was unable to get them to fade-in or out using
- Correctly rotated labels on sunburst chart
- Making the sunburst chart a half-circle
- Coming up with a nice colour scheme :-)
The code can be viewed as github gists: