Data, Time and Trust

What does it take to build a digital archive?

Data, Time and Trust

I’m a graphic design geek and recently I’ve been spending time with the Letterform Archive in San Francisco.

The archive consists of over 30,000 items spanning 2,000 years of lettering, typography, calligraphy and design. It’s an incredible resource and the people behind it are currently embarking on the digitization and cataloging of these files. Although we mostly think of Qlik Sense as a data visualization tool, I love making use of its amazing associative engine to search, filter and explore qualitative information. I’ve used it to look at the 1 million images the British Library, The Tate’s online collection as well as the Internet Archive’s 65,000 books with over 24 million pages (although I haven't quite finished reading them just yet).

For me, the Letterform Archive is as much a fascinating data problem as it is as an opportunity to get my hands on some amazing design artifacts. However, as I said, they are at the beginning of the archival task, and it’s pretty daunting. It takes deep domain knowledge and library science skills to set out the metadata structure. In addition, it also requires many other experts to add the really exciting and supplementary data for an item, such as who designed it or what fonts are used. Of course, what supplementary data is needed or useful may not be obvious immediately or even known right now. The standards will need to evolve.

As this is a visual archive, the digital images are incredibly important. A single ‘raw' image data file at the maximum available resolution weighs in at around a whopping 500MB and there can be hundreds of pages in an individual file. The drive space and backup issues alone are staggering, and on top of that there is the time taken to carefully photograph each item. Furthermore, what about the descriptive data: the information about each item, its ID, title, creator, format etc.? The standard for most libraries and archives is the MARC record, which has been around since the Library of Congress started using computers in the 1960s. This method is effective, but not easy to work with from a data exploration standpoint. High quality cataloging and data collection is slow and time consuming.

Building a digital data archive is more than just creating a collection! Here's why:

Libraries and archives have to play the long game. The key to the usefulness of an archive is the longevity, stability and authority of the information systems it supplies as well as the items it houses. But ultimately to continue existing it needs to be used, it needs to be accessed and supply value. This is the same for all data.

So how do you get on with the long term task of building the authoritative archive and in the meantime deliver interesting data and value?

First up you need to be willing to break up and diversify the data. We can think of the base archival system as our governed ‘one version of the truth’. It’s stable, slow moving, carefully maintained, and governed by standards and policies. The trick is rather than attempting to load that system with every other piece of knowledge, we simply layer it on incrementally as and when it’s created. This opens up the possibility of having various degrees of credibility and authority in the data, which is fine as long as that is explained and the core data is kept safe. Of course this requires that each item is uniquely identified and that the key is used across all the data collections. But once in place, this system opens up some great possibilities, like utilizing the domain expertise of specialist groups or simply a ‘many eyes’ approach, such as how the New York Public Library is using website visitors to help improve the data around their NYC historical maps.

Building a visual archive is a long way from most BI and analytics projects but many of the problems faced pertain to any data project. When you start your next data initiative ask yourself:

  • How can we deliver value before we are ‘finished’ collecting and cleaning all the data?
  • How can we add new layers of information to the core data?
  • How can we balance the governance, manageability and accuracy of different data sources with the information needs of the users?

To finish with, here’s something from the Letterform Archive: it’s the artwork for a hand crafted chart by William Addison Dwiggins, who is often credited with coining the term ‘graphic designer’.


 

In this article:

Comments

Learn more about how Qlik can help your business.

Follow Qlik