The sheer amount of social, cultural, and political information that is generated and, crucially, preserved every day presents new exciting opportunities to historians. A large amount of this information is being contained within web archives, which contain billions of web pages. Scholars broaching topics dating back to the mid-1990s will find their projects enhanced by web data – military historians can use forum posts by soldiers, social historians can track aspects of everyday life through blogs and comments, political historians can study changing sentiment, tropes, and link structures, and economic historians can explore the rise and fall of businesses webpages. Yet this tremendous opportunity is mitigated to some degree by the sheer challenge of dealing with all that data: we have more information than ever before, but the scale is overwhelming.
We have several common tensions, however, beyond basic ones of having enough storage and computational power to deal with all of this information. I will focus on two. The first is that while historians largely want to work with content, technological limitations push us towards rich metadata. The second is that without basic understanding of the conceptual structure of the web archive, from crawl structure to the biases, we can generate wildly misleading results – a problem for historians with most digitized sources.
In this talk, I explore these tensions as they have played out over three case studies that I have studied: compiled collection of mirrored websites), and the 2005-Present Archive-It collections of Canadian political parties, unions, and organizations (WAT files, which contain derivative data). For each archive, I briefly discuss the usage, technical, and ethical challenges that such collections present for historians: problems of too much data, processing time, and the difficulties in applying cutting-edge natural language processing