Tuesday 24 March – Text Mining the History of Medicine

Please join us for our next seminar.

Presenters:  Sophia Ananiadou (Manchester University)

Title:  Text Mining the History of Medicine

Date:  24 March 2015

Time:  5:15 PM (GMT)

Venue:  John S Cohen Room 203, 2nd floor, IHR, North block, Senate House or live online via the Digital History Seminar blog.

Live Stream

Slide Show 

Abstract: I will present the results of a collaborative and interdisciplinary project between the National Centre for Text Mining (NaCTeM) and the Centre for the History of Science, Technology and Medicine (CHSTM) at the University of Manchester, demonstrating the capabilities of innovative text mining tools to allow the automatic extraction of information from two historical archives: the British Medical Journal (BMJ) (1840 – present) and the London-area Medical Officer of Health (MOH) reports (1848-1972). NaCTeM’s text mining tools have enriched these historical archives with semantic metadata automatically by extracting terms, named entities and events.  The development of a semantic search system focused on the understanding of historical changes in lung diseases since 1840.

Seminars are normally streamed live online on this blog and on YouTube. To keep in touch, follow us on Twitter (@IHRDigHist) or at the hashtag #dhist.

Posted in Events | Leave a comment

Will historians of the future be able to study Twitter?

Over the last year or so, our seminar has become increasingly web-focussed. Last week we had an excellent paper from Jack Grieve of Aston University on the tracking of newly emerging words as they appeared in large corpora of tweets from the UK and the US. By amassing very large tweet datasets, he and his colleagues are able to observe the early traces of newly emerging words, and also (when those tweets were submitted from devices which attach geo-references) to see where those new words first appear, and how they spread. Jack and his colleagues are finding that words quite often emerge first (in the US) in the east and south-east (or California) and then spread towards the centre of the continent. They don’t necessarily spread in even waves across space, or even spring between urban centres and then to rural areas (as would have been my uneducated guess). Read more at the project site, treets.net, or watch the paper.

This kind of approach is quite impossible without the kind of very large-scale natural language data such as social media afford. This is particularly so as most words are (perhaps counter-intuitively) rather rare. In the corpus in question, the majority of the 67,000 most common words appear only once in 25 million words. Given this, datasets of billions of tweets are the minimum size necessary to be able to see the patterns.

It was interesting to me as a convenor to see the rather different spread of people who came to this paper, as opposed to the more usual digital history work the seminar showcases. Jack focussed on tweets posted since 2013; a time span that even the most contemporary historian would struggle to call their own; and so not so many of them came along – but we had perhaps our first mathematician instead. This was a shame, as Jack’s paper was a fascinating glimpse into the way that historical linguistics, and indeed other types of historical enquiry, might look in a couple of decades’ time.

But there is a caveat to this, which was beyond the scope of Jack’s paper, to do with the means by which this data will be accessible to scholars of 2014 working in (say) 2044. Jack and his colleagues work directly from the so-called Twitter “firehose”; they harvest every tweet coming from the Twitter API, and (on their own hardware) process each tweet and discard those that are not geo-coded to within the study area. This kind of work involves considerable local computing firepower, and (more importantly) is concerned with the now. It creates data in real time to answer questions of the very recent past.

Researchers working in 2044 and interested in 2014 may well be able to re-use this particular bespoke dataset (assuming it is preserved – a different matter of research data management, for another post sometime). However, they may equally well want to ask completely different questions, and so need data prepared in a quite different way. Right now, the future of the vast ocean of past tweets is not certain; and so it is not clear whether the scholar of 2044 will be able to create their own bespoke subset of data from the archive. The Library of Congress, to be sure, are receiving an archive of data from Twitter; but the access arrangements for this data are not clear, and (at present) are zero. So, in the same way that historians need to take some ownership of the future of the archived web, we need to become much more concerned about the future of social media: the primary sources that our graduate students, and their graduate students in turn, will need to work with two generations down the line.

Certainly, historians have always been used to working around and across the gaps in the historical record; it’s part of the basic skillset, to deal with the fragmentary survival of the record. But there is right now a moment in which major strategic decisions are to be made about that survival, and historians need to make themselves heard.

This post was written by Peter Webster who can also be found on his own blog Webstory.

Posted in Postscript | Tagged , , , , | Leave a comment

Tuesday 10 March – Lost Visions: retrieving the visual element of printed books

The IHR Seminar in Digital History would like to welcome you to its second seminar of 2015.

Presenters:  Julia Thomas, Nicky Lloyd and Ian Harvey (Cardiff)

Title:  Lost Visions: retrieving the visual element of printed books

Date:  10 March 2015

Time:  5:15 PM (GMT)

Venue:  John S Cohen Room 203, 2nd floor, IHR, North block, Senate House or live online via the Digital History Seminar blog.


Live Stream

The live stream for this session did not work properly. Please check back for the edited version of the video in the postscript section of the blog. Thank you.


Slide Show

Abstract: Despite the mass digitization of books, illustrations have remained more or less invisible. As an aesthetic form, illustration is conventionally positioned at the bottom of a hierarchy that places painting and sculpture at the top. The hybridity or bimediality of illustration is also problematic, the genre having fallen between the cracks of literary studies and art history. In a digital context, illustration has fared no better: new technologies can aid the editing of a literary text far more successfully than they can deal with the images that accompany it.

This paper focuses on the challenges and the implications of an AHRC-funded Big Data project that will make searchable online over a million book illustrations from the British Library’s collections. The images span the late eighteenth to the early twentieth century, cover a variety of reproductive techniques (including etching, wood engraving, lithography and photography), and are taken from around 68,000 works of literature, history, geography and philosophy.

The paper identifies issues relating to the improvement of bibliographic metadata and the analysis of the iconographic features of the images, which impact on our understanding of ‘the image’ in Digital Humanities and the negotiation of Big Data more generally. The work undertaken as part of the Lost Visions project allows for the further development of Illustration Studies, repositioning visual culture in the largely text-based process of digitisation and problematising modes of textual production.

Seminars are normally streamed live online on this blog and on YouTube. To keep in touch, follow us on Twitter (@IHRDigHist) or at the hashtag #dhist.


Posted in Events | Leave a comment

Tuesday 24 February – Tracking the Emergence of New Words across Time and Space

The IHR Seminar in Digital History would like to welcome you to its first seminar of 2015.

Presenters:  Jack Grieve (Aston)

Title:  Tracking the Emergence of New Words across Time and Space

Date:  24 February 2015

Time:  5:15 PM (GMT)

Venue:  John S Cohen Room 203, 2nd floor, IHR, North block, Senate House or live online via the Digital History Seminar blog.

Live Stream

Download Slide Show here

Abstract: Very little is known about how new words spread in language. New words are regularly identified by lexicographers, linguists, and the news media, but until recently we have not had access to sufficiently large geo-coded and time-stamped datasets that would allow for the detailed analysis of the geographical diffusion of lexical items in real time. However, with the rise of social media and smart phones, it is now possible to compile very large corpora that meet these requirements, allowing for new words to be identified and mapped across time and space and for the first time. In this presentation, I identify numerous newly emerging words based on a multi-billion word corpus of American tweets from 2013-2014 and map their geographical spread across the United States.

Seminars are normally streamed live online on this blog and on YouTube. To keep in touch, follow us on Twitter (@IHRDigHist) or at the hashtag #dhist.

Posted in Events | Leave a comment

Citizen history and its discontents: Postscript

By Matt Phillpott

There are an increasing number of crowdsourcing projects making claims about being ‘citizen history’. Old Weather, one of the more successful crowdsourcing projects of recent years, has started to use the term, and Zooniverse (the company behind it) has taken the same infrastructure this year for a World War One project called Operation War Diary. Then there is the project, Children of the Lodz Ghetto, in which volunteers undertake actual research tasks, helping to track down the names and lives of school children who fell victim to the Holocaust. By its nature this research is often complex, as names vary and change, and sources come in a variety of languages.

Citizen history is the current ‘buzz-word’, and its use is a claim to be moving beyond crowdsourcing and offering as well an opportunity to learn and master the skills collaboratively and co-operatively, of an historian.

In this third talk of this year’s Digital History seminar, Mia Ridge from the Open University shared her research into crowdsourcing and citizen history projects and asked whether they are really helping people to become historians or if they are, in actuality, overstating their contribution. As Mia, herself put it, ‘can citizen history projects succeed without communities of experts and peers to nurture sparks of historical curiosity and support novice historians in learning the skills of the discipline?’

The role of the ‘expert’?

Mia was very careful to stress that the importance of ‘expert’ historians being involved at the beginning, and throughout the project, is not to suggest that the grassroots community that these projects hope to build cannot, and do not, manage to deal with complex historical data and interpretation on their own.

When citizen history projects work well, the forums, wikis and other online spaces become an active hive of activity and co-operative discussion and collaborative learning and training. However, these communities are built upon learning about sources and their interpretation in a collaborative environment, and there are times when professional historians can offer advice where the sources are difficult or no other answer is forthcoming, or to pick up and highlight on details uncovered that are of wider historical significance. Generally, people who take to citizen history projects are there to discover the past, and learn how to use the sources, and the input of professional historians are valued as part of that process.

Often however, the role of the professional or ‘expert’ historian, is largely hidden away. Mia noted that often professional historians take an active role in the forums near the beginning of a project to help to get things started, but later on, whilst they continue to check the forums, their input reduces as teaching, research, and funding applications, by necessity, take precedence. Ideally this shouldn’t happen, but there are very real obstacles that limit the time and effort professional historians can give to citizen history projects. How we overcome this difficulty is not an easy question to answer.

What makes citizen history a success?

For a citizen history project to become successful not just in developing a resource of research materials through crowdsourcing, but also in enabling the development of historians, it is essential to build a critical mass of discussion and usage, and to expose people to historical materials that are potentially interesting. It is, also, important to include expert input, as this can transform the process.

Essentially some citizen history projects are really crowdsourcing and are perhaps misusing the term, whilst others fail to reach their goals for one reason or another. Others are highly successful. Yet there is a risk in these projects that citizen historians will become seen as faux historians, with limited skills and abilities, where in reality there are a variety of levels of citizen historians ranging from those just beginning the process to those who have built up the skills and knowledge required of any other historian.

Mia ended her talk with a call for crowdsourcing and citizen history project organisers to be more careful with the terminology they use. Signing up to a project and doing a bit of transcription work does not make that person a historian, but this can become the end result. Projects need to be clear about what it is they are offering and asking, and what exactly is required to become a citizen historian rather than, perhaps, a citizen transcriber.

Posted in Postscript | 1 Comment

Digital Humanities Project, ‘Mapping Eighteenth-Century Tourism in the English Lakes’

On Wednesday 26 November 2014, the Digital History seminar is co-hosting a seminar with the British History in the Long-Eighteenth Century seminar. Here are the details:

Title: Mapping Eighteenth-Century Tourism in the English Lakes

Speakers: Ian Gregory and Chris Donaldson (Lancaster)

Location: Wolfson Room NB01, Basement IHR, North Block, Senate House

Time: Wednesday 26 November 2014, 5.15pm

Posted in Events | Leave a comment

Tuesday 18 November – Citizen History and its discontents

The IHR Seminar in Digital History would like to welcome you to its third seminar of the 2014 autumn term.

Presenters:  Mia Ridge (Open University)

Title:  Citizen History and its discontents

Date:  18 November, 2014

Time:  5:15 PM (GMT)

Venue:  John S Cohen Room 203, 2nd floor, IHR, North block, Senate House or live online via the Digital History Seminar blog.

Live Stream

Slide Show

Abstract: An increasing number of crowdsourcing projects are making claims about ‘citizen history’ – but are they really helping people become historians, or are they overstating their contribution? Can citizen history projects succeed without communities of experts and peers to nurture sparks of historical curiosity and support novice historians in learning the skills of the discipline? Through a series of case studies this paper offers a critical examination of claims around citizen history.

Seminars are normally streamed live online on this blog and on YouTube. To keep in touch, follow us on Twitter (@IHRDigHist) or at the hashtag #dhist.

Posted in Events | Leave a comment

Interrogating the Archived UK Web – postscript

By Adam Crymble

The second talk of our 2014 Autumn programme took on the challenge of a new type of source for historians: the Internet. Not online sources and databases, but the Internet itself. The first archived copies of the UK web have started to find their way into scholarly hands. Historians now have the ability to look at webpages as sources in themselves, just as we have previously read manuscripts as a window into the past. The web is a corpus rich in details about what we were like and what we thought was important, not that long ago. For a cultural or social historian, it’s a dream.

Peter Webster introduced the UK Web Archive, which is hosted by the British Library, and contains snapshots of the UK-web (.uk sites) dating back to the 1990s. A team of historians have been given access, to see what they can make of this new (and huge) resource. I want to emphasise the experimental aspect of this project, because in many respects I think we learned more about what these scholars couldn’t achieve than what they did achieve.


That’s not a failing in the quality of the scholars themselves. They managed to do exactly what we could hope from them: to test the limits of the historian’s method on a large, messy, digital archive. They’ve done us a great service in finding some of those limits. The question now ahead of us is what we’re going to do about it?


Two of the scholars were on hand to share their experiences. Gareth Millward, whose project explored hyperlinking behaviour towards the website of the Royal National Institute of the Blind (RNIB) in those early days of the web, and tried to uncover why people were casting those hyperlinks.

Also Richard Deswarte, who used the archive to explore manifestations of Europhobia online, looking particularly for indicators that people in Britain were using the web to express dissatisfaction with the country’s continued role in the EU.


The projects themselves took on interesting questions, which were appropriate, given the type of source. Most interesting for me – and a significant part of both presentations – was the discussion of where they had problems using the corpus. Both scholars complained of noise that made it difficult to identify unique or meaningful mentions. In Millward’s case the noise came in the form of an advertisement in the Guardian for a talking watch that was endorsed by the RNIB. The ad appeared on hundreds of pages, though it really only represented a single match for Millward’s purposes. Deswarte too had trouble with a rotating banner on a newspaper website that dramatically overemphasized the number of meaningful links to an article about Europhobia.

Both also noted the sheer number of hits they were getting, and Millward in particular emphasized his attempts to get the list down to a size where he could conduct a close reading. He had failed to do so, and is still left with a collection of 39,000 hits. However, both he and Deswarte reflected on that failure, and evoked the language of social scientists and their ideas about representative sampling that they felt would have been appropriate if given the opportunity to tackle this challenge again. That reflection is significant, because it shows both Millward and Deswarte recognized the limits of the historian’s skillset for a project such as this.

However, I think we can push those limits further. The very notion that we would do a close reading of the Internet is one that I think only historians would suggest. It shows how deeply the value of close reading is held in the profession, even if it proves entirely inappropriate. We need to move on from that belief: that you can only know something if you’ve read it carefully. If we hold on to this mentality we’re going to lose our chance to discover anything at scale. We’ll be unable to pursue the longue durée that Guldi advocated for in our previous seminar.

Sitting in the audience I couldn’t help but think that the solution wasn’t in sampling and close reading. It was in corpus linguistics, data manipulation, clustering algorithms, and distant reading. Skills that are so rarely taught in our history programmes, but that this experiment made clear need to become part of our disciplinary tool kit. And if not our toolkit, then we need to engrain the value of collaboration. If you can’t do it, find someone who can that wants to work with you.

The day of the lone scholar intent on close reading are numbered. The UK Web archive has showed us that. So what are we going to do about it?

Adam Crymble is a convenor of the Digital History seminar at the IHR and a lecturer of digital history at the University of Hertfordshire. The UK Web Archive is available to search now. In addition there are a variety of related research projects such as the Big UK Domain Data for the Arts and Humanities (BUDDAH) Project. Analysis into the sustainability of the dataset can be found on the website for the Analytical Access to the Domain Dark Archive (AADDA), and examination of the potential value of the UK Web Domain dataset can be found on the Big Data: Demonstrating the Value of the UK Web Domain Dataset for Social Science Research website.

Posted in Postscript | Tagged , , , | Leave a comment

Tuesday 4 November – Interrogating the archived UK web: Historians and Social Scientists Research Experiences

surf-107865_640The IHR Seminar in Digital History would like to welcome you to its second seminar of the 2014 autumn term.

Presenters:  Dr Gareth Millward (London School of Hygiene and Tropical Medicine), Dr Peter Webster (British Library Web Archiving Team), & Richard Deswarte (UEA).

Title:  ‘Interrogating the archived UK web: Historians and Social Scientists Research Experiences’

Date:  4 November, 2014

Time:  5:15 PM (GMT)

Venue:  John S Cohen Room 203, 2nd floor, IHR, North block, Senate House or live online via the Digital History Seminar blog.

Live Stream: 

Slides: Peter Webster     Richard Deswarte     Gareth Millwood (opens in new windows)

Abstract:  The emergence of the WWW has been one of the most profound and influential phenomena of the last twenty years.  One of the dominant features of the WWW is its changing nature both in terms of content and its technological underpinnings.  The content of the WWW is an immense resource full of potential for academic researchers both in its current state and in its previous constantly changing forms.  Over the last decade, in particular, archives of WWW materials have been emerging.  These archives are still very much in a nascent form but are beginning to be made available and to be utiltised by a range of scholars.  The UK Web Archive hosted by the British Library is at the forefront of trawling and making available for researchers archived versions of the UK WWW dating back to the 1990s.  It is currently engaged jointly with the Institute of Historical Research (IHR) and the Oxford Internet Institute (OII) in the ‘Big UK Domain Data for the Arts and Humanities Project’ (BUDDAH) where a new research interface is being developed in conjunction with a number of humanities scholars who are at the same time exploring the UK Web Archive to identify its strengths and weaknesses for academic research.  Peter Webster will introduce Web Archiving, the BUDDAH project and the new research interface, while Gareth Millward and Richard Deswarte will relate their experiences in using the resource to research respectively the history of disabled people and accessibility on the WWW, and Euroscepticism.


Dr Gareth Millward is currently a Research Fellow at the Centre for History in Public Health at the London School of Hygiene and Tropical Medicine.  He has research interests in disability and government policy, and more recently notions of the ‘public’ in British vaccination programmes.  For the BUDDAH project he is researching disabled people and the Web.

Richard Deswarte is a Lecturer in Modern European History at UEA with research interests in the European idea and integration, as well as Digital Humanities.  On the BUDDAH project he is examining the presence and rise of Euroscepticism.

Dr Peter Webster is currently the British Library lead on the BUDDAH project and Web Archiving Engagement and Liaison Officer at the BL.  Alongside scholarly interests in Web Archiving and Digital Humanities, Peter researches on the history of religion, the Anglican Church and the relation between church, law and state in 19th and 20th century Britain.

Seminars are normally streamed live online on this blog and on YouTube. To keep in touch, follow us on Twitter (@IHRDigHist) or at the hashtag #dhist.


Posted in Uncategorized | Leave a comment

Introducing Paper Machines – postscript

In the welcome surroundings of the refurbished Institute of Historical Research, Jo Guldi (Brown University) kicked off the 2014 Autumn Term programme of the IHR Digital History Seminar. In town to discuss The History Manifesto, her new open access book co-authored with David Armitage, Guldi’s talk ranged from the public role of the historians, the Digital Humanities and new model of publishing to impending environmental catastrophe, the need for deep history and data processing tools that can help citizen and scholars alike overcome the problems of modern bureaucracy. To see how Guldi weaved all this threads together, you’ll need to watch the video below. Here I just want to tease in no particular order at a few of threads that stuck in my mind, threads that pertain to most, if not all, digital history projects that pass through the seminar.

Tools as provocations: Paper Machines is a research tool. But it is also a provocation, an experiment with using large swathes of information to inform historical research in the longue durée, a vantage point – the tools makers argue – historians take not often enough. The tool, in short, is the argument.

What we need now: As we sit on the precipice of environmental catastrophe, does it not behove us to think about what digital projects we need? Do we want digital projects that analyse art for art’s sake, that recapitulate old research paradigms and do not address problems of a wider, public relevance?

Hypothesis generation: At the heart of Paper Machines is hypothesis generation. It allows the scholar to take a vast paper archive and facet that archive, make visualisations, select where to read closely. How that macro to micro scaling changes the history that is written, how scholarly debates mature to integrate the inevitable discrepancies between interpretations made at these scales is the challenge historians must re-engage with.

Being bold about method: Works that change the focus of disciplines usually open their accounts by stating ‘you missed this because your method was wrong’. Digital history can and should do the same, it can and should be bold about how it comes to the conclusions it does rather than hide the methods, ways, and means that underpin its particular take on historical phenomena.

My partial, incomplete, CC BY notes on the seminar are available on GitHub Gist.

The next Digital History seminar, ‘Interrogating the archived UK web: Historians and Social Scientists Research Experiences’, will take place on 4 November and a full listing of Autumn Term seminars is available on the IHR Website.

James Baker (Curator, Digital Research, British Library)

Creative Commons License
This post is licensed under a Creative Commons Attribution 4.0 International License.

Posted in Postscript | Leave a comment