Historians are becoming computer science customers – postscript

By Adam Crymble

Historians are becoming the clients of computer science departments. It’s a quiet conspiracy to steal our research funding and push out employment opportunities for our junior historians whose skills just don’t fit the needs of these projects. In our final seminar of the 2014/15 season, we heard from two interdisciplinary projects that included a mixture of historians, archivists, computer scientists, and natural language processing experts. If you want to find out about the projects in detail, check out their websites, or watch the talks online:

  • ChartEx: Charter Excavator
    • [Columbia University, University of Brighton, University of Leiden, University of Toronto, University of York, University of Washington]
  • Traces Through Time
    • [The National Archives, Institute of Historical Research, University of Brighton, University of Leiden]

Both projects depended on natural language processing – the rapidly improving ability of computers to structure and break down digitised texts in ways that are not entirely unlike how humans perceive their own language: full of types of content with various meanings that fit together in flexible but not entirely random patterns, rather than an unrelated series of letters, numbers, and punctuation. It’s safe to say that neither project could have gone ahead without natural language processing. I’m not sure we can say the same if there had not been historians on board. The results may have looked different (probably even worse), but instead of historians being a crucial piece in the puzzle, I think we’d see that without the historians the projects wouldn’t have happened because nobody would have proposed them. In this case, what we see is effectively historians (or archivists) hiring computer scientists to solve their problems. Or to put that another way: both of these projects were computer science projects with applications for historians.

Don’t get me wrong, the results are fantastic, and I think both of these projects are great as are all the team members involved. I also think that both historical studies and computer sciences are benefitting from these collaborations, and I hope we see more of them. This post is not a reaction to what I saw yesterday at the seminar or anyone involved in it, but is instead a reflection on some of the issues that hearing about these types of projects raised for me as a member of the audience.

I am concerned about the power dynamic this model is creating in the sector, particularly because of who is picking up the tab. At the moment, despite the fact that these are computer science projects, these types of projects are being initiated by historians – they have the problems that need solving. That means historians are finding the partners, pitching the challenges in interesting ways, and then turning to the funding bodies that they know best: in the UK that means the Arts and Humanities Research Council (AHRC) or the Economic and Social Research Council (ESRC). It would be utterly unfathomable for a historian to apply for a grant to the natural home of computer scientists, the Engineering and Physical Sciences Research Council (EPSRC), but that’s exactly where we need to go.

The fact that these computer science projects are of benefit to historians is completely irrellevant. If an engineer’s research has direct relevance to medicine (such as research into the invertebral disc in the human spine with the aim to ‘inform implant design‘ – a clear medical application), it’s still funded by the EPSRC, rather than the medical council. That’s because it’s an engineering project that happens to have applications for other fields. But within the digital humanities, it’s almost exclusively the AHRC and ESRC that are footing these bills and supporting this work.

That’s a problem because according to a report by the Department for Business Innovation & Skills (BIS) in the UK, the AHRC and ESRC are the poor siblings of higher education funding accounting for only a combined 9.4% of the pot – less than any other research council gets on its own. That in itself is a powerful statement about the government’s belief of the value of arts, humanities, and social sciences. But the problem becomes even worse if that meagre budget must filter its way into the pockets of computer science departments instead of into the hands of our next generation of talented humanities scholars.

Table 1: ‘The Allocations’, The Allocation of Science and Research Funding 2015/16, The Department of Business, Innovation & Skills, (May 2014).
Research Council Funding £M (2015-16) % of total
AHRC 98.3  3.7
ESRC 153.2  5.7
EPSRC 793.5  29.8
BBSRC 351.2  13.2
MRC 580.3  21.8
NERC 289.0  10.8
STFC 400.0  15.0

What we’re seeing then is historians paying the upkeep and development of future  computer scientists, while their own graduates scramble for a handful of poverty-waged Junior Research Fellowships at Oxbridge, or take on equally meagre wages as hourly paid lecturers while they finish the ‘book’ that they’ve been promised will be the solution to all of their employment problems (it won’t be, by the way. Don’t believe that). That’s not to say junior scholars in the sciences don’t struggle to get by, but from my experience through a spouse in academic engineering, students in engineering cannot fathom the idea of doing work for free (writing a book) because they have been raised to believe that their work has monetary value, and the relative abundance of funding that is used to build teams of scholars under the mentorship of a senior scholar supports that belief. In contrast, humanities scholars are taught that their work must be a labour of love, and any money that comes their way is incidental to their progress as a scholar.

That has in part led us towards this client-supplier model, in which historians are forced to hire computer scientists (who have obvious value and costs) to solve their problems. While historical graduates (who have no obvious value) cannot be fit into the grant because there isn’t really all that much they’re needed for on these computer science initiatives. What we need to see is a shift away from the client-supplier model, and towards one that is mutually supported, in which the EPSRC or the commercial tech sector supports the computer science work, and the AHRC or ESRC supports the humanities contribution to those initiatives. That requires a rethink of the boundaries between our research councils, but it also means that the junior historians we want to hire need to make sure they’ve got skills that make them employable on these types of projects. Sure, you’ve got a PhD. But what can you do on my data mining project?

Adam Crymble is a convenor of the Digital History seminar at the IHR and a lecturer of digital history at the University of Hertfordshire.

Posted in Postscript | Leave a comment

23 June 2015 – Exploring Big and Small Historical Datasets: reflections on two recent projects

Please join us for our last session of the year. This is a joint-session with the Archives and Society seminar.
Presenter(s): Sarah Rees Jones, Helen Petrie, Sonia Ranade, Emma Bayne, and Roger Evans
Date:  23 June 2015
Time:  5:15 PM (GMT)
Venue:  John S Cohen Room 203, 2nd floor, IHR, North block, Senate House or live online via the Digital History Seminar blog.
Live Stream

Slide Shows

http://www.slideshare.net/historyspot/petrie-ihr-presentation

Abstract: Researchers from two recently funded projects, ChartEx (Digging into Data Challenge, 2012-14) and Traces Through Time (AHRC, 2014-15), reflect on the development of new tools for historians working with digital data employing analytical solutions from Natural Language Processing, Data Mining and Human Computer Interaction.
Part 1: Sarah Rees Jones and Helen Petrie: ‘Chartex overview and next steps’ (20 minutes)
Part 2: Sonia Ranade and Emma Bayne: ‘Traces Through Time overview and next steps’ (20 minutes)
Part 3: Roger Evans: ‘NLP and Data Mining: From Chartex to Traces Through Time and beyond’ (10 minutes)
Posted in Events | Leave a comment

Digital History and being afraid of being insufficiently digital

The A Big Data History of Music project uses metadata about sheet music publication to explore music history. The data the project uses comes from MARC records converted into tabular form with MARCedit. Inconsistencies in the data – inevitable with catalogue records created by people over long periods of time – were resolved with OpenRefine, the data ported back into tabular form (and – for intrepid – RDF/XML), and graphs built (for the most part) in Excel; graphs that show steep declines in score publication in Venice at times of plague (1576/7) and steep rises – smoothed against overall publication trends – of scores whose titles reference Scotland during the 1790s-1810s peak of the English invention of ‘Scottish’ identity. The use of bibliographic data in the Big Data History of Music project confirms existing suspicions, challenges established interpretations, and opens up fresh lines of historical enquiry. It is a project to be celebrated.

We might say that MARCedit and OpenRefine are hardly the most sophisticated of research tools. Both are tools that manipulate data through the use of Graphical User Interfaces (GUIs), visual interpretations of programmatic functions that a humanist could – given time – construct herself. We might say that tabulated data is hardly the most sophisticated of research data formats. It struggles to express multiple values in a given field (for example, multiple creators of a work) or the hierarchical relationship between fields (for example, a creator of a work and an editor of a work). And we might say that Excel is hardly the most sophisticated of research environments. Built around graphical input, it encourages a range of practices that are not machine readable (you can’t do a ctrl+f for bold text or for cells filled in yellow), suffers over time from compatibility issues that make visualisations from data tricky to reproduce, and struggles to handle massive datasets.

And so we perhaps have a mismatch. As historians, we celebrate findings that could – potentially – change the course of historiographical debate. As digital historians tapped into research software engineering and computational science, we wonder about the suitability, interoperability, and sustainability of the decisions made.

It is easy to get sucked into the latter perspective. And as research projects grow issues of suitability, interoperability, and sustainability must be thrust front and centre. But as we teased out during the Q&A that followed Stephen Rose’s excellent talk on the A Big Data History of Music project, we must not be afraid of being insufficiently digital. We must not be afraid of using GUI tools that may not be there tomorrow to get the job done, of using using data formats that suit our local and community needs to express our findings, and of using software environments that are not the epitome of best practice to interpret our work. For we are historians first and foremost, and for all that projects such as The Programming Historian (for which, I should add, I have written) do wonderful work getting historians from GUI to Python, from .xls to .xml, and from Excel to R, the latter must not impose themselves on our work at the expense of gaining deeper understanding of historical phenomena.

The Big Data History of Music project is a shining example of why digital history must not be afraid of being insufficiently digital. We look forward to seeing more projects pass through the Digital History seminar in coming months that embrace this spirit of getting stuff done, of making digital tools, data, and methods work towards enhancing our collective understanding of the past rather than the other way round.

James Baker

Posted in Events | Tagged , , , , , , | Leave a comment

Seeking Postgraduate Convenor for 2015/16

'Turing Bombe' by Tris Linnell

‘Turing Bombe’ by Tris Linnell

The Digital History Seminar at the Institute of Historical Research in London (IHR) is seeking applications from postgraduate students to join as a Postgraduate Seminar Convenor for the 2015/16 academic year. The role is well suited to an individual who is interested in digital history (broadly construed) and who is looking to build their professional network and skills portfolio. The successful applicant will be directly involved with running and planning future seminars, and will be an integral part of the project team. Seminars are held approximately 8 times per year during term time, on Tuesdays from 5:15-7:15pm at the IHR in London. The seminar normally moves to a local pub and later to a restaurant where there are additional opportunities to network and discuss ideas.

 

Interested applicants should send a 1 page CV and cover letter to Adam Crymble (adam.crymble@gmail.com) by 26 June 2015. Please let your students or contacts know about this opportunity, and Adam is happy to respond to any queries. This is a non-stipendiary volunteer academic service position and there are no mandatory costs associated with the role.

 

Adam Crymble

Convenor, IHR Digital History Seminar

Lecturer of Digital History,

University of Hertfordshire

adam.crymble@gmail.com

Posted in Uncategorized | Leave a comment

2015/16 Call for Papers

Are you a historian trying out some digital methods, tools or resources as a means of exploring historical phenomena? Do you have a work in progress project? Are you seeking a friendly, critical environment in which to share your preliminary findings, successes, and failures?

The Institute of Historical Research Digital History Seminar brings together a range of historians to discuss and debate cutting edge historical research that incorporates digital resources and methods. We aim not to drive the agenda, but for the agenda to be driven by that discussion and debate. So although the seminar has a great 2015/16 programme (details coming soon!) we are still looking for papers from historians at any stage of their career, including those visiting the UK in the next 12 months, to make it even better.

So if you are interested in giving a paper, please email a description (circa two to three sentences in length) of you proposed paper to James Baker.

Posted in Uncategorized | Leave a comment

Tuesday 9 June 2015 – Writing a Big Data History of Music

Please join us for our next seminar.

Presenter:  Stephen Rose (RHUL)

Title:  Writing a Big Data History of Music

Date:  9 June 2015

Time:  5:15 PM (GMT)

Venue:  John S Cohen Room 203, 2nd floor, IHR, North block, Senate House or live online via the Digital History Seminar blog.

Live Stream

Abstract: This seminar introduces the project A Big Data History of Music, which aimed to unlock the musical-bibliographical data held by libraries in order to create new research opportunities. The project cleaned and enhanced aspects of the British Library catalogues of printed and manuscript music, which are now available as open data from www.bl.uk/bibliographic/download.html. Analyses and visualisations of these datasets exposed previously uncharted patterns in the history of music, for instance involving the rise and fall of music printing in 16th- and 17th-century Europe, or the rise of nationalist colourings in music of the late 18th and early 19th centuries. The detection of these long-term trends permits new ways of linking music history to wider histories of culture, economics, society and politics.

Seminars are normally streamed live online on this blog and on YouTube. To keep in touch, follow us on Twitter (@IHRDigHist) or at the hashtag #dhist.

Posted in Events | Leave a comment

Tuesday 26 May – Virtual Rome: a digital reconstruction of the ancient city

4

Presenters:  Matthew Nicholls (Reading)

Title:  Virtual Rome: a digital reconstruction of the ancient city

Date:  26 May 2015

Time:  5:15 PM (GMT)

Venue:  John S Cohen Room 203, 2nd floor, IHR, North block, Senate House or live online via the Digital History Seminar blog.

Live Stream

Due to a Fire Alarm part way through the seminar the live stream of this event has separated into two videos. These have now been merged and will be displayed here until the final edited version of the video is available in a few weeks time.

 

Abstract: 

Dr Matthew Nicholls of the Department of Classics at the University of Reading has made a detailed digital reconstruction of the city of Rome as it appeared c.AD315. In this talk he will introduce the model and discuss some of the tools and methodology involved in its creation, including questions about date, level of detail, and conjecture. He will then talk about the paedagogical uses of digital modelling and the digital Rome model’s potential as a research tool: current work includes investigation of illumination at specific times of day and year, and sightlines within the ancient city to, from, and between major monuments.
Rome 5

Profile:

Matthew Nicholls read Literae Humaniores at St John’s College, Oxford and was a Junior Research Fellow at the Queen’s College, before taking up a lectureship in Classics at Reading where his work includes running an MA in the City of Rome. His research includes the study of ancient books and libraries, including a newly-discovered text by the 2nd C AD medical writer Galen. He is also interested in the digital reconstruction of ancient buildings and places, initially for reaching and outreach work and increasingly for research. His work in this area won the 2014 Guardian/Higher Education Academy national Teaching Excellence award, and he currently holds a British Academy Rising Star Engagement Award for work on digital visualisation in the humanities. As part of this scheme he will be running an introductory workshop on software skills for digital visualisation and welcomes enquiries about participation.

 

Seminars are normally streamed live online on this blog and on YouTube. To keep in touch, follow us on Twitter (@IHRDigHist) or at the hashtag #dhist.

Posted in Events | Leave a comment

Tuesday 24 March – Text Mining the History of Medicine

Please join us for our next seminar.

Presenters:  Sophia Ananiadou (Manchester University)

Title:  Text Mining the History of Medicine

Date:  24 March 2015

Time:  5:15 PM (GMT)

Venue:  John S Cohen Room 203, 2nd floor, IHR, North block, Senate House or live online via the Digital History Seminar blog.

Live Stream

Slide Show 

Abstract: I will present the results of a collaborative and interdisciplinary project between the National Centre for Text Mining (NaCTeM) and the Centre for the History of Science, Technology and Medicine (CHSTM) at the University of Manchester, demonstrating the capabilities of innovative text mining tools to allow the automatic extraction of information from two historical archives: the British Medical Journal (BMJ) (1840 – present) and the London-area Medical Officer of Health (MOH) reports (1848-1972). NaCTeM’s text mining tools have enriched these historical archives with semantic metadata automatically by extracting terms, named entities and events.  The development of a semantic search system focused on the understanding of historical changes in lung diseases since 1840.

Seminars are normally streamed live online on this blog and on YouTube. To keep in touch, follow us on Twitter (@IHRDigHist) or at the hashtag #dhist.

Posted in Events | Leave a comment

Will historians of the future be able to study Twitter?

Over the last year or so, our seminar has become increasingly web-focussed. Last week we had an excellent paper from Jack Grieve of Aston University on the tracking of newly emerging words as they appeared in large corpora of tweets from the UK and the US. By amassing very large tweet datasets, he and his colleagues are able to observe the early traces of newly emerging words, and also (when those tweets were submitted from devices which attach geo-references) to see where those new words first appear, and how they spread. Jack and his colleagues are finding that words quite often emerge first (in the US) in the east and south-east (or California) and then spread towards the centre of the continent. They don’t necessarily spread in even waves across space, or even spring between urban centres and then to rural areas (as would have been my uneducated guess). Read more at the project site, treets.net, or watch the paper.

This kind of approach is quite impossible without the kind of very large-scale natural language data such as social media afford. This is particularly so as most words are (perhaps counter-intuitively) rather rare. In the corpus in question, the majority of the 67,000 most common words appear only once in 25 million words. Given this, datasets of billions of tweets are the minimum size necessary to be able to see the patterns.

It was interesting to me as a convenor to see the rather different spread of people who came to this paper, as opposed to the more usual digital history work the seminar showcases. Jack focussed on tweets posted since 2013; a time span that even the most contemporary historian would struggle to call their own; and so not so many of them came along – but we had perhaps our first mathematician instead. This was a shame, as Jack’s paper was a fascinating glimpse into the way that historical linguistics, and indeed other types of historical enquiry, might look in a couple of decades’ time.

But there is a caveat to this, which was beyond the scope of Jack’s paper, to do with the means by which this data will be accessible to scholars of 2014 working in (say) 2044. Jack and his colleagues work directly from the so-called Twitter “firehose”; they harvest every tweet coming from the Twitter API, and (on their own hardware) process each tweet and discard those that are not geo-coded to within the study area. This kind of work involves considerable local computing firepower, and (more importantly) is concerned with the now. It creates data in real time to answer questions of the very recent past.

Researchers working in 2044 and interested in 2014 may well be able to re-use this particular bespoke dataset (assuming it is preserved – a different matter of research data management, for another post sometime). However, they may equally well want to ask completely different questions, and so need data prepared in a quite different way. Right now, the future of the vast ocean of past tweets is not certain; and so it is not clear whether the scholar of 2044 will be able to create their own bespoke subset of data from the archive. The Library of Congress, to be sure, are receiving an archive of data from Twitter; but the access arrangements for this data are not clear, and (at present) are zero. So, in the same way that historians need to take some ownership of the future of the archived web, we need to become much more concerned about the future of social media: the primary sources that our graduate students, and their graduate students in turn, will need to work with two generations down the line.

Certainly, historians have always been used to working around and across the gaps in the historical record; it’s part of the basic skillset, to deal with the fragmentary survival of the record. But there is right now a moment in which major strategic decisions are to be made about that survival, and historians need to make themselves heard.

This post was written by Peter Webster who can also be found on his own blog Webstory.

Posted in Postscript | Tagged , , , , | Leave a comment

Tuesday 10 March – Lost Visions: retrieving the visual element of printed books

The IHR Seminar in Digital History would like to welcome you to its second seminar of 2015.

Presenters:  Julia Thomas, Nicky Lloyd and Ian Harvey (Cardiff)

Title:  Lost Visions: retrieving the visual element of printed books

Date:  10 March 2015

Time:  5:15 PM (GMT)

Venue:  John S Cohen Room 203, 2nd floor, IHR, North block, Senate House or live online via the Digital History Seminar blog.

 

Live Stream

The live stream for this session did not work properly. Please check back for the edited version of the video in the postscript section of the blog. Thank you.

 

Slide Show

Abstract: Despite the mass digitization of books, illustrations have remained more or less invisible. As an aesthetic form, illustration is conventionally positioned at the bottom of a hierarchy that places painting and sculpture at the top. The hybridity or bimediality of illustration is also problematic, the genre having fallen between the cracks of literary studies and art history. In a digital context, illustration has fared no better: new technologies can aid the editing of a literary text far more successfully than they can deal with the images that accompany it.

This paper focuses on the challenges and the implications of an AHRC-funded Big Data project that will make searchable online over a million book illustrations from the British Library’s collections. The images span the late eighteenth to the early twentieth century, cover a variety of reproductive techniques (including etching, wood engraving, lithography and photography), and are taken from around 68,000 works of literature, history, geography and philosophy.

The paper identifies issues relating to the improvement of bibliographic metadata and the analysis of the iconographic features of the images, which impact on our understanding of ‘the image’ in Digital Humanities and the negotiation of Big Data more generally. The work undertaken as part of the Lost Visions project allows for the further development of Illustration Studies, repositioning visual culture in the largely text-based process of digitisation and problematising modes of textual production.

Seminars are normally streamed live online on this blog and on YouTube. To keep in touch, follow us on Twitter (@IHRDigHist) or at the hashtag #dhist.

 

Posted in Events | Leave a comment