Tuesday 19 November 2024 – Colin Greenstreet: AI assistants and agents: A New Skill Set for Historians?
This seminar is 5:30 pm – 7:00 pm GMT live on Zoom at https://zoom.us/j/92542420101 and later posted to our YouTube channel.
Session chair: James Baker
Abstract: A New Skill Set for Historians explores the potential for Large Language Model based intelligent assistants and agents to support historical research. The speaker makes the case for historians and archivists acting together to build knowledgeable, effective, and serendipitous assistants and agents within the history domain, and explores two parallel routes to do so. Firstly, the construction of large language models using retrieval augmented generation techniques to draw on specialized domain specific vectorbases. Secondly, the construction of large scale licence clear datasets of historical manuscript and printed knowledge for fine-tuning of medium -sized large language models.
The speaker looks at two areas of potential impact on historical research by large language models. Firstly, on historians’ interactions with archives and on their personal archival research practices. Secondly, on the types of research questions historians can ask and answer, enabled by much larger, more complex, and more interlinked datasets and metadata, and supported by sophisticated, flexible, and easy to use analytical tools.
The speaker looks at the history of technology uptake within historical research practice and asks what needs to be done to encourage the widespread adoption and embedding of large language model enabled techniques into research practices. As a contribution to the exploration and adoption of such techniques, the speaker is launching a MarineLives-Collaboratory for doctoral students interested in the application of large language models in their own research design and research practices. This will provide the opportunity to work on specific historical use cases.
The speaker illustrates his broad proposals with his own hands on work at MarineLives:
- The publication of 6 million words of semantically searchable and summarizable English High Court of Admiralty depositions using Google’s vectorbase NotebookLM
- The creation of a bespoke Pinecone vectorbase using sentence and paragraph embeddings for interrogation by researchers
- The fine-tuning of a mid-sized multi-lingual large language model to clean up raw machine transcriptions
- The use of frontier large language models (Claude Sonnet 3.5, Gemini1.5, GPT-4oi-preview) to perform high grade analytical ontological summarization as part of a pipeline from machine transcription through to open linked data creation.
The talk concludes with a vision of multi-agent/multi-player historical simulations to be integrated into graduate teaching and looks at the structure of such a simulation of international investment in the seventeenth century.