Skip to content
The Gale Review

The Gale Review

A blog from Gale International

  • Welcome to The Gale Review
  • Digital Humanities
  • For Students
  • For Academics
  • Subscribe to The Gale Review
  • All Blog Posts
Montage of images from this blog post, mixed with images from British Library Newspapers archive

Leveraging Large Language Models for Post-OCR Correction of Nineteenth-Century British Newspapers

September 3, 2024 by Gale Review Team

│By Alan Thomas, AI Research Engineer at the Centre for Machine Intelligence, University of Sheffield│

Poor optical character recognition (OCR) quality is a major obstacle for humanities scholars seeking to make use of digitised primary sources such as historical newspapers. To improve the quality of noisily OCR’d historical documents, we introduce BLN600 – an open-access dataset derived from Gale’s British Library Newspapers – and showcase the potential of large language models (LLMs) for post-OCR correction using Llama.

Background

Digital archives have become an indispensable resource for humanities research. Primary sources such as newspapers, early printed books, and handwritten documents have been digitised and preserved in searchable online databases such as Gale’s British Library Newspapers using OCR technology to convert scanned images of historical documents into machine-readable text.

However, a persistent challenge faced by scholars seeking to make use of these resources is the low quality of transcriptions produced by OCR. Due to the age and condition of the original documents, the OCR process often results in inaccurate transcriptions, creating obstacles for researchers who rely on these texts for their work.

Example of low-quality OCR text with source image from “LAW NOTICES.-THIS DAY.”, Morning Chronicle, 29 October 1835. British Library Newspapers
Example of low-quality OCR text with source image from “LAW NOTICES.-THIS DAY.”, Morning Chronicle, 29 October 1835. British Library Newspapers, https://link.gale.com/apps/doc/BA3207647413/BNCN?u=su_uk&sid=bookmark-BNCN&xid=45ba0d38

At the Centre for Machine Intelligence at the University of Sheffield, we are working on a collaborative project with the Digital Humanities Institute that aims to address this issue by applying advanced artificial intelligence methods to improve the quality of OCR transcriptions for historical documents. In this blog, we detail how LLMs can be used for post-OCR correction, which involves refining and correcting the textual output produced by OCR technology.

BLN600 – An Open-Source Dataset

Improving OCR quality, especially for historical documents, remains a significant challenge with limited publicly available resources. To address this, we released BLN600, a publicly available parallel corpus of nineteenth-century newspaper text, focused on crime in London. This corpus is derived from Parts I and II of Gale’s British Library Newspapers collection. BLN600 comprises 600 newspaper excerpts, each containing the original source image, a machine-generated OCR transcription, and a manually created gold standard transcription.

A sample from BLN600 with OCR text, source image and ground truth, using “A COURAGEOUS POLICEMAN.”, Lloyd’s Illustrated Newspaper, 20 June 1880. British Library Newspapers
A sample from BLN600 with OCR text, source image and ground truth, using “A COURAGEOUS POLICEMAN.”, Lloyd’s Illustrated Newspaper, 20 June 1880. British Library Newspapers, https://link.gale.com/apps/doc/BC3206247284/BNCN?u=su_uk&sid=bookmark-BNCN&xid=d1652e94

British Library Newspapers spans over 200 years of British newspaper history, featuring more than 240 different publications. To curate BLN600 from this extensive archive, we conducted a custom query to identify crime-related articles published in London-specific newspapers, yielding 10,000 full-page images. From these, we randomly selected 600 images based on the presence of crime-related content and the readability of the text. Each image was manually rekeyed by humans and aligned with the corresponding OCR text from British Library Newspapers to produce a complete sample.

BLN600 is a valuable resource for historians and digital humanities researchers exploring nineteenth-century crime journalism by providing gold-standard transcriptions that facilitate the application of NLP (natural language processing) techniques. The source images allow researchers to use BLN600 as a benchmark dataset for tracking and measuring improvements in OCR engine performance for historical documents. Parallel OCR text and ground truth can be used to support the development and training of post-OCR correction models.

Table showing distribution of BLN600 samples over publication and decade.
Distribution of BLN600 samples over publication and decade.

Post-OCR Correction With LLMs

Llama is a family of pre-trained and fine-tuned LLMs (large language models) released by Meta AI. The fine-tuned chat model is designed for assistant-like chat and optimised for dialogue applications, similar to ChatGPT. The pre-trained base model is a causal language model, designed to predict the next word in a sequence, which can be adapted for various natural language generation tasks including post-OCR correction. We opted to use Llama 2 due to its open-access nature and availability of various versions.

Using BLN600, we created a dataset of sequence pairs by dividing the text into segments, which can be sentences, short titles, or longer passages. After generating these pairs, we split them into training and evaluation sets. The training set is used to build an instruction-tuning dataset, featuring instruction, input, and response fields to guide the model’s response.

Table showing Instruction-tuning data breakdown (top) and example (bottom).
Instruction-tuning data breakdown (top) and example (bottom).

After fine-tuning the base Llama 2 model on our instruction-tuning dataset, it can be used to generate error corrections over the evaluation set. To measure the performance of our models, we compute the percentage reduction in character error rate. Character error rate (CER) measures how often characters are incorrect in a transcribed text compared to the total number of characters. Llama 2 7B achieves a 43.26% reduction in CER, whilst Llama 2 13B achieves a 54.51% reduction in CER, suggesting that these models can roughly halve the number of errors in OCR text.

Table showing Llama 2 13B corrections on different error types.
Llama 2 13B corrections on different error types.

By employing LLMs for post-OCR correction, we can achieve a significant reduction in the number of errors in BLN600, paving the way for future work leveraging LLMs to improve the accessibility and unlock the full potential of historical texts for humanities research. Since the completion of this work, Llama 3 has been released with greater capabilities, indicating the potential for further improvement.

BLN600 is publicly accessible at https://doi.org/10.15131/shef.data.25439023.

For more details regarding this work, please refer to the following papers:

  1. Booth, Callum William, Alan Thomas, and Robert Gaizauskas. “BLN600: A Parallel Corpus of Machine/Human Transcribed Nineteenth Century Newspaper Texts.” Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources
  2. Thomas, Alan, Robert Gaizauskas, and Haiping Lu. “Leveraging LLMs for Post-OCR Correction of Historical Newspapers.” Proceedings of the Third Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA)@ LREC-COLING-2024. 2024.

If you enjoyed reading about the use of AI to correct OCR, check out the ‘Notes from our DH Correspondent’ series, which includes:

  • Coding for Humanists: Python Notebooks in Gale Digital Scholar Lab
  • Playing Games with Data: Building Interactive Narratives with Twine
  • Finding Meaning in K-Means: Clustering Analysis in Gale Digital Scholar Lab
  • Building Projects in Gale Digital Scholar Lab

Blog post cover image citation: Montage of images in this blog post, combined with images from British Library Newspapers.

Share this post!

About the Author

Alan Thomas is an AI Research Engineer at the Centre for Machine Intelligence at the University of Sheffield. His research focuses on the application of generative AI for advancing interdisciplinary research.

Categories Digital Humanities, Technology Tags Academic Author, British Library Newspapers, Digital Humanities, metadata, Modern History, newspapers, nineteenth-century history, OCR, Technology, University of Sheffield
Bridging the Gap: Gale Primary Sources and Gale Digital Scholar Lab
The Untold Story of the 700 Orphaned Polish Children of New Zealand

Subscribe:

Never miss a post! (You will be sent an automated privacy policy to opt-in with before you receive any updates).
Loading
  • Gale News and Teams
    • Gale Ambassadors
    • Gale News
    • Gale Publishers
  • Key Categories
    • Digital Humanities
    • For Academics
    • For Librarians
    • For Students
    • Thought leaders
  • Topic Categories
    • Anniversaries
    • Arts and Culture
    • Current Issues
    • Science and the Environment
    • Society and Politics
    • Sport
    • Technology

1800s 1900s activism Analysis Tools Archives of Sexuality and Gender Archives Unbound British Library Newspapers China Civil Rights Colonialism Daily Mail Historical Archive DH Correspondent Digital Humanities Digital Literacy eighteenth-century history Eighteenth Century Collections Online feminism Gale Ambassador Gale Ambassadors Gale Digital Scholar Lab Gale Primary Sources Gender Studies government History Learning Literature newspapers nineteenth-century history Nineteenth Century Collections Online politics primary source literacy Product Team Publishing team Sarah Ketchley Social history Student study tips teaching The Times The Times Digital Archive twentieth-century history Undergraduates United States visualisation Women’s Studies

Disclaimer

Disclaimer: The views, thoughts, and opinions expressed in this blog belong solely to the authors, and do not necessarily reflect the official policy or position of Gale, part of Cengage Group.

  • LinkedIn
  • X
  • Link

Gale, part of Cengage Group, Cheriton House, North Way, Andover SP10 5BE

© 2025 The Gale Review • Built with GeneratePress