│By Alan Thomas, AI Research Engineer at the Centre for Machine Intelligence, University of Sheffield│
Poor optical character recognition (OCR) quality is a major obstacle for humanities scholars seeking to make use of digitised primary sources such as historical newspapers. To improve the quality of noisily OCR’d historical documents, we introduce BLN600 – an open-access dataset derived from Gale’s British Library Newspapers – and showcase the potential of large language models (LLMs) for post-OCR correction using Llama.