AI-powered OCR for historical archives transcription

Posted by

The state of the art in Optical Character Recognition (OCR) technology has evolved significantly over the years, incorporating advancements in machine learning and artificial intelligence. Compared to different from today’s advanced systems, the technology behind OCR (Optical Character Recognition) in the 2000s was significantly different from today’s advanced systems, relying on rule-based systems and template matching, with limited language support. This meant that complex writing recognition projects involving non-standard characters or handwriting, like in historical archives, were not something OCR systems could handle.

Fast forward to the age of Machine Learning and LLMs, OCR has become more than just “character recognition”. Modern AI OCR systems, like the ones offered by Zetta Cloud (https://zettacloud.ai/optical-character-recognition/), can preprocess scanned documents to make them suitable for further ingestion, can identify (segment) lines and even come with their own language models to provide the best output possible.

Zetta Cloud’s first experience with the novel approach to OCR was with a joint project with the Cluj-Napoca Babes Bolyai University involving the full digitalization of historical archives, with the beneficiary being IHRA (https://www.holocaustremembrance.com/).

The archive stored on microfilms was processed with a specialized appliance that would transform each microfilm frame into a high-quality digital image file. The images were then processed with our AI OCR engines into machine-readable and editable formats, automatically translated into English, and indexed in a DSpace repository for researchers. Even more, all documents were processed through our Named Entity Recognition engine, extracting names of people, organizations, locations, temporal references, and titles, used as indexes into the database, so that the huge repository that resulted would already have AI-generated metadata for triage and indexing.

But the most challenging OCR project we encountered at Zetta was the digitalization of Medieval Scripts in the Latin Alphabet dating from the 12th century.

This project relied once again on the scientist from the Babes Bolyai University in Cluj-Napoca, especially Professor Adinel C. DINCĂ, Ph.D., a world-renowned expert in paleography (academic discipline of the analysis of historical writing systems and the historicity of manuscripts and texts). The results of this project have been published on Research Gate and on the University’s DigiHub Center webpage.

So what was so special about this project?

First of all, the status quo in paleography and the transcription of historical archives is still, in most instances, a human specialist approach, because of the very specific nature of not just the writing system itself, but of what it contains. A palaeograph needs to be both an expert in the language and the writing system and also in the domain. “E-palaeography” is a brand-new approach in this field, and there is no widespread use of it. Because of this projects are long-lasting, human-intensive, and very expensive. All that knowledge waiting to be uncovered!

Secondly, there is no out-of-the-box OCR system that can process the 12th-century Carolingian Minuscule script. The variability of the handwriting, the complexity of the vocabulary and styles, and the difficulty of segmenting the text lines make currently available Optical Character Recognition (OCR) and/or Handwritten Text Recognition (HTR) systems unable to automatically recognize and transcribe medieval texts.

We decided to build a custom AI engine that would successfully handle these types of documents, using a combination of layout analysis and handwritten OCR that would be trained on sample data consisting of human-transcribed texts.

One of the key elements of producing a training corpus for the custom AI OCR model was to identify the line segments in the image files and provide the paleographers with a simple and intuitive web interface that would allow them to produce the human transcription. The concept is very similar to CAT (Computer Aided Translation) tools, where a translator would not work on the entire document, but only on segments. The human transcription would be also aided by a Binarization process, that converts a color or grayscale input into an image containing only two color levels: white (background) and black (foreground, i.e. text). This helps both the segmentation process, and the human paleographer from an UX perspective.

Once the training corpus was generated, an AI OCR system was adapted, resulting in a custom-built engine that would be able to transcribe the Carolingian Minuscule script into Latin sentences, as shown in the image below. Corrections and post-editing of the output would also be possible through the web user interface.

Project Outcomes

The resulting AI solution provided end-to-end capabilities for the automated transcription of the Carolingian Minuscule 12th Century archives:

  • A lightweight web user interface to perform human transcriptions and corrections on the binarized and segmented images.
  • A custom-trained AI OCR engine using Deep Neural Networks.
  • A web user interface for uploading single or multiple documents for inference (running the AI model against the entire archive).

The Quality of the resulting custom-trained AI OCR engine was automatically computed with an accuracy of 94.61% using training data of only 2972 segments extracted from the 38 pages. The quality is quite high even compared to long-term projects where human paleographers are involved and where the quality of their output tends to decrease in time below 90%, due to fatigue and other factors.

Even training the engines with more data –  5427 segments (68 pages) and 7741 segments (100 pages) – resulted in marginal quality gains: 95,09% and 95,53% respectively, which means that the optimal quality output is obtained with some dozen human transcripted page, which makes the entire process quite efficient. We came to the conclusion that the effort inflection point is at 2168 segments (approximately 28 pages), meaning that we could have trained the algorithm with just 28 pages of manual transcription to gain an accuracy value that is very close to what we’ve achieved by manually transcribing 100 pages and training the algorithm with them.

The value of custom-built AI OCR engines for historical archives is undisputable, as proven by our past projects.

Manuscripts and documents from centuries past, often written by hand, local and regional archives, such as those maintained by small towns, villages, or specific communities, historical newspapers and magazines, personal archives, like diaries and correspondence, especially from notable historical figures and government and legal records hold immense knowledge waiting to be used. The digitalization of these archives is crucial for preserving historical knowledge and making it more accessible. However, the process is often slow due to the need for careful handling, specialized equipment, and significant funding. As technology advances, particularly in areas like high-resolution scanning and AI-based text recognition, the pace of digitalization is likely to increase, helping to preserve these valuable historical resources.