Rescuing Data from Non Searchable Files using Open Source Workflows

Ehsan Moghadam
Dec 17, 2025
9 min read

Updated: Jun 7

If you work in libraries or archives long enough, someone eventually sends you a PDF that looks perfectly normal in a viewer, but refuses to cooperate. You cannot search it, you cannot copy text, and any attempt to extract content gives you nothing useful.

This happened to me with a conference program and abstract booklet that a researcher needed to search and mine for content. The PDF looked crisp. The fonts were clean. It behaved like a real document in every way except the one that mattered: there was no searchable text layer.

Under the hood, the text had been converted to shapes. In other words, what looked like letters on the screen were actually vector outlines, not characters that software can recognize as text.

I wanted a way to turn that kind of PDF into a fully searchable document using only open source tools, in a workflow that other librarians could reproduce and share. This post walks through what we used, how it works, and how the community could push this workflow into much more ambitious territory, including handwriting and historical material.

The Tools: A Small Open Source Stack

The core idea is simple:

Take a PDF that has no useful text layer.
Treat each page as an image.
Run Optical Character Recognition (OCR) on those images.
Embed the recognized text back into a new PDF.

We used three open source tools together:

OCRmyPDF

https://ocrmypdf.readthedocs.io/en/latest/

This is the orchestrator. OCRmyPDF takes a PDF as input, applies OCR, and writes a new PDF that has a real text layer behind the page images. It uses Tesseract and Ghostscript behind the scenes.

Tesseract OCR

https://github.com/tesseract-ocr/tesseract

Tesseract is the actual OCR engine. It looks at each image of a page and tries to recognize characters and words. It supports many languages and has been used in research, libraries, and industry for years.

Ghostscript

https://ghostscript.com/

Ghostscript is a PDF and PostScript interpreter. OCRmyPDF uses it for tasks like rendering pages, handling images, and optimizing the final output.

The important point is that every part of this stack is free and open source. No proprietary software is required, and the workflow can be scripted, shared, and adapted in many different environments.

What Was Wrong With The Original PDF?

The conference booklet that started all this had a problem that is very common:

The content had originally been set in some publication software.
At some point, when the PDF was produced, the actual text was converted into vector outlines.

That means every letter on the page became a tiny shape. When you try to search for a word, there is nothing to search because, as far as the PDF is concerned, there are only drawings, not text.

Scanned material has a similar problem, but with one additional complication: instead of vectors, you have raster images of pages. In both cases, OCR is required to reconstruct text.

The Workflow, Step By Step

1. Install the tools

On macOS with Homebrew:

brew install ocrmypdf tesseract ghostscript

On Ubuntu or Debian:

sudo apt update
sudo apt install -y ocrmypdf tesseract-ocr ghostscript

On Windows with Chocolatey (in an Administrator PowerShell):

choco install ocrmypdf tesseract ghostscript --yes

Once installed, you can run:

ocrmypdf --version
tesseract --version
gs --version

to confirm that everything is available on your system.

2. The Python Helper Script

To make this easy for librarians and archivists who may not want to remember long command line invocations, I wrote a small Python script that wraps OCRmyPDF in a way that is friendly and reusable.

Save this as scroll-scribe.py:

#!/usr/bin/env python3

import argparse
import os
import shutil
import subprocess
import sys


def ensure_dep(name: str) -> None:
    """
    Check that a required command line program is available in the PATH.

    If it is missing, print an error message and exit the script.
    """
    if shutil.which(name) is None:
        print(f"[ERROR] Required dependency not found in PATH: {name}")
        print("Hint: install it first. For example, on macOS with Homebrew:")
        print(f"  brew install {name}")
        sys.exit(1)


def main() -> None:
    parser = argparse.ArgumentParser(
        description=(
            "Apply OCR to every page of a PDF to create a fully searchable PDF. "
            "This is useful for vectorized PDFs or scanned documents that "
            "look like text but are not actually searchable."
        )
    )
    parser.add_argument(
        "input_pdf",
        help="Path to the source PDF that is non searchable or only partly searchable.",
    )
    parser.add_argument(
        "output_pdf",
        help="Path to the output searchable PDF that will be created.",
    )
    parser.add_argument(
        "--lang",
        default="eng",
        help="Tesseract language or languages, for example 'eng' or 'eng+deu'. Default: eng",
    )
    parser.add_argument(
        "--dpi",
        default="300",
        help="Rasterization DPI for OCR. Higher means better text, but larger files. Default: 300",
    )
    parser.add_argument(
        "--jobs",
        type=int,
        default=0,
        help=(
            "Number of parallel jobs, CPU cores, for ocrmypdf to use. "
            "Default 0 means let ocrmypdf decide. "
            "Try 4 or 8 on a multi core machine for faster processing."
        ),
    )
    args = parser.parse_args()

    # Make sure the required tools are installed
    for dep in ("ocrmypdf", "tesseract", "gs"):
        ensure_dep(dep)

    # Build the ocrmypdf command.
    #
    # Key flags:
    #   --redo-ocr
    #       Re run OCR on pages that might already have a text layer.
    #   --rotate-pages / --deskew
    #       Try to straighten and rotate pages automatically.
    #   --clean-final
    #       Clean up the images to reduce background noise.
    #   --pdf-renderer sandwich
    #       Keep the original page image, but overlay an invisible text layer.
    cmd = [
        "ocrmypdf",
        "--redo-ocr",
        "--rotate-pages",
        "--deskew",
        "--clean-final",
        "--optimize", "1",
        "--output-type", "pdf",
        "--tesseract-timeout", "0",
        "--tesseract-pagesegmode", "1",
        "--language", args.lang,
        "--jobs", str(args.jobs) if args.jobs else "1",
        "--pdf-renderer", "sandwich",
        "--image-dpi", args.dpi,
        args.input_pdf,
        args.output_pdf,
    ]

    print("[INFO] Running:", " ".join(cmd))
    try:
        subprocess.check_call(cmd)
    except subprocess.CalledProcessError as e:
        print(f"[ERROR] OCR failed with exit code {e.returncode}")
        sys.exit(e.returncode)

    if os.path.exists(args.output_pdf):
        print(f"[OK] Wrote searchable PDF -> {args.output_pdf}")
    else:
        print("[ERROR] Output PDF not created.")
        sys.exit(1)


if __name__ == "__main__":
    main()

3. Running The Script

Place your PDF and scroll-scribe.py together in a folder. For example:

Documents/
  Project/
    scroll-scribe.py
    untitled.pdf

Then, from that folder in a terminal or in the integrated terminal in VS Code:

python3 scroll-scribe.py "Scroll.pdf" "scroll-scribe.pdf" --lang eng --dpi 300

After some time, you should see:

[OK] Wrote searchable PDF -> scroll-scribe.pdf

Open the output file, use Ctrl+F or Command+F, search for an author name or keyword, and you should be able to search and copy text throughout the document.

How Well Does This Work?

For printed material, the results are often excellent.

Works very well for

Conference programs and abstract books, like in this example
Scanned articles and book chapters
Reports, forms, and institutional documents
Phone photos of printed pages with decent lighting

Mixed results for

Neat, modern handwriting in block letters
Simple pen or pencil notes

Struggles without specialized methods

Cursive handwriting
Damaged or heavily stained manuscripts
Ancient scrolls and papyri
Non Latin scripts that do not match Tesseract training data
Anything with ornate calligraphy or very inconsistent handwriting

This is where the workflow becomes more interesting and points toward possible future work and collaboration.

Beyond Printed Text: How This Workflow Could Evolve

The project started with a very practical goal: help a researcher by making a conference booklet searchable. Once you see that work, it is hard not to imagine what else might be possible.

Could a similar workflow help with handwritten notebooks, excavation diaries, or even digitized scrolls and papyri?

The honest answer is yes in principle, but with important caveats. The limitations are less about the PDF wrapper and more about what OCR engines can reliably read.

Below are some ways this workflow could grow, and areas where community contributions would be invaluable.

1. Cursive Handwriting

Cursive handwriting varies dramatically from person to person. Letters connect, loops are inconsistent, and spacing is irregular. Tesseract is optimized for printed fonts, not for free flowing cursive.

Considerations:

Integrate or document training of custom Tesseract models for common handwriting styles.
Explore handwriting specific systems such as Transkribus or Kraken for cases where Tesseract is not adequate.
Provide example notebooks and training data preparation scripts for users who want to train models on local collections, faculty notebooks, or institutional archives.

In the repository, contributors could add:

Example datasets of cursive writing with transcriptions.
Instructions for training and evaluating a cursive specific model.
Comparisons of recognition quality across tools.

2. Damaged Manuscripts and Challenging Scans

Historical and archival material often has fading inks, bleed through, staining, warping, and partial loss. OCR engines see all of that as visual noise, and their accuracy drops quickly as image quality declines.

Considerations:

Add preprocessing steps before OCR, using tools such as OpenCV or Pillow, for example:
- adaptive thresholding for uneven backgrounds
- contrast enhancement
- basic deblurring
- more advanced deskewing
Provide presets for common scenarios:
- faded typewritten pages
- photocopies of photocopies
- pages with significant bleed through

Contributors could help by:

Adding small modular preprocessing scripts that can be toggled on or off in the Python wrapper.
Supplying sample images and before and after comparisons.
Sharing domain specific presets for the kinds of material they handle, for example microfilm scans or local newspaper archives.

3. Ancient Scrolls, Papyri, and Non Latin Scripts

At this point we move from standard OCR into the realm of Handwritten Text Recognition (HTR) and historical script analysis. Scripts may differ completely from modern alphabets, characters can be broken or incomplete, and the writing surface itself may be curved, cracked, or discolored.

Considerations:

Integrate or document how to pair this workflow with HTR engines such as Kraken, Calamari, or Transkribus for:
- Greek uncials
- Hebrew manuscripts
- Syriac, Coptic, and other historical scripts
- Latin cursive from the medieval or early modern period
Provide hooks or templates so that the Python script can call external engines instead of, or in addition to, Tesseract.
Support more complex input such as multiple images per page, or images that come from virtual unwrapping workflows where scrolls or codices are reconstructed digitally.

Contributors working with digital scholarship or specific traditions could add:

Example configurations for particular scripts.
Links to open training sets for historical handwritings.
Documentation on how to export results from Transkribus or Kraken into text layers that can be embedded back into PDFs.

4. Ornate and Inconsistent Handwriting

Calligraphic scripts, flourished signatures, decorative marginal notes, and idiosyncratic personal styles are very challenging for conventional OCR engines. The variation is not just between writers, but often within a single page.

Considerations:

Add a classification step that tries to detect whether a page contains mostly printed text, simple handwriting, or ornate handwriting.
Route pages to different OCR models based on that classification. For example, Tesseract for clean print, Kraken for cursive or historical scripts.
Experiment with ensemble approaches where multiple engines produce candidates and a simple model chooses the best result.

In the repository this could look like:

A plugin system that lets users register different OCR back ends and define when to use each one.
Sample classification models and example code that shows how to direct pages based on handwriting style.
Test sets where users can compare outputs across engines.

Why This Is Exciting For Libraries And Archives

Even in its current form, this workflow already gives librarians and archivists a very practical capability:

Take non searchable PDFs from publishers, departments, or legacy systems.
Convert them into searchable, copyable documents using tools that anyone can install.
Integrate the script into local digitization practices and institutional repositories.

The more interesting part is what can happen if a community of practitioners builds on it.

There is a growing convergence between library practice, digital scholarship, machine learning, and open source tooling. A small script like this can become a hub for experiments in text recovery, especially if:

People contribute documentation for different environments, such as public libraries, special collections, and university archives.
Researchers add workflows for challenging material like field notebooks, excavation diaries, or lab records.
Digital humanists share scripts and training data for historical scripts.

How Others Can Contribute

This workflow is now published on GitHub, where I invite contributions in several specific ways:

Improvements to installation instructions for Windows, macOS, and Linux.
Example workflows for:
- institutional repositories
- special collections
- teaching and student projects
Test files and sample datasets that let others validate and compare results.
Preprocessing presets for particular types of material.
Optional integration with HTR tools like Transkribus and Kraken, when licenses and terms allow.
Documentation that explains when users should treat OCR output as a draft that requires human correction, especially in historical or high stakes contexts.

Conclusion

What began as a very practical response to a researcher’s request turned into a small but illustrative example of what is now possible for librarians using open source tools. With OCRmyPDF, Tesseract, and Ghostscript, you can take a non searchable PDF and reconstruct a usable text layer that supports search, copy, and basic text mining.

That alone can be transformative for conference materials, internal documents, and legacy collections that previously could not be searched. With further work, shared datasets, and contributions from the library and digital humanities communities, similar workflows can start to reach more ambitious goals, including challenging handwriting and historical material.

If you try this workflow, adapt it for your institution, or extend it to handle new kinds of documents, consider sharing your results. Each small improvement makes it easier for the next librarian, archivist, or researcher to rescue text that would otherwise remain locked in images.

Great idea to add a resources / references section. APA 7th edition is a good fit here since it is widely understood in library and academic contexts and works well for web and software citations.

References:

OCRmyPDF. (n.d.). OCRmyPDF documentation. Read the Docs. Retrieved December 1, 2025, from https://ocrmypdf.readthedocs.io/en/latest/

OCRmyPDF. (n.d.). OCRmyPDF [Computer software]. GitHub. Retrieved December 1, 2025, from https://github.com/ocrmypdf/OCRmyPDF

Tesseract OCR. (n.d.). Tesseract user manual. GitHub Pages. Retrieved December 1, 2025, from https://tesseract-ocr.github.io/tessdoc/

Tesseract OCR. (n.d.). Tesseract Open Source OCR Engine [Computer software]. GitHub. Retrieved December 17, 2025, from https://github.com/tesseract-ocr/tesseract

Artifex Software, Inc. (n.d.). Ghostscript [Computer software]. Retrieved December 1, 2025, from https://www.ghostscript.com/

Transkribus. (n.d.). AI text recognition [Web application]. Transkribus. Retrieved December 1, 2025, from https://www.transkribus.org/en/ai-text-recognition/

Mittagessen. (n.d.). Kraken: OCR engine for all the languages [Computer software]. GitHub. Retrieved December 1, 2025, from https://github.com/mittagessen/kraken

Ghent Centre for Digital Humanities. (n.d.). Transkribus: Historical documents with AI. Ghent Centre for Digital Humanities. Retrieved December 1, 2025, from https://www.ghentcdh.ugent.be/transkribus-historical-documents-ai

Digital Orientalist. (2023, September 26). Train your own OCR/HTR models with Kraken, part 1. The Digital Orientalist. Retrieved December 1, 2025, from https://digitalorientalist.com/2023/09/26/train-your-own-ocr-htr-models-with-kraken-part-1/