Configurable OCR for documents in any language. Built with Tesseract and optimized for PDF processing.
- Process PDFs with high accuracy
- Support for any language Tesseract supports
- Memory efficient processing
- Optional GPU acceleration via OpenCV
- Graceful fallback to CPU processing
- Docker-based deployment
- 7z archive support for batch processing
# Build with default language (English)
docker build -t tess-pdfseq .
# Process a 7z archive (CPU mode)
./bin/pdfseq input.7z
# Process with GPU acceleration
GPU=1 ./bin/pdfseq input.7z
# Process with custom settings
LANGS="eng+ara" DPI=300 WIDTH=2400 GPU=1 ./bin/pdfseq docs.7z
Environment variables for bin/pdfseq
:
LANGS
: OCR languages (default: eng)DPI
: PDF scan resolution (default: 500)WIDTH
: Target width in pixels (default: 1800)MAX_PAGES
: Max pages per PDF (default: 3)WORKDIR
: Working directory (default: current directory)GPU
: Enable GPU acceleration (default: 0)TAG
: Docker image tag (default: latest)
First, install languages during Docker build (plus-separated):
# Build with multiple languages available
docker build -t tess-pdfseq --build-arg TESS_LANGS="eng+fas+ara" .
Then use any of the installed languages at runtime:
# Use English (default)
./bin/pdfseq docs.7z
# Use Persian/Farsi
LANGS=fas ./bin/pdfseq docs.7z
# Use multiple languages together
LANGS="eng+fas" ./bin/pdfseq docs.7z
Note: You can only use languages at runtime that were installed during build.
List available languages:
# On Arch Linux
pacman -Ss tesseract-data | grep -Po 'tesseract-data-\K[^/\s]*(?=\s)'
# Using Docker
docker run --rm python:3.12-alpine sh -c \
"apk add --no-cache tesseract-ocr && tesseract --list-langs"
Common languages:
eng
- Englishheb
- Hebrewjpn
- Japanesekor
- Koreanrus
- Russianfas
- Persian/Farsichi_sim
- Chinese Simplified
- Optimizes images for OCR (1800px width)
- GPU acceleration with OpenCV UMat when enabled
- Automatic fallback to CPU if GPU fails
- Outputs plain text files
- Docker (GPU support optional)
- NVIDIA Container Toolkit for GPU support
- ~1GB disk space for Docker image
- p7zip for archive handling
git clone /~https://github.com/metaory/tess-pdfseq
cd tess-pdfseq
# Build with custom languages
docker build -t tess-pdfseq --build-arg TESS_LANGS="eng+ara" .