PDF Preprocessing Configuration Guide¶

This guide explains how to configure PDF preprocessing in Extralit Server using the PDFPreprocessingSettings class for optimal results with different document types.

Overview¶

Extralit Server uses OCRmyPDF for PDF preprocessing, which performs OCR (Optical Character Recognition), rotation correction, optimization, and cleanup. The preprocessing pipeline also includes PDF layout analysis to extract margin and structure information.

All settings can be configured via environment variables with the PREPROCESSING_ prefix.

Quick Start¶

Digital Research Papers (Born-Digital PDFs)¶

For modern PDFs that already contain searchable text:

# Minimal processing - just analysis and optimization
PREPROCESSING_ENABLED=true
PREPROCESSING_ENABLE_ANALYSIS=true
PREPROCESSING_SKIP_TEXT=true           # Skip OCR on text pages
PREPROCESSING_FORCE_OCR=false
PREPROCESSING_TESSERACT_TIMEOUT=0      # No timeout (not "skip OCR")
PREPROCESSING_OPTIMIZE=1               # Lossless optimization
PREPROCESSING_CLEAN=false              # No cleanup needed
PREPROCESSING_DESKEW=false             # Usually not needed

Performance: ~0.5-2s per page (mostly analysis)

Scanned Research Papers (Image-Based PDFs)¶

For scanned documents or image-only PDFs that need OCR:

# Full OCR processing
PREPROCESSING_ENABLED=true
PREPROCESSING_ENABLE_ANALYSIS=true
PREPROCESSING_FORCE_OCR=true           # OCR all pages
PREPROCESSING_SKIP_TEXT=false          # Process text layers
PREPROCESSING_TESSERACT_TIMEOUT=180    # 3 minutes per page
PREPROCESSING_LANGUAGE=["eng"]         # Add more as needed
PREPROCESSING_ROTATE_PAGES=true        # Auto-rotate pages
PREPROCESSING_DESKEW=true              # Fix skewed scans
PREPROCESSING_CLEAN=true               # Remove scan artifacts
PREPROCESSING_OPTIMIZE=2               # Lossy compression

Performance: ~2-5s per page for good quality scans

Mixed Document Collections¶

For collections with both digital and scanned papers:

# Balanced approach
PREPROCESSING_ENABLED=true
PREPROCESSING_ENABLE_ANALYSIS=true
PREPROCESSING_SKIP_TEXT=true           # Only OCR image pages
PREPROCESSING_FORCE_OCR=false          # Detect existing text
PREPROCESSING_REDO_OCR=false           # Don't re-OCR
PREPROCESSING_TESSERACT_TIMEOUT=120    # 2 minutes timeout
PREPROCESSING_ROTATE_PAGES=true
PREPROCESSING_DESKEW=false
PREPROCESSING_CLEAN=true
PREPROCESSING_OPTIMIZE=1

Configuration Reference¶

Core Settings¶

`PREPROCESSING_ENABLED`¶

Type: bool
Default: true
Description: Master switch for PDF preprocessing. When false, only layout analysis runs (if enable_analysis=true).
Use Case: Set to false to disable all OCR processing while keeping layout analysis.

`PREPROCESSING_ENABLE_ANALYSIS`¶

Type: bool
Default: true
Description: Enable PDF layout analysis and margin detection using PDFAnalyzer.
Use Case: Disable if you don't need structural metadata extraction.

OCR Settings¶

`PREPROCESSING_LANGUAGE`¶

Type: list[str]
Default: ["eng"]
Options: ISO 639-3 language codes (e.g., ["eng", "spa", "fra", "deu"])
Description: Languages for OCR recognition. Multiple languages increase processing time.
Use Case:
Single language papers: ["eng"]
Multilingual papers: ["eng", "spa"]
International collections: Add all expected languages

`PREPROCESSING_TESSERACT_TIMEOUT`¶

Type: int (seconds)
Default: 0
Description: Timeout for Tesseract OCR per page. 0 means no timeout (unlimited time), not "skip OCR". To skip OCR entirely, set PREPROCESSING_ENABLED=false.
Use Case:
0: No timeout - best for accuracy (default)
60-120: Standard scanned papers with time constraints
180-300: Complex layouts, low-quality scans
600+: Historical documents, very poor scan quality

`PREPROCESSING_FORCE_OCR`¶

Type: bool
Default: false
Description: Force OCR on all pages, even those with existing text.
Use Case:
true: Scanned documents, poor existing OCR
false: Digital PDFs, mixed collections (recommended)

`PREPROCESSING_SKIP_TEXT`¶

Type: bool
Default: true
Description: Skip OCR on pages that already have text. Only process image-only pages.
Use Case:
true: Digital PDFs, mixed collections (recommended)
false: Force OCR on all pages

`PREPROCESSING_REDO_OCR`¶

Type: bool
Default: false
Description: Redo OCR on pages that already have OCR text.
Use Case: Set to true only if existing OCR is poor quality.

Page Processing¶

`PREPROCESSING_ROTATE_PAGES`¶

Type: bool
Default: true
Description: Auto-rotate pages with horizontal text to correct orientation.
Use Case: Keep true for scanned documents; safe for digital PDFs.

`PREPROCESSING_ROTATE_PAGES_THRESHOLD`¶

Type: float
Default: 2.0
Description: Confidence threshold for rotation (higher = more conservative).
Use Case: Lower (1.0-1.5) for aggressive rotation; higher (3.0+) to avoid false rotations.

`PREPROCESSING_DESKEW`¶

Type: bool
Default: false
Description: Correct skewed/tilted text in scanned documents.
Use Case:
true: Scanned documents with visible skew
false: Digital PDFs (adds processing time)

`PREPROCESSING_CLEAN`¶

Type: bool
Default: true
Description: Use unpaper to remove scan artifacts, borders, and noise.
Use Case:
true: Scanned documents, photocopies
false: Clean digital PDFs (saves processing time)

Output Optimization¶

`PREPROCESSING_OPTIMIZE`¶

Type: int
Default: 1
Options:
0: No optimization (largest file size)
1: Lossless optimization (recommended for digital PDFs)
2: Lossy compression (good for scanned documents)
3: Aggressive compression (smallest size, some quality loss)
Use Case:
Digital PDFs: 1 (preserve quality)
Scanned documents: 2 (balance size/quality)
Large collections: 3 (minimize storage)

`PREPROCESSING_PDF_RENDERER`¶

Type: str
Default: "hocr"
Options: "auto", "hocr", "sandwich"
Description:
"hocr": Embed invisible text layer (best for most documents)
"sandwich": Visible text with image background (preserves appearance)
"auto": Let OCRmyPDF choose
Use Case:
Digital papers: "hocr" (smaller files)
Scanned papers: "hocr" or "sandwich" (depending on preference)

`PREPROCESSING_OUTPUT_TYPE`¶

Type: str
Default: "pdf"
Options: "pdf", "pdfa", "pdfa-1", "pdfa-2", "pdfa-3"
Description: Output PDF format. "pdf" skips PDF/A conversion.
Use Case: Use "pdf" for speed; PDF/A formats for long-term archival.

`PREPROCESSING_FAST_WEB_VIEW`¶

Type: int
Default: 999999 (effectively disabled)
Description: Optimize PDF for web viewing by reorganizing structure. High values disable optimization.
Use Case: Set to 1 for web-served PDFs; keep default for processing pipelines.

Performance Settings¶

`PREPROCESSING_JOBS`¶

Type: int
Default: 1
Description: Number of parallel worker processes for OCR.
Use Case:
Docker/limited CPU: 1 (avoid oversubscription)
Multi-core servers: 2-4 (balance speed/resources)
High-memory systems: 4-8 (maximum parallelism)

`PREPROCESSING_SKIP_BIG`¶

Type: float (MB)
Default: 100.0
Description: Skip OCR on images larger than this threshold to avoid timeouts.
Use Case:
High-quality scans: 50-100 MB
Standard documents: 100-200 MB
Large format papers: 200+ MB

`PREPROCESSING_PROGRESS_BAR`¶

Type: bool
Default: false
Description: Show progress bar during processing (useful for CLI, not for background jobs).
Use Case: true for interactive processing; false for production.

Troubleshooting¶

Issue: Timeout Errors¶

Symptoms: TesseractTimeout errors in logs

Solutions: 1. Increase PREPROCESSING_TESSERACT_TIMEOUT (try 300-600) 2. Increase PREPROCESSING_SKIP_BIG to skip problematic pages 3. Reduce PREPROCESSING_JOBS to avoid resource contention 4. Set PREPROCESSING_CLEAN=false to skip image preprocessing

Issue: Poor OCR Quality¶

Symptoms: Garbled or missing text extraction

Solutions: 1. Enable PREPROCESSING_DESKEW=true for skewed scans 2. Enable PREPROCESSING_CLEAN=true to remove artifacts 3. Set PREPROCESSING_FORCE_OCR=true to redo existing OCR 4. Add more languages to PREPROCESSING_LANGUAGE 5. Adjust PREPROCESSING_ROTATE_PAGES_THRESHOLD if pages are incorrectly rotated

Issue: Processing Too Slow¶

Symptoms: Long wait times for document processing

Solutions: 1. Set PREPROCESSING_ENABLED=false for digital PDFs (only run analysis) 2. Reduce PREPROCESSING_TESSERACT_TIMEOUT (try 60-120) 3. Ensure PREPROCESSING_SKIP_TEXT=true for hybrid documents 4. Reduce PREPROCESSING_OPTIMIZE level 5. Disable PREPROCESSING_CLEAN=false and PREPROCESSING_DESKEW=false

Issue: High Memory Usage¶

Symptoms: Out-of-memory errors, system slowdown

Solutions: 1. Set PREPROCESSING_JOBS=1 (most important) 2. Reduce PREPROCESSING_SKIP_BIG threshold (e.g., 50 MB) 3. Set PREPROCESSING_OPTIMIZE=3 to reduce output size 4. Process documents in smaller batches

Integration Examples¶

Programmatic Configuration (Python)¶

from extralit_server.contexts.document.preprocessing import (
    PDFPreprocessor,
    PDFPreprocessingSettings
)

# Custom settings for scanned documents
settings = PDFPreprocessingSettings(
    enabled=True,
    enable_analysis=True,
    force_ocr=True,
    tesseract_timeout=180,
    language=["eng"],
    deskew=True,
    clean=True,
    optimize=2
)

preprocessor = PDFPreprocessor(settings=settings)
result = preprocessor.preprocess(pdf_bytes, "document.pdf")

# Access processed data and metadata
processed_pdf = result.processed_data
metadata = result.metadata
print(f"Processing time: {metadata.processing_time:.2f}s")
print(f"Analysis results: {metadata.analysis_results}")

Environment Variables (Docker/Production)¶

Create a .env file:

# Extralit Server Configuration
EXTRALIT_DATABASE_URL=postgresql://user:pass@localhost/extralit
EXTRALIT_REDIS_URL=redis://localhost:6379/0

# PDF Preprocessing for Scanned Documents
PREPROCESSING_ENABLED=true
PREPROCESSING_ENABLE_ANALYSIS=true
PREPROCESSING_FORCE_OCR=true
PREPROCESSING_TESSERACT_TIMEOUT=180
PREPROCESSING_LANGUAGE=["eng", "spa"]
PREPROCESSING_DESKEW=true
PREPROCESSING_CLEAN=true
PREPROCESSING_OPTIMIZE=2
PREPROCESSING_JOBS=2

File	Purpose
`preprocessing.py`	Core preprocessing logic and settings
`margin.py`	PDF layout analysis and margin detection
`api/schemas/v1/document/preprocessing.py`	API metadata schema

Important Notes¶

Environment Variables: All settings can be overridden via PREPROCESSING_* env vars
OCRmyPDF Dependency: Requires ocrmypdf and tesseract installed
Lazy Loading: ocrmypdf is lazy-loaded to avoid import overhead
Error Handling: Falls back to temp files if BytesIO approach fails

PDF Preprocessing Configuration Guide¶

Overview¶

Quick Start¶

Digital Research Papers (Born-Digital PDFs)¶

Scanned Research Papers (Image-Based PDFs)¶

Mixed Document Collections¶

Configuration Reference¶

Core Settings¶

PREPROCESSING_ENABLED¶

PREPROCESSING_ENABLE_ANALYSIS¶

OCR Settings¶

PREPROCESSING_LANGUAGE¶

PREPROCESSING_TESSERACT_TIMEOUT¶

PREPROCESSING_FORCE_OCR¶

PREPROCESSING_SKIP_TEXT¶

PREPROCESSING_REDO_OCR¶

Page Processing¶

PREPROCESSING_ROTATE_PAGES¶

PREPROCESSING_ROTATE_PAGES_THRESHOLD¶

PREPROCESSING_DESKEW¶

PREPROCESSING_CLEAN¶