Skip to main content

Page count mismatches

When comparing PDFs with different numbers of pages, the tool handles this gracefully.

Scenario: Extra pages in one PDF

python pdf_visual_diff.py example-pdfs/test1_original.pdf example-pdfs/test3_different_pages.pdf

Output behavior

Warning: PDFs have different page counts. PDF1: 1 pages, PDF2: 2 pages.
Comparing up to the lower page count.
Extra pages only in PDF2: 2
Diff images saved to: /path/to/diff_output/20261202_171728_diff/

Generated files

For extra pages, the tool saves snapshots with descriptive filenames:
diff_output/20261202_171728_diff/
├── extra_page_2_only_in_pdf2.png
└── results.json
The tool compares pages up to the minimum page count, then separately captures any extra pages from the longer PDF.

Results JSON structure

{
  "timestamp": "20261202_171728",
  "status": "error",
  "description": "Extra pages only in PDF2: 2",
  "pdf1": "/absolute/path/to/test1_original.pdf",
  "pdf2": "/absolute/path/to/test3_different_pages.pdf",
  "pdf1_pages": 1,
  "pdf2_pages": 2,
  "threshold": 1,
  "identical": false,
  "diff_pages": [],
  "extra_pages": [2],
  "extra_pages_in": "PDF2"
}

Page size mismatches

The tool automatically handles PDFs with different page dimensions.

How it works

From the source code (pdf_visual_diff.py:45-47):
if pil_img1.size != pil_img2.size:
    # Resize images to be the same size for comparison
    pil_img2 = pil_img2.resize(pil_img1.size, Image.LANCZOS)
The Structural Similarity Index (SSIM) requires both images to have identical dimensions. The tool resizes the second PDF’s pages to match the first PDF’s dimensions using high-quality LANCZOS resampling.

Important considerations

# These PDFs contain identical content but different page sizes
python pdf_visual_diff.py letter-size.pdf a4-size.pdf
Resizing can introduce minor visual artifacts. For best results, ensure both PDFs use the same page dimensions.

Tuning the similarity threshold

The SSIM threshold controls how sensitive the comparison is. The default is 1.0 (exact match).

Understanding SSIM values

The Structural Similarity Index ranges from 0.0 to 1.0:
  • 1.0: Perfect match
  • 0.999: Nearly identical (minor rendering differences)
  • 0.95: Noticeable differences
  • 0.5: Significant differences
  • 0.0: Completely different

Adjusting sensitivity

# More tolerant (ignores minor rendering variations)
python pdf_visual_diff.py file1.pdf file2.pdf --threshold 0.999
# Default behavior (exact match required)
python pdf_visual_diff.py file1.pdf file2.pdf --threshold 1

When to adjust the threshold

Different systems may render fonts slightly differently. If you’re seeing false positives from font anti-aliasing, try --threshold 0.999.
Some PDF libraries introduce minor pixel differences even when content is identical. A threshold of 0.995 to 0.999 can help filter these out.
Keep the default 1.0 threshold when you want to catch even the smallest visual differences during regression testing.

Example: Filtering rendering artifacts

# With default threshold (1.0)
python pdf_visual_diff.py reference.pdf generated.pdf
Visual differences found on pages: 1, 2, 3, 4, 5
Diff images saved to: /path/to/diff_output/20261202_171728_diff/
# With relaxed threshold (0.999)
python pdf_visual_diff.py reference.pdf generated.pdf --threshold 0.999
All pages are visually identical.
The threshold is applied per-page. Each page’s SSIM score must meet or exceed the threshold to be considered identical.

Combining multiple options

You can combine output directory and threshold settings:
python pdf_visual_diff.py \
  original.pdf \
  modified.pdf \
  --output regression_tests/run_1 \
  --threshold 0.999

High-resolution rendering

The tool renders PDFs at 2x zoom (144 DPI) for better difference detection:
zoom = 2  # DPI = 144
mat = fitz.Matrix(zoom, zoom)
This ensures that small visual differences are captured accurately in the diff images.

Batch testing workflow

For CI/CD pipelines, you might want to test multiple PDF pairs:
#!/bin/bash

for test_case in test_cases/*.json; do
  pdf1=$(jq -r '.reference' "$test_case")
  pdf2=$(jq -r '.generated' "$test_case")
  threshold=$(jq -r '.threshold' "$test_case")
  
  echo "Testing: $test_case"
  python pdf_visual_diff.py "$pdf1" "$pdf2" \
    --output "results/$(basename "$test_case" .json)" \
    --threshold "$threshold"
done
The tool doesn’t currently return specific exit codes for failures. Check the results.json file’s status field to determine test outcomes programmatically.