Skip to main content

Overview

Each comparison run generates a timestamped output directory containing:
  1. Diff images - Visual representations of differences
  2. Extra page images - Pages that exist in only one PDF
  3. results.json - Machine-readable comparison report

Directory structure

The output follows this structure:
<output_dir>/
└── <timestamp>_diff/
    ├── diff_page_1.png
    ├── diff_page_3.png
    ├── extra_page_5_only_in_pdf2.png
    └── results.json
Implementation: pdf_visual_diff.py:14-17
timestamp = datetime.now().strftime("%Y%d%m_%H%M%S")
output_dir = os.path.join(output_dir, f"{timestamp}_diff")
if not os.path.exists(output_dir):
    os.makedirs(output_dir)
If PDFs are identical, the directory is still created but will only contain results.json.

Diff images

Diff images highlight visual differences between corresponding pages.

File naming

diff_page_<N>.png
Where N is the page number (1-indexed). Example:
  • diff_page_1.png - Differences on page 1
  • diff_page_3.png - Differences on page 3
  • diff_page_10.png - Differences on page 10

How diff images are generated

The process involves multiple steps: 1. Render pages to images (pdf_visual_diff.py:31-43)
zoom = 2  # DPI = 144
mat = fitz.Matrix(zoom, zoom)

page1 = pdf1.load_page(i)
page2 = pdf2.load_page(i)

img1 = page1.get_pixmap(matrix=mat)
img2 = page2.get_pixmap(matrix=mat)

pil_img1 = Image.frombytes("RGB", [img1.width, img1.height], img1.samples)
pil_img2 = Image.frombytes("RGB", [img2.width, img2.height], img2.samples)
2. Calculate pixel-level differences (pdf_visual_diff.py:59-60)
# Use ImageChops to find the difference
diff = ImageChops.difference(pil_img1, pil_img2)
3. Apply threshold to make differences visible (pdf_visual_diff.py:62-63)
# Threshold to make the diff more visible
thresholded_diff = diff.point(lambda p: 255 if p > 20 else 0)
4. Highlight differences in red (pdf_visual_diff.py:65-69)
if thresholded_diff.getbbox():
    drawing_layer = Image.new("RGBA", pil_img1.size, (0,0,0,0))
    drawing_layer.paste((255,0,0,128), mask=thresholded_diff.convert('L'))
    highlighted_img = Image.alpha_composite(pil_img1.convert("RGBA"), drawing_layer)
    highlighted_img.convert("RGB").save(os.path.join(output_dir, f"diff_page_{i+1}.png"))

Visual characteristics

  • Base image: Original content from PDF1
  • Red overlay: Areas with differences (semi-transparent, 50% opacity)
  • Unchanged areas: Original colors from PDF1
Diff images are only generated when thresholded_diff.getbbox() returns a bounding box. If differences exist but are too subtle (all pixels < 20), no image is saved even if the SSIM is below threshold.

Extra page images

When PDFs have different page counts, extra pages are rendered as standalone images.

File naming

extra_page_<N>_only_in_pdf<X>.png
Where:
  • N = Page number (1-indexed)
  • X = Which PDF contains the extra page (1 or 2)
Examples:
  • extra_page_5_only_in_pdf1.png - Page 5 exists only in the first PDF
  • extra_page_9_only_in_pdf2.png - Page 9 exists only in the second PDF

Generation logic

For extra pages in PDF1: (pdf_visual_diff.py:74-81)
if len(pdf1) > len(pdf2):
    longer_pdf = "PDF1"
    for i in range(page_count, len(pdf1)):
        extra_pages.append(i + 1)
        page = pdf1.load_page(i)
        img = page.get_pixmap(matrix=mat)
        pil_img = Image.frombytes("RGB", [img.width, img.height], img.samples)
        pil_img.save(os.path.join(output_dir, f"extra_page_{i+1}_only_in_pdf1.png"))
For extra pages in PDF2: (pdf_visual_diff.py:82-89)
elif len(pdf2) > len(pdf1):
    longer_pdf = "PDF2"
    for i in range(page_count, len(pdf2)):
        extra_pages.append(i + 1)
        page = pdf2.load_page(i)
        img = page.get_pixmap(matrix=mat)  
        pil_img = Image.frombytes("RGB", [img.width, img.height], img.samples)
        pil_img.save(os.path.join(output_dir, f"extra_page_{i+1}_only_in_pdf2.png"))

Results JSON file

Every comparison generates a results.json file with detailed metadata.

Schema

Implementation: pdf_visual_diff.py:109-126
results = {
    "timestamp": timestamp,
    "status": "success" if (not diff_pages and not extra_pages) else "error",
    "description": description,
    "pdf1": os.path.abspath(pdf1_path),
    "pdf2": os.path.abspath(pdf2_path),
    "pdf1_pages": pdf1_page_count,
    "pdf2_pages": pdf2_page_count,
    "threshold": threshold,
    "identical": not diff_pages and not extra_pages,
    "diff_pages": diff_pages,
    "extra_pages": extra_pages,
    "extra_pages_in": longer_pdf,
}

with open(os.path.join(output_dir, "results.json"), "w") as f:
    json.dump(results, f, indent=2)

Field reference

timestamp
string
Timestamp when the comparison was run.Format: YYYYDDMM_HHMMSS (year-day-month_hour-minute-second)Example: "20260304_143052"
status
string
Comparison result status.
  • "success" - PDFs are identical
  • "error" - Differences or extra pages found
Note: “error” does not mean the tool failed, just that differences exist.
description
string
Human-readable summary of the comparison result.Examples:
  • "All pages are visually identical."
  • "Visual differences found on pages: 1, 3, 5"
  • "Visual differences found on pages: 2 Extra pages only in PDF1: 9, 10"
pdf1
string
Absolute path to the first PDF file.Example: "/home/user/documents/baseline.pdf"
pdf2
string
Absolute path to the second PDF file.Example: "/home/user/documents/updated.pdf"
pdf1_pages
integer
Total number of pages in the first PDF.Example: 10
pdf2_pages
integer
Total number of pages in the second PDF.Example: 8
threshold
float
SSIM threshold value used for the comparison.Example: 0.999
identical
boolean
Whether the PDFs are visually identical.
  • true - No differences found
  • false - Differences or extra pages exist
Logic: pdf_visual_diff.py:118
"identical": not diff_pages and not extra_pages
diff_pages
array
Array of page numbers (1-indexed) with visual differences.Examples:
  • [] - No differences
  • [1, 3, 5] - Pages 1, 3, and 5 have differences
extra_pages
array
Array of page numbers (1-indexed) that exist in only one PDF.Examples:
  • [] - Same page count
  • [9, 10] - Pages 9 and 10 exist in only one PDF
extra_pages_in
string | null
Which PDF contains the extra pages.
  • "PDF1" - First PDF has extra pages
  • "PDF2" - Second PDF has extra pages
  • null - PDFs have the same page count

Example JSON outputs

{
  "timestamp": "20260304_143052",
  "status": "success",
  "description": "All pages are visually identical.",
  "pdf1": "/home/user/baseline.pdf",
  "pdf2": "/home/user/updated.pdf",
  "pdf1_pages": 5,
  "pdf2_pages": 5,
  "threshold": 0.999,
  "identical": true,
  "diff_pages": [],
  "extra_pages": [],
  "extra_pages_in": null
}

Programmatic usage

Parsing results in shell scripts

#!/bin/bash
RESULT_FILE="diff_output/*/results.json"

if jq -e '.identical == true' $RESULT_FILE > /dev/null; then
  echo "✓ PDFs are identical"
  exit 0
else
  echo "✗ Differences found"
  jq -r '.description' $RESULT_FILE
  exit 1
fi

Parsing results in Python

import json
import glob
import os

def get_latest_result(output_dir):
    """Find and parse the most recent results.json file."""
    pattern = os.path.join(output_dir, "*_diff", "results.json")
    result_files = glob.glob(pattern)
    
    if not result_files:
        return None
    
    # Get the most recent file
    latest_file = max(result_files, key=os.path.getmtime)
    
    with open(latest_file, 'r') as f:
        return json.load(f)

# Usage
result = get_latest_result("diff_output")

if result:
    if result["identical"]:
        print("✓ PDFs are identical")
    else:
        print(f"✗ {result['description']}")
        
        if result["diff_pages"]:
            print(f"  Diff pages: {result['diff_pages']}")
        
        if result["extra_pages"]:
            print(f"  Extra pages in {result['extra_pages_in']}: {result['extra_pages']}")

CI/CD integration example

# GitHub Actions example
name: PDF Regression Test

on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      
      - name: Install dependencies
        run: |
          pip install PyMuPDF Pillow scikit-image numpy
      
      - name: Generate baseline PDF
        run: python generate_report.py --output baseline.pdf
      
      - name: Generate updated PDF
        run: python generate_report.py --output updated.pdf
      
      - name: Compare PDFs
        run: |
          python pdf_visual_diff.py baseline.pdf updated.pdf \
            --output ci_results \
            --threshold 0.999
      
      - name: Check results
        run: |
          if jq -e '.identical == true' ci_results/*/results.json > /dev/null; then
            echo "✓ PDFs are identical"
          else
            echo "✗ Visual differences detected"
            jq '.' ci_results/*/results.json
            exit 1
          fi
      
      - name: Upload diff artifacts
        if: failure()
        uses: actions/upload-artifact@v2
        with:
          name: pdf-diffs
          path: ci_results/

File size considerations

Typical sizes

  • Diff images: 100KB - 5MB per page (depends on page complexity)
  • Extra page images: 50KB - 3MB per page
  • results.json: < 1KB

Storage management

# Archive results older than 30 days
find diff_output -type d -name "*_diff" -mtime +30 | while read dir; do
  tar -czf "${dir}.tar.gz" "$dir"
  rm -rf "$dir"
done
# Delete images but keep JSON files
find diff_output -type f -name "*.png" -mtime +7 -delete
# Delete oldest results when directory exceeds 1GB
while [ $(du -sb diff_output | cut -f1) -gt 1073741824 ]; do
  OLDEST=$(ls -td diff_output/*_diff | tail -1)
  rm -rf "$OLDEST"
done

See also