Output formats - PDF Visual Regression

Overview

Each comparison run generates a timestamped output directory containing:

Diff images - Visual representations of differences
Extra page images - Pages that exist in only one PDF
results.json - Machine-readable comparison report

Directory structure

The output follows this structure:

<output_dir>/
└── <timestamp>_diff/
    ├── diff_page_1.png
    ├── diff_page_3.png
    ├── extra_page_5_only_in_pdf2.png
    └── results.json

Implementation: pdf_visual_diff.py:14-17

timestamp = datetime.now().strftime("%Y%d%m_%H%M%S")
output_dir = os.path.join(output_dir, f"{timestamp}_diff")
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

If PDFs are identical, the directory is still created but will only contain results.json.

Diff images

Diff images highlight visual differences between corresponding pages.

File naming

diff_page_<N>.png

Where N is the page number (1-indexed). Example:

diff_page_1.png - Differences on page 1
diff_page_3.png - Differences on page 3
diff_page_10.png - Differences on page 10

How diff images are generated

The process involves multiple steps: 1. Render pages to images (pdf_visual_diff.py:31-43)

zoom = 2  # DPI = 144
mat = fitz.Matrix(zoom, zoom)

page1 = pdf1.load_page(i)
page2 = pdf2.load_page(i)

img1 = page1.get_pixmap(matrix=mat)
img2 = page2.get_pixmap(matrix=mat)

pil_img1 = Image.frombytes("RGB", [img1.width, img1.height], img1.samples)
pil_img2 = Image.frombytes("RGB", [img2.width, img2.height], img2.samples)

2. Calculate pixel-level differences (pdf_visual_diff.py:59-60)

# Use ImageChops to find the difference
diff = ImageChops.difference(pil_img1, pil_img2)

3. Apply threshold to make differences visible (pdf_visual_diff.py:62-63)

# Threshold to make the diff more visible
thresholded_diff = diff.point(lambda p: 255 if p > 20 else 0)

4. Highlight differences in red (pdf_visual_diff.py:65-69)

if thresholded_diff.getbbox():
    drawing_layer = Image.new("RGBA", pil_img1.size, (0,0,0,0))
    drawing_layer.paste((255,0,0,128), mask=thresholded_diff.convert('L'))
    highlighted_img = Image.alpha_composite(pil_img1.convert("RGBA"), drawing_layer)
    highlighted_img.convert("RGB").save(os.path.join(output_dir, f"diff_page_{i+1}.png"))

Visual characteristics

Colors
Resolution
Thresholding

Base image: Original content from PDF1
Red overlay: Areas with differences (semi-transparent, 50% opacity)
Unchanged areas: Original colors from PDF1

Diff images are only generated when thresholded_diff.getbbox() returns a bounding box. If differences exist but are too subtle (all pixels < 20), no image is saved even if the SSIM is below threshold.

Extra page images

When PDFs have different page counts, extra pages are rendered as standalone images.

File naming

extra_page_<N>_only_in_pdf<X>.png

Where:

N = Page number (1-indexed)
X = Which PDF contains the extra page (1 or 2)

Examples:

extra_page_5_only_in_pdf1.png - Page 5 exists only in the first PDF
extra_page_9_only_in_pdf2.png - Page 9 exists only in the second PDF

Generation logic

For extra pages in PDF1: (pdf_visual_diff.py:74-81)

if len(pdf1) > len(pdf2):
    longer_pdf = "PDF1"
    for i in range(page_count, len(pdf1)):
        extra_pages.append(i + 1)
        page = pdf1.load_page(i)
        img = page.get_pixmap(matrix=mat)
        pil_img = Image.frombytes("RGB", [img.width, img.height], img.samples)
        pil_img.save(os.path.join(output_dir, f"extra_page_{i+1}_only_in_pdf1.png"))

For extra pages in PDF2: (pdf_visual_diff.py:82-89)

elif len(pdf2) > len(pdf1):
    longer_pdf = "PDF2"
    for i in range(page_count, len(pdf2)):
        extra_pages.append(i + 1)
        page = pdf2.load_page(i)
        img = page.get_pixmap(matrix=mat)  
        pil_img = Image.frombytes("RGB", [img.width, img.height], img.samples)
        pil_img.save(os.path.join(output_dir, f"extra_page_{i+1}_only_in_pdf2.png"))

Results JSON file

Every comparison generates a results.json file with detailed metadata.

Schema

Implementation: pdf_visual_diff.py:109-126

results = {
    "timestamp": timestamp,
    "status": "success" if (not diff_pages and not extra_pages) else "error",
    "description": description,
    "pdf1": os.path.abspath(pdf1_path),
    "pdf2": os.path.abspath(pdf2_path),
    "pdf1_pages": pdf1_page_count,
    "pdf2_pages": pdf2_page_count,
    "threshold": threshold,
    "identical": not diff_pages and not extra_pages,
    "diff_pages": diff_pages,
    "extra_pages": extra_pages,
    "extra_pages_in": longer_pdf,
}

with open(os.path.join(output_dir, "results.json"), "w") as f:
    json.dump(results, f, indent=2)

Field reference

timestamp

string

Timestamp when the comparison was run.Format: YYYYDDMM_HHMMSS (year-day-month_hour-minute-second)Example: "20260304_143052"

status

string

Comparison result status.

"success" - PDFs are identical
"error" - Differences or extra pages found

Note: “error” does not mean the tool failed, just that differences exist.

description

string

Human-readable summary of the comparison result.Examples:

"All pages are visually identical."
"Visual differences found on pages: 1, 3, 5"
"Visual differences found on pages: 2 Extra pages only in PDF1: 9, 10"

pdf1

string

Absolute path to the first PDF file.Example: "/home/user/documents/baseline.pdf"

pdf2

string

Absolute path to the second PDF file.Example: "/home/user/documents/updated.pdf"

pdf1_pages

integer

Total number of pages in the first PDF.Example: 10

pdf2_pages

integer

Total number of pages in the second PDF.Example: 8

threshold

float

SSIM threshold value used for the comparison.Example: 0.999

identical

boolean

Whether the PDFs are visually identical.

true - No differences found
false - Differences or extra pages exist

Logic: pdf_visual_diff.py:118

"identical": not diff_pages and not extra_pages

diff_pages

array

Array of page numbers (1-indexed) with visual differences.Examples:

[] - No differences
[1, 3, 5] - Pages 1, 3, and 5 have differences

extra_pages

array

Array of page numbers (1-indexed) that exist in only one PDF.Examples:

[] - Same page count
[9, 10] - Pages 9 and 10 exist in only one PDF

extra_pages_in

string | null

Which PDF contains the extra pages.

"PDF1" - First PDF has extra pages
"PDF2" - Second PDF has extra pages
null - PDFs have the same page count

Example JSON outputs

{
  "timestamp": "20260304_143052",
  "status": "success",
  "description": "All pages are visually identical.",
  "pdf1": "/home/user/baseline.pdf",
  "pdf2": "/home/user/updated.pdf",
  "pdf1_pages": 5,
  "pdf2_pages": 5,
  "threshold": 0.999,
  "identical": true,
  "diff_pages": [],
  "extra_pages": [],
  "extra_pages_in": null
}

Programmatic usage

Parsing results in shell scripts

#!/bin/bash
RESULT_FILE="diff_output/*/results.json"

if jq -e '.identical == true' $RESULT_FILE > /dev/null; then
  echo "✓ PDFs are identical"
  exit 0
else
  echo "✗ Differences found"
  jq -r '.description' $RESULT_FILE
  exit 1
fi

Parsing results in Python

import json
import glob
import os

def get_latest_result(output_dir):
    """Find and parse the most recent results.json file."""
    pattern = os.path.join(output_dir, "*_diff", "results.json")
    result_files = glob.glob(pattern)
    
    if not result_files:
        return None
    
    # Get the most recent file
    latest_file = max(result_files, key=os.path.getmtime)
    
    with open(latest_file, 'r') as f:
        return json.load(f)

# Usage
result = get_latest_result("diff_output")

if result:
    if result["identical"]:
        print("✓ PDFs are identical")
    else:
        print(f"✗ {result['description']}")
        
        if result["diff_pages"]:
            print(f"  Diff pages: {result['diff_pages']}")
        
        if result["extra_pages"]:
            print(f"  Extra pages in {result['extra_pages_in']}: {result['extra_pages']}")

CI/CD integration example

# GitHub Actions example
name: PDF Regression Test

on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      
      - name: Install dependencies
        run: |
          pip install PyMuPDF Pillow scikit-image numpy
      
      - name: Generate baseline PDF
        run: python generate_report.py --output baseline.pdf
      
      - name: Generate updated PDF
        run: python generate_report.py --output updated.pdf
      
      - name: Compare PDFs
        run: |
          python pdf_visual_diff.py baseline.pdf updated.pdf \
            --output ci_results \
            --threshold 0.999
      
      - name: Check results
        run: |
          if jq -e '.identical == true' ci_results/*/results.json > /dev/null; then
            echo "✓ PDFs are identical"
          else
            echo "✗ Visual differences detected"
            jq '.' ci_results/*/results.json
            exit 1
          fi
      
      - name: Upload diff artifacts
        if: failure()
        uses: actions/upload-artifact@v2
        with:
          name: pdf-diffs
          path: ci_results/

File size considerations

Typical sizes

Diff images: 100KB - 5MB per page (depends on page complexity)
Extra page images: 50KB - 3MB per page
results.json: < 1KB

Storage management

Compress old results

# Archive results older than 30 days
find diff_output -type d -name "*_diff" -mtime +30 | while read dir; do
  tar -czf "${dir}.tar.gz" "$dir"
  rm -rf "$dir"
done

Keep only JSON reports

# Delete images but keep JSON files
find diff_output -type f -name "*.png" -mtime +7 -delete

Limit output size

# Delete oldest results when directory exceeds 1GB
while [ $(du -sb diff_output | cut -f1) -gt 1073741824 ]; do
  OLDEST=$(ls -td diff_output/*_diff | tail -1)
  rm -rf "$OLDEST"
done

​Overview

​Directory structure

​Diff images

​File naming

​How diff images are generated

​Visual characteristics

​Extra page images

​File naming

​Generation logic

​Results JSON file

​Schema

​Field reference

​Example JSON outputs

​Programmatic usage

​Parsing results in shell scripts

​Parsing results in Python

​CI/CD integration example

​File size considerations

​Typical sizes

​Storage management

​See also

Overview

Directory structure

Diff images

File naming

How diff images are generated

Visual characteristics

Extra page images

File naming

Generation logic

Results JSON file

Schema

Field reference

Example JSON outputs

Programmatic usage

Parsing results in shell scripts

Parsing results in Python

CI/CD integration example

File size considerations

Typical sizes

Storage management

See also