> ## Documentation Index
> Fetch the complete documentation index at: https://mintlify.com/DilwoarH/pdf-visual-regression/llms.txt
> Use this file to discover all available pages before exploring further.

# Output formats

> Understanding the diff images and JSON reports generated by pdf-visual-diff

## Overview

Each comparison run generates a timestamped output directory containing:

1. **Diff images** - Visual representations of differences
2. **Extra page images** - Pages that exist in only one PDF
3. **results.json** - Machine-readable comparison report

## Directory structure

The output follows this structure:

```
<output_dir>/
└── <timestamp>_diff/
    ├── diff_page_1.png
    ├── diff_page_3.png
    ├── extra_page_5_only_in_pdf2.png
    └── results.json
```

**Implementation:** `pdf_visual_diff.py:14-17`

```python theme={null}
timestamp = datetime.now().strftime("%Y%d%m_%H%M%S")
output_dir = os.path.join(output_dir, f"{timestamp}_diff")
if not os.path.exists(output_dir):
    os.makedirs(output_dir)
```

<Note>
  If PDFs are identical, the directory is still created but will only contain `results.json`.
</Note>

## Diff images

Diff images highlight visual differences between corresponding pages.

### File naming

```
diff_page_<N>.png
```

Where `N` is the page number (1-indexed).

**Example:**

* `diff_page_1.png` - Differences on page 1
* `diff_page_3.png` - Differences on page 3
* `diff_page_10.png` - Differences on page 10

### How diff images are generated

The process involves multiple steps:

**1. Render pages to images** (`pdf_visual_diff.py:31-43`)

```python theme={null}
zoom = 2  # DPI = 144
mat = fitz.Matrix(zoom, zoom)

page1 = pdf1.load_page(i)
page2 = pdf2.load_page(i)

img1 = page1.get_pixmap(matrix=mat)
img2 = page2.get_pixmap(matrix=mat)

pil_img1 = Image.frombytes("RGB", [img1.width, img1.height], img1.samples)
pil_img2 = Image.frombytes("RGB", [img2.width, img2.height], img2.samples)
```

**2. Calculate pixel-level differences** (`pdf_visual_diff.py:59-60`)

```python theme={null}
# Use ImageChops to find the difference
diff = ImageChops.difference(pil_img1, pil_img2)
```

**3. Apply threshold to make differences visible** (`pdf_visual_diff.py:62-63`)

```python theme={null}
# Threshold to make the diff more visible
thresholded_diff = diff.point(lambda p: 255 if p > 20 else 0)
```

**4. Highlight differences in red** (`pdf_visual_diff.py:65-69`)

```python theme={null}
if thresholded_diff.getbbox():
    drawing_layer = Image.new("RGBA", pil_img1.size, (0,0,0,0))
    drawing_layer.paste((255,0,0,128), mask=thresholded_diff.convert('L'))
    highlighted_img = Image.alpha_composite(pil_img1.convert("RGBA"), drawing_layer)
    highlighted_img.convert("RGB").save(os.path.join(output_dir, f"diff_page_{i+1}.png"))
```

### Visual characteristics

<Tabs>
  <Tab title="Colors">
    * **Base image**: Original content from PDF1
    * **Red overlay**: Areas with differences (semi-transparent, 50% opacity)
    * **Unchanged areas**: Original colors from PDF1
  </Tab>

  <Tab title="Resolution">
    * **DPI**: 144 (2x zoom factor)
    * **Color space**: RGB
    * **Format**: PNG (lossless)
    * **Dimensions**: Match the original PDF page size at 144 DPI
  </Tab>

  <Tab title="Thresholding">
    * Pixel differences \< 20 (out of 255) are ignored
    * Differences ≥ 20 are highlighted
    * This prevents highlighting minor compression artifacts
  </Tab>
</Tabs>

<Note>
  Diff images are only generated when `thresholded_diff.getbbox()` returns a bounding box. If differences exist but are too subtle (all pixels \< 20), no image is saved even if the SSIM is below threshold.
</Note>

## Extra page images

When PDFs have different page counts, extra pages are rendered as standalone images.

### File naming

```
extra_page_<N>_only_in_pdf<X>.png
```

Where:

* `N` = Page number (1-indexed)
* `X` = Which PDF contains the extra page (1 or 2)

**Examples:**

* `extra_page_5_only_in_pdf1.png` - Page 5 exists only in the first PDF
* `extra_page_9_only_in_pdf2.png` - Page 9 exists only in the second PDF

### Generation logic

**For extra pages in PDF1:** (`pdf_visual_diff.py:74-81`)

```python theme={null}
if len(pdf1) > len(pdf2):
    longer_pdf = "PDF1"
    for i in range(page_count, len(pdf1)):
        extra_pages.append(i + 1)
        page = pdf1.load_page(i)
        img = page.get_pixmap(matrix=mat)
        pil_img = Image.frombytes("RGB", [img.width, img.height], img.samples)
        pil_img.save(os.path.join(output_dir, f"extra_page_{i+1}_only_in_pdf1.png"))
```

**For extra pages in PDF2:** (`pdf_visual_diff.py:82-89`)

```python theme={null}
elif len(pdf2) > len(pdf1):
    longer_pdf = "PDF2"
    for i in range(page_count, len(pdf2)):
        extra_pages.append(i + 1)
        page = pdf2.load_page(i)
        img = page.get_pixmap(matrix=mat)  
        pil_img = Image.frombytes("RGB", [img.width, img.height], img.samples)
        pil_img.save(os.path.join(output_dir, f"extra_page_{i+1}_only_in_pdf2.png"))
```

## Results JSON file

Every comparison generates a `results.json` file with detailed metadata.

### Schema

**Implementation:** `pdf_visual_diff.py:109-126`

```python theme={null}
results = {
    "timestamp": timestamp,
    "status": "success" if (not diff_pages and not extra_pages) else "error",
    "description": description,
    "pdf1": os.path.abspath(pdf1_path),
    "pdf2": os.path.abspath(pdf2_path),
    "pdf1_pages": pdf1_page_count,
    "pdf2_pages": pdf2_page_count,
    "threshold": threshold,
    "identical": not diff_pages and not extra_pages,
    "diff_pages": diff_pages,
    "extra_pages": extra_pages,
    "extra_pages_in": longer_pdf,
}

with open(os.path.join(output_dir, "results.json"), "w") as f:
    json.dump(results, f, indent=2)
```

### Field reference

<ParamField path="timestamp" type="string">
  Timestamp when the comparison was run.

  Format: `YYYYDDMM_HHMMSS` (year-day-month\_hour-minute-second)

  **Example:** `"20260304_143052"`
</ParamField>

<ParamField path="status" type="string">
  Comparison result status.

  * `"success"` - PDFs are identical
  * `"error"` - Differences or extra pages found

  **Note:** "error" does not mean the tool failed, just that differences exist.
</ParamField>

<ParamField path="description" type="string">
  Human-readable summary of the comparison result.

  **Examples:**

  * `"All pages are visually identical."`
  * `"Visual differences found on pages: 1, 3, 5"`
  * `"Visual differences found on pages: 2 Extra pages only in PDF1: 9, 10"`
</ParamField>

<ParamField path="pdf1" type="string">
  Absolute path to the first PDF file.

  **Example:** `"/home/user/documents/baseline.pdf"`
</ParamField>

<ParamField path="pdf2" type="string">
  Absolute path to the second PDF file.

  **Example:** `"/home/user/documents/updated.pdf"`
</ParamField>

<ParamField path="pdf1_pages" type="integer">
  Total number of pages in the first PDF.

  **Example:** `10`
</ParamField>

<ParamField path="pdf2_pages" type="integer">
  Total number of pages in the second PDF.

  **Example:** `8`
</ParamField>

<ParamField path="threshold" type="float">
  SSIM threshold value used for the comparison.

  **Example:** `0.999`
</ParamField>

<ParamField path="identical" type="boolean">
  Whether the PDFs are visually identical.

  * `true` - No differences found
  * `false` - Differences or extra pages exist

  **Logic:** `pdf_visual_diff.py:118`

  ```python theme={null}
  "identical": not diff_pages and not extra_pages
  ```
</ParamField>

<ParamField path="diff_pages" type="array">
  Array of page numbers (1-indexed) with visual differences.

  **Examples:**

  * `[]` - No differences
  * `[1, 3, 5]` - Pages 1, 3, and 5 have differences
</ParamField>

<ParamField path="extra_pages" type="array">
  Array of page numbers (1-indexed) that exist in only one PDF.

  **Examples:**

  * `[]` - Same page count
  * `[9, 10]` - Pages 9 and 10 exist in only one PDF
</ParamField>

<ParamField path="extra_pages_in" type="string | null">
  Which PDF contains the extra pages.

  * `"PDF1"` - First PDF has extra pages
  * `"PDF2"` - Second PDF has extra pages
  * `null` - PDFs have the same page count
</ParamField>

### Example JSON outputs

<CodeGroup>
  ```json Identical PDFs theme={null}
  {
    "timestamp": "20260304_143052",
    "status": "success",
    "description": "All pages are visually identical.",
    "pdf1": "/home/user/baseline.pdf",
    "pdf2": "/home/user/updated.pdf",
    "pdf1_pages": 5,
    "pdf2_pages": 5,
    "threshold": 0.999,
    "identical": true,
    "diff_pages": [],
    "extra_pages": [],
    "extra_pages_in": null
  }
  ```

  ```json Differences found theme={null}
  {
    "timestamp": "20260304_145633",
    "status": "error",
    "description": "Visual differences found on pages: 1, 3, 5",
    "pdf1": "/home/user/report_jan.pdf",
    "pdf2": "/home/user/report_feb.pdf",
    "pdf1_pages": 8,
    "pdf2_pages": 8,
    "threshold": 0.95,
    "identical": false,
    "diff_pages": [1, 3, 5],
    "extra_pages": [],
    "extra_pages_in": null
  }
  ```

  ```json Different page counts theme={null}
  {
    "timestamp": "20260304_151204",
    "status": "error",
    "description": "Extra pages only in PDF1: 9, 10",
    "pdf1": "/home/user/full_report.pdf",
    "pdf2": "/home/user/summary_report.pdf",
    "pdf1_pages": 10,
    "pdf2_pages": 8,
    "threshold": 1.0,
    "identical": false,
    "diff_pages": [],
    "extra_pages": [9, 10],
    "extra_pages_in": "PDF1"
  }
  ```

  ```json Combined differences theme={null}
  {
    "timestamp": "20260304_153847",
    "status": "error",
    "description": "Visual differences found on pages: 2 Extra pages only in PDF2: 8, 9, 10",
    "pdf1": "/home/user/draft.pdf",
    "pdf2": "/home/user/final.pdf",
    "pdf1_pages": 7,
    "pdf2_pages": 10,
    "threshold": 0.999,
    "identical": false,
    "diff_pages": [2],
    "extra_pages": [8, 9, 10],
    "extra_pages_in": "PDF2"
  }
  ```
</CodeGroup>

## Programmatic usage

### Parsing results in shell scripts

<CodeGroup>
  ```bash Check if identical theme={null}
  #!/bin/bash
  RESULT_FILE="diff_output/*/results.json"

  if jq -e '.identical == true' $RESULT_FILE > /dev/null; then
    echo "✓ PDFs are identical"
    exit 0
  else
    echo "✗ Differences found"
    jq -r '.description' $RESULT_FILE
    exit 1
  fi
  ```

  ```bash Extract diff pages theme={null}
  #!/bin/bash
  RESULT_FILE="diff_output/*/results.json"

  DIFF_PAGES=$(jq -r '.diff_pages | join(", ")' $RESULT_FILE)

  if [ -n "$DIFF_PAGES" ]; then
    echo "Pages with differences: $DIFF_PAGES"
  fi
  ```

  ```bash Generate report theme={null}
  #!/bin/bash
  RESULT_FILE="diff_output/*/results.json"

  echo "PDF Comparison Report"
  echo "===================="
  echo "PDF 1: $(jq -r '.pdf1' $RESULT_FILE)"
  echo "PDF 2: $(jq -r '.pdf2' $RESULT_FILE)"
  echo "Pages: $(jq -r '.pdf1_pages' $RESULT_FILE) vs $(jq -r '.pdf2_pages' $RESULT_FILE)"
  echo "Threshold: $(jq -r '.threshold' $RESULT_FILE)"
  echo "Status: $(jq -r '.status' $RESULT_FILE)"
  echo ""
  echo "$(jq -r '.description' $RESULT_FILE)"
  ```
</CodeGroup>

### Parsing results in Python

```python theme={null}
import json
import glob
import os

def get_latest_result(output_dir):
    """Find and parse the most recent results.json file."""
    pattern = os.path.join(output_dir, "*_diff", "results.json")
    result_files = glob.glob(pattern)
    
    if not result_files:
        return None
    
    # Get the most recent file
    latest_file = max(result_files, key=os.path.getmtime)
    
    with open(latest_file, 'r') as f:
        return json.load(f)

# Usage
result = get_latest_result("diff_output")

if result:
    if result["identical"]:
        print("✓ PDFs are identical")
    else:
        print(f"✗ {result['description']}")
        
        if result["diff_pages"]:
            print(f"  Diff pages: {result['diff_pages']}")
        
        if result["extra_pages"]:
            print(f"  Extra pages in {result['extra_pages_in']}: {result['extra_pages']}")
```

### CI/CD integration example

```yaml theme={null}
# GitHub Actions example
name: PDF Regression Test

on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      
      - name: Install dependencies
        run: |
          pip install PyMuPDF Pillow scikit-image numpy
      
      - name: Generate baseline PDF
        run: python generate_report.py --output baseline.pdf
      
      - name: Generate updated PDF
        run: python generate_report.py --output updated.pdf
      
      - name: Compare PDFs
        run: |
          python pdf_visual_diff.py baseline.pdf updated.pdf \
            --output ci_results \
            --threshold 0.999
      
      - name: Check results
        run: |
          if jq -e '.identical == true' ci_results/*/results.json > /dev/null; then
            echo "✓ PDFs are identical"
          else
            echo "✗ Visual differences detected"
            jq '.' ci_results/*/results.json
            exit 1
          fi
      
      - name: Upload diff artifacts
        if: failure()
        uses: actions/upload-artifact@v2
        with:
          name: pdf-diffs
          path: ci_results/
```

## File size considerations

### Typical sizes

* **Diff images**: 100KB - 5MB per page (depends on page complexity)
* **Extra page images**: 50KB - 3MB per page
* **results.json**: \< 1KB

### Storage management

<Accordion title="Compress old results">
  ```bash theme={null}
  # Archive results older than 30 days
  find diff_output -type d -name "*_diff" -mtime +30 | while read dir; do
    tar -czf "${dir}.tar.gz" "$dir"
    rm -rf "$dir"
  done
  ```
</Accordion>

<Accordion title="Keep only JSON reports">
  ```bash theme={null}
  # Delete images but keep JSON files
  find diff_output -type f -name "*.png" -mtime +7 -delete
  ```
</Accordion>

<Accordion title="Limit output size">
  ```bash theme={null}
  # Delete oldest results when directory exceeds 1GB
  while [ $(du -sb diff_output | cut -f1) -gt 1073741824 ]; do
    OLDEST=$(ls -td diff_output/*_diff | tail -1)
    rm -rf "$OLDEST"
  done
  ```
</Accordion>

## See also

* [Command reference](/usage/command-reference) - Complete CLI documentation
* [Configuration options](/usage/configuration-options) - Threshold and output settings
* [Basic comparison](/usage/basic-comparison) - Getting started guide
