Contributing - PDF Visual Regression

Getting started

We welcome contributions to improve the PDF visual diff tool! Whether you’re fixing bugs, adding features, or improving documentation, your help is appreciated.

Prerequisites

Before contributing, ensure you have:

Python 3.7 or higher
Git for version control
A text editor or IDE
Basic understanding of Python and PDF processing

Initial setup

Fork and clone the repository

Fork the repository on GitHub and clone your fork locally:

git clone https://github.com/YOUR_USERNAME/pdf-visual-diff.git
cd pdf-visual-diff

Install dependencies

Install the required Python packages:

make install
# or
python3 -m pip install -r requirements.txt

Dependencies from requirements.txt:

PyMuPDF - PDF rendering
scikit-image - SSIM algorithm
Pillow - Image processing
numpy - Numerical operations
reportlab - Test PDF generation

Run the test suite

Verify your setup by running tests:

make test

All tests should pass on a fresh installation.

Codebase structure

The project follows a simple, focused structure:

.
├── pdf_visual_diff.py       # Main script with comparison logic
├── tests/
│   ├── test_diff_script.py  # Unittest test cases
│   ├── create_test_pdfs.py  # Test fixture generator
│   ├── test_pdfs/           # Generated test PDFs (gitignored)
│   └── test_output/         # Test results (gitignored)
├── requirements.txt         # Python dependencies
├── Makefile                 # Build automation
└── README.md               # Project documentation

Core modules

pdf_visual_diff.py

The main entry point containing all comparison logic.Key functions:

compare_pdfs(pdf1_path, pdf2_path, output_dir, threshold) - Core comparison function (lines 10-136)
main() - CLI argument parsing and script entry (lines 137-148)

Key sections:

PDF loading and validation (lines 19-30)
Page rendering loop (lines 34-69)
Extra page handling (lines 72-89)
Results generation (lines 97-126)

tests/test_diff_script.py

Integration tests using subprocess to test the CLI.Test class:

TestPdfVisualDiff - Main test suite with three test methods

Test methods:

test_identical_pdfs() - Verifies identical PDFs pass
test_different_text_pdfs() - Checks diff detection
test_different_page_count_pdfs() - Tests page count handling

tests/create_test_pdfs.py

Test fixture generator using ReportLab.Functions:

create_test_pdf(filename, text_content) - Creates a simple one-page PDF
setup_test_files() - Generates all test fixtures

Makefile

Build automation with common development tasks.Targets:

make install - Install dependencies
make test - Run test suite
make setup - Generate test PDFs
make clean - Remove generated files

Development workflow

Making changes

Create a feature branch

Always work on a separate branch:

git checkout -b feature/my-improvement

Use descriptive branch names:

feature/add-threshold-auto-detect
fix/memory-leak-large-pdfs
docs/improve-readme

Make your changes

Edit the relevant files. Common areas:

Core logic: Modify pdf_visual_diff.py
Tests: Add/update tests/test_diff_script.py
Test fixtures: Update tests/create_test_pdfs.py
Dependencies: Update requirements.txt if needed

Test your changes

Run the test suite to ensure nothing broke:

make clean
make test

For manual testing:

python3 pdf_visual_diff.py test1.pdf test2.pdf --output my_test

Commit your changes

Write clear, descriptive commit messages:

git add .
git commit -m "Add auto-threshold detection based on PDF content"

Good commit messages explain why, not just what.

Push and create a pull request

Push your branch and open a PR:

git push origin feature/my-improvement

In your PR description:

Explain what problem you’re solving
Describe your solution approach
Mention any breaking changes
Include example usage if relevant

Code style guidelines

Follow these conventions to maintain consistency:

Python style

Follow PEP 8 style guide
Use 4 spaces for indentation (no tabs)
Maximum line length: 100 characters
Use descriptive variable names
Add docstrings to all functions

Example:

def compare_pdfs(pdf1_path, pdf2_path, output_dir, threshold=0.999):
    """
    Compares two PDFs page by page for visual differences.
    
    Args:
        pdf1_path: Path to the first PDF file
        pdf2_path: Path to the second PDF file
        output_dir: Directory to save difference images
        threshold: SSIM threshold (0.0 to 1.0)
    
    Returns:
        None (outputs to console and files)
    """
    # Implementation...

Testing conventions

Write tests for all new features
Use descriptive test method names
Include docstrings explaining what each test verifies
Follow the Arrange-Act-Assert pattern

Example:

def test_custom_threshold(self):
    """Test that custom SSIM threshold is respected."""
    # Arrange
    pdf1 = "tests/test_pdfs/test1.pdf"
    pdf2 = "tests/test_pdfs/test2.pdf"
    
    # Act
    result = subprocess.run(
        ["python3", "pdf_visual_diff.py", pdf1, pdf2, "--threshold", "0.95"],
        capture_output=True, text=True
    )
    
    # Assert
    self.assertIn("threshold: 0.95", result.stdout)

Common contribution areas

Feature additions

Potential features to implement:

Multi-format support: Export diffs as PDF, HTML reports
Threshold auto-tuning: Automatically determine optimal threshold
Batch comparison: Compare multiple PDF pairs
Ignore regions: Mask specific areas from comparison
Performance optimization: Parallel page processing
CI/CD integration: GitHub Actions workflow examples

Bug fixes

When fixing bugs:

Create a test that reproduces the bug
Verify the test fails before your fix
Implement the fix
Verify the test passes
Check that existing tests still pass

Documentation improvements

Improve code comments
Add usage examples to README
Create troubleshooting guides
Document edge cases

Reviewing code

When reviewing contributions, check for:

Correctness: Does it solve the stated problem?
Tests: Are there tests covering the new code?
Style: Does it follow project conventions?
Performance: Are there any obvious bottlenecks?
Documentation: Are changes documented?

Release process

For maintainers releasing new versions:

Update version number

Update version in relevant files (if applicable)

Update changelog

Document all changes since last release

Run full test suite

make clean
make test

Tag release

git tag -a v1.2.0 -m "Release version 1.2.0"
git push origin v1.2.0

Publish release notes

Create GitHub release with changelog and binaries

Getting help

If you need assistance:

Issues: Open a GitHub issue for bugs or feature requests
Discussions: Use GitHub Discussions for questions
Code review: Tag maintainers in your PR for review

Before opening an issue, search existing issues to avoid duplicates. Provide:

Clear description of the problem
Steps to reproduce
Expected vs actual behavior
System information (OS, Python version)
Sample PDFs if applicable (without sensitive data)

Code of conduct

We expect all contributors to:

Be respectful and constructive
Welcome newcomers and help them get started
Focus on what’s best for the project and community
Accept constructive criticism gracefully
Show empathy towards other community members

Thanks for contributing! Every improvement, no matter how small, makes this tool better for everyone.

​Getting started

​Prerequisites

​Initial setup

​Codebase structure

​Core modules

​Development workflow

​Making changes

​Code style guidelines

​Python style

​Testing conventions

​Common contribution areas

​Feature additions

​Bug fixes

​Documentation improvements

​Reviewing code

​Release process

​Getting help

​Code of conduct