Automating PDF Reports with Python

Q: How do I save the PDFs locally?

The API returns a download URL for each generated PDF. Use requests.get() to download the file and write it to disk with open(filename, 'wb'). Make sure to use binary write mode ('wb') since PDFs are binary files. You can derive meaningful filenames from the URL or the document title returned in the API response metadata.

API TUTORIAL

Why automate PDF generation?

Manual saving is fine for a few pages. But when you need to archive 50 pages of documentation, generate weekly reports from dashboards, or snapshot competitor pages on a schedule, automation is the answer.

Python combined with the Pretty PDF API gives you a scriptable pipeline that converts any URL or HTML into a professionally styled PDF. Instead of clicking through a browser extension one page at a time, you write a script once and run it whenever you need fresh PDFs.

Automation eliminates repetitive manual work, reduces errors from copy-paste workflows, and ensures consistent formatting across every document. Whether you are archiving documentation before a version update, generating compliance evidence on a schedule, or building a reporting pipeline that runs in CI, the pattern is the same: a list of URLs goes in, a folder of styled PDFs comes out.

The rest of this tutorial walks through the complete workflow — from a single API call to a production-ready batch script with error handling and scheduling.

API TUTORIAL

Setup and installation

You need Python 3.7 or later and the requests library. The setup takes less than a minute.

Install the requests library

pip install requests

Store your API key securely

Never hardcode API keys in your scripts. Use an environment variable instead. Generate your API key from the Pretty PDF dashboard under Settings, then export it in your shell:

# Linux / macOS
export PRETTYPDF_API_KEY="your-api-key-here"

# Windows (PowerShell)
$env:PRETTYPDF_API_KEY = "your-api-key-here"

Basic configuration

Create a Python file with the base configuration that every script in this tutorial will use:

import os
import requests

API_KEY = os.environ["PRETTYPDF_API_KEY"]
API_BASE = "https://api.prettypdfprinter.com/v1"

HEADERS = {
    "X-API-Key": API_KEY,
    "Content-Type": "application/json",
}

If the PRETTYPDF_API_KEY environment variable is not set, the script will raise a KeyError immediately rather than failing silently later. This is intentional — you want to know right away if the key is missing.

API TUTORIAL

Single URL conversion

Start with the simplest case: convert one URL to a PDF and save it to disk.

Convert a URL to PDF

The POST /v1/generate/url endpoint accepts a URL and returns a document record with a download link. Send the URL, then download the resulting PDF:

import os
import requests

API_KEY = os.environ["PRETTYPDF_API_KEY"]
API_BASE = "https://api.prettypdfprinter.com/v1"
HEADERS = {
    "X-API-Key": API_KEY,
    "Content-Type": "application/json",
}


def convert_url_to_pdf(url, output_path, template="clean"):
    """Convert a single URL to PDF and save to disk."""
    # Step 1: Request PDF generation
    response = requests.post(
        f"{API_BASE}/generate/url",
        headers=HEADERS,
        json={
            "url": url,
            "template": template,
        },
    )
    response.raise_for_status()
    result = response.json()

    # Step 2: Download the generated PDF
    doc_id = result["id"]
    download_url = f"{API_BASE}/files/{doc_id}"
    pdf_response = requests.get(download_url, headers=HEADERS)
    pdf_response.raise_for_status()

    # Step 3: Save to disk
    with open(output_path, "wb") as f:
        f.write(pdf_response.content)

    print(f"Saved: {output_path} ({len(pdf_response.content)} bytes)")
    return result


# Usage
convert_url_to_pdf(
    "https://docs.python.org/3/tutorial/index.html",
    "python-tutorial.pdf",
    template="clean",
)

What happens in this request

The API receives your URL, fetches the page server-side, runs the content extraction pipeline to strip navigation and ads, applies the template you specified, and renders the result to PDF with WeasyPrint. The response includes a document ID that you use to download the file. The entire round trip typically takes 2 to 5 seconds depending on page complexity.

API TUTORIAL

Batch processing

Process a list of URLs with rate limiting, progress tracking, and meaningful filenames.

Process multiple URLs

When you have a list of URLs to convert, iterate through them sequentially with a delay between requests to respect rate limits. The function below reads URLs from a list, generates a filename from each URL, and tracks progress:

import os
import re
import time
import requests

API_KEY = os.environ["PRETTYPDF_API_KEY"]
API_BASE = "https://api.prettypdfprinter.com/v1"
HEADERS = {
    "X-API-Key": API_KEY,
    "Content-Type": "application/json",
}

DELAY_BETWEEN_REQUESTS = 3  # seconds (safe for Pro+)


def url_to_filename(url):
    """Convert a URL to a safe filename."""
    name = url.split("//")[-1]
    name = re.sub(r"[^\w\-.]", "_", name)
    return name[:100] + ".pdf"


def batch_convert(urls, output_dir="pdfs", template="clean"):
    """Convert a list of URLs to PDFs with rate limiting."""
    os.makedirs(output_dir, exist_ok=True)
    results = {"success": [], "failed": []}

    for i, url in enumerate(urls, 1):
        print(f"[{i}/{len(urls)}] Processing: {url}")

        try:
            # Generate PDF
            response = requests.post(
                f"{API_BASE}/generate/url",
                headers=HEADERS,
                json={"url": url, "template": template},
            )
            response.raise_for_status()
            result = response.json()

            # Download PDF
            doc_id = result["id"]
            pdf_response = requests.get(
                f"{API_BASE}/files/{doc_id}",
                headers=HEADERS,
            )
            pdf_response.raise_for_status()

            # Save to disk
            filename = url_to_filename(url)
            filepath = os.path.join(output_dir, filename)
            with open(filepath, "wb") as f:
                f.write(pdf_response.content)

            print(f"  Saved: {filepath}")
            results["success"].append(url)

        except requests.RequestException as e:
            print(f"  Failed: {e}")
            results["failed"].append({"url": url, "error": str(e)})

        # Rate limit delay (skip after last URL)
        if i < len(urls):
            time.sleep(DELAY_BETWEEN_REQUESTS)

    print(f"\nDone: {len(results['success'])} succeeded, "
          f"{len(results['failed'])} failed")
    return results


# Usage
urls = [
    "https://docs.python.org/3/tutorial/index.html",
    "https://docs.python.org/3/library/os.html",
    "https://docs.python.org/3/library/pathlib.html",
    "https://docs.python.org/3/library/json.html",
]
batch_convert(urls, output_dir="python-docs")

Reading URLs from a file

For larger batches, keep your URLs in a text file (one per line) and read them in:

def load_urls(filepath):
    """Load URLs from a text file, one URL per line."""
    with open(filepath) as f:
        return [line.strip() for line in f if line.strip()]

urls = load_urls("urls.txt")
batch_convert(urls)

API TUTORIAL

Error handling and retries

A production script needs to handle rate limits, server errors, and network failures gracefully. This section adds retry logic to keep your batch running.

Retry with exponential backoff

The function below handles three failure modes: HTTP 429 (rate limited) by reading the Retry-After header, HTTP 5xx (server error) by retrying with exponential backoff, and network errors by retrying with a timeout. Permanent failures (4xx other than 429) are logged and skipped.

import time
import requests


def request_with_retry(method, url, max_retries=3, **kwargs):
    """Make an HTTP request with retry logic for rate limits and errors."""
    for attempt in range(max_retries + 1):
        try:
            response = method(url, timeout=30, **kwargs)

            # Success
            if response.status_code < 400:
                return response

            # Rate limited: wait and retry
            if response.status_code == 429:
                retry_after = int(response.headers.get("Retry-After", 10))
                print(f"  Rate limited. Waiting {retry_after}s...")
                time.sleep(retry_after)
                continue

            # Server error: retry with backoff
            if response.status_code >= 500:
                if attempt < max_retries:
                    wait = 2 ** attempt  # 1s, 2s, 4s
                    print(f"  Server error {response.status_code}. "
                          f"Retrying in {wait}s...")
                    time.sleep(wait)
                    continue

            # Permanent client error: do not retry
            response.raise_for_status()

        except requests.ConnectionError:
            if attempt < max_retries:
                wait = 2 ** attempt
                print(f"  Connection error. Retrying in {wait}s...")
                time.sleep(wait)
                continue
            raise

        except requests.Timeout:
            if attempt < max_retries:
                wait = 2 ** attempt
                print(f"  Timeout. Retrying in {wait}s...")
                time.sleep(wait)
                continue
            raise

    return response

Using the retry function in batch processing

Replace direct requests.post() and requests.get() calls with request_with_retry():

# Instead of:
response = requests.post(f"{API_BASE}/generate/url", headers=HEADERS, json=payload)

# Use:
response = request_with_retry(
    requests.post,
    f"{API_BASE}/generate/url",
    headers=HEADERS,
    json=payload,
)

With retry logic in place, transient failures no longer stop the entire batch. The script waits when rate limited, backs off on server errors, and only skips URLs that fail permanently.

API TUTORIAL

Scheduling with cron

Run your script on a schedule to automate recurring PDF generation. Archive documentation every Monday, snapshot dashboards every morning, or back up content before monthly releases.

Self-contained script for cron

A cron job runs in a minimal environment, so the script should be self-contained with absolute paths and logging to a file:

#!/usr/bin/env python3
"""Weekly documentation archival script.
Run via cron: 0 6 * * 1 /usr/bin/python3 /home/user/scripts/archive_docs.py
"""
import os
import sys
import time
import logging
from datetime import datetime

import requests

# Configuration
API_KEY = os.environ.get("PRETTYPDF_API_KEY")
if not API_KEY:
    sys.exit("PRETTYPDF_API_KEY environment variable not set")

API_BASE = "https://api.prettypdfprinter.com/v1"
HEADERS = {"X-API-Key": API_KEY, "Content-Type": "application/json"}
OUTPUT_DIR = "/home/user/archives/docs"
URL_FILE = "/home/user/scripts/doc_urls.txt"
DELAY = 3

# Logging
date_str = datetime.now().strftime("%Y-%m-%d")
log_file = f"/home/user/logs/archive_{date_str}.log"
os.makedirs(os.path.dirname(log_file), exist_ok=True)
logging.basicConfig(filename=log_file, level=logging.INFO,
                    format="%(asctime)s %(message)s")

def main():
    # Create dated output directory
    out_dir = os.path.join(OUTPUT_DIR, date_str)
    os.makedirs(out_dir, exist_ok=True)

    # Load URLs
    with open(URL_FILE) as f:
        urls = [l.strip() for l in f if l.strip()]

    logging.info(f"Starting archive of {len(urls)} URLs")
    success, failed = 0, 0

    for i, url in enumerate(urls, 1):
        try:
            resp = requests.post(f"{API_BASE}/generate/url",
                                 headers=HEADERS,
                                 json={"url": url, "template": "clean"},
                                 timeout=30)
            resp.raise_for_status()
            doc_id = resp.json()["id"]

            pdf = requests.get(f"{API_BASE}/files/{doc_id}",
                               headers=HEADERS, timeout=30)
            pdf.raise_for_status()

            filename = f"{i:03d}.pdf"
            filepath = os.path.join(out_dir, filename)
            with open(filepath, "wb") as f:
                f.write(pdf.content)

            logging.info(f"OK: {url} -> {filepath}")
            success += 1
        except Exception as e:
            logging.error(f"FAIL: {url} -> {e}")
            failed += 1

        if i < len(urls):
            time.sleep(DELAY)

    logging.info(f"Done: {success} succeeded, {failed} failed")

if __name__ == "__main__":
    main()

Crontab entry

Edit your crontab with crontab -e and add the following line to run the script every Monday at 6:00 AM:

# Archive documentation every Monday at 6 AM
0 6 * * 1 PRETTYPDF_API_KEY="your-key" /usr/bin/python3 /home/user/scripts/archive_docs.py

Windows Task Scheduler

On Windows, create a scheduled task that runs python C:\scripts\archive_docs.py on your preferred schedule. Set the PRETTYPDF_API_KEY environment variable in the task's environment settings or in your system environment variables.

API TUTORIAL

Use cases

Python automation with the Pretty PDF API fits a range of workflows where you need PDFs generated consistently and on schedule.

Automated weekly reports

Point the script at your internal dashboards and reporting pages. Every Monday morning, the cron job fetches each URL, generates a styled PDF, and saves it to a shared drive or uploads it to Slack. Stakeholders get consistent, readable reports without anyone manually exporting pages.

Documentation archival

Before a major version update, run the batch script against your documentation URLs to create a complete PDF archive of the current version. Store the snapshots alongside your release artifacts so you always have a historical record of what the docs said at each version.

Competitive analysis snapshots

Track competitor pricing pages, feature announcements, and landing pages by snapshotting them weekly. The dated output directories give you a timeline of changes. PDFs preserve the full layout and content in a format that is easy to share and reference in strategy meetings.

Compliance evidence collection

Regulatory and legal teams need proof of what was published on a specific date. Schedule the script to capture policy pages, terms of service, and public disclosures on a recurring basis. Each run produces timestamped PDFs that serve as evidence in audits and legal proceedings.

Content backup before migrations

Before migrating a website to a new CMS or redesigning a site, run the batch script against every page to create a complete PDF backup. If anything goes wrong during migration, you have a full archive of the original content and layout for reference and recovery.

Frequently asked questions about automating PDFs with Python

Python 3.7 or later. The requests library is the only external dependency. You can verify your version by running python --version in your terminal. If you are using a virtual environment, make sure the environment is activated before installing dependencies.

Add a delay between requests — 2 to 3 seconds for Pro+ accounts, 1 second for API Scale accounts. If you receive a 429 response, read the Retry-After header and wait that many seconds before continuing. Implementing exponential backoff for repeated 429 responses prevents your script from hammering the API during high-traffic periods.

Yes. With API Scale (30 requests per minute), you can process hundreds of URLs in a single session. Implement proper rate limiting and error handling, and the script will work through the list sequentially. For very large batches, consider splitting the work across multiple scheduled runs rather than processing everything in one session.

The API returns a download URL for each generated PDF. Use requests.get() to download the file and write it to disk with open(filename, 'wb'). Make sure to use binary write mode ('wb') since PDFs are binary files. You can derive meaningful filenames from the URL or the document title returned in the API response metadata.

Automating PDF Reports with Python

Why automate PDF generation?

Setup and installation

Install the requests library

Store your API key securely

Basic configuration

Single URL conversion

Convert a URL to PDF

What happens in this request

Batch processing

Process multiple URLs

Reading URLs from a file

Error handling and retries

Retry with exponential backoff

Using the retry function in batch processing

Scheduling with cron

Self-contained script for cron

Crontab entry

Windows Task Scheduler

Use cases

Automated weekly reports

Documentation archival

Competitive analysis snapshots

Compliance evidence collection

Content backup before migrations

Frequently asked questions about automating PDFs with Python

Related resources

Getting Started with the API

CI/CD Pipeline Integration

Batch Convert Webpages to PDF

Developer API Overview

Start automating PDF generation with Python