Building a Document Archival System with the API

API Tutorial

Why archive web content?

Link rot is real. Studies show the average web page has a half-life of about 2 years. The content you rely on today may not exist tomorrow.

Government notices, legal documents, regulatory pages, competitor content, research sources, supplier terms — all can vanish without warning. A URL that works today may return a 404 next month, redirect to an unrelated page next quarter, or disappear entirely when a company restructures its website.

PDF archival creates permanent, timestamped snapshots that serve as evidence, reference, and backup. Unlike bookmarks or saved links, a PDF captures the actual content at a specific point in time. You can prove what a page said, when you captured it, and reference the original URL for context.

Common archival use cases include:

Legal and compliance — Preserve regulatory notices, terms of service, and contractual pages as evidence of what was published and when.
Competitive intelligence — Archive competitor pricing pages, feature lists, and announcements before they change or disappear.
Research preservation — Save academic papers, government reports, and source material that may be removed or moved behind paywalls.
Organizational knowledge — Capture supplier documentation, vendor terms, and partner agreements as permanent records.
News monitoring — Archive news articles and press releases relevant to your industry, clients, or regulatory environment.

API Tutorial

System architecture

A document archival system has five independent components. Each is swappable, so you can start simple and scale each layer independently.

The high-level data flow looks like this:

URL Source → Scheduler → Pretty PDF API → Storage → Index

┌─────────────┐    ┌───────────┐    ┌──────────────┐    ┌──────────┐    ┌───────┐
│ URL Source   │───→│ Scheduler │───→│ Pretty PDF   │───→│ Storage  │───→│ Index │
│              │    │           │    │ API          │    │          │    │       │
│ - YAML list  │    │ - cron    │    │ POST /v1/    │    │ - Local  │    │ SQLite│
│ - RSS feed   │    │ - Celery  │    │   generate/  │    │ - S3     │    │ FTS5  │
│ - Sitemap    │    │ - Airflow │    │   url        │    │ - GCS    │    │       │
│ - Web scrape │    │ - schedule│    │              │    │          │    │       │
└─────────────┘    └───────────┘    └──────────────┘    └──────────┘    └───────┘

URL Source provides the list of pages to archive. This can be a static file, a dynamic feed, or a programmatic discovery mechanism.

Scheduler triggers archival runs at defined intervals. A simple cron job works for small archives. Celery or Airflow handle complex workflows with dependencies and retries.

Pretty PDF API fetches each URL, extracts the content, applies a template, and returns a styled PDF. The API handles content extraction, site-specific parsing, and PDF rendering — your system just sends URLs and receives PDFs.

Storage holds the generated PDF files. Start with local filesystem storage and move to cloud object storage (S3, GCS) as your archive grows.

Index stores metadata about each archived document — URL, title, archive date, file path, tags — and provides search capability over the archive.

API Tutorial

URL collection

Your archive is only as good as your URL list. There are several approaches to building and maintaining it, from static files to automated discovery.

Static URL lists

The simplest approach: maintain a YAML or JSON file with the URLs you want to archive. This works well for a known, stable set of pages — regulatory sites, competitor pages, key documentation.

# archive_urls.yaml
sources:
  - url: https://example.gov/regulations/2026-update
    category: regulatory
    frequency: weekly

  - url: https://competitor.com/pricing
    category: competitive
    frequency: daily

  - url: https://docs.supplier.com/api/terms
    category: vendor
    frequency: monthly

Reading URLs in Python

import yaml

def load_urls(config_path="archive_urls.yaml"):
    with open(config_path) as f:
        config = yaml.safe_load(f)
    return config["sources"]

urls = load_urls()
for source in urls:
    print(f"{source['url']} — {source['frequency']}")

RSS feed parsing

For news sites and blogs, parse RSS feeds to discover new articles automatically. Use the feedparser library to extract URLs from any RSS or Atom feed. New entries get added to your archival queue without manual intervention.

Sitemap parsing

To archive an entire site, parse its sitemap.xml. Sitemaps list every public URL on a site and include last-modified dates, making it easy to detect new or changed pages. Use the xml.etree.ElementTree module or the ultimate-sitemap-parser library for nested sitemap indexes.

Web scraping for dynamic discovery

For sites without feeds or sitemaps, use a scraper to discover URLs. Crawl a seed page, extract links matching your criteria (e.g., paths containing /blog/ or /docs/), and add them to your queue. Libraries like httpx and beautifulsoup4 handle this well. Be mindful of rate limits and robots.txt.

API Tutorial

Batch processing

Send URLs to the API with rate limiting, track status per URL, and store results with metadata. A processing loop with error handling keeps your archive running reliably.

Processing loop with error handling

Walk through your URL list, call the API for each one, and record the result. Handle failures gracefully — network errors, API rate limits, and timeouts should not stop the entire batch.

import httpx
import time
import json
from datetime import datetime

API_KEY = "your_api_key"
API_URL = "https://api.prettypdfprinter.com/v1/generate/url"

def archive_urls(urls, output_dir="archive/"):
    results = []

    for source in urls:
        url = source["url"]
        status = "pending"
        print(f"Archiving: {url}")

        try:
            response = httpx.post(
                API_URL,
                headers={"X-API-Key": API_KEY},
                json={"url": url, "template": "clean"},
                timeout=60.0
            )

            if response.status_code == 200:
                # Save the PDF
                filename = build_filename(url)
                filepath = f"{output_dir}{filename}"
                with open(filepath, "wb") as f:
                    f.write(response.content)

                status = "completed"
                results.append({
                    "url": url,
                    "status": status,
                    "filepath": filepath,
                    "timestamp": datetime.utcnow().isoformat(),
                    "size_bytes": len(response.content)
                })

            elif response.status_code == 429:
                # Rate limited — wait and retry
                retry_after = int(response.headers.get("Retry-After", 60))
                print(f"Rate limited. Waiting {retry_after}s...")
                time.sleep(retry_after)
                status = "pending"  # Will retry on next run

            else:
                status = "failed"
                results.append({
                    "url": url,
                    "status": status,
                    "error": f"HTTP {response.status_code}",
                    "timestamp": datetime.utcnow().isoformat()
                })

        except httpx.TimeoutException:
            status = "failed"
            results.append({
                "url": url,
                "status": status,
                "error": "timeout",
                "timestamp": datetime.utcnow().isoformat()
            })

        # Respect rate limits: pause between requests
        time.sleep(6)  # 10 requests/min = 1 every 6 seconds

    return results

Status tracking

Track every URL through four states: pending (not yet processed), processing (API call in progress), completed (PDF saved), and failed (error occurred). Store status in your index database so you can resume interrupted batches, retry failures, and report on archive completeness.

Metadata per document

For each archived PDF, store: the source URL, the archive timestamp, the document ID or file path, the page title (from the API response or parsed from HTML), the file size, the category or tag from your URL list, and the content hash for change detection.

API Tutorial

Scheduling

Set up recurring archival runs at intervals that match how often your target pages change. Daily for news and regulatory sites, weekly for documentation, monthly for reference pages.

Crontab example

The simplest scheduler is a cron job. Add entries to your crontab for each archival frequency:

# Archive regulatory pages daily at 6am
0 6 * * * /usr/bin/python3 /opt/archiver/run.py --frequency daily

# Archive documentation weekly on Sunday at 2am
0 2 * * 0 /usr/bin/python3 /opt/archiver/run.py --frequency weekly

# Archive reference pages monthly on the 1st at 3am
0 3 1 * * /usr/bin/python3 /opt/archiver/run.py --frequency monthly

Python scheduler with the schedule library

For a self-contained Python process, the schedule library provides a lightweight alternative to cron:

import schedule
import time
from archiver import run_archival

def daily_archive():
    run_archival(frequency="daily")

def weekly_archive():
    run_archival(frequency="weekly")

def monthly_archive():
    run_archival(frequency="monthly")

schedule.every().day.at("06:00").do(daily_archive)
schedule.every().sunday.at("02:00").do(weekly_archive)
schedule.every(30).days.do(monthly_archive)

while True:
    schedule.run_pending()
    time.sleep(60)

Skip already-archived URLs

Before calling the API, check whether the URL has already been archived within the current period. This avoids generating duplicate PDFs and wasting API credits:

from datetime import datetime, timedelta

def should_archive(url, frequency, last_archived):
    if last_archived is None:
        return True

    now = datetime.utcnow()
    intervals = {
        "daily": timedelta(days=1),
        "weekly": timedelta(weeks=1),
        "monthly": timedelta(days=30)
    }
    threshold = intervals.get(frequency, timedelta(days=1))
    return (now - last_archived) >= threshold

API Tutorial

Storage and organization

A well-organized archive is a searchable archive. Consistent file naming, directory structure, and metadata storage make it easy to find any document months or years later.

Directory structure

Organize archived PDFs by date, source domain, or category. A date-based structure works well for most archives because it naturally partitions files and makes cleanup straightforward:

archive/
├── 2026/
│   ├── 02/
│   │   ├── 13/
│   │   │   ├── example-gov_regulations-update_2026-02-13.pdf
│   │   │   ├── competitor-com_pricing_2026-02-13.pdf
│   │   │   └── docs-supplier-com_api-terms_2026-02-13.pdf
│   │   └── 14/
│   │       └── ...
│   └── 03/
│       └── ...
└── metadata.db

Filename conventions

Use a consistent naming pattern: {domain}_{slug}_{date}.pdf. This makes files identifiable at a glance without needing the index database. Sanitize domain and slug to remove special characters:

import re
from urllib.parse import urlparse
from datetime import date

def build_filename(url):
    parsed = urlparse(url)
    domain = parsed.netloc.replace(".", "-")
    slug = parsed.path.strip("/").replace("/", "-")
    slug = re.sub(r"[^a-zA-Z0-9-]", "", slug)[:80]
    today = date.today().isoformat()
    return f"{domain}_{slug}_{today}.pdf"

Metadata in SQLite

Store metadata in a SQLite database alongside your archive. This gives you fast queries, full-text search, and a single file to back up:

CREATE TABLE archived_documents (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    url TEXT NOT NULL,
    domain TEXT NOT NULL,
    title TEXT,
    filepath TEXT NOT NULL,
    archived_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    file_size INTEGER,
    content_hash TEXT,
    category TEXT,
    frequency TEXT,
    status TEXT DEFAULT 'completed'
);

CREATE INDEX idx_archived_url ON archived_documents(url);
CREATE INDEX idx_archived_domain ON archived_documents(domain);
CREATE INDEX idx_archived_date ON archived_documents(archived_at);

Cloud storage for large archives

For archives exceeding a few thousand documents, consider cloud object storage (S3, Google Cloud Storage) for the PDF files while keeping the metadata index in SQLite locally. This separates the cheap, scalable blob storage from the fast, queryable metadata layer. Use the same filename conventions and organize objects with key prefixes that mirror your date-based directory structure.

API Tutorial

Search and retrieval

An archive you cannot search is just a pile of files. Build a search interface over your metadata so you can find any document by title, URL, domain, date, or tag.

SQLite FTS5 for full-text search

SQLite's FTS5 extension provides fast full-text search across your metadata. Create a virtual table that indexes titles, URLs, and tags:

CREATE VIRTUAL TABLE archive_search USING fts5(
    title,
    url,
    category,
    content='archived_documents',
    content_rowid='id'
);

-- Search for documents related to regulations
SELECT ad.url, ad.title, ad.archived_at, ad.filepath
FROM archive_search AS s
JOIN archived_documents AS ad ON s.rowid = ad.id
WHERE archive_search MATCH 'regulations'
ORDER BY ad.archived_at DESC
LIMIT 20;

Query by date range

Find everything archived within a specific period — useful for compliance audits and periodic reviews:

SELECT url, title, archived_at, filepath
FROM archived_documents
WHERE archived_at BETWEEN '2026-01-01' AND '2026-02-01'
  AND category = 'regulatory'
ORDER BY archived_at DESC;

Pretty PDF cloud library

If you use Pretty PDF's cloud storage, your archived documents are automatically indexed and searchable through the cloud library interface. The GET /v1/documents API endpoint lets you search, filter, and paginate through your PDF library programmatically — no separate index database needed. This is the simplest path for teams that want archival without building their own metadata layer.

Frequently asked questions about document archival

A typical web article converts to a 200KB-2MB PDF depending on image content. 1,000 archived pages would use roughly 500MB-2GB. Plan storage based on your expected volume and image density. Text-heavy pages with few images stay closer to 200KB, while image-rich pages with embedded photos or diagrams can reach 2MB or more. Monitor your archive growth rate over the first few weeks and extrapolate to estimate long-term storage needs.

The URL fetch API doesn't support authenticated pages. For login-protected content, use the Chrome extension to capture while logged in, or send the HTML directly to the /v1/generate endpoint from a script that handles authentication. You can use a headless browser like Playwright to log in, capture the rendered HTML, and then pass it to the Pretty PDF API for conversion. This gives you full control over the authentication flow while still benefiting from the API's content extraction and template styling.

Compare HTTP headers (Last-Modified, ETag) or content hashes before re-archiving. Only generate a new PDF when the content has actually changed, saving API credits. Send a HEAD request first to check the Last-Modified or ETag headers against your stored values. If headers are unavailable, fetch the page content, compute a hash (SHA-256), and compare it to the hash stored from the previous archival run. This approach avoids generating duplicate PDFs for unchanged pages.

Archiving publicly accessible web content for internal reference is generally permitted. However, redistribution of archived content may be subject to copyright restrictions. Consult legal counsel for compliance-critical archival programs. Fair use and internal research exceptions vary by jurisdiction. For regulated industries (finance, healthcare, government), your compliance team may have specific requirements around retention periods, access controls, and audit trails that your archival system should support.

Building a Document Archival System with the API

Why archive web content?

System architecture

URL collection

Static URL lists

Reading URLs in Python

RSS feed parsing

Sitemap parsing

Web scraping for dynamic discovery

Batch processing

Processing loop with error handling

Status tracking

Metadata per document

Scheduling

Crontab example

Python scheduler with the schedule library

Skip already-archived URLs

Storage and organization

Directory structure

Filename conventions

Metadata in SQLite

Cloud storage for large archives

Search and retrieval

SQLite FTS5 for full-text search

Query by date range

Pretty PDF cloud library

Frequently asked questions about document archival

Related resources

Getting Started with the API

Automate PDF Reports with Python

Cloud PDF Library

Developer API — Automate PDF Generation

Start building your document archive