Web pages disappear, change, and break. A document archival system preserves critical content as searchable, permanent PDFs. This guide covers the architecture for building one with the Pretty PDF API.
API access on Pro+ ($12/mo). Full docs at api.prettypdfprinter.com/docs.
Link rot is real. Studies show the average web page has a half-life of about 2 years. The content you rely on today may not exist tomorrow.
Government notices, legal documents, regulatory pages, competitor content, research sources, supplier terms — all can vanish without warning. A URL that works today may return a 404 next month, redirect to an unrelated page next quarter, or disappear entirely when a company restructures its website.
PDF archival creates permanent, timestamped snapshots that serve as evidence, reference, and backup. Unlike bookmarks or saved links, a PDF captures the actual content at a specific point in time. You can prove what a page said, when you captured it, and reference the original URL for context.
Common archival use cases include:
A document archival system has five independent components. Each is swappable, so you can start simple and scale each layer independently.
The high-level data flow looks like this:
URL Source → Scheduler → Pretty PDF API → Storage → Index
┌─────────────┐ ┌───────────┐ ┌──────────────┐ ┌──────────┐ ┌───────┐
│ URL Source │───→│ Scheduler │───→│ Pretty PDF │───→│ Storage │───→│ Index │
│ │ │ │ │ API │ │ │ │ │
│ - YAML list │ │ - cron │ │ POST /v1/ │ │ - Local │ │ SQLite│
│ - RSS feed │ │ - Celery │ │ generate/ │ │ - S3 │ │ FTS5 │
│ - Sitemap │ │ - Airflow │ │ url │ │ - GCS │ │ │
│ - Web scrape │ │ - schedule│ │ │ │ │ │ │
└─────────────┘ └───────────┘ └──────────────┘ └──────────┘ └───────┘
URL Source provides the list of pages to archive. This can be a static file, a dynamic feed, or a programmatic discovery mechanism.
Scheduler triggers archival runs at defined intervals. A simple cron job works for small archives. Celery or Airflow handle complex workflows with dependencies and retries.
Pretty PDF API fetches each URL, extracts the content, applies a template, and returns a styled PDF. The API handles content extraction, site-specific parsing, and PDF rendering — your system just sends URLs and receives PDFs.
Storage holds the generated PDF files. Start with local filesystem storage and move to cloud object storage (S3, GCS) as your archive grows.
Index stores metadata about each archived document — URL, title, archive date, file path, tags — and provides search capability over the archive.
Your archive is only as good as your URL list. There are several approaches to building and maintaining it, from static files to automated discovery.
The simplest approach: maintain a YAML or JSON file with the URLs you want to archive. This works well for a known, stable set of pages — regulatory sites, competitor pages, key documentation.
# archive_urls.yaml
sources:
- url: https://example.gov/regulations/2026-update
category: regulatory
frequency: weekly
- url: https://competitor.com/pricing
category: competitive
frequency: daily
- url: https://docs.supplier.com/api/terms
category: vendor
frequency: monthly
import yaml
def load_urls(config_path="archive_urls.yaml"):
with open(config_path) as f:
config = yaml.safe_load(f)
return config["sources"]
urls = load_urls()
for source in urls:
print(f"{source['url']} — {source['frequency']}")
For news sites and blogs, parse RSS feeds to discover new articles automatically. Use the feedparser library to extract URLs from any RSS or Atom feed. New entries get added to your archival queue without manual intervention.
To archive an entire site, parse its sitemap.xml. Sitemaps list every public URL on a site and include last-modified dates, making it easy to detect new or changed pages. Use the xml.etree.ElementTree module or the ultimate-sitemap-parser library for nested sitemap indexes.
For sites without feeds or sitemaps, use a scraper to discover URLs. Crawl a seed page, extract links matching your criteria (e.g., paths containing /blog/ or /docs/), and add them to your queue. Libraries like httpx and beautifulsoup4 handle this well. Be mindful of rate limits and robots.txt.
Send URLs to the API with rate limiting, track status per URL, and store results with metadata. A processing loop with error handling keeps your archive running reliably.
Walk through your URL list, call the API for each one, and record the result. Handle failures gracefully — network errors, API rate limits, and timeouts should not stop the entire batch.
import httpx
import time
import json
from datetime import datetime
API_KEY = "your_api_key"
API_URL = "https://api.prettypdfprinter.com/v1/generate/url"
def archive_urls(urls, output_dir="archive/"):
results = []
for source in urls:
url = source["url"]
status = "pending"
print(f"Archiving: {url}")
try:
response = httpx.post(
API_URL,
headers={"X-API-Key": API_KEY},
json={"url": url, "template": "clean"},
timeout=60.0
)
if response.status_code == 200:
# Save the PDF
filename = build_filename(url)
filepath = f"{output_dir}{filename}"
with open(filepath, "wb") as f:
f.write(response.content)
status = "completed"
results.append({
"url": url,
"status": status,
"filepath": filepath,
"timestamp": datetime.utcnow().isoformat(),
"size_bytes": len(response.content)
})
elif response.status_code == 429:
# Rate limited — wait and retry
retry_after = int(response.headers.get("Retry-After", 60))
print(f"Rate limited. Waiting {retry_after}s...")
time.sleep(retry_after)
status = "pending" # Will retry on next run
else:
status = "failed"
results.append({
"url": url,
"status": status,
"error": f"HTTP {response.status_code}",
"timestamp": datetime.utcnow().isoformat()
})
except httpx.TimeoutException:
status = "failed"
results.append({
"url": url,
"status": status,
"error": "timeout",
"timestamp": datetime.utcnow().isoformat()
})
# Respect rate limits: pause between requests
time.sleep(6) # 10 requests/min = 1 every 6 seconds
return results
Track every URL through four states: pending (not yet processed), processing (API call in progress), completed (PDF saved), and failed (error occurred). Store status in your index database so you can resume interrupted batches, retry failures, and report on archive completeness.
For each archived PDF, store: the source URL, the archive timestamp, the document ID or file path, the page title (from the API response or parsed from HTML), the file size, the category or tag from your URL list, and the content hash for change detection.
Set up recurring archival runs at intervals that match how often your target pages change. Daily for news and regulatory sites, weekly for documentation, monthly for reference pages.
The simplest scheduler is a cron job. Add entries to your crontab for each archival frequency:
# Archive regulatory pages daily at 6am
0 6 * * * /usr/bin/python3 /opt/archiver/run.py --frequency daily
# Archive documentation weekly on Sunday at 2am
0 2 * * 0 /usr/bin/python3 /opt/archiver/run.py --frequency weekly
# Archive reference pages monthly on the 1st at 3am
0 3 1 * * /usr/bin/python3 /opt/archiver/run.py --frequency monthly
For a self-contained Python process, the schedule library provides a lightweight alternative to cron:
import schedule
import time
from archiver import run_archival
def daily_archive():
run_archival(frequency="daily")
def weekly_archive():
run_archival(frequency="weekly")
def monthly_archive():
run_archival(frequency="monthly")
schedule.every().day.at("06:00").do(daily_archive)
schedule.every().sunday.at("02:00").do(weekly_archive)
schedule.every(30).days.do(monthly_archive)
while True:
schedule.run_pending()
time.sleep(60)
Before calling the API, check whether the URL has already been archived within the current period. This avoids generating duplicate PDFs and wasting API credits:
from datetime import datetime, timedelta
def should_archive(url, frequency, last_archived):
if last_archived is None:
return True
now = datetime.utcnow()
intervals = {
"daily": timedelta(days=1),
"weekly": timedelta(weeks=1),
"monthly": timedelta(days=30)
}
threshold = intervals.get(frequency, timedelta(days=1))
return (now - last_archived) >= threshold
A well-organized archive is a searchable archive. Consistent file naming, directory structure, and metadata storage make it easy to find any document months or years later.
Organize archived PDFs by date, source domain, or category. A date-based structure works well for most archives because it naturally partitions files and makes cleanup straightforward:
archive/
├── 2026/
│ ├── 02/
│ │ ├── 13/
│ │ │ ├── example-gov_regulations-update_2026-02-13.pdf
│ │ │ ├── competitor-com_pricing_2026-02-13.pdf
│ │ │ └── docs-supplier-com_api-terms_2026-02-13.pdf
│ │ └── 14/
│ │ └── ...
│ └── 03/
│ └── ...
└── metadata.db
Use a consistent naming pattern: {domain}_{slug}_{date}.pdf. This makes files identifiable at a glance without needing the index database. Sanitize domain and slug to remove special characters:
import re
from urllib.parse import urlparse
from datetime import date
def build_filename(url):
parsed = urlparse(url)
domain = parsed.netloc.replace(".", "-")
slug = parsed.path.strip("/").replace("/", "-")
slug = re.sub(r"[^a-zA-Z0-9-]", "", slug)[:80]
today = date.today().isoformat()
return f"{domain}_{slug}_{today}.pdf"
Store metadata in a SQLite database alongside your archive. This gives you fast queries, full-text search, and a single file to back up:
CREATE TABLE archived_documents (
id INTEGER PRIMARY KEY AUTOINCREMENT,
url TEXT NOT NULL,
domain TEXT NOT NULL,
title TEXT,
filepath TEXT NOT NULL,
archived_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
file_size INTEGER,
content_hash TEXT,
category TEXT,
frequency TEXT,
status TEXT DEFAULT 'completed'
);
CREATE INDEX idx_archived_url ON archived_documents(url);
CREATE INDEX idx_archived_domain ON archived_documents(domain);
CREATE INDEX idx_archived_date ON archived_documents(archived_at);
For archives exceeding a few thousand documents, consider cloud object storage (S3, Google Cloud Storage) for the PDF files while keeping the metadata index in SQLite locally. This separates the cheap, scalable blob storage from the fast, queryable metadata layer. Use the same filename conventions and organize objects with key prefixes that mirror your date-based directory structure.
An archive you cannot search is just a pile of files. Build a search interface over your metadata so you can find any document by title, URL, domain, date, or tag.
SQLite's FTS5 extension provides fast full-text search across your metadata. Create a virtual table that indexes titles, URLs, and tags:
CREATE VIRTUAL TABLE archive_search USING fts5(
title,
url,
category,
content='archived_documents',
content_rowid='id'
);
-- Search for documents related to regulations
SELECT ad.url, ad.title, ad.archived_at, ad.filepath
FROM archive_search AS s
JOIN archived_documents AS ad ON s.rowid = ad.id
WHERE archive_search MATCH 'regulations'
ORDER BY ad.archived_at DESC
LIMIT 20;
Find everything archived within a specific period — useful for compliance audits and periodic reviews:
SELECT url, title, archived_at, filepath
FROM archived_documents
WHERE archived_at BETWEEN '2026-01-01' AND '2026-02-01'
AND category = 'regulatory'
ORDER BY archived_at DESC;
If you use Pretty PDF's cloud storage, your archived documents are automatically indexed and searchable through the cloud library interface. The GET /v1/documents API endpoint lets you search, filter, and paginate through your PDF library programmatically — no separate index database needed. This is the simplest path for teams that want archival without building their own metadata layer.
Permanent, searchable PDFs from any web page. One API call per document, no browser automation required.