Site-Specific Optimization — Smart Parsing for 8+ Platforms

FEATURE DEEP-DIVE

Why generic extraction isn't enough

Every platform structures content differently, and a one-size-fits-all approach inevitably misses important details or captures irrelevant page chrome.

GitHub renders markdown with syntax-highlighted code blocks, collapsible details sections, and rich issue and PR discussion threads. Notion builds pages from dynamic blocks — toggles, databases, callouts, and embedded content — with class names that change on every page load. Reddit wraps its post content inside shadow DOM elements that are invisible to standard HTML parsers. Confluence hides content inside expand macros and renders code blocks with JavaScript.

A generic extraction algorithm can identify the main content area on most pages, but it does not understand these platform-specific patterns. It cannot expand a Notion toggle to reveal hidden content. It cannot reach inside Reddit's shadow DOM to extract the post body. It cannot reconstruct a GitHub discussion thread with proper nesting. These platforms need parsers that are specifically engineered to understand their unique HTML structures.

That is why Pretty PDF includes dedicated parsers for 8 platforms. Each parser is purpose-built to handle the idiosyncrasies of its platform, producing higher-quality extraction than any generic approach could achieve.

FEATURE DEEP-DIVE

8 supported platforms

Each platform has a dedicated parser that understands its unique content structure. Here is what each parser handles.

GitHub

README rendering, issue threads, PR discussions, wiki pages, and code files with syntax highlighting. Multi-page support captures the full context of any GitHub page.

Stack Overflow

Question plus accepted answer and top answers, code blocks with formatting, and vote counts. Also covers Stack Exchange, Super User, Server Fault, and Ask Ubuntu.

Medium

Article extraction with custom domain support and member-only content when accessible. Detects Medium-powered sites via meta generator tags regardless of the domain name.

Notion

Toggle expansion, database views, callouts, code blocks, and embedded content. Resolves dynamic class names that change on every page load to ensure complete capture.

Dev.to

Liquid tag conversion for GitHub embeds, CodePen, and YouTube. Series navigation links are preserved so you can reference related articles in the same series.

Substack

Newsletter content with custom domain detection via meta generator tags and substackcdn URLs. Embedded media and images are preserved in the output PDF.

Reddit

Shadow DOM serialization from shreddit-post elements, post content plus top comments, and nested thread structure. Captures content that is invisible to standard HTML parsers.

Confluence

Expand macros, code macros, page hierarchy, and table rendering. Captures content hidden inside expandable sections and preserves the full page structure.

FEATURE DEEP-DIVE

How auto-detection works

Pretty PDF automatically identifies which parser to use based on the URL or the site type provided by the Chrome extension.

When you send a URL to the API, Pretty PDF checks it against URL patterns registered by each of the 8 parsers. If the URL matches a known pattern — for example, a github.com path or a stackoverflow.com question link — the corresponding parser is selected automatically. No configuration is needed on your part.

When you use the Chrome extension, detection happens even earlier. The extension's content script identifies the site type before capturing the page HTML and sends it along with the request. The server uses this pre-identified site type to select the correct parser instantly, without any URL analysis.

For platforms like Notion, Reddit, and Confluence that rely on client-side rendering, the extension goes a step further. Before capturing the HTML, the site_preprocessor.js script runs platform-specific preprocessing on the live DOM — expanding toggles, serializing shadow DOM content, and capturing dynamically rendered elements. This preprocessed HTML and any associated site hints are sent to the server alongside the site type.

If no parser matches the URL and no site type is provided, Pretty PDF falls back to its generic trafilatura-based extraction pipeline, which works well for any standard HTML content page.

FEATURE DEEP-DIVE

Fidelity comparison

Not every entry point produces the same quality of extraction. The Chrome extension provides the highest fidelity across all platforms, while the API achieves full fidelity for server-rendered sites.

Entry point	5 server-rendered sites	3 DOM-dependent sites
Chrome Extension	Full fidelity	Full fidelity
API with URL	Full fidelity	Reduced fidelity
URL Fetch endpoint	Full fidelity	Reduced fidelity
API without URL	Generic extraction only	Generic extraction only

The five server-rendered sites — GitHub, Stack Overflow, Medium, Dev.to, and Substack — serve their content as fully rendered HTML. This means the API can extract content at the same quality as the Chrome extension because the raw HTML already contains everything the parser needs.

The three DOM-dependent sites — Notion, Reddit, and Confluence — rely on client-side JavaScript to render their content. Toggle blocks, shadow DOM elements, and expand macros are not present in the raw HTML. The API still works with these sites, but some content may be missing or incompletely rendered. For complete capture of these platforms, use the Chrome extension.

FEATURE DEEP-DIVE

Extension preprocessing

Three platforms need live DOM access to achieve full fidelity. The Chrome extension handles this automatically before capturing the page HTML.

Notion

Notion builds pages from dynamic blocks with class names that are generated at runtime. The extension preprocessor expands all toggle blocks to reveal their hidden content, resolves dynamic class names to stable selectors, and captures database views, callouts, and embedded content in their fully rendered state. Without this preprocessing, exported Notion content would contain collapsed toggles and broken references.

Reddit

Reddit's modern interface wraps post content inside shadow DOM elements using the shreddit-post custom element. Shadow DOM content is invisible to standard HTML capture methods — it exists in a separate DOM tree that is not part of the page's main HTML. The extension preprocessor serializes the shadow DOM content into regular HTML so it can be sent to the server and processed by the Reddit parser.

Confluence

Confluence uses expand macros to hide content behind clickable headers, and code macros to render syntax-highlighted code blocks via JavaScript. The extension preprocessor opens all expand macro sections to capture their content and extracts the rendered code from code macros. It also preserves the page hierarchy and table rendering that Confluence generates dynamically.

All three preprocessors run inside the extension's site_preprocessor.js before the content script captures the page HTML. The preprocessing is transparent to the user — the extension detects the platform, runs the appropriate preprocessor, and includes the results as site_hints alongside the captured HTML.

FEATURE DEEP-DIVE

Adding your own sites

The parser system is designed to be extensible, and the API accepts pre-processed HTML from any source.

If your site is not on the list of 8 supported platforms, the generic extraction pipeline still produces high-quality results for most standard content pages. Articles, blog posts, documentation, tutorials, and forum threads are all handled well by the trafilatura-based extraction engine.

For custom integrations, the API accepts pre-processed HTML that bypasses auto-detection entirely. If you submit a request with HTML but no URL and no site_type, the generic extraction pipeline processes the content directly. This means you can build your own preprocessing pipeline for any site and send the cleaned HTML to Pretty PDF for template application and PDF rendering.

Developers who want to submit well-structured HTML from a specific platform can pre-extract the content on their end and send it to the API. The API will apply your chosen template, embed fonts, set margins, and generate a professionally formatted PDF — even if it has never seen that particular site before.

Frequently asked questions about site-specific optimization

Pretty PDF includes dedicated parsers for 8 platforms: GitHub (README files, issues, PRs, discussions, wiki pages, code files with syntax highlighting), Stack Overflow (questions, accepted and top answers, code blocks, vote counts — also covers Stack Exchange, Super User, Server Fault, and Ask Ubuntu), Medium (article extraction with custom domain support), Notion (toggle expansion, database views, callouts, code blocks, embedded content), Dev.to (liquid tag conversion for GitHub embeds, CodePen, YouTube, plus series navigation), Substack (newsletter content with custom domain detection and embedded media), Reddit (shadow DOM serialization, posts and top comments with nested thread structure), and Confluence (expand macros, code macros, page hierarchy, table rendering).

Generic trafilatura-based extraction works for any HTML website. It handles articles, blog posts, documentation, and most standard content pages by analyzing content density and separating the main content from site chrome like navigation, ads, and sidebars. The generic extraction produces high-quality results for the vast majority of websites — the dedicated parsers simply provide an extra level of optimization for the 8 supported platforms.

When you submit a URL, Pretty PDF matches it against registered parser URL patterns to determine the site type. When using the Chrome extension, the extension's content script detects the site type before capturing the page and sends it along with the request, so the server knows exactly which parser to use without any URL analysis. If no parser matches the URL or site type, the generic trafilatura extraction pipeline is used instead.

These three platforms rely heavily on client-side rendering. Notion uses dynamic blocks with toggle elements and dynamically generated class names that only exist in the live DOM. Reddit wraps post content inside shadow DOM elements (shreddit-post) that are not present in the raw HTML source. Confluence uses expand macros and code macros that are rendered by JavaScript after page load. The Chrome extension captures the fully rendered DOM and preprocesses these platform-specific elements before sending the HTML to the server, which is why it achieves full fidelity on these platforms.

For GitHub, Stack Overflow, Medium, Dev.to, and Substack — yes, the API provides full fidelity because these platforms serve their content as server-rendered HTML. The API can extract everything the extension can from these sites. For Notion, Reddit, and Confluence, the API works but with reduced fidelity because these platforms depend on client-side JavaScript rendering. To get complete, fully rendered content from these three platforms, use the Chrome extension.

Site-Specific Optimization — Smart Parsing for 8+ Platforms

Why generic extraction isn't enough

8 supported platforms

GitHub

Stack Overflow

Medium

Notion

Dev.to

Substack

Reddit

Confluence

How auto-detection works

Fidelity comparison

Extension preprocessing

Notion

Reddit

Confluence

Adding your own sites

Frequently asked questions about site-specific optimization

Related resources

Smart Content Extraction

GitHub to PDF

Notion to PDF

Developer API

Platform-aware PDFs — try it free