Smart Content Extraction — How We Remove the Clutter

Feature Deep-Dive

What gets removed

Pretty PDF's extraction engine identifies and strips away the elements that have no place in a clean document. These are the six most common categories of clutter that the engine removes automatically before your PDF is generated.

Display ads and sponsored content

Banner ads, interstitial placements, sponsored post blocks, and ad-network iframes are all identified and removed. The engine detects common ad container patterns, known ad-serving domains, and sponsored content markers so none of them reach your final PDF.

Navigation menus and site headers

Top navigation bars, hamburger menus, site-wide headers, and footer link blocks are stripped out. These elements are essential for browsing a website but serve no purpose in a document. Removing them often eliminates two or more pages of wasted space.

Cookie consent banners and modals

GDPR cookie notices, privacy consent overlays, and modal popups that cover page content are detected and removed. These elements frequently appear on top of the content area, and when printed via Ctrl+P they obscure the text underneath.

Social sharing buttons

Share-to-Twitter buttons, Facebook like widgets, LinkedIn share bars, and floating social sidebars are all removed. These interactive elements are useless in a PDF and often introduce layout-breaking iframes and fixed-position overlays.

Comment sections

User comment threads, Disqus embeds, and discussion areas below articles are discarded. Comment sections can add dozens of pages to a PDF and are rarely the content you are trying to save. If you do need comments, switch to Full Page mode.

What gets preserved

Removing clutter is only half the job. The extraction engine is equally careful about preserving the content and structure that make a document useful. Semantic structure is maintained throughout — headings remain headings, links remain clickable, and lists keep their hierarchy.

Article text and headings

All body text, paragraphs, and heading levels (H1 through H6) are preserved with their original hierarchy intact. The heading structure flows naturally into the PDF template, producing a document with clear visual organization and a logical reading order.

Images and figures with captions

Inline images, figure elements, and their associated captions are retained and properly positioned within the document flow. Lazy-loaded images are resolved before rendering so nothing appears as a blank placeholder in the final PDF.

Code blocks and preformatted text

Fenced code blocks, inline code spans, and preformatted text are preserved with monospace formatting and proper line wrapping. Long lines wrap cleanly instead of overflowing off the page edge, and indentation is maintained exactly as it appears on the original site.

Tables with column alignment

Data tables are preserved with their column alignment, borders, header rows, and cell spacing intact. The template styling applies clean table formatting that is optimized for print, so columns do not collapse or overflow regardless of content width.

Lists (ordered and unordered)

Bullet lists, numbered lists, and nested sublists maintain their hierarchy and indentation in the output PDF. Definition lists and checklists are also supported. The semantic list structure is preserved so the PDF reads the same way the original webpage does.

Feature Deep-Dive

How it works — the three-layer pipeline

Pretty PDF does not use a single blunt filter. The extraction engine runs your page through three distinct layers, each designed to handle a different aspect of content identification and cleanup. Here is what happens between the moment you click "Generate PDF" and the moment your clean document downloads.

Layer 1: Site-specific parsing

For eight supported platforms — GitHub, Medium, Stack Overflow, Notion, Dev.to, Substack, Reddit, and Confluence — Pretty PDF uses dedicated parsers that understand each site's unique HTML structure. These parsers know exactly where the content lives on each platform and what to extract.

The GitHub parser, for example, handles README files, issue threads, pull request discussions, wiki pages, and individual code files with proper syntax formatting. The Stack Overflow parser extracts the question and accepted answer while discarding sidebar ads and related question links. The Notion parser expands toggle blocks and resolves dynamic class names that would otherwise produce blank output. Each parser is purpose-built to produce the highest-quality extraction for its platform.

When you use the Chrome extension, it detects the current site and tells the server which parser to use. For platforms like Notion, Reddit, and Confluence that rely on client-side rendering, the extension also preprocesses the live DOM — expanding toggles, serializing shadow DOM content, and capturing dynamic elements — before sending the HTML to the server.

Layer 2: Trafilatura extraction

For all other websites — and there are millions of them — the engine uses trafilatura-based extraction. This is the same class of algorithms that power browser reader modes, tuned specifically for PDF output quality.

The algorithm analyzes content density across the page, measuring the ratio of visible text to HTML markup in each section. It identifies the main content area by looking for the largest contiguous block of high-density text. It detects article boundaries, scores page elements by their likelihood of being actual content versus site chrome, and assembles the extracted content into a clean, well-ordered document. Navigation bars score low. Article paragraphs score high. The engine keeps the high scorers and discards the rest.

Layer 3: HTML sanitization

After extraction, the content passes through a BeautifulSoup-based sanitizer that handles the fine details. This layer removes residual scripts, tracking pixels, hidden elements, event handlers, and potentially dangerous markup. It strips inline styles that would conflict with the PDF template, resolves relative URLs so images and links work correctly in the output, and normalizes the HTML structure so WeasyPrint can render it cleanly.

The result is a sanitized, well-structured HTML fragment containing only the visible content from the original page. This fragment is then wrapped in your chosen template and rendered into the final PDF.

Feature Deep-Dive

Extraction in action

The difference between a browser PDF and an extracted PDF is not subtle. Here is how page counts compare across four common content types when the extraction engine removes the clutter.

Content type	Before (Ctrl+P)	After (Pretty PDF)
News article	9 pages with ads, nav, and related links	2 pages of clean article text
Blog post	6 pages with sidebar, popups, and comment section	3 pages of focused content
Documentation page	5 pages with nav tree, breadcrumbs, and footer	4 pages of reference content
Forum thread	8 pages with UI chrome, user badges, and sidebars	3 pages of Q&A content

Page count reduction depends on how cluttered the original page is. Ad-heavy news sites typically see the largest improvement — sometimes a 75% reduction in page count. Documentation pages with minimal chrome see smaller but still meaningful improvements. In every case, the result is a tighter document that contains only what you intended to save.

Feature Deep-Dive

When to use Full Page instead

Content extraction is not always what you want. Sometimes the "clutter" is actually content you need to preserve.

Article mode is the default because it produces the best results for the most common use case: saving an article, blog post, or documentation page as a focused document. But there are situations where you want everything on the page, not just the extracted content.

Full Page mode captures everything visible on the page without any content filtering. The template styling still applies — your PDF will have professional typography and clean margins — but no elements are removed. This mode is useful when you need to:

Archive a landing page design — preserve the complete layout, including hero sections, feature grids, pricing tables, and footer content, exactly as they appear on screen.
Capture dashboards and data views — analytics dashboards, admin panels, and data-heavy interfaces where every widget and sidebar is part of the content you need.
Preserve page layouts for evidence — legal documentation, compliance snapshots, or screenshots-for-the-record where the complete page state matters.
Save pages where the "clutter" is content — e-commerce product pages where the sidebar shows specifications, or wiki pages where the table of contents is essential context.

You can switch between Article and Full Page mode in the extension popup before generating your PDF. If you only need a specific section of the page, Selection mode lets you highlight exactly the content you want.

Frequently asked questions about content extraction

The extraction engine uses a three-layer pipeline. First, it checks whether the page comes from one of 8 supported platforms (GitHub, Medium, Stack Overflow, Notion, Dev.to, Substack, Reddit, Confluence) and applies a site-specific parser that knows exactly where the content lives. For all other sites, it uses trafilatura-based extraction — the same class of algorithms behind browser reader modes — which analyzes content density, text-to-markup ratios, and DOM structure to score every element on the page. Elements with high content scores (article text, headings, images) are kept. Elements with low scores (ads, navigation, sidebars) are discarded. Finally, a sanitization pass removes any residual scripts, tracking pixels, and hidden elements.

Yes. If the automatic extraction removes something you want to keep, switch to Full Page mode in the extension popup. Full Page captures everything visible on the page without any content filtering. You can also use Selection mode to highlight exactly the content you want — only your selection will appear in the final PDF. Between Article, Full Page, and Selection modes, you have full control over what ends up in your document.

This can happen on pages with unusual layouts where content appears in sidebars or non-standard containers. If you notice missing content, try Full Page mode to capture everything, or use Selection mode to manually highlight the specific content you need. For the 8 supported platforms, the dedicated parsers are tuned to capture all relevant content correctly. If you encounter a consistent issue on a specific site, contact our support team — we actively refine the extraction algorithms based on user feedback.

The extraction engine works on any website that serves HTML content. It performs best on standard content pages like articles, blog posts, documentation, and forum threads. Pages that rely heavily on JavaScript rendering (single-page applications with no server-side content) may produce limited results when using the URL fetch API, but work correctly through the Chrome extension since the extension captures the fully rendered DOM. The 8 supported platforms receive the highest-quality extraction through their dedicated parsers.

Browser reader mode and Pretty PDF's extraction engine use similar underlying algorithms for identifying main content, but they serve different purposes and produce different results. Reader mode reformats content for on-screen reading with a single font and minimal styling. Pretty PDF extracts the same content but then applies one of 5 professional PDF templates with embedded fonts, proper page margins, and print-optimized typography. Additionally, Pretty PDF's site-specific parsers for 8 platforms go beyond what reader mode can do — handling shadow DOM content, expanding toggle blocks, resolving dynamic class names, and preserving platform-specific formatting like code syntax highlighting and vote counts.

Smart content extraction — how we remove the clutter

What gets removed

Display ads and sponsored content

Navigation menus and site headers

Cookie consent banners and modals

Social sharing buttons

Comment sections

Related article widgets and sidebar content

What gets preserved

Article text and headings

Images and figures with captions

Code blocks and preformatted text

Tables with column alignment

Lists (ordered and unordered)

How it works — the three-layer pipeline

Layer 1: Site-specific parsing

Layer 2: Trafilatura extraction

Layer 3: HTML sanitization

Extraction in action

When to use Full Page instead

Frequently asked questions about content extraction

See the extraction in action — try it free

Smart content extraction — how we remove the clutter

What gets removed

Display ads and sponsored content

Navigation menus and site headers

Cookie consent banners and modals

Social sharing buttons

Comment sections

Related article widgets and sidebar content

What gets preserved

Article text and headings

Images and figures with captions

Code blocks and preformatted text

Tables with column alignment

Lists (ordered and unordered)

How it works — the three-layer pipeline

Layer 1: Site-specific parsing

Layer 2: Trafilatura extraction

Layer 3: HTML sanitization

Extraction in action

When to use Full Page instead

Frequently asked questions about content extraction

Related resources

How to Save a Webpage as PDF (Without the Clutter)

Remove Ads and Clutter When Saving Webpages as PDF

Pretty PDF vs Browser "Save as PDF"

5 Reasons Your Browser's Save as PDF Looks Terrible

See the extraction in action — try it free