Pricing

from $3.90 / 1,000 page converteds

PDF to HTML Converter

Convert PDFs to clean HTML preserving formatting, headings, tables, and layout. Multi-page support with per-page or combined output. OCR fallback for image PDFs. Inline CSS styling. Download via API.

Pricing

from $3.90 / 1,000 page converteds

Rating

0.0

(0)

Developer

junipr

Actor stats

Bookmarked

Total users

Monthly active users

a month ago

Last modified

Introduction

Convert any PDF document to clean, semantic HTML that preserves the original document structure. Unlike most PDF-to-HTML tools that produce visual HTML with absolutely-positioned divs and spans (mimicking PDF layout pixel-by-pixel), this actor generates real semantic elements: <h1>-<h6> headings, <table> with <thead>/<tbody>, <ul>/<ol> lists, and <p> paragraphs. The output is valid HTML5, screen-reader accessible, and ready for web publishing, CMS import, content migration, or further processing in any pipeline. Batch processing is supported — convert hundreds of PDFs in a single run with configurable styling options.

Why Use This Actor

Most PDF-to-HTML converters produce "visual HTML" — absolute-positioned divs that look like the PDF but have no semantic meaning. This means tables are rendered as scattered text spans, headings are just bigger fonts, and lists become disconnected bullet characters. Our actor produces semantic HTML that browsers, search engines, screen readers, and CMS platforms can actually understand.

Feature	This Actor	pdf2htmlEX	Adobe API	Online Tools
Semantic HTML	Headings, tables, lists	Absolute positioning	Partial	Rarely
Table detection	Proper `<table>`	Positioned text	Yes	Poor
List detection	`<ul>` + `<ol>`	None	Partial	None
CSS options	Inline / class / none	Inline only	Class-based	Inline
Batch processing	Yes (up to 5,000 PDFs)	CLI only	Yes	Single file
Cost	$3/1K pages	Free (self-hosted)	$0.05/page	Freemium
Setup	Zero config	CLI install + Docker	API key required	Upload UI

The output is WCAG-friendly: screen readers can navigate headings, read table headers, and traverse list items — something impossible with visual HTML output.

How to Use

Zero-config example — just provide a PDF URL:

{
  "sources": [
    { "url": "https://example.com/report.pdf" }
  ]
}

Node.js (Apify Client):

import { ApifyClient } from 'apify-client';

const client = new ApifyClient({ token: 'YOUR_TOKEN' });

const run = await client.actor('junipr/pdf-to-html').call({
    sources: [{ url: 'https://example.com/document.pdf' }],
    stylingMode: 'class',
    wrapInDocument: true,
});

const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(items[0].html);

Python:

from apify_client import ApifyClient

client = ApifyClient("YOUR_TOKEN")
run = client.actor("junipr/pdf-to-html").call(run_input={
    "sources": [{"url": "https://example.com/document.pdf"}],
    "stylingMode": "class",
})

dataset = client.dataset(run["defaultDatasetId"]).list_items().items
print(dataset[0]["html"])

Load from Apify Key-Value Store:

{
  "sources": [
    { "kvStoreKey": "my-document.pdf", "kvStoreId": "abc123" }
  ]
}

Input Configuration

All parameters except sources are optional. Common recipes:

Quick conversion — URL only, all defaults:

{ "sources": [{ "url": "https://example.com/doc.pdf" }] }

Web publishing — styled HTML document with WebP images:

{
  "sources": [{ "url": "https://example.com/doc.pdf" }],
  "stylingMode": "class",
  "wrapInDocument": true,
  "imageFormat": "webp",
  "includeDefaultStyles": true
}

CMS import — pure semantic HTML, no styling, no page breaks:

{
  "sources": [{ "url": "https://example.com/doc.pdf" }],
  "stylingMode": "none",
  "pageBreakMode": "none",
  "wrapInDocument": false
}

Print archive — preserve all formatting with inline CSS:

{
  "sources": [{ "url": "https://example.com/doc.pdf" }],
  "stylingMode": "inline",
  "preserveFontSizes": true,
  "preserveColors": true,
  "preserveFontStyles": true
}

See the Input tab for the full list of parameters with descriptions and defaults.

Output Format

Each converted PDF produces one dataset item with the full HTML, per-page breakdown, image references, metadata, and conversion stats. Example output (fragment mode):

<h1 class="heading-1">Annual Report 2024</h1>
<p class="paragraph">Revenue grew 23% year-over-year...</p>

<h2 class="heading-2">Financial Summary</h2>
<table class="pdf-table">
  <thead>
    <tr><th>Quarter</th><th>Revenue</th><th>Growth</th></tr>
  </thead>
  <tbody>
    <tr><td>Q1</td><td>$2.1M</td><td>18%</td></tr>
    <tr><td>Q2</td><td>$2.5M</td><td>25%</td></tr>
  </tbody>
</table>

<ul class="pdf-list">
  <li>Expanded into 3 new markets</li>
  <li>Launched enterprise tier</li>
</ul>

When wrapInDocument is enabled, the output includes <!DOCTYPE html>, <html>, <head> with meta tags from PDF metadata, and a <style> block with the default or custom stylesheet.

Extracted images are stored in the run's Key-Value Store and referenced in the images array with dimensions, format, and originating page number.

Tips and Advanced Usage

Multi-column PDFs: Enable detectColumns (on by default) to merge multi-column text in natural reading order rather than interleaving columns.
Custom CSS: Use customCss with stylingMode: "class" to inject your own styles. Class names like .heading-1, .paragraph, .pdf-table, .pdf-list are consistent across all documents.
Scanned PDFs: This actor works with text-based PDFs only. For scanned documents, run them through an OCR actor first, then convert the output.
Page selection: Use pageRange to convert only specific pages (e.g., "1-3,7") — you only pay for pages actually converted.
Batch optimization: Process up to 5,000 PDFs per run. Each PDF is processed sequentially to manage memory. For very large batches, increase the memory allocation to 4096 MB or higher.
CMS integration: Use stylingMode: "none" and wrapInDocument: false for WordPress, Contentful, or Strapi imports — these platforms apply their own styling.

Pricing

This actor uses Pay-Per-Event (PPE) pricing at $3.00 per 1,000 pages converted ($0.003 per page).

Scenario	Pages	Cost
Single 10-page document	10	$0.03
Product catalog (100 pages)	100	$0.30
Legal contract batch (50 docs x 20 pages)	1,000	$3.00
Website migration (500 PDFs x 5 pages)	2,500	$7.50
Document archive (10K pages)	10,000	$30.00

Not billed: pages that fail to convert, scanned pages with no text, pages skipped by pageRange, empty pages, and duplicate PDFs. You only pay for successfully converted pages.

Compared to Adobe Document Services API ($0.05/page = $50/1K pages), this actor is 94% cheaper at any scale.

FAQ

What makes this different from pdf2htmlEX?

pdf2htmlEX produces visual HTML — every text element is absolutely positioned with pixel coordinates, mimicking the PDF layout. While it looks identical to the original, the HTML has no semantic meaning. Our actor produces real <h1>, <table>, <ul>, and <p> elements that browsers, search engines, and screen readers understand. The tradeoff is that our output may not be pixel-perfect, but it is actually useful as HTML.

Can it handle tables with merged cells?

Yes. The actor detects table structures and attempts to identify colspan/rowspan relationships. For very complex tables (deeply nested or irregularly merged), it falls back to a <pre> block with formatted text to preserve readability.

What happens with scanned/image-only PDF pages?

Scanned pages are detected automatically and produce a SCANNED_PAGE_DETECTED warning. These pages are skipped (not billed) because there is no text to convert. For scanned documents, use an OCR actor first to extract text, then run this actor on the result.

Does it support password-protected PDFs?

Yes. Provide the password via the password input field (applies to all sources) or per-source via sources[].password. If a PDF is encrypted and no password is provided, you get an ENCRYPTED_NO_PASSWORD error. If the password is wrong, you get an INVALID_PASSWORD error.

Can I customize the CSS output?

Yes. Choose between three styling modes: "class" (CSS classes + <style> block), "inline" (style attributes on each element), or "none" (pure semantic HTML). With class-based styling, you can inject custom CSS via the customCss field and toggle the built-in default styles with includeDefaultStyles.

How are images handled in the output?

When extractImages is enabled, embedded images are extracted from the PDF, converted to your chosen format (PNG, JPEG, or WebP), and stored in the run's Key-Value Store. The HTML output references images via <img> tags, and the images array in the dataset provides each image's KV store key, dimensions, format, and originating page number.

What's the maximum PDF file size?

Configurable via maxFileSizeMb, default is 100 MB, maximum is 500 MB. For very large PDFs, increase the actor's memory allocation proportionally. A 200-page PDF with many images may need 4096 MB of memory.

Can I convert only specific pages?

Yes. Use the pageRange field with ranges like "1-5", "1,3,5", or "1-3,7,9-12". Pages outside the range are skipped and not billed. The output includes only the requested pages.

PDF to Markdown Converter - AI-Powered with OCR & Tables

clearpath/pdf-to-markdown-api

Convert PDFs to clean Markdown with GPU-accelerated AI. Extracts tables, LaTeX formulas, and images from complex layouts. Supports OCR for scanned docs in 8 languages. Batch process hundreds of PDFs in parallel via URL, upload, or API.

ClearPath

Html To Pdf Api

simplifysme/html-to-pdf-api

📄 Convert any HTML page or URL to high-quality PDF documents via API. Perfect for reports, invoices, documentation, web page archiving, and automated document generation.

SimplifySME Toolbox

HTML to PDF converter

apify/html-to-pdf-converter

Convert HTML string to A4 PDF.

Apify

198

4.3

PDF to Text Extractor

junipr/pdf-to-text-extractor

Extract text from PDFs with native parsing and OCR fallback. Per-page granularity, paragraph structure preserved. Batch process multiple URLs. Output as plain text, JSON, or combined document. Ideal for data pipelines.

junipr

HTML to PDF Converter Pro 🔄

powerful_bachelor/html-to-pdf-converter-pro

🔄 Convert web pages to high-quality PDFs with special canvas element handling! Perfect for 📄 documentation, 🖨️ printing, and 🔒 archiving. Features include batch processing and flexible page settings. Transform your web content into professional PDFs! 🚀

Powerful Bachelor

PDF OCR Tool — Extract Text from Scanned Documents

junipr/pdf-ocr-tool

Extract text from scanned PDFs and images using Tesseract OCR. 100+ languages, multi-page support. Configurable DPI, page segmentation, language selection. Output as plain text or structured JSON per page.

junipr

Convert Image to PDF and PDF to Image

akash9078/image-pdf-converter

Convert images (JPG, PNG, BMP, and more) into high-quality PDFs, or extract images from PDF files in seconds. Image–PDF Converter Pro delivers fast, reliable, and professional results for all your document and image conversion needs.

Akash Kumar Naik

Web Page to Single-Page PDF & HTML (Automation-Ready)

exciting_perfume/Web-Page-to-Single-Page-PDF-and-HTML

Convert webpages to single-page PDFs and extract raw HTML via API. Captures full scroll height (no A4 splits). Built for automation with n8n, Make, and Zapier. Ideal for archiving, AI workflows, compliance, and bulk processing.

Gavin Campbell

Markdown to PDF MCP Server

parseforge/markdown-to-pdf-mcp

Convert Markdown content to PDF format using Model Context Protocol (MCP). Perfect for developers, content creators, and businesses who need to programmatically convert Markdown documents to professional PDFs with custom styling, page sizes, margins, and orientations.

ParseForge

5.0

HTML To PDF API

igview-owner/html-to-pdf-api

Convert HTML content and webpage URLs to high-quality PDF documents instantly. HTML to PDF converter with customizable page formats (A4, Letter), scale control, background graphics, and smart waiting for dynamic content. Perfect for reports, documentation, and automated PDF generation workflows.