PDF to HTML Converter avatar

PDF to HTML Converter

Pricing

from $3.00 / 1,000 page converteds

Go to Apify Store
PDF to HTML Converter

PDF to HTML Converter

Convert PDFs to clean HTML preserving formatting, headings, tables, and layout. Multi-page support with per-page or combined output. OCR fallback for image PDFs. Inline CSS styling. Download via API.

Pricing

from $3.00 / 1,000 page converteds

Rating

0.0

(0)

Developer

junipr

junipr

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

3 days ago

Last modified

Share

Introduction

Convert any PDF document to clean, semantic HTML that preserves the original document structure. Unlike most PDF-to-HTML tools that produce visual HTML with absolutely-positioned divs and spans (mimicking PDF layout pixel-by-pixel), this actor generates real semantic elements: <h1>-<h6> headings, <table> with <thead>/<tbody>, <ul>/<ol> lists, and <p> paragraphs. The output is valid HTML5, screen-reader accessible, and ready for web publishing, CMS import, content migration, or further processing in any pipeline. Batch processing is supported — convert hundreds of PDFs in a single run with configurable styling options.

Why Use This Actor

Most PDF-to-HTML converters produce "visual HTML" — absolute-positioned divs that look like the PDF but have no semantic meaning. This means tables are rendered as scattered text spans, headings are just bigger fonts, and lists become disconnected bullet characters. Our actor produces semantic HTML that browsers, search engines, screen readers, and CMS platforms can actually understand.

FeatureThis Actorpdf2htmlEXAdobe APIOnline Tools
Semantic HTMLHeadings, tables, listsAbsolute positioningPartialRarely
Table detectionProper <table>Positioned textYesPoor
List detection<ul> + <ol>NonePartialNone
CSS optionsInline / class / noneInline onlyClass-basedInline
Batch processingYes (up to 5,000 PDFs)CLI onlyYesSingle file
Cost$3/1K pagesFree (self-hosted)$0.05/pageFreemium
SetupZero configCLI install + DockerAPI key requiredUpload UI

The output is WCAG-friendly: screen readers can navigate headings, read table headers, and traverse list items — something impossible with visual HTML output.

How to Use

Zero-config example — just provide a PDF URL:

{
"sources": [
{ "url": "https://example.com/report.pdf" }
]
}

Node.js (Apify Client):

import { ApifyClient } from 'apify-client';
const client = new ApifyClient({ token: 'YOUR_TOKEN' });
const run = await client.actor('junipr/pdf-to-html').call({
sources: [{ url: 'https://example.com/document.pdf' }],
stylingMode: 'class',
wrapInDocument: true,
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(items[0].html);

Python:

from apify_client import ApifyClient
client = ApifyClient("YOUR_TOKEN")
run = client.actor("junipr/pdf-to-html").call(run_input={
"sources": [{"url": "https://example.com/document.pdf"}],
"stylingMode": "class",
})
dataset = client.dataset(run["defaultDatasetId"]).list_items().items
print(dataset[0]["html"])

Load from Apify Key-Value Store:

{
"sources": [
{ "kvStoreKey": "my-document.pdf", "kvStoreId": "abc123" }
]
}

Input Configuration

All parameters except sources are optional. Common recipes:

Quick conversion — URL only, all defaults:

{ "sources": [{ "url": "https://example.com/doc.pdf" }] }

Web publishing — styled HTML document with WebP images:

{
"sources": [{ "url": "https://example.com/doc.pdf" }],
"stylingMode": "class",
"wrapInDocument": true,
"imageFormat": "webp",
"includeDefaultStyles": true
}

CMS import — pure semantic HTML, no styling, no page breaks:

{
"sources": [{ "url": "https://example.com/doc.pdf" }],
"stylingMode": "none",
"pageBreakMode": "none",
"wrapInDocument": false
}

Print archive — preserve all formatting with inline CSS:

{
"sources": [{ "url": "https://example.com/doc.pdf" }],
"stylingMode": "inline",
"preserveFontSizes": true,
"preserveColors": true,
"preserveFontStyles": true
}

See the Input tab for the full list of parameters with descriptions and defaults.

Output Format

Each converted PDF produces one dataset item with the full HTML, per-page breakdown, image references, metadata, and conversion stats. Example output (fragment mode):

<h1 class="heading-1">Annual Report 2024</h1>
<p class="paragraph">Revenue grew 23% year-over-year...</p>
<h2 class="heading-2">Financial Summary</h2>
<table class="pdf-table">
<thead>
<tr><th>Quarter</th><th>Revenue</th><th>Growth</th></tr>
</thead>
<tbody>
<tr><td>Q1</td><td>$2.1M</td><td>18%</td></tr>
<tr><td>Q2</td><td>$2.5M</td><td>25%</td></tr>
</tbody>
</table>
<ul class="pdf-list">
<li>Expanded into 3 new markets</li>
<li>Launched enterprise tier</li>
</ul>

When wrapInDocument is enabled, the output includes <!DOCTYPE html>, <html>, <head> with meta tags from PDF metadata, and a <style> block with the default or custom stylesheet.

Extracted images are stored in the run's Key-Value Store and referenced in the images array with dimensions, format, and originating page number.

Tips and Advanced Usage

  • Multi-column PDFs: Enable detectColumns (on by default) to merge multi-column text in natural reading order rather than interleaving columns.
  • Custom CSS: Use customCss with stylingMode: "class" to inject your own styles. Class names like .heading-1, .paragraph, .pdf-table, .pdf-list are consistent across all documents.
  • Scanned PDFs: This actor works with text-based PDFs only. For scanned documents, run them through an OCR actor first, then convert the output.
  • Page selection: Use pageRange to convert only specific pages (e.g., "1-3,7") — you only pay for pages actually converted.
  • Batch optimization: Process up to 5,000 PDFs per run. Each PDF is processed sequentially to manage memory. For very large batches, increase the memory allocation to 4096 MB or higher.
  • CMS integration: Use stylingMode: "none" and wrapInDocument: false for WordPress, Contentful, or Strapi imports — these platforms apply their own styling.

Pricing

This actor uses Pay-Per-Event (PPE) pricing at $3.00 per 1,000 pages converted ($0.003 per page).

ScenarioPagesCost
Single 10-page document10$0.03
Product catalog (100 pages)100$0.30
Legal contract batch (50 docs x 20 pages)1,000$3.00
Website migration (500 PDFs x 5 pages)2,500$7.50
Document archive (10K pages)10,000$30.00

Not billed: pages that fail to convert, scanned pages with no text, pages skipped by pageRange, empty pages, and duplicate PDFs. You only pay for successfully converted pages.

Compared to Adobe Document Services API ($0.05/page = $50/1K pages), this actor is 94% cheaper at any scale.

FAQ

What makes this different from pdf2htmlEX?

pdf2htmlEX produces visual HTML — every text element is absolutely positioned with pixel coordinates, mimicking the PDF layout. While it looks identical to the original, the HTML has no semantic meaning. Our actor produces real <h1>, <table>, <ul>, and <p> elements that browsers, search engines, and screen readers understand. The tradeoff is that our output may not be pixel-perfect, but it is actually useful as HTML.

Can it handle tables with merged cells?

Yes. The actor detects table structures and attempts to identify colspan/rowspan relationships. For very complex tables (deeply nested or irregularly merged), it falls back to a <pre> block with formatted text to preserve readability.

What happens with scanned/image-only PDF pages?

Scanned pages are detected automatically and produce a SCANNED_PAGE_DETECTED warning. These pages are skipped (not billed) because there is no text to convert. For scanned documents, use an OCR actor first to extract text, then run this actor on the result.

Does it support password-protected PDFs?

Yes. Provide the password via the password input field (applies to all sources) or per-source via sources[].password. If a PDF is encrypted and no password is provided, you get an ENCRYPTED_NO_PASSWORD error. If the password is wrong, you get an INVALID_PASSWORD error.

Can I customize the CSS output?

Yes. Choose between three styling modes: "class" (CSS classes + <style> block), "inline" (style attributes on each element), or "none" (pure semantic HTML). With class-based styling, you can inject custom CSS via the customCss field and toggle the built-in default styles with includeDefaultStyles.

How are images handled in the output?

When extractImages is enabled, embedded images are extracted from the PDF, converted to your chosen format (PNG, JPEG, or WebP), and stored in the run's Key-Value Store. The HTML output references images via <img> tags, and the images array in the dataset provides each image's KV store key, dimensions, format, and originating page number.

What's the maximum PDF file size?

Configurable via maxFileSizeMb, default is 100 MB, maximum is 500 MB. For very large PDFs, increase the actor's memory allocation proportionally. A 200-page PDF with many images may need 4096 MB of memory.

Can I convert only specific pages?

Yes. Use the pageRange field with ranges like "1-5", "1,3,5", or "1-3,7,9-12". Pages outside the range are skipped and not billed. The output includes only the requested pages.