HTML to JSON Smart Parser avatar

HTML to JSON Smart Parser

Pricing

Pay per event

Go to Apify Store
HTML to JSON Smart Parser

HTML to JSON Smart Parser

Convert HTML to structured JSON using AI! Uses OpenAI to extract and structure data from HTML into clean JSON format. Perfect for developers and data analysts who need to transform HTML into structured data without manual parsing.

Pricing

Pay per event

Rating

5.0

(2)

Developer

ParseForge

ParseForge

Maintained by Community

Actor stats

0

Bookmarked

34

Total users

1

Monthly active users

12 hours ago

Last modified

Share

ParseForge Banner

🧩 HTML to JSON Smart Parser

🚀 Convert HTML into structured JSON in seconds. Bring your own OpenAI API key. URL fetch, paste HTML, or upload files. No bespoke parsers.

🕒 Last updated: 2026-05-09 · 🧠 BYO OpenAI key · 📥 URL / paste / file upload · 🔑 BYO model selection

Convert HTML into clean structured JSON without writing a parser per page. Provide one or more URLs, paste HTML directly, or upload HTML files, then specify (or auto-detect) which fields to extract. The actor sends the HTML to your OpenAI account using your API key, parses the response, and returns one structured record per input. Built for developers who want layout-agnostic HTML extraction without bespoke selector code.

You bring your own OpenAI API key, so all model usage is billed directly to your OpenAI account. Choose the model (gpt-4o, gpt-4o-mini, gpt-3.5-turbo, etc.) based on your accuracy and cost trade-offs.

👥 Built for🎯 Primary use cases
DevelopersSkip writing CSS selectors and XPath queries
Data engineersBuild layout-agnostic data pipelines
AI opsConvert HTML into structured prompts for LLM workflows
ResearchersIndex HTML archives without bespoke parsers
Content opsMigrate HTML content into structured DBs
Indie devsAdd HTML parsing to side projects without a parser

📋 What the HTML to JSON Smart Parser does

  • 🌐 Three input modes. URL fetch, paste raw HTML, or upload HTML file URLs.
  • 🧠 AI-driven extraction. Sends HTML to OpenAI with your key for layout-agnostic parsing.
  • 🎯 Field selection. Specify which fields to extract or let the AI auto-detect.
  • 🤖 Model choice. gpt-4o, gpt-4o-mini, gpt-4-turbo, gpt-3.5-turbo, or gpt-5 when available.
  • ✏️ Custom prompts. Optional system prompt to bias the extraction.
  • 🆔 Per-input metadata. Each record carries the source URL, prompt, and timestamp.

The actor processes inputs in the order you provide them. Records stream into the dataset as parsing completes.

💡 Why it matters: writing a parser per page type costs hours and breaks with every layout change. AI-driven extraction adapts to layout variation without code changes, so dev teams can ship structured-data features faster.


🎬 Full Demo

🚧 Coming soon: a 3-minute walkthrough showing URL input, custom field extraction, and how to feed the output into a downstream pipeline.


⚙️ Input

FieldTypeNameDescription
urlarrayURL (Fetch HTML)URLs to fetch HTML from. The actor does a plain HTTP GET.
htmlContentstringHTML Content (Paste)Optional. Paste raw HTML directly.
htmlFileUrlarrayHTML File URL (Upload)Optional. URLs to uploaded HTML files.
openAIApiKeystringOpenAI API KeyRequired. Your OpenAI API key. The actor uses this for the model call.
modelenumOpenAI Modelgpt-4o-mini (default), gpt-4o, gpt-4-turbo, gpt-3.5-turbo, gpt-5.

Example 1. URL extraction with default model.

{
"url": [{"url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html"}],
"openAIApiKey": "sk-...",
"model": "gpt-4o-mini"
}

Example 2. Paste HTML directly.

{
"htmlContent": "<html><body><h1>Title</h1><p>Body</p></body></html>",
"openAIApiKey": "sk-...",
"model": "gpt-4o"
}

⚠️ Good to Know: you must supply your own OpenAI API key. All model usage is billed to your OpenAI account.


📊 Output

The dataset returns one structured record per input. Each record carries the source identifier, extracted JSON, the model used, and a timestamp. Consume the dataset as JSON, CSV, Excel, XML, or RSS via the Apify console or API.

🧾 Schema

FieldTypeExample
🌐 sourceUrlstring (url) or nullhttps://books.toscrape.com/.../1000/index.html
📦 parsedDataobject{"title":"A Light in the Attic","price":51.77,"availability":"In stock"}
🤖 modelstringgpt-4o-mini
🎯 promptstringExtract title, price, and availability
📅 timestampISO datetime2026-05-09T12:00:00.000Z
errorstring or nullnull

📦 Sample records

1. URL extraction (book product page)

{
"sourceUrl": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
"parsedData": {
"title": "A Light in the Attic",
"price": 51.77,
"availability": "In stock",
"rating": "Three",
"description": "It's hard to imagine a world without A Light in the Attic..."
},
"model": "gpt-4o-mini",
"prompt": "Extract title, price, availability, rating, and description",
"timestamp": "2026-05-09T12:00:00.000Z",
"error": null
}

2. Pasted HTML (simple page)

{
"sourceUrl": null,
"parsedData": {
"title": "Welcome",
"body": "Today we launched our new product..."
},
"model": "gpt-4o",
"timestamp": "2026-05-09T12:00:00.000Z",
"error": null
}

3. Failed parse (missing API key)

{
"sourceUrl": "https://example.com/page.html",
"parsedData": null,
"model": "gpt-4o-mini",
"timestamp": "2026-05-09T12:00:00.000Z",
"error": "Missing OpenAI API key"
}

✨ Why choose this Actor

Capability
🎯Built for the job. Single-purpose HTML-to-JSON pipeline with sensible defaults.
🧠BYO OpenAI key. All model usage billed directly to your OpenAI account.
⚙️Model choice. Pick model based on accuracy and cost trade-offs.
🔁Live processing. Every run runs end to end with no caching of input HTML.
🌐No infra to manage. Apify handles compute, scaling, scheduling, and storage.
🛡️Reliable. Per-input error reporting means one bad URL does not kill the whole run.
🚫No code required. Configure in the UI, run from CLI, schedule via cron, or call from any language with the Apify SDK.

📊 Production-grade HTML-to-JSON conversion without writing or maintaining custom parsers.


📈 How it compares to alternatives

ApproachCostCoverageRefreshQualitySetup
⭐ HTML to JSON Smart Parser (this Actor)$5 free credit + your OpenAI usageAny HTMLLive per runHigh, layout-agnostic⚡ 2 min
Hand-written parsersEngineering hoursPer layoutWhenever you maintain itHigh but brittle🐢 Days to weeks
Paid HTML-extraction SaaS$$ monthlyLimitedLiveVariable⏳ Hours
Manual reviewHours per fileOne at a timeStaleHighest🕒 Variable

Pick this Actor when you want flexible, layout-agnostic HTML parsing without owning the model integration.


🚀 How to use

  1. 📝 Sign up. Create a free account with $5 credit (takes 2 minutes).
  2. 🌐 Open the Actor. Go to the HTML to JSON Smart Parser page on the Apify Store.
  3. 🎯 Set inputs. Provide URLs, paste HTML, or upload files. Add your OpenAI API key.
  4. 🚀 Run it. Click Start and let the Actor parse each input.
  5. 📥 Download. Grab your results in the Dataset tab as CSV, Excel, JSON, or XML.

⏱️ Total time from signup to first parsed JSON: 3-5 minutes for a single URL.


💼 Business use cases

📊 Data engineering

  • Build layout-agnostic data pipelines
  • Skip CSS selectors and XPath queries
  • Replace bespoke parsers across products
  • Power ETL of HTML archives

🏢 AI ops and product

  • Convert HTML into structured prompts
  • Build LLM-driven content workflows
  • Power RAG ingestion from HTML sources
  • Surface structured data from emails

🎯 Research and migration

  • Index HTML archives without bespoke parsers
  • Migrate legacy HTML content into structured DBs
  • Build content audits from CMS exports
  • Power knowledge-base ingestion

🛠️ Engineering and product

  • Add HTML parsing to your apps
  • Wire parsing into CMS via webhooks
  • Build prototype scrapers fast
  • Skip the model-integration maintenance entirely

🌟 Beyond business use cases

Data like this powers more than commercial workflows. The same structured records support research, education, civic projects, and personal initiatives.

🎓 Research and academia

  • Empirical datasets for papers, thesis work, and coursework
  • Longitudinal studies tracking changes across snapshots
  • Reproducible research with cited, versioned data pulls
  • Classroom exercises on data analysis and ethical scraping

🎨 Personal and creative

  • Side projects, portfolio demos, and indie app launches
  • Data visualizations, dashboards, and infographics
  • Content research for bloggers, YouTubers, and podcasters
  • Hobbyist collections and personal trackers

🤝 Non-profit and civic

  • Transparency reporting and accountability projects
  • Advocacy campaigns backed by public-interest data
  • Community-run databases for local issues
  • Investigative journalism on public records

🧪 Experimentation

  • Prototype AI and machine-learning pipelines with real data
  • Validate product-market hypotheses before engineering spend
  • Train small domain-specific models on niche corpora
  • Test dashboard concepts with live input

🔌 Automating HTML to JSON Smart Parser

This Actor exposes a REST endpoint, so you can drive it from any language or workflow tool.

Schedules. Use Apify Scheduler to batch-parse a folder of HTML inputs. Combine with webhooks to trigger downstream workflows when parsing completes.


❓ Frequently Asked Questions

💳 Do I need a paid Apify plan to run this actor?

No, but you do need an OpenAI API key. You can start the actor on the free Apify plan (which includes $5 in monthly credit), but model calls are billed to your OpenAI account.

🚨 What happens if my run fails or returns no results?

Failed runs are not charged on Apify. If a single input fails, the actor records the error on that record only. If the OpenAI key is invalid or out of credits, the actor logs the error.

🧠 Why do I need to bring my own OpenAI key?

So your model usage is metered against your OpenAI account, with full control over rate limits and billing. We never see or store your key.

🤖 Which model should I pick?

gpt-4o-mini is the recommended default for cost. gpt-4o is more accurate for complex layouts. gpt-3.5-turbo is cheapest but less reliable on dense pages.

📥 Which input mode should I use?

URLs are simplest for public pages. Paste HTML when you have content not on the public web. Upload HTML files for bulk processing.

🧑‍💻 Can I call this actor from my own code?

Yes. Apify exposes every actor as a REST endpoint and ships first-class SDKs for Node.js and Python.

📤 How do I export the data?

Every Apify dataset can be downloaded in one click as CSV, JSON, JSONL, Excel, HTML, XML, or RSS.

📅 Can I schedule the actor to run automatically?

Yes. Use the Apify scheduler to parse new URLs on a cadence. Wire to webhooks for trigger-driven parsing.

🏪 Can I use the data commercially?

Yes. Parsed data is yours to use, subject to your rights to the source HTML.

💼 Which plan should I pick for production use?

Apify's Starter and Scale plans are designed for production workloads. OpenAI usage is billed separately to your OpenAI account.

🛠️ Can you add other LLM providers?

Open the contact form and tell us about your use case. We add features regularly when there is a clear use case behind the request.

Yes, provided you have rights to the source HTML. You are responsible for compliance with OpenAI's terms, source-site terms, and applicable copyright laws.


🔌 Integrate with any app

HTML to JSON Smart Parser connects to any cloud service via Apify integrations:

  • Make - Automate multi-step workflows
  • Zapier - Connect with 5,000+ apps
  • Slack - Get run notifications in your channels
  • Airbyte - Pipe results into your warehouse
  • GitHub - Trigger runs from commits and releases
  • Google Drive - Export datasets straight to Sheets

You can also use webhooks to trigger downstream actions when a run finishes.


💡 Pro Tip: browse the complete ParseForge collection for more reference-data scrapers.


🆘 Need Help? Open our contact form to request a new actor, propose a custom project, or report an issue.


⚠️ Disclaimer. This Actor is an independent tool. The actor processes only HTML you supply by URL, paste, or upload, and is intended for legitimate data-extraction workflows. Users are responsible for ensuring they hold the rights to the source content and for compliance with copyright, OpenAI's terms of service, and applicable law in their jurisdiction.