Pricing

from $0.50 / 1,000 extracted webpages

Webpage Text Extractor

Extract clean text, article text, and Markdown from public web pages. Get titles, metadata, headings, links, word counts, final URLs, and timestamps for LLM prompts, RAG inputs, reviews, and exports.

Pricing

from $0.50 / 1,000 extracted webpages

Rating

0.0

(0)

Developer

Maxime Dupré

Actor stats

Bookmarked

Total users

Monthly active users

a month ago

Last modified

📄 Webpage text extractor for LLM-ready content

Webpage Text Extractor turns public web pages into clean text, article text, or Markdown for LLM prompts, RAG inputs, content review, and spreadsheet exports. Add one URL or a batch of URLs, choose the text shape, and the Actor returns readable page content with source metadata, headings, links, counts, redirects, and scrape timestamps.

Use it when you need the text from pages such as Example Domain, documentation pages, blog posts, help-center articles, landing pages, or public knowledge-base pages without copying each page by hand. It is built for public HTML pages that can be opened without logging in.

For a quick first run, keep the prefilled public webpage list, leave Extraction mode on Clean page text, and run the Actor. You will get a representative batch of output items that shows the full row shape before you add your own URLs.

🧭 What this Actor does

Extracts clean text from public HTML web pages.
Supports Clean page text, Article text, and Markdown for LLMs modes.
Removes common page noise such as scripts, styles, navigation, headers, footers, forms, and hidden elements before extracting text.
Includes useful page details by default: title, meta description, author, published date, language, headings, links, canonical URL, final URL, HTTP status, word count, and character count.
Saves one output item per successfully extracted webpage.
Marks sparse but usable pages as partial so you can review them.
Logs skipped URLs when a page is invalid, unavailable, non-HTML, empty, private, blocked, or too slow to load.

The Actor is focused on webpage text extraction. It does not extract PDFs, Word documents, OCR from images, video transcripts, private dashboards, logged-in pages, or full rendered content from every JavaScript-heavy web app.

📊 Data you can extract

Each dataset item is one successfully extracted webpage. Rows can include:

type - always webpage_text
status - ok or partial
inputIndex - submitted URL position
requestedUrl - original URL from the input
finalUrl - final page URL after redirects
canonicalUrl - canonical page URL when the page provides one
httpStatusCode and contentType - response details for the extracted page
extractionMode - cleanText, articleText, or markdown
title, metaDescription, author, publishedAt, and language
excerpt - short preview of the extracted text
text - main extracted text in the selected mode
markdown - Markdown text when Markdown mode is selected
wordCount and charCount
headings - page heading outline with level and text
links - visible page links with text, absolute URL, and external-link flag
quality - sparse-content and redirect flags
scrapedAt - UTC timestamp when the page was saved

You can export the dataset as JSON, CSV, Excel, XML, RSS, or HTML, or use the same output through the Apify API, schedules, webhooks, and integrations.

🚀 How to run it

Open the Input tab.
Add one or more public webpage URLs in Webpage URLs.
Choose Extraction mode.
Keep Maximum pages small for your first run, then raise it when the output looks right.
Run the Actor and open the dataset.

Use Clean page text for a general page-to-text scraper. Use Article text for blog posts, articles, and reader-style pages where the main content matters most. Use Markdown for LLMs when you want headings and links represented in Markdown for prompts, RAG ingestion, or documentation workflows.

🧾 Input example

{
	"startUrls": [
		{ "url": "https://example.com" },
		{ "url": "https://www.iana.org/domains/reserved" }
	],
	"extractionMode": "markdown",
	"maxPages": 2
}

Webpage URLs is the only required input. Add public http or https pages that can be opened without a login.

Extraction mode controls the main text format saved in text. The supported values are cleanText, articleText, and markdown.

Maximum pages caps how many submitted URLs can be extracted in one run. The public maximum is 100.

📦 Output example

{
	"type": "webpage_text",
	"status": "ok",
	"inputIndex": 1,
	"requestedUrl": "https://example.com",
	"finalUrl": "https://example.com/",
	"canonicalUrl": null,
	"httpStatusCode": 200,
	"contentType": "text/html",
	"extractionMode": "markdown",
	"title": "Example Domain",
	"metaDescription": null,
	"author": null,
	"publishedAt": null,
	"language": "en",
	"excerpt": "# Example Domain\n\nThis domain is for use in illustrative examples in documents.",
	"text": "# Example Domain\n\nThis domain is for use in illustrative examples in documents.",
	"markdown": "# Example Domain\n\nThis domain is for use in illustrative examples in documents.",
	"wordCount": 20,
	"charCount": 127,
	"headings": [
		{ "level": 1, "text": "Example Domain" }
	],
	"links": [
		{
			"text": "More information...",
			"url": "https://www.iana.org/domains/example",
			"isExternal": true
		}
	],
	"quality": {
		"isSparse": false,
		"wasRedirected": true,
		"reason": null
	},
	"scrapedAt": "2026-06-13T14:12:00.000Z"
}

💸 Pricing

This Actor uses pay-per-event pricing. You are charged for each successfully extracted webpage. Failed, invalid, unavailable, empty, or non-HTML URLs are skipped and are not saved as output items.

Current event prices are:

Tier	Price per extracted webpage
FREE	$0.00090
BRONZE	$0.00090
SILVER	$0.00070
GOLD	$0.00050
PLATINUM	$0.00035
DIAMOND	$0.00025

There is no separate Actor-start charge in this Actor's pricing artifact.

⚠️ Limits and caveats

Webpage Text Extractor works best on public HTML pages with readable content in the initial page response. Pages that require login, block access, return non-HTML files, or rely heavily on client-side rendering may produce no output item or a partial row.

The Actor does not crawl a whole website from one URL. It extracts the submitted URLs only. If you need a link map first, use a crawler to collect URLs, then pass selected pages to this Actor.

❓ FAQ

🧾 Can I use this as a webpage to Markdown converter?

Yes. Choose Markdown for LLMs. The main text field will contain Markdown, and the markdown field will contain the same Markdown value for easy filtering.

🔗 Does it include links and headings?

Yes. Headings and visible links are included by default when the page provides them. You do not need to turn on separate metadata options.

🔒 Does it scrape private pages?

No. This Actor is for public web pages. It does not accept cookies, sessions, API keys, or login credentials.

⚠️ What happens when a URL fails?

The Actor logs the skipped URL and continues with the rest of the input. Only successfully extracted pages are saved to the dataset.

📝 Changelog

0.1: Initial release.

🆘 Support

For issues, questions, or feature requests, file a ticket and I'll fix or implement it in less than 24h 🫡

🔗 Other actors

Website URL Crawler ↗ - Crawl public websites and export a clean link map before extracting selected page text.
Web Images Scraper ↗ - Extract image URLs, metadata, and optional image files from public webpages.
Website Emails Scraper ↗ - Find public contact emails from websites and keep the source URL attached.
Font Detector ↗ - Detect web fonts, CSS font families, and font source evidence from public pages.
Ahrefs Free Website Stats Scraper ↗ - Extract public Ahrefs website metrics for SEO research and website audits.

Made with ❤️ by Maxime Dupré

🌐 Web Content Scraper - Clean Text, Metadata & Links

unrivaled_fortress/web-content-intelligence

Scrape public webpages and extract clean text, metadata, headings, links and structured data ready for AI/LLM consumption. Pay per result.

David Ahn

Webpage Text & Markdown Extractor

snapperwapper/webpage-text-markdown-extractor

Convert up to 1,000 webpage URLs into clean readable text, Markdown, metadata, canonical URLs, images, and deduplicated links for AI and content workflows.

snapperwapper

Article Extractor - Clean Text for LLM & RAG Pipelines

pattonholdings/article-extractor

Extract clean article text + metadata from any URL: title, author, publish date, full plain text, top image, word count. JSON-LD + Open Graph + readability heuristics, no browser. Use for LLM/RAG ingestion, news monitoring, research agents. Input: url or urls[] (max 1000). Output: JSON.

Coleton Patton

Web Text Extractor

rl1987/web-text-extractor

R.L.

AI-Ready Webpage Extractor

s3nafps/ai-ready-webpage-extractor

Convert public web pages into clean Markdown, text, links, tables, images, metadata, and JSON-LD for AI workflows.

mohamed senator

AI Web Extractor: URL → Clean Markdown + JSON for LLM/RAG

boxbox10/ai-web-extractor

Turn any URL into clean, LLM-ready Markdown + structured JSON (title, headings, main content, links, metadata, token count). Perfect for RAG pipelines, AI agents, and LLM context.

Marvin Eguilos

Web Page to Markdown Extractor

fetch_cat/web-page-to-markdown-extractor

Convert public URLs into clean Markdown, text, metadata, links, images, and optional HTML for AI and automation workflows.

Hanna Nosova

Article Extractor — Clean Web Content to Markdown/Text

omao/article-extractor

Extract the main article from any web page into clean Markdown or text, with title, author, date and description. Strips nav, ads and boilerplate. Fast, no setup.

Marouane Oulabass

Website Content Crawler — Text, Markdown & HTML for AI/LLM

hichemdev/website-content-crawler

Crawl any website and extract clean text, Markdown, and HTML from every page — ready for LLM, RAG, and AI ingestion.

Hichem Ben Moussa

Website to Markdown

cool_ya/website-to-markdown

Convert any web page into clean, LLM-ready Markdown. Strips nav, ads and boilerplate and returns the main article text plus title, description and word count. Perfect for RAG and AI pipelines.