Webpage Text Extractor avatar

Webpage Text Extractor

Pricing

from $0.50 / 1,000 extracted webpages

Go to Apify Store
Webpage Text Extractor

Webpage Text Extractor

Extract clean text, article text, and Markdown from public web pages. Get titles, metadata, headings, links, word counts, final URLs, and timestamps for LLM prompts, RAG inputs, reviews, and exports.

Pricing

from $0.50 / 1,000 extracted webpages

Rating

0.0

(0)

Developer

Maxime Dupré

Maxime Dupré

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

2 days ago

Last modified

Share

📄 Webpage text extractor for LLM-ready content

Webpage Text Extractor turns public web pages into clean text, article text, or Markdown for LLM prompts, RAG inputs, content review, and spreadsheet exports. Add one URL or a batch of URLs, choose the text shape, and the Actor returns readable page content with source metadata, headings, links, counts, redirects, and scrape timestamps.

Use it when you need the text from pages such as Example Domain, documentation pages, blog posts, help-center articles, landing pages, or public knowledge-base pages without copying each page by hand. It is built for public HTML pages that can be opened without logging in.

For a quick first run, keep the prefilled public webpage list, leave Extraction mode on Clean page text, and run the Actor. You will get a representative batch of output items that shows the full row shape before you add your own URLs.

🧭 What this Actor does

  • Extracts clean text from public HTML web pages.
  • Supports Clean page text, Article text, and Markdown for LLMs modes.
  • Removes common page noise such as scripts, styles, navigation, headers, footers, forms, and hidden elements before extracting text.
  • Includes useful page details by default: title, meta description, author, published date, language, headings, links, canonical URL, final URL, HTTP status, word count, and character count.
  • Saves one output item per successfully extracted webpage.
  • Marks sparse but usable pages as partial so you can review them.
  • Logs skipped URLs when a page is invalid, unavailable, non-HTML, empty, private, blocked, or too slow to load.

The Actor is focused on webpage text extraction. It does not extract PDFs, Word documents, OCR from images, video transcripts, private dashboards, logged-in pages, or full rendered content from every JavaScript-heavy web app.

📊 Data you can extract

Each dataset item is one successfully extracted webpage. Rows can include:

  • type - always webpage_text
  • status - ok or partial
  • inputIndex - submitted URL position
  • requestedUrl - original URL from the input
  • finalUrl - final page URL after redirects
  • canonicalUrl - canonical page URL when the page provides one
  • httpStatusCode and contentType - response details for the extracted page
  • extractionMode - cleanText, articleText, or markdown
  • title, metaDescription, author, publishedAt, and language
  • excerpt - short preview of the extracted text
  • text - main extracted text in the selected mode
  • markdown - Markdown text when Markdown mode is selected
  • wordCount and charCount
  • headings - page heading outline with level and text
  • links - visible page links with text, absolute URL, and external-link flag
  • quality - sparse-content and redirect flags
  • scrapedAt - UTC timestamp when the page was saved

You can export the dataset as JSON, CSV, Excel, XML, RSS, or HTML, or use the same output through the Apify API, schedules, webhooks, and integrations.

🚀 How to run it

  1. Open the Input tab.
  2. Add one or more public webpage URLs in Webpage URLs.
  3. Choose Extraction mode.
  4. Keep Maximum pages small for your first run, then raise it when the output looks right.
  5. Run the Actor and open the dataset.

Use Clean page text for a general page-to-text scraper. Use Article text for blog posts, articles, and reader-style pages where the main content matters most. Use Markdown for LLMs when you want headings and links represented in Markdown for prompts, RAG ingestion, or documentation workflows.

🧾 Input example

{
"startUrls": [
{ "url": "https://example.com" },
{ "url": "https://www.iana.org/domains/reserved" }
],
"extractionMode": "markdown",
"maxPages": 2
}

Webpage URLs is the only required input. Add public http or https pages that can be opened without a login.

Extraction mode controls the main text format saved in text. The supported values are cleanText, articleText, and markdown.

Maximum pages caps how many submitted URLs can be extracted in one run. The public maximum is 100.

📦 Output example

{
"type": "webpage_text",
"status": "ok",
"inputIndex": 1,
"requestedUrl": "https://example.com",
"finalUrl": "https://example.com/",
"canonicalUrl": null,
"httpStatusCode": 200,
"contentType": "text/html",
"extractionMode": "markdown",
"title": "Example Domain",
"metaDescription": null,
"author": null,
"publishedAt": null,
"language": "en",
"excerpt": "# Example Domain\n\nThis domain is for use in illustrative examples in documents.",
"text": "# Example Domain\n\nThis domain is for use in illustrative examples in documents.",
"markdown": "# Example Domain\n\nThis domain is for use in illustrative examples in documents.",
"wordCount": 20,
"charCount": 127,
"headings": [
{ "level": 1, "text": "Example Domain" }
],
"links": [
{
"text": "More information...",
"url": "https://www.iana.org/domains/example",
"isExternal": true
}
],
"quality": {
"isSparse": false,
"wasRedirected": true,
"reason": null
},
"scrapedAt": "2026-06-13T14:12:00.000Z"
}

💸 Pricing

This Actor uses pay-per-event pricing. You are charged for each successfully extracted webpage. Failed, invalid, unavailable, empty, or non-HTML URLs are skipped and are not saved as output items.

Current event prices are:

TierPrice per extracted webpage
FREE$0.00090
BRONZE$0.00090
SILVER$0.00070
GOLD$0.00050
PLATINUM$0.00035
DIAMOND$0.00025

There is no separate Actor-start charge in this Actor's pricing artifact.

⚠️ Limits and caveats

Webpage Text Extractor works best on public HTML pages with readable content in the initial page response. Pages that require login, block access, return non-HTML files, or rely heavily on client-side rendering may produce no output item or a partial row.

The Actor does not crawl a whole website from one URL. It extracts the submitted URLs only. If you need a link map first, use a crawler to collect URLs, then pass selected pages to this Actor.

❓ FAQ

🧾 Can I use this as a webpage to Markdown converter?

Yes. Choose Markdown for LLMs. The main text field will contain Markdown, and the markdown field will contain the same Markdown value for easy filtering.

Yes. Headings and visible links are included by default when the page provides them. You do not need to turn on separate metadata options.

🔒 Does it scrape private pages?

No. This Actor is for public web pages. It does not accept cookies, sessions, API keys, or login credentials.

⚠️ What happens when a URL fails?

The Actor logs the skipped URL and continues with the rest of the input. Only successfully extracted pages are saved to the dataset.

📝 Changelog

  • 0.1: Initial release.

🆘 Support

For issues, questions, or feature requests, file a ticket and I'll fix or implement it in less than 24h 🫡

🔗 Other actors

Made with ❤️ by Maxime Dupré