Pricing

Pay per event

Webpage Text Extractor

This actor fetches web pages and extracts their clean text content by stripping all HTML tags, scripts, and styles. It identifies the main content area (article, main, etc.), extracts headings structure, page links, and metadata like author, publish date, and language. Use it for LLM input...

Pricing

Pay per event

Rating

0.0

(0)

Developer

Stas Persiianenko

Actor stats

Bookmarked

Total users

Monthly active users

14 days ago

Last modified

What does Webpage Text Extractor do?

This actor fetches web pages and extracts their clean text content by stripping all HTML tags, scripts, and styles. It identifies the main content area (article, main, etc.), extracts headings structure, page links, and metadata like author, publish date, and language. Use it for LLM input preparation, content analysis, text mining, or feeding clean text into downstream data pipelines.

Use cases

AI/LLM engineers -- convert web pages to clean text for RAG pipelines, fine-tuning datasets, or prompt context
Content analysts -- extract text for sentiment analysis, topic modeling, keyword extraction, or NLP processing
Data journalists -- collect article text from multiple news sources for comparison and analysis
Accessibility auditors -- extract text structure and heading hierarchy to verify correct semantic markup
Data pipeline builders -- feed clean, structured text into downstream processing tools and databases

Why use Webpage Text Extractor?

AI-ready clean text -- strips all HTML, scripts, styles, and ads to return structured output ready for LLM training, RAG pipelines, and AI agent workflows
Rich metadata -- extracts title, meta description, author, publish date, language, and Open Graph tags
Heading structure -- returns all headings with their level (H1-H6) for document outline analysis
Link extraction -- captures all links with text, href, and internal/external classification
Configurable metadata -- toggle metadata inclusion with the includeMetadata option to control output size
Pay-per-event pricing -- costs just $0.001 per URL with no monthly subscription

Input parameters

Parameter	Type	Required	Default	Description
`urls`	string[]	Yes	--	List of web page URLs to extract text from
`includeMetadata`	boolean	No	`true`	Include links and extra metadata in the output

Example input

{
    "urls": [
        "https://en.wikipedia.org/wiki/Web_scraping",
        "https://blog.apify.com"
    ],
    "includeMetadata": true
}

Output example

{
    "url": "https://en.wikipedia.org/wiki/Web_scraping",
    "title": "Web scraping - Wikipedia",
    "metaDescription": "...",
    "author": null,
    "publishedDate": null,
    "language": "en",
    "mainText": "Web scraping is the process of...",
    "headings": [
        { "level": 1, "text": "Web scraping" },
        { "level": 2, "text": "Techniques" }
    ],
    "links": [
        { "text": "data extraction", "href": "/wiki/Data_extraction", "isExternal": false }
    ],
    "wordCount": 3450,
    "charCount": 21000,
    "error": null,
    "extractedAt": "2026-03-01T12:00:00.000Z"
}

Output fields

Field	Type	Description
`url`	string	The extracted page URL
`title`	string	The page title
`metaDescription`	string	The meta description tag content
`author`	string	Author name if detected from meta tags
`publishedDate`	string	Publish date if detected from meta tags
`language`	string	Page language from the lang attribute
`mainText`	string	Clean text content with HTML stripped
`headings`	array	List of headings with level (1-6) and text
`links`	array	List of links with text, href, and isExternal flag
`wordCount`	number	Total words in the extracted text
`charCount`	number	Total characters in the extracted text
`error`	string	Error message if extraction failed, null otherwise
`extractedAt`	string	ISO timestamp of the extraction

How to extract text from web pages

Open Webpage Text Extractor on Apify.
Enter one or more web page URLs in the urls field.
Choose whether to include metadata (links, headings, author info) by setting includeMetadata.
Click Start and wait for the run to finish.
Download the extracted text as JSON, CSV, or Excel from the Dataset tab.

How much does it cost to extract text from web pages?

Webpage Text Extractor uses Apify's pay-per-event pricing model. You only pay for what you use.

Event	Price	Description
Start	$0.035	One-time per run
URL extracted	$0.001	Per page extracted

Example costs:

10 pages: $0.035 + 10 x $0.001 = $0.045
100 pages: $0.035 + 100 x $0.001 = $0.135
1,000 pages: $0.035 + 1,000 x $0.001 = $1.035

Using the Apify API

You can start Webpage Text Extractor programmatically from your own applications using the Apify API. The following examples show how to run the actor and retrieve results in both Node.js and Python.

Node.js

import { ApifyClient } from 'apify-client';

const client = new ApifyClient({ token: 'YOUR_TOKEN' });
const run = await client.actor('automation-lab/webpage-text-extractor').call({
    urls: ['https://en.wikipedia.org/wiki/Web_scraping'],
    includeMetadata: true,
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(items);

Python

from apify_client import ApifyClient

client = ApifyClient('YOUR_TOKEN')
run = client.actor('automation-lab/webpage-text-extractor').call(run_input={
    'urls': ['https://en.wikipedia.org/wiki/Web_scraping'],
    'includeMetadata': True,
})
items = client.dataset(run['defaultDatasetId']).list_items().items
print(items)

cURL

curl -X POST "https://api.apify.com/v2/acts/automation-lab~webpage-text-extractor/runs?token=YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "urls": ["https://en.wikipedia.org/wiki/Web_scraping"],
    "includeMetadata": true
  }'

Use with Claude AI (MCP)

This actor is available as a tool in Claude AI through the Model Context Protocol (MCP). Add it to Claude Desktop, Cursor, Windsurf, or any MCP-compatible client.

Setup for Claude Code

$claude mcp add --transport http apify "https://mcp.apify.com?tools=automation-lab/webpage-text-extractor"

Setup for Claude Desktop, Cursor, or VS Code

Add this to your MCP config file:

{
    "mcpServers": {
        "apify": {
            "url": "https://mcp.apify.com?tools=automation-lab/webpage-text-extractor"
        }
    }
}

Example prompts

"Extract the main text content from this article: https://example.com/blog/post"
"Get clean text from these web pages and summarize them"
"How many words are on this page and what is the heading structure?"

Learn more in the Apify MCP documentation.

Integrations

Webpage Text Extractor works with all major automation platforms available on Apify. Export results to Google Sheets to build a text content database for analysis. Use Zapier or Make to trigger text extraction whenever new URLs are added to a watchlist. Send extracted text to Slack channels for quick review. Pipe results into n8n workflows to feed clean text into LLM APIs, vector databases, or NLP pipelines. Set up webhooks to get notified when extraction finishes and automatically pass text to downstream processing.

Tips and best practices

Set includeMetadata to false if you only need the main text -- this reduces output size significantly, especially for pages with hundreds of links
Use the headings array to understand document structure before feeding text into LLMs -- heading hierarchy provides valuable context for summarization and Q&A
Filter by language when processing multilingual sites to route text to the correct NLP model or translation pipeline
Combine with Content Readability Checker to get both the raw text and readability scores for each page
Chain with Sitemap URL Extractor to first get all URLs from a sitemap, then extract clean text from every page for a complete content export

Legality

This tool analyzes publicly accessible web content. Automated analysis of public web resources is standard practice in SEO and web development. Always respect robots.txt directives and rate limits when analyzing third-party websites. For personal data processing, ensure compliance with applicable privacy regulations.

FAQ

Does the actor render JavaScript? No. The actor uses plain HTTP requests and extracts text from the initial HTML response. Pages that load content dynamically via JavaScript after page load may return incomplete text.

What is the mainText field? It contains the clean text content extracted from the page's main content area, with all HTML tags, scripts, styles, and navigation elements stripped out. This is the primary output field for most use cases.

Can I extract text from PDF or Word documents? No. The actor only processes HTML web pages. For document conversion, use a dedicated file processing tool or actor.

The extracted text includes navigation menu and footer text. How do I get only the article content? The actor tries to detect the main content area using semantic HTML elements (<article>, <main>). If the website does not use these elements, the actor falls back to <body> and strips common non-content elements. Check the contentArea field in the output -- if it says "body", the site likely lacks proper semantic markup, which can cause nav/footer text to be included.

The actor returns very little or no text for a page that has content. Why? The page likely loads its content via client-side JavaScript (React, Angular, Vue, etc.). The actor uses plain HTTP requests and parses the initial HTML response without executing JavaScript. For JavaScript-heavy sites, you may need a browser-based scraping solution.

Who is it for?

This actor is for teams that need repeatable, API-ready extraction without maintaining scraping infrastructure.

API usage

Use the Apify API examples above to run this actor from scripts, scheduled jobs, or backend workflows.

Explore related Automation Lab actors on Apify for adjacent extraction and automation workflows.

Fast URL Content Crawler

6sigmag/fast-url-content-crawler

A high-performance web scraper that rapidly extracts and analyzes content from multiple URLs simultaneously. Perfect for competitive research, content aggregation, and website structure analysis.

David

357

5.0

Text Scraper (Free)

karamelo/text-scraper-free

Website Text Extractor. Extract Text from Webpages and Feed Your LLMs

karamelo

1.1K

4.1

Threads Toolkit

alexabc/threads-toolkit

Threads scraper for posts, profiles, and hashtags. Extracts text, images, videos, engagement stats, and replies. No login required.

Alex Huang

5.0

Website Main Content Extractor

sync-network/website-main-content-extractor

Alam

Webpage Text Extractor

maximedupre/webpage-text-extractor

Extract clean text, article text, and Markdown from public web pages. Get titles, metadata, headings, links, word counts, final URLs, and timestamps for LLM prompts, RAG inputs, reviews, and exports.

Maxime Dupré

Generic Articles Main Content Extractor

nlp_data_lni/generic-articles-content-extractor

Extract the main content of articles. Input can be article links or pages from which to identify and extract article links. Articles are scraped and cleaned to extract the main text and many useful metadatas. Search terms and date post filters can be applied and highlighted snippets produced.

LilaK

Smart Article Extractor

datapilot/smart-article-extractor

News Article Extractor Actor fetches article URLs and extracts structured content using Requests, , and Newspaper3k. It collects title, author, publish date, text, summary, keywords, images, and word count. Supports proxy use and outputs clean JSON results.

Data Pilot

Article Content Extractor

codingfrontend/article-content-extractor

Extract clean article content, metadata and structured information from any web page. Returns title, description, author, publish date, plain content, word count, images, and more.

Coding Frontned

Article Content Extractor 📄

easyapi/article-content-extractor

Extract clean article content, metadata and structured information from any web page. Supports multiple URLs and returns well-formatted JSON with title, description, content, author, publish date and more. 🔍📄

EasyApi

134

Web Article Extractor — Clean Reader Mode Text & Metadata

maged120/reader-mode

Extract clean, readable article content from any web page. Strips ads, navigation, and clutter — returns title, author, full body text, and publish date in structured JSON.

Maged

Webpage Text Extractor

What does Webpage Text Extractor do?

Use cases

Why use Webpage Text Extractor?

Input parameters

Example input

Output example

Output fields

How to extract text from web pages

How much does it cost to extract text from web pages?

Using the Apify API

Node.js

Python

cURL

Use with Claude AI (MCP)

Setup for Claude Code

Setup for Claude Desktop, Cursor, or VS Code

Example prompts

Integrations

Tips and best practices

Legality

FAQ

Other SEO and content tools on Apify

Who is it for?

API usage

Related actors

You might also like

Fast URL Content Crawler

Text Scraper (Free)

Threads Toolkit

Website Main Content Extractor

Webpage Text Extractor

Generic Articles Main Content Extractor

Smart Article Extractor

Article Content Extractor

Article Content Extractor 📄

Web Article Extractor — Clean Reader Mode Text & Metadata