Webpage Text Extractor
Pricing
Pay per event
Webpage Text Extractor
This actor fetches web pages and extracts their clean text content by stripping all HTML tags, scripts, and styles. It identifies the main content area (article, main, etc.), extracts headings structure, page links, and metadata like author, publish date, and language. Use it for LLM input...
Pricing
Pay per event
Rating
0.0
(0)
Developer
Stas Persiianenko
Actor stats
0
Bookmarked
8
Total users
2
Monthly active users
5 hours ago
Last modified
Categories
Share
Extract clean text content from web pages. Strips HTML and returns structured text with headings, links, metadata, and word count.
What does Webpage Text Extractor do?
This actor fetches web pages and extracts their clean text content by stripping all HTML tags, scripts, and styles. It identifies the main content area (article, main, etc.), extracts headings structure, page links, and metadata like author, publish date, and language. Use it for LLM input preparation, content analysis, text mining, or feeding clean text into downstream data pipelines.
Use cases
- AI/LLM engineers -- convert web pages to clean text for RAG pipelines, fine-tuning datasets, or prompt context
- Content analysts -- extract text for sentiment analysis, topic modeling, keyword extraction, or NLP processing
- Data journalists -- collect article text from multiple news sources for comparison and analysis
- Accessibility auditors -- extract text structure and heading hierarchy to verify correct semantic markup
- Data pipeline builders -- feed clean, structured text into downstream processing tools and databases
Why use Webpage Text Extractor?
- AI-ready clean text -- strips all HTML, scripts, styles, and ads to return structured output ready for LLM training, RAG pipelines, and AI agent workflows
- Rich metadata -- extracts title, meta description, author, publish date, language, and Open Graph tags
- Heading structure -- returns all headings with their level (H1-H6) for document outline analysis
- Link extraction -- captures all links with text, href, and internal/external classification
- Configurable metadata -- toggle metadata inclusion with the
includeMetadataoption to control output size - Pay-per-event pricing -- costs just $0.001 per URL with no monthly subscription
Input parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
urls | string[] | Yes | -- | List of web page URLs to extract text from |
includeMetadata | boolean | No | true | Include links and extra metadata in the output |
Example input
{"urls": ["https://en.wikipedia.org/wiki/Web_scraping","https://blog.apify.com"],"includeMetadata": true}
Output example
{"url": "https://en.wikipedia.org/wiki/Web_scraping","title": "Web scraping - Wikipedia","metaDescription": "...","author": null,"publishedDate": null,"language": "en","mainText": "Web scraping is the process of...","headings": [{ "level": 1, "text": "Web scraping" },{ "level": 2, "text": "Techniques" }],"links": [{ "text": "data extraction", "href": "/wiki/Data_extraction", "isExternal": false }],"wordCount": 3450,"charCount": 21000,"error": null,"extractedAt": "2026-03-01T12:00:00.000Z"}
Output fields
| Field | Type | Description |
|---|---|---|
url | string | The extracted page URL |
title | string | The page title |
metaDescription | string | The meta description tag content |
author | string | Author name if detected from meta tags |
publishedDate | string | Publish date if detected from meta tags |
language | string | Page language from the lang attribute |
mainText | string | Clean text content with HTML stripped |
headings | array | List of headings with level (1-6) and text |
links | array | List of links with text, href, and isExternal flag |
wordCount | number | Total words in the extracted text |
charCount | number | Total characters in the extracted text |
error | string | Error message if extraction failed, null otherwise |
extractedAt | string | ISO timestamp of the extraction |
How to extract text from web pages
- Open Webpage Text Extractor on Apify.
- Enter one or more web page URLs in the
urlsfield. - Choose whether to include metadata (links, headings, author info) by setting
includeMetadata. - Click Start and wait for the run to finish.
- Download the extracted text as JSON, CSV, or Excel from the Dataset tab.
How much does it cost to extract text from web pages?
Webpage Text Extractor uses Apify's pay-per-event pricing model. You only pay for what you use.
| Event | Price | Description |
|---|---|---|
| Start | $0.035 | One-time per run |
| URL extracted | $0.001 | Per page extracted |
Example costs:
- 10 pages: $0.035 + 10 x $0.001 = $0.045
- 100 pages: $0.035 + 100 x $0.001 = $0.135
- 1,000 pages: $0.035 + 1,000 x $0.001 = $1.035
Using the Apify API
You can start Webpage Text Extractor programmatically from your own applications using the Apify API. The following examples show how to run the actor and retrieve results in both Node.js and Python.
Node.js
import { ApifyClient } from 'apify-client';const client = new ApifyClient({ token: 'YOUR_TOKEN' });const run = await client.actor('automation-lab/webpage-text-extractor').call({urls: ['https://en.wikipedia.org/wiki/Web_scraping'],includeMetadata: true,});const { items } = await client.dataset(run.defaultDatasetId).listItems();console.log(items);
Python
from apify_client import ApifyClientclient = ApifyClient('YOUR_TOKEN')run = client.actor('automation-lab/webpage-text-extractor').call(run_input={'urls': ['https://en.wikipedia.org/wiki/Web_scraping'],'includeMetadata': True,})items = client.dataset(run['defaultDatasetId']).list_items().itemsprint(items)
cURL
curl -X POST "https://api.apify.com/v2/acts/automation-lab~webpage-text-extractor/runs?token=YOUR_TOKEN" \-H "Content-Type: application/json" \-d '{"urls": ["https://en.wikipedia.org/wiki/Web_scraping"],"includeMetadata": true}'
Use with Claude AI (MCP)
This actor is available as a tool in Claude AI through the Model Context Protocol (MCP). Add it to Claude Desktop, Cursor, Windsurf, or any MCP-compatible client to let Claude extract clean text from web pages directly in your conversation.
Setup
Claude Desktop (claude_desktop_config.json):
{"mcpServers": {"apify": {"command": "npx","args": ["-y", "@anthropic-ai/mcp-apify"],"env": {"APIFY_TOKEN": "your-apify-api-token"}}}}
Claude Code CLI:
$claude mcp add apify -- npx -y @anthropic-ai/mcp-apify
Example prompts
- "Extract the main text content from this article: https://example.com/blog/post"
- "Get clean text from these web pages and summarize them"
- "How many words are on this page and what is the heading structure?"
Integrations
Webpage Text Extractor works with all major automation platforms available on Apify. Export results to Google Sheets to build a text content database for analysis. Use Zapier or Make to trigger text extraction whenever new URLs are added to a watchlist. Send extracted text to Slack channels for quick review. Pipe results into n8n workflows to feed clean text into LLM APIs, vector databases, or NLP pipelines. Set up webhooks to get notified when extraction finishes and automatically pass text to downstream processing.
Tips and best practices
- Set
includeMetadatato false if you only need the main text -- this reduces output size significantly, especially for pages with hundreds of links - Use the
headingsarray to understand document structure before feeding text into LLMs -- heading hierarchy provides valuable context for summarization and Q&A - Filter by
languagewhen processing multilingual sites to route text to the correct NLP model or translation pipeline - Combine with Content Readability Checker to get both the raw text and readability scores for each page
- Chain with Sitemap URL Extractor to first get all URLs from a sitemap, then extract clean text from every page for a complete content export
Legality
This tool analyzes publicly accessible web content. Automated analysis of public web resources is standard practice in SEO and web development. Always respect robots.txt directives and rate limits when analyzing third-party websites. For personal data processing, ensure compliance with applicable privacy regulations.
FAQ
Does the actor render JavaScript? No. The actor uses plain HTTP requests and extracts text from the initial HTML response. Pages that load content dynamically via JavaScript after page load may return incomplete text.
What is the mainText field?
It contains the clean text content extracted from the page's main content area, with all HTML tags, scripts, styles, and navigation elements stripped out. This is the primary output field for most use cases.
Can I extract text from PDF or Word documents? No. The actor only processes HTML web pages. For document conversion, use a dedicated file processing tool or actor.
The extracted text includes navigation menu and footer text. How do I get only the article content?
The actor tries to detect the main content area using semantic HTML elements (<article>, <main>). If the website does not use these elements, the actor falls back to <body> and strips common non-content elements. Check the contentArea field in the output -- if it says "body", the site likely lacks proper semantic markup, which can cause nav/footer text to be included.
The actor returns very little or no text for a page that has content. Why? The page likely loads its content via client-side JavaScript (React, Angular, Vue, etc.). The actor uses plain HTTP requests and parses the initial HTML response without executing JavaScript. For JavaScript-heavy sites, you may need a browser-based scraping solution.
Other SEO and content tools on Apify
- Word Counter -- count words, sentences, and paragraphs on any web page
- Website Language Detector -- detect the language of web pages from HTML attributes
- Website Performance Checker -- measure TTFB, page size, and compression
- Website Carbon Calculator -- estimate the carbon footprint of any web page
- Website Health Report -- comprehensive website health audit with scoring
