Webpage Text Extractor
Pricing
Pay per event
Webpage Text Extractor
This actor fetches web pages and extracts their clean text content by stripping all HTML tags, scripts, and styles. It identifies the main content area (article, main, etc.), extracts headings structure, page links, and metadata like author, publish date, and language. Use it for LLM input...
Pricing
Pay per event
Rating
0.0
(0)
Developer

Stas Persiianenko
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
a day ago
Last modified
Categories
Share
Extract clean text content from web pages. Strips HTML and returns structured text with headings, links, metadata, and word count.
What does Webpage Text Extractor do?
This actor fetches web pages and extracts their clean text content by stripping all HTML tags, scripts, and styles. It identifies the main content area (article, main, etc.), extracts headings structure, page links, and metadata like author, publish date, and language. Use it for LLM input preparation, content analysis, text mining, or feeding clean text into downstream data pipelines.
Use cases
- AI/LLM engineers -- convert web pages to clean text for RAG pipelines, fine-tuning datasets, or prompt context
- Content analysts -- extract text for sentiment analysis, topic modeling, keyword extraction, or NLP processing
- Data journalists -- collect article text from multiple news sources for comparison and analysis
- Accessibility auditors -- extract text structure and heading hierarchy to verify correct semantic markup
- Data pipeline builders -- feed clean, structured text into downstream processing tools and databases
Why use Webpage Text Extractor?
- Clean text output -- strips all HTML, scripts, styles, and ads to return only readable content
- Rich metadata -- extracts title, meta description, author, publish date, language, and Open Graph tags
- Heading structure -- returns all headings with their level (H1-H6) for document outline analysis
- Link extraction -- captures all links with text, href, and internal/external classification
- Configurable metadata -- toggle metadata inclusion with the
includeMetadataoption to control output size - Pay-per-event pricing -- costs just $0.001 per URL with no monthly subscription
Input parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
urls | string[] | Yes | -- | List of web page URLs to extract text from |
includeMetadata | boolean | No | true | Include links and extra metadata in the output |
Example input
{"urls": ["https://en.wikipedia.org/wiki/Web_scraping","https://blog.apify.com"],"includeMetadata": true}
Output example
{"url": "https://en.wikipedia.org/wiki/Web_scraping","title": "Web scraping - Wikipedia","metaDescription": "...","author": null,"publishedDate": null,"language": "en","mainText": "Web scraping is the process of...","headings": [{ "level": 1, "text": "Web scraping" },{ "level": 2, "text": "Techniques" }],"links": [{ "text": "data extraction", "href": "/wiki/Data_extraction", "isExternal": false }],"wordCount": 3450,"charCount": 21000,"error": null,"extractedAt": "2026-03-01T12:00:00.000Z"}
Output fields
| Field | Type | Description |
|---|---|---|
url | string | The extracted page URL |
title | string | The page title |
metaDescription | string | The meta description tag content |
author | string | Author name if detected from meta tags |
publishedDate | string | Publish date if detected from meta tags |
language | string | Page language from the lang attribute |
mainText | string | Clean text content with HTML stripped |
headings | array | List of headings with level (1-6) and text |
links | array | List of links with text, href, and isExternal flag |
wordCount | number | Total words in the extracted text |
charCount | number | Total characters in the extracted text |
error | string | Error message if extraction failed, null otherwise |
extractedAt | string | ISO timestamp of the extraction |
How much does it cost?
Webpage Text Extractor uses Apify's pay-per-event pricing model. You only pay for what you use.
| Event | Price | Description |
|---|---|---|
| Start | $0.035 | One-time per run |
| URL extracted | $0.001 | Per page extracted |
Example costs:
- 10 pages: $0.035 + 10 x $0.001 = $0.045
- 100 pages: $0.035 + 100 x $0.001 = $0.135
- 1,000 pages: $0.035 + 1,000 x $0.001 = $1.035
Using the Apify API
You can start Webpage Text Extractor programmatically from your own applications using the Apify API. The following examples show how to run the actor and retrieve results in both Node.js and Python.
Node.js
import { ApifyClient } from 'apify-client';const client = new ApifyClient({ token: 'YOUR_TOKEN' });const run = await client.actor('automation-lab/webpage-text-extractor').call({urls: ['https://en.wikipedia.org/wiki/Web_scraping'],includeMetadata: true,});const { items } = await client.dataset(run.defaultDatasetId).listItems();console.log(items);
Python
from apify_client import ApifyClientclient = ApifyClient('YOUR_TOKEN')run = client.actor('automation-lab/webpage-text-extractor').call(run_input={'urls': ['https://en.wikipedia.org/wiki/Web_scraping'],'includeMetadata': True,})items = client.dataset(run['defaultDatasetId']).list_items().itemsprint(items)
Integrations
Webpage Text Extractor works with all major automation platforms available on Apify. Export results to Google Sheets to build a text content database for analysis. Use Zapier or Make to trigger text extraction whenever new URLs are added to a watchlist. Send extracted text to Slack channels for quick review. Pipe results into n8n workflows to feed clean text into LLM APIs, vector databases, or NLP pipelines. Set up webhooks to get notified when extraction finishes and automatically pass text to downstream processing.
Tips and best practices
- Set
includeMetadatato false if you only need the main text -- this reduces output size significantly, especially for pages with hundreds of links - Use the
headingsarray to understand document structure before feeding text into LLMs -- heading hierarchy provides valuable context for summarization and Q&A - Filter by
languagewhen processing multilingual sites to route text to the correct NLP model or translation pipeline - Combine with Content Readability Checker to get both the raw text and readability scores for each page
- Chain with Sitemap URL Extractor to first get all URLs from a sitemap, then extract clean text from every page for a complete content export
FAQ
Does the actor render JavaScript? No. The actor uses plain HTTP requests and extracts text from the initial HTML response. Pages that load content dynamically via JavaScript after page load may return incomplete text.
What is the mainText field?
It contains the clean text content extracted from the page's main content area, with all HTML tags, scripts, styles, and navigation elements stripped out. This is the primary output field for most use cases.
Can I extract text from PDF or Word documents? No. The actor only processes HTML web pages. For document conversion, use a dedicated file processing tool or actor.