Webpage Text Extractor avatar

Webpage Text Extractor

Pricing

Pay per event

Go to Apify Store
Webpage Text Extractor

Webpage Text Extractor

This actor fetches web pages and extracts their clean text content by stripping all HTML tags, scripts, and styles. It identifies the main content area (article, main, etc.), extracts headings structure, page links, and metadata like author, publish date, and language. Use it for LLM input...

Pricing

Pay per event

Rating

0.0

(0)

Developer

Stas Persiianenko

Stas Persiianenko

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

a day ago

Last modified

Categories

Share

Extract clean text content from web pages. Strips HTML and returns structured text with headings, links, metadata, and word count.

What does Webpage Text Extractor do?

This actor fetches web pages and extracts their clean text content by stripping all HTML tags, scripts, and styles. It identifies the main content area (article, main, etc.), extracts headings structure, page links, and metadata like author, publish date, and language. Use it for LLM input preparation, content analysis, text mining, or feeding clean text into downstream data pipelines.

Use cases

  • AI/LLM engineers -- convert web pages to clean text for RAG pipelines, fine-tuning datasets, or prompt context
  • Content analysts -- extract text for sentiment analysis, topic modeling, keyword extraction, or NLP processing
  • Data journalists -- collect article text from multiple news sources for comparison and analysis
  • Accessibility auditors -- extract text structure and heading hierarchy to verify correct semantic markup
  • Data pipeline builders -- feed clean, structured text into downstream processing tools and databases

Why use Webpage Text Extractor?

  • Clean text output -- strips all HTML, scripts, styles, and ads to return only readable content
  • Rich metadata -- extracts title, meta description, author, publish date, language, and Open Graph tags
  • Heading structure -- returns all headings with their level (H1-H6) for document outline analysis
  • Link extraction -- captures all links with text, href, and internal/external classification
  • Configurable metadata -- toggle metadata inclusion with the includeMetadata option to control output size
  • Pay-per-event pricing -- costs just $0.001 per URL with no monthly subscription

Input parameters

ParameterTypeRequiredDefaultDescription
urlsstring[]Yes--List of web page URLs to extract text from
includeMetadatabooleanNotrueInclude links and extra metadata in the output

Example input

{
"urls": [
"https://en.wikipedia.org/wiki/Web_scraping",
"https://blog.apify.com"
],
"includeMetadata": true
}

Output example

{
"url": "https://en.wikipedia.org/wiki/Web_scraping",
"title": "Web scraping - Wikipedia",
"metaDescription": "...",
"author": null,
"publishedDate": null,
"language": "en",
"mainText": "Web scraping is the process of...",
"headings": [
{ "level": 1, "text": "Web scraping" },
{ "level": 2, "text": "Techniques" }
],
"links": [
{ "text": "data extraction", "href": "/wiki/Data_extraction", "isExternal": false }
],
"wordCount": 3450,
"charCount": 21000,
"error": null,
"extractedAt": "2026-03-01T12:00:00.000Z"
}

Output fields

FieldTypeDescription
urlstringThe extracted page URL
titlestringThe page title
metaDescriptionstringThe meta description tag content
authorstringAuthor name if detected from meta tags
publishedDatestringPublish date if detected from meta tags
languagestringPage language from the lang attribute
mainTextstringClean text content with HTML stripped
headingsarrayList of headings with level (1-6) and text
linksarrayList of links with text, href, and isExternal flag
wordCountnumberTotal words in the extracted text
charCountnumberTotal characters in the extracted text
errorstringError message if extraction failed, null otherwise
extractedAtstringISO timestamp of the extraction

How much does it cost?

Webpage Text Extractor uses Apify's pay-per-event pricing model. You only pay for what you use.

EventPriceDescription
Start$0.035One-time per run
URL extracted$0.001Per page extracted

Example costs:

  • 10 pages: $0.035 + 10 x $0.001 = $0.045
  • 100 pages: $0.035 + 100 x $0.001 = $0.135
  • 1,000 pages: $0.035 + 1,000 x $0.001 = $1.035

Using the Apify API

You can start Webpage Text Extractor programmatically from your own applications using the Apify API. The following examples show how to run the actor and retrieve results in both Node.js and Python.

Node.js

import { ApifyClient } from 'apify-client';
const client = new ApifyClient({ token: 'YOUR_TOKEN' });
const run = await client.actor('automation-lab/webpage-text-extractor').call({
urls: ['https://en.wikipedia.org/wiki/Web_scraping'],
includeMetadata: true,
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(items);

Python

from apify_client import ApifyClient
client = ApifyClient('YOUR_TOKEN')
run = client.actor('automation-lab/webpage-text-extractor').call(run_input={
'urls': ['https://en.wikipedia.org/wiki/Web_scraping'],
'includeMetadata': True,
})
items = client.dataset(run['defaultDatasetId']).list_items().items
print(items)

Integrations

Webpage Text Extractor works with all major automation platforms available on Apify. Export results to Google Sheets to build a text content database for analysis. Use Zapier or Make to trigger text extraction whenever new URLs are added to a watchlist. Send extracted text to Slack channels for quick review. Pipe results into n8n workflows to feed clean text into LLM APIs, vector databases, or NLP pipelines. Set up webhooks to get notified when extraction finishes and automatically pass text to downstream processing.

Tips and best practices

  • Set includeMetadata to false if you only need the main text -- this reduces output size significantly, especially for pages with hundreds of links
  • Use the headings array to understand document structure before feeding text into LLMs -- heading hierarchy provides valuable context for summarization and Q&A
  • Filter by language when processing multilingual sites to route text to the correct NLP model or translation pipeline
  • Combine with Content Readability Checker to get both the raw text and readability scores for each page
  • Chain with Sitemap URL Extractor to first get all URLs from a sitemap, then extract clean text from every page for a complete content export

FAQ

Does the actor render JavaScript? No. The actor uses plain HTTP requests and extracts text from the initial HTML response. Pages that load content dynamically via JavaScript after page load may return incomplete text.

What is the mainText field? It contains the clean text content extracted from the page's main content area, with all HTML tags, scripts, styles, and navigation elements stripped out. This is the primary output field for most use cases.

Can I extract text from PDF or Word documents? No. The actor only processes HTML web pages. For document conversion, use a dedicated file processing tool or actor.