Website Content Extractor avatar

Website Content Extractor

Pricing

Pay per usage

Go to Apify Store
Website Content Extractor

Website Content Extractor

CAPABILITIES: extract_content, convert_to_markdown, batch_urls, extract_metadata. INPUT: URLs (single or array), with optional selectors and output format. OUTPUT: structured JSON with title, text, metadata, word_count. FORMATS: json, markdown, text. PRICING: PPE $0.001/page.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Bado

Bado

Maintained by Community

Actor stats

1

Bookmarked

3

Total users

3

Monthly active users

15 hours ago

Last modified

Categories

Share

What does Website Content Extractor do?

Extract clean, structured content from any web page — optimized for AI agents, RAG pipelines, and LLM context windows. Get page title, author, publication date, main content as clean text or Markdown, and word/token counts — all in a single API call. Built by Tropical Tools for the AI-agent ecosystem.

Whether you're building a RAG pipeline that needs web scraping at scale, feeding content into an LLM for summarization, or letting an AI agent research topics autonomously, Website Content Extractor delivers exactly what your system needs: noise-free content extraction with machine-readable metadata.

Why use this over alternatives?

  • AI-Native Output — Clean JSON with _tropicalTools metadata, word counts, and token estimates. No HTML noise.
  • 3 Output Formats — JSON (structured), Markdown (with frontmatter), or Plain Text (for direct LLM injection)
  • Batch Processing — Send 100+ URLs in one run. Get partial results even when some URLs fail.
  • MCP-Compatible — Machine-readable CAPABILITIES description for Model Context Protocol discovery
  • Pay-Per-Event — Only pay for pages actually extracted. $0.001 per page.

Features

  • Extract page title, description, author, publish date, canonical URL, OG image
  • Convert any web page to clean Markdown automatically
  • Readability-powered content extraction (strips ads, nav, sidebars, footers)
  • Word count and LLM token estimates included in every result
  • Batch URL processing with per-URL error reporting
  • Residential proxy support for hard-to-reach sites
  • _tropicalTools metadata block for cross-actor discovery

Input Configuration

FieldTypeDescriptionDefault
urlsarrayURLs to extract content fromrequired
outputFormatstringjson, markdown, or textjson
includeMetadatabooleanInclude title, author, date, OG tagstrue
maxPagesPerUrlintegerMax pages per starting URL1

Output Example

{
"url": "https://en.wikipedia.org/wiki/Web_scraping",
"title": "Web scraping - Wikipedia",
"author": "Contributors to Wikimedia projects",
"description": "Web scraping is data scraping used for extracting data from websites.",
"content": "Web scraping is the process of...",
"markdown": "# Web scraping\n\nWeb scraping is...",
"wordCount": 3888,
"tokenEstimate": 5054,
"_tropicalTools": {
"actorName": "website-content-extractor",
"extractedAt": "2026-03-19T22:30:00.000Z",
"processingTimeMs": 2358
}
}

Cost Estimation

VolumeEstimated CostTime
10 pages$0.01~10 sec
100 pages$0.10~2 min
1,000 pages$1.00~15 min
10,000 pages$10.00~2.5 hr

Use Cases

  • RAG Pipelines — Feed web pages into vector databases with clean text + metadata
  • AI Agent Research — Let agents read and understand any web page
  • Content Monitoring — Track changes on competitor pages, news sites, blogs
  • Data Enrichment — Add web page content to CRM records, lead lists

Integration with AI Agents (MCP)

This actor is optimized for AI agent discovery via MCP. Agents can find it by searching for:

  • extract_content, convert_to_markdown, batch_urls, extract_metadata

FAQ

Q: Can it handle JavaScript-rendered pages (SPAs)? A: Yes, it uses a headless browser for pages that require JavaScript rendering.

Q: What happens if a URL fails? A: You get a result with the URL and error message. Other URLs continue processing. You're only charged for successful extractions.

Q: Does it respect robots.txt? A: Yes, the actor respects robots.txt directives by default.

Q: How is this different from apify/website-content-crawler? A: This actor is specifically optimized for AI agent consumption — clean JSON output, token estimates, MCP metadata, and batch processing with partial results.