Pricing

$1.00 / 1,000 page extracteds

Website Content Extractor

CAPABILITIES: extract_content, convert_to_markdown, batch_urls, extract_metadata. INPUT: URLs (single or array), with optional selectors and output format. OUTPUT: structured JSON with title, text, metadata, word_count. FORMATS: json, markdown, text. PRICING: PPE $0.001/page.

Pricing

$1.00 / 1,000 page extracteds

Rating

0.0

(0)

Developer

Bado

Actor stats

Bookmarked

Total users

Monthly active users

3 months ago

Last modified

What does Website Content Extractor do?

Extract clean, structured content from any web page — optimized for AI agents, RAG pipelines, and LLM context windows. Get page title, author, publication date, main content as clean text or Markdown, and word/token counts — all in a single API call. Built by Tropical Tools for the AI-agent ecosystem.

Whether you're building a RAG pipeline that needs web scraping at scale, feeding content into an LLM for summarization, or letting an AI agent research topics autonomously, Website Content Extractor delivers exactly what your system needs: noise-free content extraction with machine-readable metadata.

Why use this over alternatives?

AI-Native Output — Clean JSON with _tropicalTools metadata, word counts, and token estimates. No HTML noise.
3 Output Formats — JSON (structured), Markdown (with frontmatter), or Plain Text (for direct LLM injection)
Batch Processing — Send 100+ URLs in one run. Get partial results even when some URLs fail.
MCP-Compatible — Machine-readable CAPABILITIES description for Model Context Protocol discovery
Pay-Per-Event — Only pay for pages actually extracted. $0.001 per page.

Features

Extract page title, description, author, publish date, canonical URL, OG image
Convert any web page to clean Markdown automatically
Readability-powered content extraction (strips ads, nav, sidebars, footers)
Word count and LLM token estimates included in every result
Batch URL processing with per-URL error reporting
Residential proxy support for hard-to-reach sites
_tropicalTools metadata block for cross-actor discovery

Input Configuration

Field	Type	Description	Default
`urls`	array	URLs to extract content from	required
`outputFormat`	string	`json`, `markdown`, or `text`	`json`
`includeMetadata`	boolean	Include title, author, date, OG tags	`true`
`maxPagesPerUrl`	integer	Max pages per starting URL	`1`

Output Example

{
    "url": "https://en.wikipedia.org/wiki/Web_scraping",
    "title": "Web scraping - Wikipedia",
    "author": "Contributors to Wikimedia projects",
    "description": "Web scraping is data scraping used for extracting data from websites.",
    "content": "Web scraping is the process of...",
    "markdown": "# Web scraping\n\nWeb scraping is...",
    "wordCount": 3888,
    "tokenEstimate": 5054,
    "_tropicalTools": {
        "actorName": "website-content-extractor",
        "extractedAt": "2026-03-19T22:30:00.000Z",
        "processingTimeMs": 2358
    }
}

Cost Estimation

Volume	Estimated Cost	Time
10 pages	$0.01	~10 sec
100 pages	$0.10	~2 min
1,000 pages	$1.00	~15 min
10,000 pages	$10.00	~2.5 hr

Use Cases

RAG Pipelines — Feed web pages into vector databases with clean text + metadata
AI Agent Research — Let agents read and understand any web page
Content Monitoring — Track changes on competitor pages, news sites, blogs
Data Enrichment — Add web page content to CRM records, lead lists

Integration with AI Agents (MCP)

This actor is optimized for AI agent discovery via MCP. Agents can find it by searching for:

extract_content, convert_to_markdown, batch_urls, extract_metadata

FAQ

Q: Can it handle JavaScript-rendered pages (SPAs)? A: Yes, it uses a headless browser for pages that require JavaScript rendering.

Q: What happens if a URL fails? A: You get a result with the URL and error message. Other URLs continue processing. You're only charged for successful extractions.

Q: Does it respect robots.txt? A: Yes, the actor respects robots.txt directives by default.

Q: How is this different from apify/website-content-crawler? A: This actor is specifically optimized for AI agent consumption — clean JSON output, token estimates, MCP metadata, and batch processing with partial results.

Article/News Extractor

tropical_prune/article-news-extractor

CAPABILITIES: extract_article, extract_metadata, detect_language, clean_text, batch_urls. INPUT: URLs (single or array) of articles/news pages. OUTPUT: structured JSON with title, author, date, content, language, word_count. FORMATS: json, markdown, text. PRICING: PPE $0.001/article.

Bado

5.0

Social Profile Aggregator

tropical_prune/social-profile-aggregator

CAPABILITIES: extract_profile, aggregate_platforms, unified_schema, batch_profiles. INPUT: Usernames or profile URLs (single or array), with platform selection. OUTPUT: structured JSON with unified profile data across IG, TikTok, X, LinkedIn. FORMATS: json, markdown. PRICING: PPE $0.005/profile.

Bado

PDF Text Extractor

automation-lab/pdf-text-extractor

Extract text, metadata, and page-by-page content from PDF files. Provide PDF URLs and get structured JSON with full text, per-page text, page count, author, title, creation date, and more. Export as JSON, CSV, or Excel. No browser or proxy needed.

Stas Persiianenko

YouTube Transcript Pro

tropical_prune/youtube-transcript-pro

CAPABILITIES: extract_transcript, extract_subtitles, batch_playlist, batch_channel. FORMATS: json, srt, vtt, text, markdown. INPUT: YouTube URLs. OUTPUT: timestamped segments, plain text, metadata. LANGUAGES: auto-detect. NO_API_KEY. PPE: $0.002/video.

Bado

Web Page to Markdown Extractor

fetch_cat/web-page-to-markdown-extractor

Convert public URLs into clean Markdown, text, metadata, links, images, and optional HTML for AI agents, RAG, support, and automation workflows.

Hanna Nosova

PDF Extractor: Structured Text + Metadata

aitoolbreakdown/atb-pdf-extractor

Point it at one or many PDF URLs. Get clean structured JSON back: full text, per-page text, title, author, page count, and word count. Ready for RAG, search, or doc automation.

AI Tool Breakdown

Website Content Extractor

fastidious_drawer/website-content-extractor

This extractor lets you extract content from any website with a single or multiple URLs. Use selectors to choose specific sections like the body and exclude elements like headers or navigation. It also extracts images and links, providing data in JSON and DataTable formats for easy processing.