Website Content Extractor
Pricing
Pay per usage
Website Content Extractor
CAPABILITIES: extract_content, convert_to_markdown, batch_urls, extract_metadata. INPUT: URLs (single or array), with optional selectors and output format. OUTPUT: structured JSON with title, text, metadata, word_count. FORMATS: json, markdown, text. PRICING: PPE $0.001/page.
Pricing
Pay per usage
Rating
0.0
(0)
Developer
Bado
Actor stats
1
Bookmarked
3
Total users
3
Monthly active users
15 hours ago
Last modified
Categories
Share
What does Website Content Extractor do?
Extract clean, structured content from any web page — optimized for AI agents, RAG pipelines, and LLM context windows. Get page title, author, publication date, main content as clean text or Markdown, and word/token counts — all in a single API call. Built by Tropical Tools for the AI-agent ecosystem.
Whether you're building a RAG pipeline that needs web scraping at scale, feeding content into an LLM for summarization, or letting an AI agent research topics autonomously, Website Content Extractor delivers exactly what your system needs: noise-free content extraction with machine-readable metadata.
Why use this over alternatives?
- AI-Native Output — Clean JSON with
_tropicalToolsmetadata, word counts, and token estimates. No HTML noise. - 3 Output Formats — JSON (structured), Markdown (with frontmatter), or Plain Text (for direct LLM injection)
- Batch Processing — Send 100+ URLs in one run. Get partial results even when some URLs fail.
- MCP-Compatible — Machine-readable CAPABILITIES description for Model Context Protocol discovery
- Pay-Per-Event — Only pay for pages actually extracted. $0.001 per page.
Features
- Extract page title, description, author, publish date, canonical URL, OG image
- Convert any web page to clean Markdown automatically
- Readability-powered content extraction (strips ads, nav, sidebars, footers)
- Word count and LLM token estimates included in every result
- Batch URL processing with per-URL error reporting
- Residential proxy support for hard-to-reach sites
_tropicalToolsmetadata block for cross-actor discovery
Input Configuration
| Field | Type | Description | Default |
|---|---|---|---|
urls | array | URLs to extract content from | required |
outputFormat | string | json, markdown, or text | json |
includeMetadata | boolean | Include title, author, date, OG tags | true |
maxPagesPerUrl | integer | Max pages per starting URL | 1 |
Output Example
{"url": "https://en.wikipedia.org/wiki/Web_scraping","title": "Web scraping - Wikipedia","author": "Contributors to Wikimedia projects","description": "Web scraping is data scraping used for extracting data from websites.","content": "Web scraping is the process of...","markdown": "# Web scraping\n\nWeb scraping is...","wordCount": 3888,"tokenEstimate": 5054,"_tropicalTools": {"actorName": "website-content-extractor","extractedAt": "2026-03-19T22:30:00.000Z","processingTimeMs": 2358}}
Cost Estimation
| Volume | Estimated Cost | Time |
|---|---|---|
| 10 pages | $0.01 | ~10 sec |
| 100 pages | $0.10 | ~2 min |
| 1,000 pages | $1.00 | ~15 min |
| 10,000 pages | $10.00 | ~2.5 hr |
Use Cases
- RAG Pipelines — Feed web pages into vector databases with clean text + metadata
- AI Agent Research — Let agents read and understand any web page
- Content Monitoring — Track changes on competitor pages, news sites, blogs
- Data Enrichment — Add web page content to CRM records, lead lists
Integration with AI Agents (MCP)
This actor is optimized for AI agent discovery via MCP. Agents can find it by searching for:
extract_content,convert_to_markdown,batch_urls,extract_metadata
FAQ
Q: Can it handle JavaScript-rendered pages (SPAs)? A: Yes, it uses a headless browser for pages that require JavaScript rendering.
Q: What happens if a URL fails? A: You get a result with the URL and error message. Other URLs continue processing. You're only charged for successful extractions.
Q: Does it respect robots.txt? A: Yes, the actor respects robots.txt directives by default.
Q: How is this different from apify/website-content-crawler? A: This actor is specifically optimized for AI agent consumption — clean JSON output, token estimates, MCP metadata, and batch processing with partial results.