Universal Markdown Scraper for LLMs
Pricing
$1.00 / 1,000 results
Pricing
$1.00 / 1,000 results
Rating
0.0
(0)
Developer

BotFlowTech
Actor stats
0
Bookmarked
3
Total users
2
Monthly active users
a day ago
Last modified
Categories
Share
Transform any webpage into clean, token-efficient Markdown optimized for ChatGPT, Claude, and other Large Language Models. This Actor automatically removes ads, navigation bars, footers, and other noise that wastes valuable API tokens.
Why Use This Actor?
Most web scrapers return messy JSON or raw HTML that's unsuitable for LLM context windows. This Actor solves that problem by:
- Extracting only main content using Mozilla's Readability algorithm
- Converting to clean Markdown format that LLMs process efficiently
- Removing token-wasting elements like ads, sidebars, cookie banners, and navigation
- Providing token estimates so you know exactly how much context you're using
- Processing at scale with concurrent URL handling
Perfect for AI developers building RAG systems, chatbots, research assistants, and content analysis tools.
Use Cases
- RAG (Retrieval Augmented Generation): Extract clean content for vector databases and knowledge bases
- AI Research Assistants: Feed web articles directly into ChatGPT/Claude for analysis
- Content Summarization: Get article text without noise for LLM-powered summarizers
- Documentation Processing: Convert technical docs to Markdown for AI-powered Q&A systems
- News Monitoring: Extract clean article content for sentiment analysis and topic modeling
- Training Data Preparation: Collect high-quality text data for fine-tuning LLMs
Input
The Actor accepts the following input parameters:
| Field | Type | Required | Description |
|---|---|---|---|
startUrls | Array | Yes | List of URLs to scrape. Format: [{ "url": "https://example.com" }] |
maxConcurrency | Integer | No | Number of pages to process simultaneously (1-50, default: 10) |
removeImages | Boolean | No | Strip all images to save tokens (default: false) |
removeLinks | Boolean | No | Convert hyperlinks to plain text to save tokens (default: false) |
Example Input
{ "startUrls": [ { "url": "https://apify.com/about" }, { "url": "https://openai.com/research" }, { "url": "https://www.anthropic.com/news" } ], "maxConcurrency": 5, "removeImages": false, "removeLinks": false }
Output
The Actor outputs clean Markdown with metadata for each URL processed. Results are stored in the default dataset.
Example Output
{ "url": "https://apify.com/about", "title": "About Apify - Web Scraping and Automation Platform", "markdown": "# About Apify\n\nApify is a cloud platform for web scraping...", "author": "Apify Team", "excerpt": "Learn about Apify's mission to make the web more accessible...", "contentLength": 4521, "markdownLength": 3842, "estimatedTokens": 960, "processedAt": "2025-12-06T06:44:22.195Z", "success": true, "error": null }
Output Fields
url- Original URL that was scrapedtitle- Extracted page titlemarkdown- Clean Markdown content ready for LLM inputauthor- Article author (if detected)excerpt- Brief content summarycontentLength- Character count of extracted contentmarkdownLength- Character count of Markdown outputestimatedTokens- Approximate token count (1 token ≈ 4 characters)processedAt- ISO timestamp of processingsuccess- Boolean indicating if extraction succeedederror- Error message (ifsuccessis false)
Features
Intelligent Content Extraction
Uses Mozilla's Readability library to identify main article content while automatically removing:
- Navigation menus and headers
- Sidebars and advertisements
- Footers and copyright notices
- Cookie banners and popups
- Social media widgets
- Comment sections
- Related article suggestions
Token Optimization
- Markdown format: More efficient than HTML for LLM processing
- Optional image removal: Save tokens by excluding image references
- Optional link removal: Convert links to plain text when URLs aren't needed
- Token estimates: Know upfront how much of your context window you'll use
Production-Ready
- Error handling: Gracefully handles failed requests and parsing errors
- Concurrent processing: Process multiple URLs simultaneously
- Detailed logging: Track processing status in real-time
- Fallback extraction: Uses body content if Readability fails
Cost Efficiency
Running this Actor is cost-effective for AI development:
- Compute units: ~0.01 CU per page (approximately $0.003 USD)
- Speed: Average 2-3 seconds per URL
- Batch processing: Process 100 URLs for ~$0.30
Compare this to the cost of wasted API tokens from unprocessed HTML!
How to Use
Via Apify Console
- Open the Actor in Apify Console
- Click "Try for free"
- Enter your URLs in the
startUrlsfield - Configure optional parameters
- Click "Start" and wait for results
- Download Markdown from the dataset
Via API
const { ApifyClient } = require('apify-client');
const client = new ApifyClient({ token: 'YOUR_API_TOKEN', });
const run = await client.actor('YOUR_USERNAME/universal-markdown-scraper-llm').call({ startUrls: [ { url: 'https://example.com/article' } ], removeImages: true });
const { items } = await client.dataset(run.defaultDatasetId).listItems(); console.log(items.markdown);
Integration with LangChain
from apify_client import ApifyClient from langchain.document_loaders import ApifyDatasetLoader
client = ApifyClient('YOUR_API_TOKEN')
Run the Actor run = client.actor('YOUR_USERNAME/universal-markdown-scraper-llm').call( run_input={'startUrls': [{'url': 'https://example.com'}]} )
Load into LangChain loader = ApifyDatasetLoader( dataset_id=run['defaultDatasetId'], dataset_mapping_function=lambda item: item['markdown'] )
docs = loader.load()
Limitations
- JavaScript-heavy sites: Some dynamic content may not render. Consider using a browser-based scraper for SPAs.
- Paywalled content: Cannot access content behind authentication walls
- Rate limiting: Respect target website rate limits using
maxConcurrency - Token estimation: Approximate only; actual tokens vary by model tokenizer
Tips for Best Results
- Start small: Test with 5-10 URLs before scaling up
- Enable token optimization: Use
removeImagesandremoveLinksfor RAG systems that don't need them - Monitor output: Check the
estimatedTokensfield to stay within context limits - Handle errors: Always check the
successfield before using Markdown output
Support & Feedback
Need help or have suggestions?
- Issues: Report bugs via the Issues tab
- Questions: Contact us through Apify support
- Feature requests: We're actively developing this Actor and welcome feedback
Version History
v1.0.0 (December 2025)
- Initial release
- Mozilla Readability integration
- Token estimation
- Configurable image and link removal
- Batch processing support
Built with ❤️ for the AI development community. Save tokens, save money, build better AI applications.