Universal Markdown Scraper for LLMs avatar
Universal Markdown Scraper for LLMs

Pricing

$1.00 / 1,000 results

Go to Apify Store
Universal Markdown Scraper for LLMs

Universal Markdown Scraper for LLMs

Universal Markdown Scraper for LLMs

Pricing

$1.00 / 1,000 results

Rating

0.0

(0)

Developer

BotFlowTech

BotFlowTech

Maintained by Community

Actor stats

0

Bookmarked

3

Total users

2

Monthly active users

a day ago

Last modified

Share

Transform any webpage into clean, token-efficient Markdown optimized for ChatGPT, Claude, and other Large Language Models. This Actor automatically removes ads, navigation bars, footers, and other noise that wastes valuable API tokens.

Why Use This Actor?

Most web scrapers return messy JSON or raw HTML that's unsuitable for LLM context windows. This Actor solves that problem by:

  • Extracting only main content using Mozilla's Readability algorithm
  • Converting to clean Markdown format that LLMs process efficiently
  • Removing token-wasting elements like ads, sidebars, cookie banners, and navigation
  • Providing token estimates so you know exactly how much context you're using
  • Processing at scale with concurrent URL handling

Perfect for AI developers building RAG systems, chatbots, research assistants, and content analysis tools.

Use Cases

  • RAG (Retrieval Augmented Generation): Extract clean content for vector databases and knowledge bases
  • AI Research Assistants: Feed web articles directly into ChatGPT/Claude for analysis
  • Content Summarization: Get article text without noise for LLM-powered summarizers
  • Documentation Processing: Convert technical docs to Markdown for AI-powered Q&A systems
  • News Monitoring: Extract clean article content for sentiment analysis and topic modeling
  • Training Data Preparation: Collect high-quality text data for fine-tuning LLMs

Input

The Actor accepts the following input parameters:

FieldTypeRequiredDescription
startUrlsArrayYesList of URLs to scrape. Format: [{ "url": "https://example.com" }]
maxConcurrencyIntegerNoNumber of pages to process simultaneously (1-50, default: 10)
removeImagesBooleanNoStrip all images to save tokens (default: false)
removeLinksBooleanNoConvert hyperlinks to plain text to save tokens (default: false)

Example Input

{ "startUrls": [ { "url": "https://apify.com/about" }, { "url": "https://openai.com/research" }, { "url": "https://www.anthropic.com/news" } ], "maxConcurrency": 5, "removeImages": false, "removeLinks": false }

Output

The Actor outputs clean Markdown with metadata for each URL processed. Results are stored in the default dataset.

Example Output

{ "url": "https://apify.com/about", "title": "About Apify - Web Scraping and Automation Platform", "markdown": "# About Apify\n\nApify is a cloud platform for web scraping...", "author": "Apify Team", "excerpt": "Learn about Apify's mission to make the web more accessible...", "contentLength": 4521, "markdownLength": 3842, "estimatedTokens": 960, "processedAt": "2025-12-06T06:44:22.195Z", "success": true, "error": null }

Output Fields

  • url - Original URL that was scraped
  • title - Extracted page title
  • markdown - Clean Markdown content ready for LLM input
  • author - Article author (if detected)
  • excerpt - Brief content summary
  • contentLength - Character count of extracted content
  • markdownLength - Character count of Markdown output
  • estimatedTokens - Approximate token count (1 token ≈ 4 characters)
  • processedAt - ISO timestamp of processing
  • success - Boolean indicating if extraction succeeded
  • error - Error message (if success is false)

Features

Intelligent Content Extraction

Uses Mozilla's Readability library to identify main article content while automatically removing:

  • Navigation menus and headers
  • Sidebars and advertisements
  • Footers and copyright notices
  • Cookie banners and popups
  • Social media widgets
  • Comment sections
  • Related article suggestions

Token Optimization

  • Markdown format: More efficient than HTML for LLM processing
  • Optional image removal: Save tokens by excluding image references
  • Optional link removal: Convert links to plain text when URLs aren't needed
  • Token estimates: Know upfront how much of your context window you'll use

Production-Ready

  • Error handling: Gracefully handles failed requests and parsing errors
  • Concurrent processing: Process multiple URLs simultaneously
  • Detailed logging: Track processing status in real-time
  • Fallback extraction: Uses body content if Readability fails

Cost Efficiency

Running this Actor is cost-effective for AI development:

  • Compute units: ~0.01 CU per page (approximately $0.003 USD)
  • Speed: Average 2-3 seconds per URL
  • Batch processing: Process 100 URLs for ~$0.30

Compare this to the cost of wasted API tokens from unprocessed HTML!

How to Use

Via Apify Console

  1. Open the Actor in Apify Console
  2. Click "Try for free"
  3. Enter your URLs in the startUrls field
  4. Configure optional parameters
  5. Click "Start" and wait for results
  6. Download Markdown from the dataset

Via API

const { ApifyClient } = require('apify-client');

const client = new ApifyClient({ token: 'YOUR_API_TOKEN', });

const run = await client.actor('YOUR_USERNAME/universal-markdown-scraper-llm').call({ startUrls: [ { url: 'https://example.com/article' } ], removeImages: true });

const { items } = await client.dataset(run.defaultDatasetId).listItems(); console.log(items.markdown);

Integration with LangChain

from apify_client import ApifyClient from langchain.document_loaders import ApifyDatasetLoader

client = ApifyClient('YOUR_API_TOKEN')

Run the Actor run = client.actor('YOUR_USERNAME/universal-markdown-scraper-llm').call( run_input={'startUrls': [{'url': 'https://example.com'}]} )

Load into LangChain loader = ApifyDatasetLoader( dataset_id=run['defaultDatasetId'], dataset_mapping_function=lambda item: item['markdown'] )

docs = loader.load()

Limitations

  • JavaScript-heavy sites: Some dynamic content may not render. Consider using a browser-based scraper for SPAs.
  • Paywalled content: Cannot access content behind authentication walls
  • Rate limiting: Respect target website rate limits using maxConcurrency
  • Token estimation: Approximate only; actual tokens vary by model tokenizer

Tips for Best Results

  • Start small: Test with 5-10 URLs before scaling up
  • Enable token optimization: Use removeImages and removeLinks for RAG systems that don't need them
  • Monitor output: Check the estimatedTokens field to stay within context limits
  • Handle errors: Always check the success field before using Markdown output

Support & Feedback

Need help or have suggestions?

  • Issues: Report bugs via the Issues tab
  • Questions: Contact us through Apify support
  • Feature requests: We're actively developing this Actor and welcome feedback

Version History

v1.0.0 (December 2025)

  • Initial release
  • Mozilla Readability integration
  • Token estimation
  • Configurable image and link removal
  • Batch processing support

Built with ❤️ for the AI development community. Save tokens, save money, build better AI applications.