Pricing

Pay per usage

Go to Apify Store

Universal Markdown Scraper for LLMs

Try for free

Pricing

Pay per usage

Rating

0.0

(0)

Developer

BotFlowTech

Actor stats

Bookmarked

Total users

Monthly active users

2 months ago

Last modified

Why Use This Actor?

Most web scrapers return messy JSON or raw HTML that's unsuitable for LLM context windows. This Actor solves that problem by:

Extracting only main content using Mozilla's Readability algorithm
Converting to clean Markdown format that LLMs process efficiently
Removing token-wasting elements like ads, sidebars, cookie banners, and navigation
Providing token estimates so you know exactly how much context you're using
Processing at scale with concurrent URL handling

Perfect for AI developers building RAG systems, chatbots, research assistants, and content analysis tools.

Use Cases

RAG (Retrieval Augmented Generation): Extract clean content for vector databases and knowledge bases
AI Research Assistants: Feed web articles directly into ChatGPT/Claude for analysis
Content Summarization: Get article text without noise for LLM-powered summarizers
Documentation Processing: Convert technical docs to Markdown for AI-powered Q&A systems
News Monitoring: Extract clean article content for sentiment analysis and topic modeling
Training Data Preparation: Collect high-quality text data for fine-tuning LLMs

Input

The Actor accepts the following input parameters:

Field	Type	Required	Description
`startUrls`	Array	Yes	List of URLs to scrape. Format: `[{ "url": "https://example.com" }]`
`maxConcurrency`	Integer	No	Number of pages to process simultaneously (1-50, default: 10)
`removeImages`	Boolean	No	Strip all images to save tokens (default: false)
`removeLinks`	Boolean	No	Convert hyperlinks to plain text to save tokens (default: false)

Example Input

{ "startUrls": [ { "url": "https://apify.com/about" }, { "url": "https://openai.com/research" }, { "url": "https://www.anthropic.com/news" } ], "maxConcurrency": 5, "removeImages": false, "removeLinks": false }

Output

The Actor outputs clean Markdown with metadata for each URL processed. Results are stored in the default dataset.

Example Output

{ "url": "https://apify.com/about", "title": "About Apify - Web Scraping and Automation Platform", "markdown": "# About Apify\n\nApify is a cloud platform for web scraping...", "author": "Apify Team", "excerpt": "Learn about Apify's mission to make the web more accessible...", "contentLength": 4521, "markdownLength": 3842, "estimatedTokens": 960, "processedAt": "2025-12-06T06:44:22.195Z", "success": true, "error": null }

Output Fields

url - Original URL that was scraped
title - Extracted page title
markdown - Clean Markdown content ready for LLM input
author - Article author (if detected)
excerpt - Brief content summary
contentLength - Character count of extracted content
markdownLength - Character count of Markdown output
estimatedTokens - Approximate token count (1 token ≈ 4 characters)
processedAt - ISO timestamp of processing
success - Boolean indicating if extraction succeeded
error - Error message (if success is false)

Features

Intelligent Content Extraction

Uses Mozilla's Readability library to identify main article content while automatically removing:

Navigation menus and headers
Sidebars and advertisements
Footers and copyright notices
Cookie banners and popups
Social media widgets
Comment sections
Related article suggestions

Token Optimization

Markdown format: More efficient than HTML for LLM processing
Optional image removal: Save tokens by excluding image references
Optional link removal: Convert links to plain text when URLs aren't needed
Token estimates: Know upfront how much of your context window you'll use

Production-Ready

Error handling: Gracefully handles failed requests and parsing errors
Concurrent processing: Process multiple URLs simultaneously
Detailed logging: Track processing status in real-time
Fallback extraction: Uses body content if Readability fails

Cost Efficiency

Running this Actor is cost-effective for AI development:

Compute units: ~0.01 CU per page (approximately $0.003 USD)
Speed: Average 2-3 seconds per URL
Batch processing: Process 100 URLs for ~$0.30

Compare this to the cost of wasted API tokens from unprocessed HTML!

How to Use

Via Apify Console

Open the Actor in Apify Console
Click "Try for free"
Enter your URLs in the startUrls field
Configure optional parameters
Click "Start" and wait for results
Download Markdown from the dataset

Via API

const { ApifyClient } = require('apify-client');

const client = new ApifyClient({ token: 'YOUR_API_TOKEN', });

const run = await client.actor('YOUR_USERNAME/universal-markdown-scraper-llm').call({ startUrls: [ { url: 'https://example.com/article' } ], removeImages: true });

const { items } = await client.dataset(run.defaultDatasetId).listItems(); console.log(items.markdown);

Integration with LangChain

from apify_client import ApifyClient from langchain.document_loaders import ApifyDatasetLoader

client = ApifyClient('YOUR_API_TOKEN')

Run the Actor run = client.actor('YOUR_USERNAME/universal-markdown-scraper-llm').call( run_input={'startUrls': [{'url': 'https://example.com'}]} )

Load into LangChain loader = ApifyDatasetLoader( dataset_id=run['defaultDatasetId'], dataset_mapping_function=lambda item: item['markdown'] )

docs = loader.load()

Limitations

JavaScript-heavy sites: Some dynamic content may not render. Consider using a browser-based scraper for SPAs.
Paywalled content: Cannot access content behind authentication walls
Rate limiting: Respect target website rate limits using maxConcurrency
Token estimation: Approximate only; actual tokens vary by model tokenizer

Tips for Best Results

Start small: Test with 5-10 URLs before scaling up
Enable token optimization: Use removeImages and removeLinks for RAG systems that don't need them
Monitor output: Check the estimatedTokens field to stay within context limits
Handle errors: Always check the success field before using Markdown output

Support & Feedback

Need help or have suggestions?

Issues: Report bugs via the Issues tab
Questions: Contact us through Apify support
Feature requests: We're actively developing this Actor and welcome feedback

Version History

v1.0.0 (December 2025)

Initial release
Mozilla Readability integration
Token estimation
Configurable image and link removal
Batch processing support

Built with ❤️ for the AI development community. Save tokens, save money, build better AI applications.

Universal Web to Markdown (Bulk & AI-Ready)

lentic_october/web-to-markdown-converter

Bulk convert any website URLs to clean Markdown for AI & LLMs. Universal scraper that removes ads, scripts, and clutter. Optimized for RAG, ChatGPT, Claude, and LangChain. Fast, async, and API-ready.

kalthireddy Abhishek

Markdown API

vivid_astronaut/markdown

Fabio Suizu

Web Scraper For Llms

abotapi/web-scraper-for-llms

Stealth web scraping engine built for LLMs. Converts any web page to clean markdown or HTML

AbotAPI

The LLMS.TXT Generator | Hi LLMS

onescales/the-llms-txt-generator

The most powerful tool online to generate LLMS - llms.txt , llms-full.txt and markdown .md files within seconds! Get your website discovered, and recommended by ChatGPT, Claude, Google Gemini, Perplexity, Grok, and every AI. (Great for AEO, AIO, GEO, SEO) Transform your site to AI-friendly today!

One Scales

5.0

universal-web-to-markdown

hachi-dev/universal-web-to-markdown

High-performance tool for AI & RAG pipelines. Converts web pages to clean Markdown by removing noise and fixing relative URLs. Built with Cheerio for extreme speed and low cost ($0.50/1k pages). Perfect for feeding clean data to LLMs.

JI JUN

Convert To Markdown

datavault/convert-to-markdown

Convert to Markdown, converts documents, spreadsheets, images (OCR), audio (transcription), and web/data files into clean Markdown. It runs fully locally, requires no API keys, and is ideal for LLMs, docs, and archiving.

Datavault

AI Website Content Markdown Scraper

quaking_pail/ai-website-content-markdown-scraper

This Apify Actor, "Website Content Crawler with Markdown Extraction," is designed to perform a comprehensive crawl of specified websites, extract their text content, convert it into Markdown format, and store it in a structured dataset. The extracted content is suitable for feeding LLMs.

AI_Builder

888

3.9

Llms Txt Generator

francis.businessfm/llms-txt-generator

Francis Marzyński

Any-to-Markdown

abotapi/any-doc-parser

Any-to-Markdown converts images and scanned PDFs into structured Markdown using AI-powered OCR.

AbotAPI

Markdown Maker: HTML to Markdown 📝

shahidirfan/Markdown-Maker

Instantly convert complex HTML into clean, structured Markdown. This lightweight actor is optimized to render web content into a format that is easily readable for AI LLMs, reducing token usage and improving context. Perfect for RAG pipelines and preparing data for training.