Pricing

from $9.00 / 1,000 webpages

Try for free

Go to Apify Store

Website To Markdown

Try for free

Convert any webpage into clean, LLM-ready Markdown in seconds — perfect for AI training data, RAG pipelines, and content archiving.

Pricing

from $9.00 / 1,000 webpages

Rating

5.0

(1)

Developer

SmartApi

Actor stats

Bookmarked

Total users

Monthly active users

2 months ago

Last modified

Website to Markdown - LLM-Ready Content Extractor

Transform any webpage into clean, structured Markdown perfect for AI/LLM applications, content archiving, and documentation.

⚡ Quick Start (30 seconds)

Click Start in the Apify Console
Paste your URLs
Click Run

That's it! No configuration needed.

{
  "urls": ["https://example.com/article"]
}

💰 Pricing

Pay-Per-Result - You only pay for what you use.

Usage	Estimated Cost
100 pages	$1.19
1,000 pages	$11.99
10,000 pages	$119.90

✨ What This Actor Does

✅ Extract main content from any webpage (articles, blogs, documentation)
✅ Convert HTML to clean, well-formatted Markdown
✅ Preserve headings, links, images, lists, and tables
✅ Remove ads, navigation, footers, and clutter automatically
✅ Process multiple URLs concurrently (up to 100 at once)
✅ Handle JavaScript-rendered pages with Playwright
✅ Bypass bot detection with stealth mode
✅ Retry failed requests automatically with exponential backoff

📊 Sample Output

This is exactly what you'll get:

{
  "recordType": "success",
  "url": "https://example.com/article",
  "finalUrl": "https://example.com/article",
  "title": "Scientists Print Working Electrodes Directly on Skin With Light",
  "markdown": "# Scientists Print Working Electrodes Directly on Skin With Light\n\n...",
  "markdownLength": 4521,
  "processingTimeMs": 2340,
  "timestamp": "2025-01-15T10:30:00.000Z",
  "metadata": {
    "description": "A new study shows visible light can form electrodes from conductive plastics",
    "lang": "en",
    "wordCount": 892,
    "imageCount": 4,
    "linkCount": 12,
    "headingCount": 5
  }
}

The markdown field contains clean, structured content ready for LLM processing, RAG pipelines, or documentation.

🎯 Use Cases

🤖 AI & LLM Training Data

Collect clean text from web sources for fine-tuning models, building knowledge bases, or RAG applications.

📚 Documentation Archiving

Convert online documentation, wikis, and help centers into portable Markdown files.

📰 Content Aggregation

Gather articles from multiple news sources for analysis, summarization, or newsletters.

🔬 Research & Analysis

Extract research papers, blog posts, and technical content for systematic analysis.

📝 Blog Migration

Move content from one platform to another by extracting articles as Markdown.

🗂️ Knowledge Management

Build searchable knowledge bases from scattered web resources.

📝 How to Use

Step 1: Prepare Your URLs

Gather the URLs you want to convert. Each URL should:

Start with http:// or https://
Point to a page with readable content (articles, blogs, docs)

Examples:

https://docs.python.org/3/tutorial/index.html
https://blog.example.com/my-article
https://news.ycombinator.com/item?id=12345

Step 2: Configure the Actor

In the Apify Console, enter your URLs in the URLs field. You can:

Paste URLs one per line
Use the JSON editor for bulk input

Minimal configuration:

{
  "urls": [
    "https://example.com/page1",
    "https://example.com/page2"
  ]
}

Step 3: Adjust Settings (Optional)

For most websites, default settings work great. Adjust only if needed:

Max Concurrency: Increase for faster processing (default: 5)
Stealth Mode: Keep ON for protected sites (default: ON)
Proxy: Enable for rate-limited or geo-restricted sites

Step 4: Run the Actor

Click the Start button
Watch the Log tab for progress
Wait for completion (usually seconds per page)

Step 5: Download Results

When finished, go to the Storage tab:

Dataset: Click to view all results
Export: Download as JSON, CSV, or Excel
API: Use the dataset ID in your applications

🔧 Input Parameters (Full Reference)

Parameter	Required	Type	Default	What It Does
`urls`	YES	Array	—	List of webpage URLs to convert to Markdown
`maxConcurrency`	No	Integer	5	How many pages to process simultaneously (1-100)
`pageLoadTimeout`	No	Integer	30000	Max time to wait for page load in milliseconds (5000-120000)
`maxRetries`	No	Integer	3	Retry attempts for failed pages (0-10)
`stealthMode`	No	Boolean	true	Enable anti-bot-detection techniques
`proxyConfiguration`	No	Object	—	Proxy settings for accessing restricted content

About Stealth Mode

Stealth mode helps access websites that block automated requests:

Randomizes request timing
Uses browser fingerprint rotation
Mimics human browsing patterns

Keep it ON for most websites. Only disable for trusted internal sites where speed is critical.

About Proxy Configuration

Use proxies when:

Websites rate-limit your requests
Content is geo-restricted
You need to distribute requests across IPs

Configure in the Apify Console using the built-in proxy selector.

📤 Output Data (What You Get)

Every result includes these fields:

✅ Success Records

Field	Type	What It Contains
`recordType`	Text	Always `"success"` for successful extractions
`url`	Text	The original URL you provided
`finalUrl`	Text	The actual URL after any redirects
`title`	Text	Page title from the `<title>` tag
`markdown`	Text	The extracted content as Markdown
`markdownLength`	Number	Character count of the Markdown content
`processingTimeMs`	Number	How long extraction took in milliseconds
`timestamp`	Text	When the extraction happened (ISO 8601)
`metadata`	Object	Additional page info (description, language, word count, etc.)

❌ Error Records

Field	Type	What It Contains
`recordType`	Text	Always `"error"` for failed extractions
`url`	Text	The URL that failed
`errorType`	Text	Category: `http_error`, `timeout`, `bot_detection`, etc.
`severity`	Text	`"warning"` (retried) or `"error"` (final failure)
`message`	Text	Human-readable explanation of what went wrong
`httpStatus`	Number	HTTP status code if applicable (404, 500, etc.)
`retryCount`	Number	How many retry attempts were made
`processingTimeMs`	Number	Time spent before failure
`timestamp`	Text	When the error occurred

💡 Common Questions

❓ How long does it take to process pages?

Most pages complete in 2-5 seconds. Complex JavaScript-heavy pages may take up to 30 seconds. Processing 100 pages with default concurrency (5) typically takes 1-2 minutes.

❓ How much does it cost to run?

Costs depend on page complexity and compute time. Typical costs are around $0.01 per page. Enable the free tier or set spending limits in your Apify account settings.

❓ Do I need any API keys or logins?

No! This actor works out of the box. No external API keys, no website logins, no configuration files needed.

❓ What if a page fails to load?

The actor automatically retries failed pages up to 3 times (configurable) with exponential backoff. If all retries fail, you'll get an error record explaining what went wrong.

❓ Does this work with JavaScript-heavy websites?

Yes! The actor uses Playwright to fully render JavaScript before extraction. Single-page apps (SPAs), React sites, and dynamic content are all supported.

❓ Can I use this commercially?

Yes! The extracted content is yours to use. However, always respect the source website's terms of service and copyright.

❓ What happens with paywalled content?

The actor extracts only publicly visible content. It cannot bypass paywalls, login walls, or access restricted content without proper authentication.

❓ Can I process thousands of URLs at once?

Yes! Increase maxConcurrency (up to 100) for faster processing. For very large jobs, consider using Apify's scheduling and webhook features.

⚠️ Troubleshooting

Problem: "Bot detection" error

What it means: The website identified the request as automated and blocked it.

Solution:

Make sure Stealth Mode is enabled (it's ON by default)
Enable Proxy Configuration using Apify's residential proxies
Reduce Max Concurrency to 2-3 to appear more human-like
Add delays between requests by lowering concurrency

Problem: "Timeout" error

What it means: The page took too long to load or render.

Solution:

Increase Page Load Timeout to 60000ms (60 seconds)
Check if the URL is valid and the site is online
Some pages may require proxy to avoid rate limiting

Problem: "Extraction failed" error

What it means: The page loaded but the content extractor couldn't find readable content.

Solution:

Verify the page has actual text content (not just images/video)
Some pages (login walls, empty pages) genuinely have no extractable content
Check if the URL is correct and the page exists

Problem: Empty or low-quality Markdown output

What it means: Content was extracted but may not be complete or well-structured.

Solution:

Check the markdownLength and metadata.wordCount fields to assess content size
Pages with unusual HTML structure may need manual review
The content extractor works best with article-style pages

Problem: "HTTP 403/404/500" errors

What it means:

403: Access forbidden (blocked or requires authentication)
404: Page not found (URL is wrong or page deleted)
500: Server error (website is having problems)

Solution:

For 403: Try enabling proxy; the site may be blocking your region
For 404: Verify the URL is correct
For 500: Wait and retry; the website may be temporarily down

Still stuck? Check the error message details in your results. Most issues can be resolved by enabling proxies or adjusting timeout settings.

🌍 Proxy Configuration

For most users: Default settings work fine.

If you need proxies:

Go to the Input tab
Scroll to Proxy Configuration
Select from Apify's proxy options:
- Datacenter proxies: Fast and cheap, good for most sites
- Residential proxies: Better success rate for protected sites

Example configuration:

{
  "urls": ["https://example.com"],
  "proxyConfiguration": {
    "useApifyProxy": true,
    "apifyProxyGroups": ["RESIDENTIAL"]
  }
}

Cost implications: Residential proxies cost more but have higher success rates. Start with datacenter proxies and upgrade only if you see bot detection errors.

❌ Limitations

🚫 Cannot bypass paywalls or login-protected content
🚫 Cannot extract content from PDFs or non-HTML documents
🚫 Cannot process more than 10,000 URLs per run (split into multiple runs)
🚫 Cannot guarantee extraction quality for pages with unusual HTML structure
🚫 Cannot access content blocked by geographic restrictions without appropriate proxies

🚀 Getting Started Now

Copy this example:

{
  "urls": [
    "https://en.wikipedia.org/wiki/Web_scraping",
    "https://docs.apify.com/academy/web-scraping-for-beginners"
  ]
}

Paste into the Input JSON editor
Click Start
Wait ~10 seconds
Download results from the Storage tab as JSON or CSV

That's it! You're done. 🎉

📞 Support & Maintenance

Response time: We respond within 48 hours

Contact: Open an issue on the Actor's page in Apify Console

How we help:

Debug failed extractions
Optimize settings for your use case
Answer questions about output format

Maintenance:

We actively monitor for website changes affecting extraction
Updates are rolled out automatically
You'll see release notes for major changes

🔄 Version History

v0.0 — Initial Release

Core content extraction with Mozilla Readability
HTML-to-Markdown conversion with Turndown
Concurrent URL processing with Crawlee
Stealth mode for bot detection avoidance
Automatic retry with exponential backoff
Rich metadata output including word count, image count, and more

Last updated: December 2025 We maintain this actor actively and update as websites and web standards evolve.

Website To Markdown

hamzasaleem/website-to-markdown

Convert any webpage to clean, readable Markdown format. Perfect for content extraction and readability.

Hmza

Website Content to Markdown for LLM Training

easyapi/website-content-to-markdown-for-llm-training

🚀 Transform web content into clean, LLM-ready Markdown! 📘 Scrape multiple pages, extract main content, and convert to Markdown format. Perfect for AI researchers, data scientists, and LLM developers. Fast, efficient, and customizable. Supercharge your AI training data today! 🌐📝🧠

EasyApi

244

5.0

Webpage to Markdown

extremescrapes/webpage-to-markdown

This actor cost-effectively converts websites into structured markdown optimized for AI processing. It extracts webpage content, formats it into clean markdown, and ensures compatibility with AI models.

Extreme Scrapes

169

5.0

URL to Markdown (JustHTML) - Clean Markdown Extractor

macheta/justhtml-link-to-markdown

Convert webpages to clean Markdown for RAG and archiving. Uses JustHTML and supports optional Cloudflare/Turnstile bypass plus CSS selector extraction.

Anass

5.0

AI Website Content Markdown Scraper

quaking_pail/ai-website-content-markdown-scraper

This Apify Actor, "Website Content Crawler with Markdown Extraction," is designed to perform a comprehensive crawl of specified websites, extract their text content, convert it into Markdown format, and store it in a structured dataset. The extracted content is suitable for feeding LLMs.

AI_Builder

888

3.9

🔥 FireScrape AI Website Content Markdown Scraper

mohamedgb00714/fireScraper-AI-Website-Content-Markdown-Scraper

Advanced web scraper powered by Crawlee and Puppeteer — extracts website content, converts it to Markdown, and structures it for LLM training datasets.

mohamed el hadi msaid

262

3.8

Web-to-Markdown Generator for AI & RAG Pipelines

profitstack/web-to-markdown-generator-for-ai-rag-pipelines

Convert any website into clean, heading-based chunking, LLM-ready Markdown for RAG and AI agents.

Manas Mantri

Find Sitemap from url

eesti/find-sitemap-from-url

A powerful [Apify Actor] that finds sitemap URLs for any website. This Actor helps you discover XML sitemaps by checking common locations, robots.txt files, and analyzing HTML content for sitemap links.

ando

201

1.0

WebPage Scraper

muhammadsaifkhalid4/my-actor

You can scrape Webpages for data. What changed? Multiple URLs Error handling: Each URL is handled independently, failures are logged & stored. Anti-blocking: Added User-Agent + Accept-Language. Data structure: Instead of just a flat heading list, you now get per-URL results with metadata.

Saif Khalid

104

1.3

Video Script + Storyboard (AI) - Hooks + Captions

macheta/video-script-storyboard

Generate video hooks, scripts, storyboard shot lists, on-screen text, captions, and thumbnail prompts tailored to TikTok, YouTube, Instagram, X, or LinkedIn.

Anass

Website To Markdown

Website to Markdown - LLM-Ready Content Extractor

⚡ Quick Start (30 seconds)

💰 Pricing

✨ What This Actor Does

📊 Sample Output

🎯 Use Cases

🤖 AI & LLM Training Data

📚 Documentation Archiving

📰 Content Aggregation

🔬 Research & Analysis

📝 Blog Migration

🗂️ Knowledge Management

📝 How to Use

Step 1: Prepare Your URLs

Step 2: Configure the Actor

Step 3: Adjust Settings (Optional)

Step 4: Run the Actor

Step 5: Download Results

🔧 Input Parameters (Full Reference)

About Stealth Mode

About Proxy Configuration

📤 Output Data (What You Get)

✅ Success Records

❌ Error Records

💡 Common Questions

❓ How long does it take to process pages?

❓ How much does it cost to run?

❓ Do I need any API keys or logins?

❓ What if a page fails to load?

❓ Does this work with JavaScript-heavy websites?

❓ Can I use this commercially?

❓ What happens with paywalled content?

❓ Can I process thousands of URLs at once?

⚠️ Troubleshooting

Problem: "Bot detection" error

Problem: "Timeout" error

Problem: "Extraction failed" error

Problem: Empty or low-quality Markdown output

Problem: "HTTP 403/404/500" errors

🌍 Proxy Configuration

❌ Limitations

🚀 Getting Started Now

📞 Support & Maintenance

🔄 Version History

You might also like

Website To Markdown

Website Content to Markdown for LLM Training

Webpage to Markdown

URL to Markdown (JustHTML) - Clean Markdown Extractor

AI Website Content Markdown Scraper

🔥 FireScrape AI Website Content Markdown Scraper

Web-to-Markdown Generator for AI & RAG Pipelines

Find Sitemap from url

WebPage Scraper

Video Script + Storyboard (AI) - Hooks + Captions