Website To Markdown avatar
Website To Markdown

Pricing

from $9.00 / 1,000 webpages

Go to Apify Store
Website To Markdown

Website To Markdown

Convert any webpage into clean, LLM-ready Markdown in seconds β€” perfect for AI training data, RAG pipelines, and content archiving.

Pricing

from $9.00 / 1,000 webpages

Rating

5.0

(1)

Developer

SmartApi

SmartApi

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

2 days ago

Last modified

Share

Website to Markdown - LLM-Ready Content Extractor

Transform any webpage into clean, structured Markdown perfect for AI/LLM applications, content archiving, and documentation.


⚑ Quick Start (30 seconds)

  1. Click Start in the Apify Console
  2. Paste your URLs
  3. Click Run

That's it! No configuration needed.

{
"urls": ["https://example.com/article"]
}

πŸ’° Pricing

Pay-Per-Result - You only pay for what you use.

UsageEstimated Cost
100 pages$1.19
1,000 pages$11.99
10,000 pages$119.90

✨ What This Actor Does

  • βœ… Extract main content from any webpage (articles, blogs, documentation)
  • βœ… Convert HTML to clean, well-formatted Markdown
  • βœ… Preserve headings, links, images, lists, and tables
  • βœ… Remove ads, navigation, footers, and clutter automatically
  • βœ… Process multiple URLs concurrently (up to 100 at once)
  • βœ… Handle JavaScript-rendered pages with Playwright
  • βœ… Bypass bot detection with stealth mode
  • βœ… Retry failed requests automatically with exponential backoff

πŸ“Š Sample Output

This is exactly what you'll get:

{
"recordType": "success",
"url": "https://example.com/article",
"finalUrl": "https://example.com/article",
"title": "Scientists Print Working Electrodes Directly on Skin With Light",
"markdown": "# Scientists Print Working Electrodes Directly on Skin With Light\n\n...",
"markdownLength": 4521,
"processingTimeMs": 2340,
"timestamp": "2025-01-15T10:30:00.000Z",
"metadata": {
"description": "A new study shows visible light can form electrodes from conductive plastics",
"lang": "en",
"wordCount": 892,
"imageCount": 4,
"linkCount": 12,
"headingCount": 5
}
}

The markdown field contains clean, structured content ready for LLM processing, RAG pipelines, or documentation.


🎯 Use Cases

πŸ€– AI & LLM Training Data

Collect clean text from web sources for fine-tuning models, building knowledge bases, or RAG applications.

πŸ“š Documentation Archiving

Convert online documentation, wikis, and help centers into portable Markdown files.

πŸ“° Content Aggregation

Gather articles from multiple news sources for analysis, summarization, or newsletters.

πŸ”¬ Research & Analysis

Extract research papers, blog posts, and technical content for systematic analysis.

πŸ“ Blog Migration

Move content from one platform to another by extracting articles as Markdown.

πŸ—‚οΈ Knowledge Management

Build searchable knowledge bases from scattered web resources.


πŸ“ How to Use

Step 1: Prepare Your URLs

Gather the URLs you want to convert. Each URL should:

  • Start with http:// or https://
  • Point to a page with readable content (articles, blogs, docs)

Examples:

https://docs.python.org/3/tutorial/index.html
https://blog.example.com/my-article
https://news.ycombinator.com/item?id=12345

Step 2: Configure the Actor

In the Apify Console, enter your URLs in the URLs field. You can:

  • Paste URLs one per line
  • Use the JSON editor for bulk input

Minimal configuration:

{
"urls": [
"https://example.com/page1",
"https://example.com/page2"
]
}

Step 3: Adjust Settings (Optional)

For most websites, default settings work great. Adjust only if needed:

  • Max Concurrency: Increase for faster processing (default: 5)
  • Stealth Mode: Keep ON for protected sites (default: ON)
  • Proxy: Enable for rate-limited or geo-restricted sites

Step 4: Run the Actor

  1. Click the Start button
  2. Watch the Log tab for progress
  3. Wait for completion (usually seconds per page)

Step 5: Download Results

When finished, go to the Storage tab:

  • Dataset: Click to view all results
  • Export: Download as JSON, CSV, or Excel
  • API: Use the dataset ID in your applications

πŸ”§ Input Parameters (Full Reference)

ParameterRequiredTypeDefaultWhat It Does
urlsYESArrayβ€”List of webpage URLs to convert to Markdown
maxConcurrencyNoInteger5How many pages to process simultaneously (1-100)
pageLoadTimeoutNoInteger30000Max time to wait for page load in milliseconds (5000-120000)
maxRetriesNoInteger3Retry attempts for failed pages (0-10)
stealthModeNoBooleantrueEnable anti-bot-detection techniques
proxyConfigurationNoObjectβ€”Proxy settings for accessing restricted content

About Stealth Mode

Stealth mode helps access websites that block automated requests:

  • Randomizes request timing
  • Uses browser fingerprint rotation
  • Mimics human browsing patterns

Keep it ON for most websites. Only disable for trusted internal sites where speed is critical.

About Proxy Configuration

Use proxies when:

  • Websites rate-limit your requests
  • Content is geo-restricted
  • You need to distribute requests across IPs

Configure in the Apify Console using the built-in proxy selector.


πŸ“€ Output Data (What You Get)

Every result includes these fields:

βœ… Success Records

FieldTypeWhat It Contains
recordTypeTextAlways "success" for successful extractions
urlTextThe original URL you provided
finalUrlTextThe actual URL after any redirects
titleTextPage title from the <title> tag
markdownTextThe extracted content as Markdown
markdownLengthNumberCharacter count of the Markdown content
processingTimeMsNumberHow long extraction took in milliseconds
timestampTextWhen the extraction happened (ISO 8601)
metadataObjectAdditional page info (description, language, word count, etc.)

❌ Error Records

FieldTypeWhat It Contains
recordTypeTextAlways "error" for failed extractions
urlTextThe URL that failed
errorTypeTextCategory: http_error, timeout, bot_detection, etc.
severityText"warning" (retried) or "error" (final failure)
messageTextHuman-readable explanation of what went wrong
httpStatusNumberHTTP status code if applicable (404, 500, etc.)
retryCountNumberHow many retry attempts were made
processingTimeMsNumberTime spent before failure
timestampTextWhen the error occurred

πŸ’‘ Common Questions

❓ How long does it take to process pages?

Most pages complete in 2-5 seconds. Complex JavaScript-heavy pages may take up to 30 seconds. Processing 100 pages with default concurrency (5) typically takes 1-2 minutes.

❓ How much does it cost to run?

Costs depend on page complexity and compute time. Typical costs are around $0.01 per page. Enable the free tier or set spending limits in your Apify account settings.

❓ Do I need any API keys or logins?

No! This actor works out of the box. No external API keys, no website logins, no configuration files needed.

❓ What if a page fails to load?

The actor automatically retries failed pages up to 3 times (configurable) with exponential backoff. If all retries fail, you'll get an error record explaining what went wrong.

❓ Does this work with JavaScript-heavy websites?

Yes! The actor uses Playwright to fully render JavaScript before extraction. Single-page apps (SPAs), React sites, and dynamic content are all supported.

❓ Can I use this commercially?

Yes! The extracted content is yours to use. However, always respect the source website's terms of service and copyright.

❓ What happens with paywalled content?

The actor extracts only publicly visible content. It cannot bypass paywalls, login walls, or access restricted content without proper authentication.

❓ Can I process thousands of URLs at once?

Yes! Increase maxConcurrency (up to 100) for faster processing. For very large jobs, consider using Apify's scheduling and webhook features.


⚠️ Troubleshooting

Problem: "Bot detection" error

What it means: The website identified the request as automated and blocked it.

Solution:

  1. Make sure Stealth Mode is enabled (it's ON by default)
  2. Enable Proxy Configuration using Apify's residential proxies
  3. Reduce Max Concurrency to 2-3 to appear more human-like
  4. Add delays between requests by lowering concurrency

Problem: "Timeout" error

What it means: The page took too long to load or render.

Solution:

  1. Increase Page Load Timeout to 60000ms (60 seconds)
  2. Check if the URL is valid and the site is online
  3. Some pages may require proxy to avoid rate limiting

Problem: "Extraction failed" error

What it means: The page loaded but the content extractor couldn't find readable content.

Solution:

  1. Verify the page has actual text content (not just images/video)
  2. Some pages (login walls, empty pages) genuinely have no extractable content
  3. Check if the URL is correct and the page exists

Problem: Empty or low-quality Markdown output

What it means: Content was extracted but may not be complete or well-structured.

Solution:

  1. Check the markdownLength and metadata.wordCount fields to assess content size
  2. Pages with unusual HTML structure may need manual review
  3. The content extractor works best with article-style pages

Problem: "HTTP 403/404/500" errors

What it means:

  • 403: Access forbidden (blocked or requires authentication)
  • 404: Page not found (URL is wrong or page deleted)
  • 500: Server error (website is having problems)

Solution:

  1. For 403: Try enabling proxy; the site may be blocking your region
  2. For 404: Verify the URL is correct
  3. For 500: Wait and retry; the website may be temporarily down

Still stuck? Check the error message details in your results. Most issues can be resolved by enabling proxies or adjusting timeout settings.


🌍 Proxy Configuration

For most users: Default settings work fine.

If you need proxies:

  1. Go to the Input tab
  2. Scroll to Proxy Configuration
  3. Select from Apify's proxy options:
    • Datacenter proxies: Fast and cheap, good for most sites
    • Residential proxies: Better success rate for protected sites

Example configuration:

{
"urls": ["https://example.com"],
"proxyConfiguration": {
"useApifyProxy": true,
"apifyProxyGroups": ["RESIDENTIAL"]
}
}

Cost implications: Residential proxies cost more but have higher success rates. Start with datacenter proxies and upgrade only if you see bot detection errors.


❌ Limitations

  • 🚫 Cannot bypass paywalls or login-protected content
  • 🚫 Cannot extract content from PDFs or non-HTML documents
  • 🚫 Cannot process more than 10,000 URLs per run (split into multiple runs)
  • 🚫 Cannot guarantee extraction quality for pages with unusual HTML structure
  • 🚫 Cannot access content blocked by geographic restrictions without appropriate proxies

πŸš€ Getting Started Now

  1. Copy this example:
{
"urls": [
"https://en.wikipedia.org/wiki/Web_scraping",
"https://docs.apify.com/academy/web-scraping-for-beginners"
]
}
  1. Paste into the Input JSON editor

  2. Click Start

  3. Wait ~10 seconds

  4. Download results from the Storage tab as JSON or CSV

That's it! You're done. πŸŽ‰


πŸ“ž Support & Maintenance

Response time: We respond within 48 hours

Contact: Open an issue on the Actor's page in Apify Console

How we help:

  • Debug failed extractions
  • Optimize settings for your use case
  • Answer questions about output format

Maintenance:

  • We actively monitor for website changes affecting extraction
  • Updates are rolled out automatically
  • You'll see release notes for major changes

πŸ”„ Version History

v0.0 β€” Initial Release

  • Core content extraction with Mozilla Readability
  • HTML-to-Markdown conversion with Turndown
  • Concurrent URL processing with Crawlee
  • Stealth mode for bot detection avoidance
  • Automatic retry with exponential backoff
  • Rich metadata output including word count, image count, and more

Last updated: December 2025 We maintain this actor actively and update as websites and web standards evolve.