Website To Markdown
Pricing
from $9.00 / 1,000 webpages
Website To Markdown
Convert any webpage into clean, LLM-ready Markdown in seconds β perfect for AI training data, RAG pipelines, and content archiving.
Pricing
from $9.00 / 1,000 webpages
Rating
5.0
(1)
Developer

SmartApi
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
2 days ago
Last modified
Categories
Share
Website to Markdown - LLM-Ready Content Extractor
Transform any webpage into clean, structured Markdown perfect for AI/LLM applications, content archiving, and documentation.
β‘ Quick Start (30 seconds)
- Click Start in the Apify Console
- Paste your URLs
- Click Run
That's it! No configuration needed.
{"urls": ["https://example.com/article"]}
π° Pricing
Pay-Per-Result - You only pay for what you use.
| Usage | Estimated Cost |
|---|---|
| 100 pages | $1.19 |
| 1,000 pages | $11.99 |
| 10,000 pages | $119.90 |
β¨ What This Actor Does
- β Extract main content from any webpage (articles, blogs, documentation)
- β Convert HTML to clean, well-formatted Markdown
- β Preserve headings, links, images, lists, and tables
- β Remove ads, navigation, footers, and clutter automatically
- β Process multiple URLs concurrently (up to 100 at once)
- β Handle JavaScript-rendered pages with Playwright
- β Bypass bot detection with stealth mode
- β Retry failed requests automatically with exponential backoff
π Sample Output
This is exactly what you'll get:
{"recordType": "success","url": "https://example.com/article","finalUrl": "https://example.com/article","title": "Scientists Print Working Electrodes Directly on Skin With Light","markdown": "# Scientists Print Working Electrodes Directly on Skin With Light\n\n...","markdownLength": 4521,"processingTimeMs": 2340,"timestamp": "2025-01-15T10:30:00.000Z","metadata": {"description": "A new study shows visible light can form electrodes from conductive plastics","lang": "en","wordCount": 892,"imageCount": 4,"linkCount": 12,"headingCount": 5}}
The markdown field contains clean, structured content ready for LLM processing, RAG pipelines, or documentation.
π― Use Cases
π€ AI & LLM Training Data
Collect clean text from web sources for fine-tuning models, building knowledge bases, or RAG applications.
π Documentation Archiving
Convert online documentation, wikis, and help centers into portable Markdown files.
π° Content Aggregation
Gather articles from multiple news sources for analysis, summarization, or newsletters.
π¬ Research & Analysis
Extract research papers, blog posts, and technical content for systematic analysis.
π Blog Migration
Move content from one platform to another by extracting articles as Markdown.
ποΈ Knowledge Management
Build searchable knowledge bases from scattered web resources.
π How to Use
Step 1: Prepare Your URLs
Gather the URLs you want to convert. Each URL should:
- Start with
http://orhttps:// - Point to a page with readable content (articles, blogs, docs)
Examples:
https://docs.python.org/3/tutorial/index.htmlhttps://blog.example.com/my-articlehttps://news.ycombinator.com/item?id=12345
Step 2: Configure the Actor
In the Apify Console, enter your URLs in the URLs field. You can:
- Paste URLs one per line
- Use the JSON editor for bulk input
Minimal configuration:
{"urls": ["https://example.com/page1","https://example.com/page2"]}
Step 3: Adjust Settings (Optional)
For most websites, default settings work great. Adjust only if needed:
- Max Concurrency: Increase for faster processing (default: 5)
- Stealth Mode: Keep ON for protected sites (default: ON)
- Proxy: Enable for rate-limited or geo-restricted sites
Step 4: Run the Actor
- Click the Start button
- Watch the Log tab for progress
- Wait for completion (usually seconds per page)
Step 5: Download Results
When finished, go to the Storage tab:
- Dataset: Click to view all results
- Export: Download as JSON, CSV, or Excel
- API: Use the dataset ID in your applications
π§ Input Parameters (Full Reference)
| Parameter | Required | Type | Default | What It Does |
|---|---|---|---|---|
urls | YES | Array | β | List of webpage URLs to convert to Markdown |
maxConcurrency | No | Integer | 5 | How many pages to process simultaneously (1-100) |
pageLoadTimeout | No | Integer | 30000 | Max time to wait for page load in milliseconds (5000-120000) |
maxRetries | No | Integer | 3 | Retry attempts for failed pages (0-10) |
stealthMode | No | Boolean | true | Enable anti-bot-detection techniques |
proxyConfiguration | No | Object | β | Proxy settings for accessing restricted content |
About Stealth Mode
Stealth mode helps access websites that block automated requests:
- Randomizes request timing
- Uses browser fingerprint rotation
- Mimics human browsing patterns
Keep it ON for most websites. Only disable for trusted internal sites where speed is critical.
About Proxy Configuration
Use proxies when:
- Websites rate-limit your requests
- Content is geo-restricted
- You need to distribute requests across IPs
Configure in the Apify Console using the built-in proxy selector.
π€ Output Data (What You Get)
Every result includes these fields:
β Success Records
| Field | Type | What It Contains |
|---|---|---|
recordType | Text | Always "success" for successful extractions |
url | Text | The original URL you provided |
finalUrl | Text | The actual URL after any redirects |
title | Text | Page title from the <title> tag |
markdown | Text | The extracted content as Markdown |
markdownLength | Number | Character count of the Markdown content |
processingTimeMs | Number | How long extraction took in milliseconds |
timestamp | Text | When the extraction happened (ISO 8601) |
metadata | Object | Additional page info (description, language, word count, etc.) |
β Error Records
| Field | Type | What It Contains |
|---|---|---|
recordType | Text | Always "error" for failed extractions |
url | Text | The URL that failed |
errorType | Text | Category: http_error, timeout, bot_detection, etc. |
severity | Text | "warning" (retried) or "error" (final failure) |
message | Text | Human-readable explanation of what went wrong |
httpStatus | Number | HTTP status code if applicable (404, 500, etc.) |
retryCount | Number | How many retry attempts were made |
processingTimeMs | Number | Time spent before failure |
timestamp | Text | When the error occurred |
π‘ Common Questions
β How long does it take to process pages?
Most pages complete in 2-5 seconds. Complex JavaScript-heavy pages may take up to 30 seconds. Processing 100 pages with default concurrency (5) typically takes 1-2 minutes.
β How much does it cost to run?
Costs depend on page complexity and compute time. Typical costs are around $0.01 per page. Enable the free tier or set spending limits in your Apify account settings.
β Do I need any API keys or logins?
No! This actor works out of the box. No external API keys, no website logins, no configuration files needed.
β What if a page fails to load?
The actor automatically retries failed pages up to 3 times (configurable) with exponential backoff. If all retries fail, you'll get an error record explaining what went wrong.
β Does this work with JavaScript-heavy websites?
Yes! The actor uses Playwright to fully render JavaScript before extraction. Single-page apps (SPAs), React sites, and dynamic content are all supported.
β Can I use this commercially?
Yes! The extracted content is yours to use. However, always respect the source website's terms of service and copyright.
β What happens with paywalled content?
The actor extracts only publicly visible content. It cannot bypass paywalls, login walls, or access restricted content without proper authentication.
β Can I process thousands of URLs at once?
Yes! Increase maxConcurrency (up to 100) for faster processing. For very large jobs, consider using Apify's scheduling and webhook features.
β οΈ Troubleshooting
Problem: "Bot detection" error
What it means: The website identified the request as automated and blocked it.
Solution:
- Make sure Stealth Mode is enabled (it's ON by default)
- Enable Proxy Configuration using Apify's residential proxies
- Reduce Max Concurrency to 2-3 to appear more human-like
- Add delays between requests by lowering concurrency
Problem: "Timeout" error
What it means: The page took too long to load or render.
Solution:
- Increase Page Load Timeout to 60000ms (60 seconds)
- Check if the URL is valid and the site is online
- Some pages may require proxy to avoid rate limiting
Problem: "Extraction failed" error
What it means: The page loaded but the content extractor couldn't find readable content.
Solution:
- Verify the page has actual text content (not just images/video)
- Some pages (login walls, empty pages) genuinely have no extractable content
- Check if the URL is correct and the page exists
Problem: Empty or low-quality Markdown output
What it means: Content was extracted but may not be complete or well-structured.
Solution:
- Check the
markdownLengthandmetadata.wordCountfields to assess content size - Pages with unusual HTML structure may need manual review
- The content extractor works best with article-style pages
Problem: "HTTP 403/404/500" errors
What it means:
- 403: Access forbidden (blocked or requires authentication)
- 404: Page not found (URL is wrong or page deleted)
- 500: Server error (website is having problems)
Solution:
- For 403: Try enabling proxy; the site may be blocking your region
- For 404: Verify the URL is correct
- For 500: Wait and retry; the website may be temporarily down
Still stuck? Check the error message details in your results. Most issues can be resolved by enabling proxies or adjusting timeout settings.
π Proxy Configuration
For most users: Default settings work fine.
If you need proxies:
- Go to the Input tab
- Scroll to Proxy Configuration
- Select from Apify's proxy options:
- Datacenter proxies: Fast and cheap, good for most sites
- Residential proxies: Better success rate for protected sites
Example configuration:
{"urls": ["https://example.com"],"proxyConfiguration": {"useApifyProxy": true,"apifyProxyGroups": ["RESIDENTIAL"]}}
Cost implications: Residential proxies cost more but have higher success rates. Start with datacenter proxies and upgrade only if you see bot detection errors.
β Limitations
- π« Cannot bypass paywalls or login-protected content
- π« Cannot extract content from PDFs or non-HTML documents
- π« Cannot process more than 10,000 URLs per run (split into multiple runs)
- π« Cannot guarantee extraction quality for pages with unusual HTML structure
- π« Cannot access content blocked by geographic restrictions without appropriate proxies
π Getting Started Now
- Copy this example:
{"urls": ["https://en.wikipedia.org/wiki/Web_scraping","https://docs.apify.com/academy/web-scraping-for-beginners"]}
-
Paste into the Input JSON editor
-
Click Start
-
Wait ~10 seconds
-
Download results from the Storage tab as JSON or CSV
That's it! You're done. π
π Support & Maintenance
Response time: We respond within 48 hours
Contact: Open an issue on the Actor's page in Apify Console
How we help:
- Debug failed extractions
- Optimize settings for your use case
- Answer questions about output format
Maintenance:
- We actively monitor for website changes affecting extraction
- Updates are rolled out automatically
- You'll see release notes for major changes
π Version History
v0.0 β Initial Release
- Core content extraction with Mozilla Readability
- HTML-to-Markdown conversion with Turndown
- Concurrent URL processing with Crawlee
- Stealth mode for bot detection avoidance
- Automatic retry with exponential backoff
- Rich metadata output including word count, image count, and more
Last updated: December 2025 We maintain this actor actively and update as websites and web standards evolve.