πŸ“ Markdown Maker: HTML to AI-Ready Text avatar
πŸ“ Markdown Maker: HTML to AI-Ready Text

Pricing

Pay per usage

Go to Apify Store
πŸ“ Markdown Maker: HTML to AI-Ready Text

πŸ“ Markdown Maker: HTML to AI-Ready Text

Instantly convert complex HTML into clean, structured Markdown. This lightweight actor is optimized to render web content into a format that is easily readable for AI LLMs, reducing token usage and improving context. Perfect for RAG pipelines and preparing data for training.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Shahid Irfan

Shahid Irfan

Maintained by Community

Actor stats

0

Bookmarked

1

Total users

1

Monthly active users

2 days ago

Last modified

Share

Markdown Maker

Convert any web page into clean, AI-ready markdown format in seconds. Perfect for feeding content to AI models, creating documentation, or archiving web content in a portable format.

Apify Actor Markdown AI Ready

πŸ“‹ What This Actor Does

Markdown Maker automatically transforms web pages into clean, well-formatted markdown that's optimized for AI processing and human readability. Whether you're building an AI training dataset, creating documentation, or archiving web content, this tool extracts the main content from any URL and converts it to structured markdownβ€”eliminating ads, navigation menus, and other clutter.

Perfect for:

  • AI Training Data - Convert documentation and articles into markdown for feeding to language models
  • Content Archiving - Save web content in a portable, future-proof format
  • Documentation Migration - Extract content from old sites to import into new documentation platforms
  • Research - Collect and organize content from multiple sources
  • Data Analysis - Convert web content to structured format for text analysis

✨ Key Features

  • 🎯 Smart Content Extraction - Automatically identifies and filters out ads, navigation, and clutter
  • πŸ“ GitHub-Flavored Markdown - Clean, standardized markdown with proper table syntax and formatting
  • ⚑ Batch Processing - Process multiple URLs at once with optional delays
  • πŸ”’ Reliable Scraping - Built-in proxy rotation and retry logic for consistent results
  • 🌐 Universal Compatibility - Works on any website including JavaScript-heavy pages
  • πŸš€ Production Ready - Optimized for speed and reliability

πŸš€ Quick Start

Basic Usage - Single URL

{
"startUrls": [
{
"url": "https://docs.apify.com/api/v2"
}
]
}

Multiple URLs

{
"startUrls": [
{
"url": "https://docs.apify.com/api/v2"
},
{
"url": "https://example.com/article"
},
{
"url": "https://blog.example.com/post"
}
],
"maxItems": 10
}

With Rate Limiting

{
"startUrls": [
{
"url": "https://docs.example.com/page1"
},
{
"url": "https://docs.example.com/page2"
}
],
"delayBetweenRequests": 2,
"proxyConfiguration": {
"useApifyProxy": true
}
}

πŸ“Š Input Parameters

ParameterTypeRequiredDescriptionExample
startUrlsarrayβœ… YesList of URLs to convert to markdown[{"url": "https://example.com"}]
maxItemsinteger❌ NoMaximum number of pages to process10 (default: unlimited)
delayBetweenRequestsinteger❌ NoSeconds to wait between processing each URL (0-300)2 (default: 0)
proxyConfigurationobject❌ NoProxy settings for reliable access{"useApifyProxy": true}

πŸ“ˆ Output Data Structure

Each converted page provides clean markdown with metadata:

{
"url": "https://docs.apify.com/api/v2",
"title": "Apify API Documentation",
"markdown": "# Apify API Documentation\n\n**URL Source:** https://docs.apify.com/api/v2\n\n---\n\nThe Apify API provides programmatic access...\n\n## Authentication\n\n...",
"timestamp": "2024-12-13T10:30:00.000Z"
}

Output Fields

  • url - Source web page URL
  • title - Extracted page title
  • markdown - Full content converted to clean markdown format
  • timestamp - When the page was processed

Markdown Format Features

  • βœ… Proper heading hierarchy (H1-H6)
  • βœ… Clean table syntax with pipes (|)
  • βœ… Bullet points using asterisks (*)
  • βœ… Code blocks with triple backticks
  • βœ… Strikethrough and emphasis preserved
  • βœ… Horizontal rules under major sections
  • βœ… Source URL included in output

🎯 Use Cases & Applications

AI & Machine Learning

  • Training Data Preparation - Convert documentation for AI model training
  • RAG Systems - Prepare content for retrieval-augmented generation
  • Knowledge Bases - Build searchable AI knowledge repositories
  • Prompt Engineering - Create clean context for LLM prompts

Documentation & Content

  • Documentation Migration - Move content to modern markdown-based systems
  • Content Archiving - Preserve web content in portable format
  • Static Site Generation - Feed content to Jekyll, Hugo, or Next.js
  • Knowledge Management - Build internal wikis and documentation

Research & Analysis

  • Academic Research - Collect and analyze web content
  • Market Research - Extract competitor information
  • Text Mining - Prepare web data for NLP analysis
  • Content Monitoring - Track changes to web pages over time

⚑ Performance & Cost Optimization

Use CaseMax ItemsDelayEst. Time
Quick Test50~30 seconds
Documentation Site501~2 minutes
Content Archive2002~8 minutes
Large Dataset500+2~20 minutes

Plan Limits

  • Free Plan: Limited to 100 pages per run
  • Paid Plans: Unlimited page processing

Upgrade to a paid plan to process unlimited pages.

Best Practices

  • Start Small: Test with 5-10 URLs first to verify output quality
  • Use Delays: Set delayBetweenRequests to avoid overwhelming servers
  • Enable Proxies: Use Apify Proxy for reliable access to any website
  • Batch Processing: Process URLs in batches for better control
  • Monitor Output: Check markdown quality and adjust as needed

πŸ”§ Configuration Examples

Documentation Site

Convert entire documentation site for AI training:

{
"startUrls": [
{"url": "https://docs.example.com/getting-started"},
{"url": "https://docs.example.com/api-reference"},
{"url": "https://docs.example.com/tutorials"}
],
"maxItems": 50,
"delayBetweenRequests": 1,
"proxyConfiguration": {
"useApifyProxy": true
}
}

Blog Archive

Archive blog posts in markdown format:

{
"startUrls": [
{"url": "https://blog.example.com/2024/post-1"},
{"url": "https://blog.example.com/2024/post-2"}
],
"maxItems": 100,
"delayBetweenRequests": 2
}

Research Collection

Gather content from multiple sources:

{
"startUrls": [
{"url": "https://wikipedia.org/wiki/Topic"},
{"url": "https://example.com/research-paper"},
{"url": "https://news.example.com/article"}
],
"proxyConfiguration": {
"useApifyProxy": true
}
}

Quick Single Page

Convert a single page quickly:

{
"startUrls": [
{"url": "https://example.com/important-page"}
]
}

πŸ“‹ Supported Content & Features

Website Compatibility

  • βœ… Static HTML pages
  • βœ… JavaScript-rendered content (SPA, React, Vue, Angular)
  • βœ… Documentation sites (GitBook, Docusaurus, MkDocs)
  • βœ… Blog platforms (WordPress, Medium, Ghost)
  • βœ… Wiki pages (Wikipedia, Confluence)
  • βœ… News articles and magazines
  • βœ… Product pages and landing pages

Content Extraction

  • Smart Filtering: Automatically removes ads, navigation, footers, and sidebars
  • Semantic Analysis: Identifies main content using multiple algorithms
  • Structure Preservation: Maintains headings, lists, tables, and code blocks
  • Link Handling: Preserves hyperlinks in markdown format
  • Image Alt Text: Includes image descriptions when available

Language Support

  • Works with any language (Unicode support)
  • Preserves special characters and formatting
  • Handles RTL (right-to-left) text

πŸ†˜ Troubleshooting

Common Issues

Empty or Poor Quality Markdown

  • Page may have aggressive anti-scraping measures
  • Enable proxyConfiguration with Apify Proxy
  • Some pages may have no extractable content
  • Try increasing delayBetweenRequests

Timeout Errors

  • Reduce the number of URLs in startUrls
  • Increase delayBetweenRequests to slow down processing
  • Enable proxy configuration for better reliability
  • Split large jobs into smaller batches

Missing Content

  • JavaScript-heavy sites may need more processing time
  • Some content may be dynamically loaded after page render
  • Check if the page requires authentication

Rate Limiting

  • Increase delayBetweenRequests (e.g., 2-5 seconds)
  • Enable Apify Proxy to rotate IP addresses
  • Process fewer URLs per run

Support

For issues or feature requests:

  • Email: Contact via Google Form
  • Documentation: Check Apify documentation
  • Community: Visit Apify Discord community

We're here to help! Fill out the form at https://docs.google.com/forms/d/e/1FAIpQLSfsKyzZ3nRED7mML47I4LAfNh_mBwkuFMp1FgYYJ4AkDRgaRw/viewform to get support.

οΏ½ Export Options

The Apify platform provides multiple ways to export your markdown data:

JSON Format

Perfect for programmatic use or integration with other tools:

[
{
"url": "https://example.com",
"title": "Example Page",
"markdown": "# Example Page\n\n..."
}
]

CSV Format

Great for opening in Excel or Google Sheets - each row contains one URL and its markdown content.

Integration Options

  • Webhooks - Send results to your own API
  • Google Sheets - Automatically populate a spreadsheet
  • Make.com / Zapier - Trigger workflows based on results
  • Other Apify Actors - Chain multiple actors together

πŸ”— API Integration

Access your results programmatically:

# Get the dataset
curl https://api.apify.com/v2/datasets/{DATASET_ID}/items

Results are stored in Apify's dataset storage and remain available for download even after the actor finishes running.

πŸ“„ License & Terms

This actor extracts publicly available web content in accordance with applicable web scraping regulations and respects robots.txt directives.


Built with ❀️ by Shahid

Keywords: markdown converter, web scraping, ai training data, content extraction, documentation tools, markdown generator, web to markdown, apify actor, content archiving, ai-ready data