Under maintenance

Pricing

Pay per event

Try for free

Go to Apify Store

Clean Web Scraper - Markdown for AI via Firecrawl

Under maintenance

Try for free

Convert any website to clean, LLM-optimized markdown using Firecrawl. Perfect for RAG pipelines, AI training data, and knowledge bases. No login required, 25% cheaper than Firecrawl direct. Batch process hundreds of URLs. Supports PDF/DOCX. Pay only $0.004 per page - no monthly fees.

Pricing

Pay per event

Rating

0.0

(0)

Developer

ClearPath

Actor stats

Bookmarked

Total users

Monthly active users

5 days ago

Last modified

Clean Web Scraper - Markdown for AI | Firecrawl Powered

The easiest way to convert any website to clean, LLM-optimized markdown — no login required, no cookies needed, just paste URLs and get structured content ready for RAG pipelines, fine-tuning, and knowledge bases.

Built on Firecrawl's production-grade infrastructure, this Actor delivers 25% cheaper pricing than subscribing to Firecrawl directly. Pay only for what you scrape — no monthly commitments.

✅ No authentication required — works without browser sessions or cookies
✅ Production-grade reliability — powered by Firecrawl's battle-tested infrastructure
✅ LLM-optimized output — clean markdown stripped of navigation, ads, and clutter
✅ Web search — search the web and scrape results in one step
✅ PDF support — pass any PDF URL and get clean markdown automatically
✅ Batch processing — scrape URLs in parallel
✅ Website crawling — discover and scrape all pages from a site automatically
✅ Multiple formats — markdown, HTML, raw HTML, links, or screenshots

Why Firecrawl?

Built on Firecrawl's enterprise infrastructure, you get these capabilities automatically — no configuration required:

Feature	What It Does
Smart Wait	Intelligently waits for JavaScript content to load. Dynamic SPAs, lazy-loaded content, and client-rendered pages just work.
Stealth Mode	Handles anti-bot protection automatically. Rotates user agents, manages browser fingerprinting, retries with stealth proxies when needed.
Intelligent Caching	Recently scraped pages are cached for up to 500% faster repeated requests.
Media Parsing	Native PDF and DOCX parsing. Pass any document URL and get clean markdown.
Ad Blocking	Ads, cookie banners, and popups automatically blocked for cleaner output.

⚡ Key Features

📝 LLM-Optimized Content Extraction

Clean markdown output — Headers, footers, navigation, ads automatically removed
Smart content detection — Firecrawl identifies and extracts the main article/content
Preserves semantic structure — Headings, lists, tables, code blocks intact
Native document parsing — PDFs and DOCX files converted to markdown automatically
Multiple formats — Get markdown, HTML, raw HTML, links, or screenshots

🔍 Web Search + Scrape

Search mode — Search the web and scrape results in one API call
Combine with URLs — Run search AND scrape specific URLs together
Configurable limit — Return 1-100 search results

🚀 High-Performance Batch Processing

Single URL mode — Quick scrape for one page
Batch mode — Process hundreds of URLs in parallel
Auto-detection — Automatically chooses optimal mode based on input
Progress tracking — Real-time status updates during batch jobs

💰 Pay-Per-Use Pricing

No monthly fees — Pay only for pages you scrape
25% cheaper — Lower cost than Firecrawl Hobby plan
Predictable costs — $0.004 per page, no hidden fees
No commitment — Scale up or down instantly

Use Cases

For Lead Generation & Sales

Company research — Extract company profiles, team info, and contact details from YC, Crunchbase, LinkedIn
Prospect enrichment — Scrape about pages, team bios, and social links at scale
Competitive intelligence — Monitor competitor websites, pricing pages, and product updates
Investment research — Gather startup data, funding info, and founder backgrounds

For AI/ML Engineers

Build RAG pipelines — Convert documentation sites to vector embeddings
Create training datasets — Scrape clean text for LLM fine-tuning
Feed knowledge bases — Extract content for AI assistants
Process research papers — Convert PDFs to structured markdown

For Developers

Documentation scraping — Mirror docs for offline access
Content migration — Move websites between platforms
Data extraction — Pull structured content from any page
API integration — Automate content pipelines

For Content Teams

Competitive analysis — Extract competitor content for review
Content auditing — Bulk export site content
Archive creation — Preserve web content in clean format
Research compilation — Gather sources into structured documents

Quick Start

Web Search (Search Mode)

{
  "query": "AI startups funding 2025",
  "searchLimit": 10
}

Image Search

{
  "query": "golden retriever puppy",
  "searchSources": ["images"],
  "searchLimit": 10
}

News Search

{
  "query": "climate change policy",
  "searchSources": ["news"],
  "searchTimeFilter": "week",
  "searchLimit": 10
}

Search + Specific URLs (Combined)

{
  "query": "best project management software",
  "urls": ["https://www.ycombinator.com/companies/asana", "https://www.ycombinator.com/companies/notion"],
  "searchLimit": 5
}

Single URL (Scrape Mode)

{
  "urls": ["https://docs.firecrawl.dev/introduction"]
}

Multiple URLs (Batch Mode)

{
  "urls": [
    "https://www.ycombinator.com/companies/airbnb",
    "https://www.ycombinator.com/companies/stripe",
    "https://www.ycombinator.com/companies/openai"
  ],
  "formats": ["markdown", "links"]
}

Crawl Entire Site (Crawl Mode)

{
  "crawlUrl": "https://docs.firecrawl.dev",
  "crawlLimit": 50,
  "crawlDepth": 2
}

PDF to Markdown (via URL)

{
  "urls": ["https://www.orimi.com/pdf-test.pdf"]
}

PDF/DOCX Upload (Direct File)

You can also upload PDF or DOCX files directly using the Upload PDF or DOCX field in the Apify Console. The file is stored in a key-value store and processed automatically — no hosting required.

Company Research (Lead Gen)

{
  "urls": [
    "https://www.ycombinator.com/companies/airbnb",
    "https://www.notion.so/about",
    "https://linear.app/about"
  ],
  "formats": ["markdown", "links"],
  "onlyMainContent": true
}

Input Parameters

Parameter	Type	Required	Default	Description
`query`	string	No*	-	Search query. Results are scraped and returned as markdown. Can be combined with URLs.
`searchLimit`	integer	No	`5`	Maximum search results to return (1-100).
`searchSources`	array	No	`["web"]`	Types of results: `web`, `images`, `news`. Can combine multiple.
`searchTimeFilter`	string	No	`any`	Filter by recency: `any`, `hour`, `day`, `week`, `month`, `year`.
`searchLocation`	string	No	-	Geographic location (e.g., `San Francisco,California,United States`).
`searchCategories`	array	No	`[]`	Filter web results: `github`, `research`, `pdf`.
`urls`	array	No*	-	One or more URLs to scrape. Single URL triggers scrape mode, multiple URLs trigger batch mode.
`fileUpload`	string	No*	-	Upload a PDF or DOCX file directly. File is stored in key-value store and processed automatically.
`crawlUrl`	string	No*	-	Base URL to start crawling. Discovers and scrapes all internal pages.
`crawlLimit`	integer	No	`500`	Maximum pages to crawl (1-10000).
`crawlDepth`	integer	No	`2`	Max link depth from starting URL. 0 = starting page only.
`includePaths`	array	No	`[]`	Only crawl URLs matching these patterns (regex).
`excludePaths`	array	No	`[]`	Skip URLs matching these patterns (regex).
`formats`	array	No	`["markdown"]`	Output formats to include: `markdown`, `html`, `rawHtml`, `links`, `screenshot`
`onlyMainContent`	boolean	No	`true`	When enabled, strips headers, footers, and navigation for cleaner LLM-ready output

*At least one of crawlUrl, query, urls, or fileUpload is required. All can be combined in one run.

Output Formats Explained

Format	Description	Best For
`markdown`	Clean, structured markdown	RAG, LLMs, documentation
`html`	Cleaned HTML with structure preserved	Web apps, rendering
`rawHtml`	Original HTML untouched	Archival, debugging
`links`	All links found on page	Site mapping, crawling
`screenshot`	Full-page screenshot	Visual verification

Output

Each scraped page returns:

{
  "url": "https://www.ycombinator.com/companies/airbnb",
  "success": true,
  "markdown": "# Airbnb\n\nBook accommodations around the world.\n\nY Combinator Winter 2009 | Public | San Francisco\n\n## About\n\nFounded in August of 2008 and based in San Francisco, California, Airbnb is a trusted community marketplace for people to list, discover, and book unique accommodations around the world...",
  "links": [
    "https://twitter.com/bchesky",
    "https://www.linkedin.com/in/brianchesky/",
    "https://www.linkedin.com/company/airbnb/"
  ],
  "metadata": {
    "title": "Airbnb: Book accommodations around the world. | Y Combinator",
    "description": "Book accommodations around the world. Founded in 2008.",
    "language": "en",
    "statusCode": 200
  },
  "scraped_at": "2025-01-15T10:30:00.000Z"
}

Search Output (Web)

When using search mode with searchSources: ["web"] (default):

{
  "url": "https://github.com/talkpython/async-techniques-python-course",
  "success": true,
  "sourceType": "web",
  "title": "GitHub - talkpython/async-techniques-python-course",
  "description": "Async Techniques and Examples in Python Course.",
  "query": "python async programming",
  "markdown": "# Async Techniques and Examples in Python Course\n\nPython's async and parallel programming support is highly underrated...",
  "metadata": {
    "title": "GitHub - talkpython/async-techniques-python-course",
    "language": "en",
    "statusCode": 200
  },
  "scraped_at": "2025-01-15T10:30:00.000Z"
}

Search Output (Images)

When using searchSources: ["images"]:

{
  "url": "https://www.akc.org/expert-advice/dog-breeds/golden-retriever-puppy-training/",
  "success": true,
  "sourceType": "image",
  "title": "How to Train a Golden Retriever Puppy",
  "description": "Golden Retrievers are known for their calm demeanor.",
  "query": "golden retriever puppy",
  "imageUrl": "https://s3.amazonaws.com/cdn-origin-etr.akc.org/wp-content/uploads/Golden-Retriever-puppy.jpg",
  "markdown": "# How to Train a Golden Retriever Puppy\n\nGolden Retriever puppies are eager to please...",
  "metadata": {
    "title": "How to Train a Golden Retriever Puppy",
    "og_image": "https://s3.amazonaws.com/cdn-origin-etr.akc.org/wp-content/uploads/Golden-Retriever-puppy.jpg",
    "statusCode": 200
  },
  "scraped_at": "2025-01-15T10:30:00.000Z"
}

Search Output (News)

When using searchSources: ["news"]:

{
  "url": "https://ec.europa.eu/eurostat/web/products-eurostat-news/w/ddn-20251211-2",
  "success": true,
  "sourceType": "news",
  "title": "20% of EU enterprises use AI technologies",
  "description": "In 2025, 20.0% of EU enterprises used AI technologies.",
  "query": "AI technology 2024",
  "publishedDate": "2025-12-11T10:00:00Z",
  "imageUrl": "https://ec.europa.eu/eurostat/documents/4187653/15566025/image.jpg",
  "markdown": "# 20% of EU enterprises use AI technologies\n\nIn 2025, 20.0% of EU enterprises with 10 or more employees used artificial intelligence...",
  "metadata": {
    "title": "20% of EU enterprises use AI technologies",
    "published_time": "2025-12-11T10:00:00Z",
    "statusCode": 200
  },
  "scraped_at": "2025-01-15T10:30:00.000Z"
}

Batch Output

When scraping multiple URLs, each page is saved as a separate item in the dataset. Access results via:

Apify Console — View and export from the Dataset tab
API — Fetch via GET /datasets/{datasetId}/items
Integrations — Connect to Google Sheets, Airtable, webhooks

Crawl Output

When crawling a website, each discovered page is saved as a separate item. The output format is identical to scrape mode:

{
  "url": "https://docs.firecrawl.dev/features/crawl",
  "success": true,
  "markdown": "# Crawl\n\nFirecrawl can crawl a URL and all accessible subpages...",
  "metadata": {
    "title": "Crawl - Firecrawl Docs",
    "description": "Learn how to crawl websites with Firecrawl",
    "language": "en",
    "statusCode": 200
  },
  "scraped_at": "2025-01-15T10:30:00.000Z"
}

A crawl with crawlLimit: 50 produces up to 50 dataset items — one per discovered page.

Pricing - Pay Per Event (PPE)

Transparent, predictable pricing with no monthly fees

Event	Price	Description
`page_scraped`	$0.004	Charged per URL successfully scraped

Cost Comparison vs Firecrawl Direct

Pages	This Actor	Firecrawl Hobby ($16/mo)	Savings
100	$0.40	$16.00	97%
1,000	$4.00	$16.00	75%
3,000	$12.00	$16.00	25%

Pricing Examples

Scenario	Pages	Cost
Research 50 YC companies	50	$0.20
Scrape competitor about pages	100	$0.40
Build prospect database	500	$2.00
Weekly company monitoring	1,000	$4.00

API Integration

Python

from apify_client import ApifyClient

client = ApifyClient("your_api_token")

run = client.actor("clearpath/web-to-markdown").call(
    run_input={
        "urls": [
            "https://www.ycombinator.com/companies/stripe",
            "https://www.ycombinator.com/companies/openai",
            "https://www.ycombinator.com/companies/airbnb"
        ],
        "formats": ["markdown", "links"],
        "onlyMainContent": True
    }
)

# Fetch results - extract founder info
dataset = client.dataset(run["defaultDatasetId"])
for item in dataset.iterate_items():
    print(f"Company: {item['url'].split('/')[-1]}")
    print(f"LinkedIn links: {[l for l in item.get('links', []) if 'linkedin' in l]}")
    print("---")

JavaScript

import { ApifyClient } from 'apify-client';

const client = new ApifyClient({ token: 'your_api_token' });

const run = await client.actor('clearpath/web-to-markdown').call({
    urls: [
        'https://www.ycombinator.com/companies/notion',
        'https://www.ycombinator.com/companies/vercel'
    ],
    formats: ['markdown', 'links'],
    onlyMainContent: true
});

const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach(item => {
    console.log(`${item.url}: ${item.links.length} links extracted`);
});

cURL

curl -X POST "https://api.apify.com/v2/acts/clearpath~web-to-markdown/runs?token=your_api_token" \
  -H "Content-Type: application/json" \
  -d '{
    "urls": ["https://www.ycombinator.com/companies/airbnb"],
    "formats": ["markdown", "links"]
  }'

Advanced Usage

Batch Company Research

{
  "urls": [
    "https://www.ycombinator.com/companies/stripe",
    "https://www.ycombinator.com/companies/openai",
    "https://www.ycombinator.com/companies/airbnb",
    "https://www.ycombinator.com/companies/dropbox"
  ],
  "formats": ["markdown", "links"],
  "onlyMainContent": true
}

Scrape About/Team Pages

{
  "urls": [
    "https://www.notion.so/about",
    "https://linear.app/about",
    "https://vercel.com/about"
  ],
  "formats": ["markdown", "links"]
}

Extract Pricing Pages

{
  "urls": [
    "https://linear.app/pricing",
    "https://www.notion.so/pricing"
  ],
  "formats": ["markdown"],
  "onlyMainContent": true
}

Get Page Links for Discovery

{
  "urls": ["https://www.ycombinator.com/companies"],
  "formats": ["links"]
}

Technical Requirements

Requirement	Value
Memory	256-512 MB recommended
Timeout	30 seconds per page default
Proxy	Not required (handled by Firecrawl)
Rate limits	Managed automatically
Anti-bot	Automatic (stealth mode)
JS Rendering	Automatic (smart wait)
Caching	2-day default

Data Export

Export your scraped data in multiple formats:

JSON — Structured data for programmatic access
CSV — Spreadsheet-compatible for analysis
Excel — Ready for business reporting
XML — Integration with enterprise systems

Access exports from the Apify Console Dataset tab or via API.

Automation

Scheduled Runs

Set up recurring scrapes — hourly, daily, or weekly — directly in Apify Console.

Webhooks

Receive notifications when scraping completes:

{
  "event": "ACTOR.RUN.SUCCEEDED",
  "data": {
    "actorRunId": "abc123",
    "defaultDatasetId": "xyz789"
  }
}

Integrations

Connect to 100+ apps via Apify integrations:

Google Sheets
Airtable
Slack
Zapier
Make (Integromat)

FAQ

Q: Do I need a Firecrawl account? A: No. This Actor handles all Firecrawl authentication internally. Just run and get results.

Q: How does search mode work? A: Provide a query parameter and the Actor searches the web, then scrapes each result page. You get both search metadata (title, description) and full page content (markdown). You can combine search with specific URLs to run both in one call.

Q: What websites can I scrape? A: Most public websites work. Firecrawl covers 96% of the web, including JavaScript-heavy and protected pages.

Q: How does it handle JavaScript-heavy sites? A: Firecrawl uses smart wait technology that automatically detects when content has finished loading. Dynamic SPAs, lazy-loaded content, and client-rendered pages work without any configuration.

Q: What about sites with anti-bot protection? A: Stealth mode is enabled by default. Firecrawl automatically handles browser fingerprinting, user agent rotation, and retries with stealth proxies when basic requests fail.

Q: Is there any caching? A: Yes. Firecrawl caches recently scraped pages (default 2 days) for faster repeated requests. You're only charged once per unique scrape within the cache window.

Q: How many URLs can I scrape at once? A: No hard limit. Batch mode processes URLs in parallel for maximum efficiency. For very large jobs (10,000+ URLs), consider splitting into multiple runs.

Q: Is the data real-time? A: Yes. Each run fetches fresh data directly from the target websites.

Q: What if a page fails to scrape? A: Failed pages return "success": false with error details. You're only charged for successful scrapes.

Q: Can I scrape PDFs? A: Yes. Firecrawl natively parses PDFs and converts them to markdown. Just provide the PDF URL.

Q: How does pricing compare to Firecrawl direct? A: At $0.004/page, you save 25% compared to Firecrawl's Hobby plan ($16/month for ~3,000 credits). Plus, no monthly commitment — pay only for what you use.

Q: Can I use my own Firecrawl API key? A: Currently, the Actor uses a managed Firecrawl account. Contact us if you need custom API key support.

Q: What's the difference between crawl and batch scrape? A: Batch scrape takes explicit URLs you provide. Crawl mode discovers pages automatically — you give it a starting URL and it follows internal links up to your specified depth and limit. Use crawl for "scrape this entire site" and batch for "scrape these specific pages."

Getting Started

1. Create Account

Sign up for Apify (free)
No credit card required for free tier
$5 free platform credit included

2. Configure Input

Add your target URLs
Choose output formats (markdown recommended for LLMs)
Enable onlyMainContent for cleaner output

3. Run Actor

Click Start to begin scraping
Monitor progress in real-time
View results in Dataset tab

4. Export & Integrate

Download as JSON, CSV, or Excel
Set up scheduled runs for automation
Connect webhooks for real-time notifications

Support

📧 Email: max@mapa.slmail.me
💡 Feature requests: Email or Issues Tab
⏱️ Response time: Within 24 hours

Legal Compliance

This Actor extracts publicly available web content. Users are responsible for:

Complying with target website Terms of Service
Respecting robots.txt directives
Following data protection regulations (GDPR, CCPA)
Using extracted data ethically and legally

Content Ownership: Only scrape content you have rights to use.

🚀 Start Scraping Websites to Markdown Now

Convert any website to LLM-ready markdown in seconds. No setup, no monthly fees, no hassle.

Firecrawl AI-Powered Web Search & Scrape

alizarin_refrigerator-owner/firecrawl-ai-powered-web-search-scrape

Search the web and get clean, LLM-ready content in one API call. Powered by Firecrawl's /v1/search endpoint. Returns markdown, HTML, or extracted data. Perfect for SEO research, competitor analysis, and AI training data collection.

The Howlers

Firecrawl Search - LLM-ready content

alizarin_refrigerator-owner/firecrawl-search---llm-ready-content

The Howlers

Firecrawl Pro Advanced Web Scraping Full Firecrawl Features

alizarin_refrigerator-owner/firecrawl-pro-advanced-web-scraping-full-firecrawl-features

Professional scraping using Firecrawl's complete feature set / exposes all Firecrawl capabilities Markdown/HTML Content Filtering Include/Exclude Screenshots Stealth Mode Fast Mode Caching Location Proxies JSON Prompt-Based Branding Link Extraction Image Extraction Autonomous Multi-Page Discovery

The Howlers

Firecrawl Enrich - Data Enrichment

alizarin_refrigerator-owner/fire-enrich

Enrich business data with Firecrawl. Get company info, contact details, social profiles & website data. Batch process leads for sales & marketing.

The Howlers

Website To Markdown

smart_api/website-to-markdown

Convert any webpage into clean, LLM-ready Markdown in seconds — perfect for AI training data, RAG pipelines, and content archiving.

SmartApi

5.0

Firecrawl Agent - Web Crawler

alizarin_refrigerator-owner/firecrawl-agent

Advanced web crawling with Firecrawl. Extract clean markdown, handle JavaScript sites & manage large-scale crawls with built-in rate limiting & error handling.

The Howlers

Web-to-Markdown Generator for AI & RAG Pipelines

profitstack/web-to-markdown-generator-for-ai-rag-pipelines

Convert any website into clean, heading-based chunking, LLM-ready Markdown for RAG and AI agents.

Manas Mantri

Website Content to Markdown for LLM Training

easyapi/website-content-to-markdown-for-llm-training

🚀 Transform web content into clean, LLM-ready Markdown! 📘 Scrape multiple pages, extract main content, and convert to Markdown format. Perfect for AI researchers, data scientists, and LLM developers. Fast, efficient, and customizable. Supercharge your AI training data today! 🌐📝🧠

EasyApi

249

5.0

Firecrawl MCP Server

agentify/firecrawl-mcp-server

A Model Context Protocol (MCP) server implementation that integrates with Firecrawl MCP for web scraping capabilities

agentify

316

RAG Spider - Web to Markdown Crawler for AI Training Data

lenient_grove/RAG-Spider

Enterprise-grade web crawler that converts messy websites into clean, chunked Markdown for AI systems. Uses Mozilla Readability for 95% cleaner extraction than competitors. Outputs RAG-ready data with metadata and token estimates. Perfect for building knowledge bases and training AI chatbots.

Tejas Rawool

5.0

Clean Web Scraper - Markdown for AI via Firecrawl

Clean Web Scraper - Markdown for AI | Firecrawl Powered

Why Firecrawl?

⚡ Key Features

📝 LLM-Optimized Content Extraction

🔍 Web Search + Scrape

🚀 High-Performance Batch Processing

💰 Pay-Per-Use Pricing

Use Cases

For Lead Generation & Sales

For AI/ML Engineers

For Developers

For Content Teams

Quick Start

Web Search (Search Mode)

Image Search

News Search

Search + Specific URLs (Combined)

Single URL (Scrape Mode)

Multiple URLs (Batch Mode)

Crawl Entire Site (Crawl Mode)

PDF to Markdown (via URL)

PDF/DOCX Upload (Direct File)

Company Research (Lead Gen)

Input Parameters

Output Formats Explained

Output

Search Output (Web)

Search Output (Images)

Search Output (News)

Batch Output

Crawl Output

Pricing - Pay Per Event (PPE)

Cost Comparison vs Firecrawl Direct

Pricing Examples

API Integration

Python

JavaScript

cURL

Advanced Usage

Batch Company Research

Scrape About/Team Pages

Extract Pricing Pages

Get Page Links for Discovery

Technical Requirements

Data Export

Automation

Scheduled Runs

Webhooks

Integrations

FAQ

Getting Started

1. Create Account

2. Configure Input

3. Run Actor

4. Export & Integrate

Support

Legal Compliance

You might also like

Firecrawl AI-Powered Web Search & Scrape

Firecrawl Search - LLM-ready content

Firecrawl Pro Advanced Web Scraping Full Firecrawl Features

Firecrawl Enrich - Data Enrichment

Website To Markdown

Firecrawl Agent - Web Crawler

Web-to-Markdown Generator for AI & RAG Pipelines

Website Content to Markdown for LLM Training

Firecrawl MCP Server

RAG Spider - Web to Markdown Crawler for AI Training Data

Related articles