Clean Web Scraper - Markdown for AI πŸ”₯ Firecrawl API avatar
Clean Web Scraper - Markdown for AI πŸ”₯ Firecrawl API

Pricing

Pay per event

Go to Apify Store
Clean Web Scraper - Markdown for AI πŸ”₯ Firecrawl API

Clean Web Scraper - Markdown for AI πŸ”₯ Firecrawl API

Convert any website to clean, LLM-optimized markdown using Firecrawl. Perfect for RAG pipelines, AI training data, and knowledge bases. No login required, 25% cheaper than Firecrawl direct. Batch process hundreds of URLs. Supports PDF/DOCX. Pay only $0.004 per page - no monthly fees.

Pricing

Pay per event

Rating

0.0

(0)

Developer

ClearPath

ClearPath

Maintained by Community

Actor stats

2

Bookmarked

6

Total users

2

Monthly active users

5 days ago

Last modified

Share

Clean Web Scraper - Markdown for AI | Firecrawl Powered

The easiest way to convert any website to clean, LLM-optimized markdown β€” no login required, no cookies needed, just paste URLs and get structured content ready for RAG pipelines, fine-tuning, and knowledge bases.

Demo

Built on Firecrawl's production-grade infrastructure, this Actor delivers 25% cheaper pricing than subscribing to Firecrawl directly. Pay only for what you scrape β€” no monthly commitments.

  • βœ… No authentication required β€” works without browser sessions or cookies
  • βœ… Production-grade reliability β€” powered by Firecrawl's battle-tested infrastructure
  • βœ… LLM-optimized output β€” clean markdown stripped of navigation, ads, and clutter
  • βœ… Web search β€” search the web and scrape results in one step
  • βœ… PDF support β€” pass any PDF URL and get clean markdown automatically
  • βœ… Batch processing β€” scrape URLs in parallel
  • βœ… Website crawling β€” discover and scrape all pages from a site automatically
  • βœ… Multiple formats β€” markdown, HTML, raw HTML, links, or screenshots

Why Firecrawl?

Built on Firecrawl's enterprise infrastructure, you get these capabilities automatically β€” no configuration required:

FeatureWhat It Does
Smart WaitIntelligently waits for JavaScript content to load. Dynamic SPAs, lazy-loaded content, and client-rendered pages just work.
Stealth ModeHandles anti-bot protection automatically. Rotates user agents, manages browser fingerprinting, retries with stealth proxies when needed.
Intelligent CachingRecently scraped pages are cached for up to 500% faster repeated requests.
Media ParsingNative PDF and DOCX parsing. Pass any document URL and get clean markdown.
Ad BlockingAds, cookie banners, and popups automatically blocked for cleaner output.

⚑ Key Features

πŸ“ LLM-Optimized Content Extraction

  • Clean markdown output β€” Headers, footers, navigation, ads automatically removed
  • Smart content detection β€” Firecrawl identifies and extracts the main article/content
  • Preserves semantic structure β€” Headings, lists, tables, code blocks intact
  • Native document parsing β€” PDFs and DOCX files converted to markdown automatically
  • Multiple formats β€” Get markdown, HTML, raw HTML, links, or screenshots

πŸ” Web Search + Scrape

  • Search mode β€” Search the web and scrape results in one API call
  • Combine with URLs β€” Run search AND scrape specific URLs together
  • Configurable limit β€” Return 1-100 search results

πŸš€ High-Performance Batch Processing

  • Single URL mode β€” Quick scrape for one page
  • Batch mode β€” Process hundreds of URLs in parallel
  • Auto-detection β€” Automatically chooses optimal mode based on input
  • Progress tracking β€” Real-time status updates during batch jobs

πŸ’° Pay-Per-Use Pricing

  • No monthly fees β€” Pay only for pages you scrape
  • 25% cheaper β€” Lower cost than Firecrawl Hobby plan
  • Predictable costs β€” $0.004 per page, no hidden fees
  • No commitment β€” Scale up or down instantly

Use Cases

For Lead Generation & Sales

  • Company research β€” Extract company profiles, team info, and contact details from YC, Crunchbase, LinkedIn
  • Prospect enrichment β€” Scrape about pages, team bios, and social links at scale
  • Competitive intelligence β€” Monitor competitor websites, pricing pages, and product updates
  • Investment research β€” Gather startup data, funding info, and founder backgrounds

For AI/ML Engineers

  • Build RAG pipelines β€” Convert documentation sites to vector embeddings
  • Create training datasets β€” Scrape clean text for LLM fine-tuning
  • Feed knowledge bases β€” Extract content for AI assistants
  • Process research papers β€” Convert PDFs to structured markdown

For Developers

  • Documentation scraping β€” Mirror docs for offline access
  • Content migration β€” Move websites between platforms
  • Data extraction β€” Pull structured content from any page
  • API integration β€” Automate content pipelines

For Content Teams

  • Competitive analysis β€” Extract competitor content for review
  • Content auditing β€” Bulk export site content
  • Archive creation β€” Preserve web content in clean format
  • Research compilation β€” Gather sources into structured documents

Quick Start

Web Search (Search Mode)

{
"query": "AI startups funding 2025",
"searchLimit": 10
}
{
"query": "golden retriever puppy",
"searchSources": ["images"],
"searchLimit": 10
}
{
"query": "climate change policy",
"searchSources": ["news"],
"searchTimeFilter": "week",
"searchLimit": 10
}

Search + Specific URLs (Combined)

{
"query": "best project management software",
"urls": ["https://www.ycombinator.com/companies/asana", "https://www.ycombinator.com/companies/notion"],
"searchLimit": 5
}

Single URL (Scrape Mode)

{
"urls": ["https://docs.firecrawl.dev/introduction"]
}

Multiple URLs (Batch Mode)

{
"urls": [
"https://www.ycombinator.com/companies/airbnb",
"https://www.ycombinator.com/companies/stripe",
"https://www.ycombinator.com/companies/openai"
],
"formats": ["markdown", "links"]
}

Crawl Entire Site (Crawl Mode)

{
"crawlUrl": "https://docs.firecrawl.dev",
"crawlLimit": 50,
"crawlDepth": 2
}

PDF to Markdown (via URL)

{
"urls": ["https://www.orimi.com/pdf-test.pdf"]
}

PDF/DOCX Upload (Direct File)

You can also upload PDF or DOCX files directly using the Upload PDF or DOCX field in the Apify Console. The file is stored in a key-value store and processed automatically β€” no hosting required.

Company Research (Lead Gen)

{
"urls": [
"https://www.ycombinator.com/companies/airbnb",
"https://www.notion.so/about",
"https://linear.app/about"
],
"formats": ["markdown", "links"],
"onlyMainContent": true
}

Input Parameters

ParameterTypeRequiredDefaultDescription
querystringNo*-Search query. Results are scraped and returned as markdown. Can be combined with URLs.
searchLimitintegerNo5Maximum search results to return (1-100).
searchSourcesarrayNo["web"]Types of results: web, images, news. Can combine multiple.
searchTimeFilterstringNoanyFilter by recency: any, hour, day, week, month, year.
searchLocationstringNo-Geographic location (e.g., San Francisco,California,United States).
searchCategoriesarrayNo[]Filter web results: github, research, pdf.
urlsarrayNo*-One or more URLs to scrape. Single URL triggers scrape mode, multiple URLs trigger batch mode.
fileUploadstringNo*-Upload a PDF or DOCX file directly. File is stored in key-value store and processed automatically.
crawlUrlstringNo*-Base URL to start crawling. Discovers and scrapes all internal pages.
crawlLimitintegerNo500Maximum pages to crawl (1-10000).
crawlDepthintegerNo2Max link depth from starting URL. 0 = starting page only.
includePathsarrayNo[]Only crawl URLs matching these patterns (regex).
excludePathsarrayNo[]Skip URLs matching these patterns (regex).
formatsarrayNo["markdown"]Output formats to include: markdown, html, rawHtml, links, screenshot
onlyMainContentbooleanNotrueWhen enabled, strips headers, footers, and navigation for cleaner LLM-ready output

*At least one of crawlUrl, query, urls, or fileUpload is required. All can be combined in one run.

Output Formats Explained

FormatDescriptionBest For
markdownClean, structured markdownRAG, LLMs, documentation
htmlCleaned HTML with structure preservedWeb apps, rendering
rawHtmlOriginal HTML untouchedArchival, debugging
linksAll links found on pageSite mapping, crawling
screenshotFull-page screenshotVisual verification

Output

Each scraped page returns:

{
"url": "https://www.ycombinator.com/companies/airbnb",
"success": true,
"markdown": "# Airbnb\n\nBook accommodations around the world.\n\nY Combinator Winter 2009 | Public | San Francisco\n\n## About\n\nFounded in August of 2008 and based in San Francisco, California, Airbnb is a trusted community marketplace for people to list, discover, and book unique accommodations around the world...",
"links": [
"https://twitter.com/bchesky",
"https://www.linkedin.com/in/brianchesky/",
"https://www.linkedin.com/company/airbnb/"
],
"metadata": {
"title": "Airbnb: Book accommodations around the world. | Y Combinator",
"description": "Book accommodations around the world. Founded in 2008.",
"language": "en",
"statusCode": 200
},
"scraped_at": "2025-01-15T10:30:00.000Z"
}

Search Output (Web)

When using search mode with searchSources: ["web"] (default):

{
"url": "https://github.com/talkpython/async-techniques-python-course",
"success": true,
"sourceType": "web",
"title": "GitHub - talkpython/async-techniques-python-course",
"description": "Async Techniques and Examples in Python Course.",
"query": "python async programming",
"markdown": "# Async Techniques and Examples in Python Course\n\nPython's async and parallel programming support is highly underrated...",
"metadata": {
"title": "GitHub - talkpython/async-techniques-python-course",
"language": "en",
"statusCode": 200
},
"scraped_at": "2025-01-15T10:30:00.000Z"
}

Search Output (Images)

When using searchSources: ["images"]:

{
"url": "https://www.akc.org/expert-advice/dog-breeds/golden-retriever-puppy-training/",
"success": true,
"sourceType": "image",
"title": "How to Train a Golden Retriever Puppy",
"description": "Golden Retrievers are known for their calm demeanor.",
"query": "golden retriever puppy",
"imageUrl": "https://s3.amazonaws.com/cdn-origin-etr.akc.org/wp-content/uploads/Golden-Retriever-puppy.jpg",
"markdown": "# How to Train a Golden Retriever Puppy\n\nGolden Retriever puppies are eager to please...",
"metadata": {
"title": "How to Train a Golden Retriever Puppy",
"og_image": "https://s3.amazonaws.com/cdn-origin-etr.akc.org/wp-content/uploads/Golden-Retriever-puppy.jpg",
"statusCode": 200
},
"scraped_at": "2025-01-15T10:30:00.000Z"
}

Search Output (News)

When using searchSources: ["news"]:

{
"url": "https://ec.europa.eu/eurostat/web/products-eurostat-news/w/ddn-20251211-2",
"success": true,
"sourceType": "news",
"title": "20% of EU enterprises use AI technologies",
"description": "In 2025, 20.0% of EU enterprises used AI technologies.",
"query": "AI technology 2024",
"publishedDate": "2025-12-11T10:00:00Z",
"imageUrl": "https://ec.europa.eu/eurostat/documents/4187653/15566025/image.jpg",
"markdown": "# 20% of EU enterprises use AI technologies\n\nIn 2025, 20.0% of EU enterprises with 10 or more employees used artificial intelligence...",
"metadata": {
"title": "20% of EU enterprises use AI technologies",
"published_time": "2025-12-11T10:00:00Z",
"statusCode": 200
},
"scraped_at": "2025-01-15T10:30:00.000Z"
}

Batch Output

When scraping multiple URLs, each page is saved as a separate item in the dataset. Access results via:

  • Apify Console β€” View and export from the Dataset tab
  • API β€” Fetch via GET /datasets/{datasetId}/items
  • Integrations β€” Connect to Google Sheets, Airtable, webhooks

Crawl Output

When crawling a website, each discovered page is saved as a separate item. The output format is identical to scrape mode:

{
"url": "https://docs.firecrawl.dev/features/crawl",
"success": true,
"markdown": "# Crawl\n\nFirecrawl can crawl a URL and all accessible subpages...",
"metadata": {
"title": "Crawl - Firecrawl Docs",
"description": "Learn how to crawl websites with Firecrawl",
"language": "en",
"statusCode": 200
},
"scraped_at": "2025-01-15T10:30:00.000Z"
}

A crawl with crawlLimit: 50 produces up to 50 dataset items β€” one per discovered page.


Pricing - Pay Per Event (PPE)

Transparent, predictable pricing with no monthly fees

EventPriceDescription
page_scraped$0.004Charged per URL successfully scraped

Cost Comparison vs Firecrawl Direct

PagesThis ActorFirecrawl Hobby ($16/mo)Savings
100$0.40$16.0097%
1,000$4.00$16.0075%
3,000$12.00$16.0025%

Pricing Examples

ScenarioPagesCost
Research 50 YC companies50$0.20
Scrape competitor about pages100$0.40
Build prospect database500$2.00
Weekly company monitoring1,000$4.00

API Integration

Python

from apify_client import ApifyClient
client = ApifyClient("your_api_token")
run = client.actor("clearpath/web-to-markdown").call(
run_input={
"urls": [
"https://www.ycombinator.com/companies/stripe",
"https://www.ycombinator.com/companies/openai",
"https://www.ycombinator.com/companies/airbnb"
],
"formats": ["markdown", "links"],
"onlyMainContent": True
}
)
# Fetch results - extract founder info
dataset = client.dataset(run["defaultDatasetId"])
for item in dataset.iterate_items():
print(f"Company: {item['url'].split('/')[-1]}")
print(f"LinkedIn links: {[l for l in item.get('links', []) if 'linkedin' in l]}")
print("---")

JavaScript

import { ApifyClient } from 'apify-client';
const client = new ApifyClient({ token: 'your_api_token' });
const run = await client.actor('clearpath/web-to-markdown').call({
urls: [
'https://www.ycombinator.com/companies/notion',
'https://www.ycombinator.com/companies/vercel'
],
formats: ['markdown', 'links'],
onlyMainContent: true
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach(item => {
console.log(`${item.url}: ${item.links.length} links extracted`);
});

cURL

curl -X POST "https://api.apify.com/v2/acts/clearpath~web-to-markdown/runs?token=your_api_token" \
-H "Content-Type: application/json" \
-d '{
"urls": ["https://www.ycombinator.com/companies/airbnb"],
"formats": ["markdown", "links"]
}'

Advanced Usage

Batch Company Research

{
"urls": [
"https://www.ycombinator.com/companies/stripe",
"https://www.ycombinator.com/companies/openai",
"https://www.ycombinator.com/companies/airbnb",
"https://www.ycombinator.com/companies/dropbox"
],
"formats": ["markdown", "links"],
"onlyMainContent": true
}

Scrape About/Team Pages

{
"urls": [
"https://www.notion.so/about",
"https://linear.app/about",
"https://vercel.com/about"
],
"formats": ["markdown", "links"]
}

Extract Pricing Pages

{
"urls": [
"https://linear.app/pricing",
"https://www.notion.so/pricing"
],
"formats": ["markdown"],
"onlyMainContent": true
}
{
"urls": ["https://www.ycombinator.com/companies"],
"formats": ["links"]
}

Technical Requirements

RequirementValue
Memory256-512 MB recommended
Timeout30 seconds per page default
ProxyNot required (handled by Firecrawl)
Rate limitsManaged automatically
Anti-botAutomatic (stealth mode)
JS RenderingAutomatic (smart wait)
Caching2-day default

Data Export

Export your scraped data in multiple formats:

  • JSON β€” Structured data for programmatic access
  • CSV β€” Spreadsheet-compatible for analysis
  • Excel β€” Ready for business reporting
  • XML β€” Integration with enterprise systems

Access exports from the Apify Console Dataset tab or via API.


Automation

Scheduled Runs

Set up recurring scrapes β€” hourly, daily, or weekly β€” directly in Apify Console.

Webhooks

Receive notifications when scraping completes:

{
"event": "ACTOR.RUN.SUCCEEDED",
"data": {
"actorRunId": "abc123",
"defaultDatasetId": "xyz789"
}
}

Integrations

Connect to 100+ apps via Apify integrations:

  • Google Sheets
  • Airtable
  • Slack
  • Zapier
  • Make (Integromat)

FAQ

Q: Do I need a Firecrawl account? A: No. This Actor handles all Firecrawl authentication internally. Just run and get results.

Q: How does search mode work? A: Provide a query parameter and the Actor searches the web, then scrapes each result page. You get both search metadata (title, description) and full page content (markdown). You can combine search with specific URLs to run both in one call.

Q: What websites can I scrape? A: Most public websites work. Firecrawl covers 96% of the web, including JavaScript-heavy and protected pages.

Q: How does it handle JavaScript-heavy sites? A: Firecrawl uses smart wait technology that automatically detects when content has finished loading. Dynamic SPAs, lazy-loaded content, and client-rendered pages work without any configuration.

Q: What about sites with anti-bot protection? A: Stealth mode is enabled by default. Firecrawl automatically handles browser fingerprinting, user agent rotation, and retries with stealth proxies when basic requests fail.

Q: Is there any caching? A: Yes. Firecrawl caches recently scraped pages (default 2 days) for faster repeated requests. You're only charged once per unique scrape within the cache window.

Q: How many URLs can I scrape at once? A: No hard limit. Batch mode processes URLs in parallel for maximum efficiency. For very large jobs (10,000+ URLs), consider splitting into multiple runs.

Q: Is the data real-time? A: Yes. Each run fetches fresh data directly from the target websites.

Q: What if a page fails to scrape? A: Failed pages return "success": false with error details. You're only charged for successful scrapes.

Q: Can I scrape PDFs? A: Yes. Firecrawl natively parses PDFs and converts them to markdown. Just provide the PDF URL.

Q: How does pricing compare to Firecrawl direct? A: At $0.004/page, you save 25% compared to Firecrawl's Hobby plan ($16/month for ~3,000 credits). Plus, no monthly commitment β€” pay only for what you use.

Q: Can I use my own Firecrawl API key? A: Currently, the Actor uses a managed Firecrawl account. Contact us if you need custom API key support.

Q: What's the difference between crawl and batch scrape? A: Batch scrape takes explicit URLs you provide. Crawl mode discovers pages automatically β€” you give it a starting URL and it follows internal links up to your specified depth and limit. Use crawl for "scrape this entire site" and batch for "scrape these specific pages."


Getting Started

1. Create Account

  1. Sign up for Apify (free)
  2. No credit card required for free tier
  3. $5 free platform credit included

2. Configure Input

  1. Add your target URLs
  2. Choose output formats (markdown recommended for LLMs)
  3. Enable onlyMainContent for cleaner output

3. Run Actor

  1. Click Start to begin scraping
  2. Monitor progress in real-time
  3. View results in Dataset tab

4. Export & Integrate

  1. Download as JSON, CSV, or Excel
  2. Set up scheduled runs for automation
  3. Connect webhooks for real-time notifications

Support

  • πŸ“§ Email: max@mapa.slmail.me
  • πŸ’‘ Feature requests: Email or Issues Tab
  • ⏱️ Response time: Within 24 hours

This Actor extracts publicly available web content. Users are responsible for:

  • Complying with target website Terms of Service
  • Respecting robots.txt directives
  • Following data protection regulations (GDPR, CCPA)
  • Using extracted data ethically and legally

Content Ownership: Only scrape content you have rights to use.


πŸš€ Start Scraping Websites to Markdown Now


Convert any website to LLM-ready markdown in seconds. No setup, no monthly fees, no hassle.