Pricing

from $1.00 / 1,000 results

Go to Apify Store

Web Scraper For Llms

Try for free

Stealth web scraping engine built for LLMs. Converts any web page to clean markdown or HTML

Pricing

from $1.00 / 1,000 results

Rating

0.0

(0)

Developer

Abot API

Actor stats

Bookmarked

Total users

Monthly active users

6 days ago

Last modified

Quick Start

Scrape a list of URLs:

{
  "urls": ["https://example.com", "https://medium.com/"]
}

Crawl a website and scrape all discovered pages:

{
  "urls": ["https://docs.example.com"],
  "crawl": true,
  "crawlDepth": 2,
  "crawlMaxPages": 50
}

Input Parameters

Parameter	Type	Default	Description
`urls`	Array	required	URLs to scrape or crawl from
`crawl`	Boolean	`false`	Follow links to discover additional pages
`crawlDepth`	Integer	`1`	Link hops from seed URL (crawl only)
`crawlMaxPages`	Integer	`20`	Max pages to discover per seed (crawl only)
`formats`	Array	`["markdown"]`	Output formats: `markdown`, `html`, or both
`concurrency`	Integer	`3`	Parallel URL processing
`maxRetries`	Integer	`2`	Retry attempts for failed URLs (scrape only)
`timeoutMs`	Integer	`30000`	Timeout per URL in milliseconds
`onlyMainContent`	Boolean	`true`	Strip nav/header/footer/sidebar (scrape only)
`removeAds`	Boolean	`true`	Remove ads and tracking elements
`removeBase64Images`	Boolean	`true`	Remove inline base64 images
`includeTags`	Array	-	CSS selectors to keep (scrape only)
`excludeTags`	Array	-	CSS selectors to remove (scrape only)
`includePatterns`	Array	-	Regex URL filters (include only matching)
`excludePatterns`	Array	-	Regex URL filters (skip matching)
`waitForSelector`	String	-	Wait for CSS selector before extraction (scrape only)
`proxyConfiguration`	Object	-	Apify proxy settings

Output

{
  "url": "https://medium.com/",
  "title": "Medium: Read and write stories.",
  "description": null,
  "markdown": "## Human stories & ideas\n\nA place to read, write, and deepen your understanding...",
  "html": null,
  "metadata": {
    "title": "Medium: Read and write stories.",
    "language": "en",
    "favicon": "https://miro.medium.com/...",
    "canonical": "https://medium.com/",
    "openGraph": null,
    "twitter": null
  },
  "duration": 5725,
  "scrapedAt": "2026-02-24T03:36:28.990Z",
  "success": true,
  "error": null
}

Use Cases

RAG pipelines - Feed clean markdown into LLM knowledge bases
Content monitoring - Track changes across a set of pages
Research - Bulk extract articles, documentation, or product pages
Site migration - Crawl and export an entire site as markdown
Data extraction - Scrape structured content from specific CSS selectors

Web Page & PDF to Markdown

prizable_aster/web-page-pdf-to-markdown

Converts public web pages and text-based PDFs into clean Markdown, plain text, and structured JSON.

Vaque Wei

Web Page to Markdown & Text - URL Reader for LLMs & RAG

entranced_gelato/ai-web-page-reader

Read any web page as clean text + Markdown for LLMs and automations. Strips ads, nav, and scripts; returns the main content, metadata (title, author, date, word count), and an optional AI TL;DR + key points. The web-reading primitive for AI agents, RAG pipelines, and no-code flows.

AIDevs

Universal Markdown Scraper for LLMs

botflowtech/universal-markdown-scraper-for-llms

Universal Markdown Scraper for LLMs

BotFlowTech

🧠 RAG Web Browser — Web Content for AI & LLMs

nexgendata/rag-web-browser

Web browser for RAG pipelines and AI agents. Search Google, scrape top results, return clean Markdown. Feed your LLM with real-time web data. Works with Claude, GPT, LangChain, CrewAI. No API key needed.

NexGenData

Web to Markdown for LLMs

george.the.developer/web-to-markdown-llm

Convert any URL to clean LLM-ready markdown. 60-70% fewer tokens than raw HTML. Built for AI agents and RAG pipelines.

George Kioko

universal-web-to-markdown

hachi-dev/universal-web-to-markdown

High-performance tool for AI & RAG pipelines. Converts web pages to clean Markdown by removing noise and fixing relative URLs. Built with Cheerio for extreme speed and low cost ($0.50/1k pages). Perfect for feeding clean data to LLMs.

JI JUN

Contextractor — clean web content extraction for LLMs

glueo/contextractor

Crawl any website and extract clean main-content text as Markdown, plain text, JSON, or HTML — ready for LLMs, RAG pipelines, and vector databases. Built on the rs-trafilatura engine and an adaptive Crawlee + Playwright crawler.

Glueo

AI Web-to-Markdown Extract API — URL to Clean JSON for LLMs

olican/ai-web-to-markdown-extract

Scrapes any webpage, automatically cleans HTML clutter (nav, footers, scripts, ads, cookie consent banners), and transforms the main content into clean, structured Markdown for LLMs and RAG.

Sergio Calvo

5.0

RAG Web Browser

parseforge/rag-web-browser

Give your AI agents real-time web access! Search the web on any topic and get full page content as clean Markdown, ready for LLMs, RAG pipelines, or OpenAI Assistants. Includes titles, descriptions, links, authors, images, and metadata. Start grounding your AI with fresh data in minutes!