Deprecated

Pricing

Pay per usage

See alternative Actors

Go to Apify Store

Contextractor — clean web content extraction for LLMs

Deprecated

See alternative Actors

Crawl any website and extract clean main-content text as Markdown, plain text, JSON, or HTML — ready for LLMs, RAG pipelines, and vector databases. Built on the rs-trafilatura engine and an adaptive Crawlee + Playwright crawler.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Glueo

Actor stats

Bookmarked

Total users

Monthly active users

5 days ago

Last modified

Contextractor

Details & badges:

Also available
as an Apify actor

Crawl any website and extract clean, boilerplate-free main content as Markdown, plain text, JSON, HTML, or raw original HTML — ready to feed LLMs, RAG pipelines, and vector databases. Contextractor uses the rs-trafilatura extraction engine to strip away navigation, ads, and cookie banners, and an adaptive Crawlee + Playwright crawler that automatically switches between a real browser and fast HTTP — with proxy rotation and anti-blocking handled for you.

Point it at a single page or crawl an entire site: Contextractor returns only the content that matters, in the exact format your AI workflow needs.

Homepage & docs: www.contextractor.com · Apify guide

What can Contextractor do?

Extract clean main content — the rs-trafilatura engine isolates the article body and removes navigation, headers, footers, ads, and cookie banners.
Five output formats — Markdown, plain text (txt), JSON, cleaned HTML, and the original raw HTML, saved individually or together.
Adaptive crawling — automatically switches between a headless browser (for JavaScript-heavy pages) and fast raw HTTP per page; or force Chromium, Firefox, or HTTP-only.
Whole-site crawling — follow links with a CSS selector and scope the crawl with include/exclude URL globs, sitemaps, and depth and page limits.
Tunable extraction — choose precision, balanced, or recall, and toggle tables, links, images (alt text), and comments.
Built-in anti-blocking — proxy rotation, persistent session pools, and automatic IP/fingerprint rotation when a block is detected.
Page metadata — captures title, author, publication date, description, site name, and detected language.
Handles modern pages — dismisses cookie modals, waits for selectors or network idle, scrolls lazy-loaded content, and accepts custom cookies and HTTP headers for logged-in or gated pages.
Deduplication — skip already-seen pages by canonical URL or by extracted-content hash.

Designed for LLMs, RAG, and AI pipelines

Contextractor turns messy web pages into clean, structured text that's ready for AI:

Build RAG knowledge bases — crawl docs, blogs, or help centers and ingest clean Markdown into a vector database.
Feed and contextualize LLMs — supply boilerplate-free content as context for ChatGPT, Claude, or your custom GPTs.
Create training and fine-tuning datasets — gather large volumes of clean article text.
Bulk content processing — summarize, translate, classify, or proofread pages at scale.
Content and SEO research — archive competitor or reference content as plain text or JSON.

Each output format is suited to a different job:

Format	Best for
`markdown`	Chunking and embeddings, chat context, notebooks — the default for RAG.
`txt`	Lightweight NLP, keyword stats, and simple text pipelines.
`json`	Structured, programmatic downstream processing.
`html`	Layout-aware processing or feeding other HTML tools.
`original`	The full, unmodified page for re-processing, archival, or auditing.

How does it work?

Contextractor runs a simple three-stage pipeline for every page:

Crawl — an adaptive Crawlee + Playwright crawler fetches each page and follows links within the scope you set (selectors, URL globs, depth, sitemaps), respecting robots.txt when enabled.
Extract — the rs-trafilatura engine isolates the main content and discards navigation, ads, and cookie modals, using your chosen precision/balanced/recall mode.
Output — each page is emitted in the formats you selected, with an MD5 hash and byte length, and saved to your dataset or key-value store.

How to use Contextractor

No code required — run it straight from the Apify Console:

Add your start URLs — one or more pages or site sections you want to extract.
Choose what to save and where — the Save field takes format-destination tokens (e.g. Markdown → Key-value store, Original HTML → Dataset). Pick a format for each destination you want; selecting the same format for both the dataset and the key-value store saves it to both.
(Optional) Set the crawl scope and behavior — link selector, include/exclude URL globs, depth, and page limits to follow links across a site; enable proxy rotation, robots.txt, or render waits as needed.
Click Start and watch the run progress live.
Download your data — from the dataset (JSON, CSV, Excel) or the key-value store, or pull it programmatically via the Apify API.

Every option is documented in the input form and in the Input table below.

Input

Configure Contextractor entirely from the input form in the Apify Console — every field below is also editable as JSON. Start URLs are the only required field; everything else has a sensible default.

A minimal input looks like this:

{
  "startUrls": [{ "url": "https://blog.example.com/" }],
  "selector": "a[href]",
  "globs": [{ "glob": "https://blog.example.com/**" }],
  "maxCrawlDepth": 2,
  "save": ["markdown-kvs"]
}

startUrls, globs, and exclude each take an array of objects ({ "url": … } and { "glob": … }), not bare strings; save takes format-destination tokens such as markdown-kvs or txt-dataset.

Field	Type	Default	Description
`startUrls`	array	required	URLs to extract content from
`crawlerType`	enum (`playwright-adaptive` \| `playwright-firefox` \| `playwright-chromium` \| `cheerio`)	`"playwright-adaptive"`	Browser engine or HTTP client for crawling. playwright-adaptive automatically switches between browser and HTTP client per page. cheerio uses raw HTTP only (fastest, no JS).
`renderingTypeDetectionRatio`	number	`0.1`	(Adaptive only) Ratio (0–1) of pages on which the crawler runs a rendering-type detection probe. Higher values are more accurate but slower.
`globs`	array	`[]`	Glob patterns matching URLs of pages that will be included in crawling. Setting this option allows you to customize the crawling scope. For example `https://{store,docs}.example.com/**` lets the craw…
`exclude`	array	`[]`	Glob patterns matching URLs of pages that will be excluded from crawling. Note that this affects only links found on pages, but not Start URLs, which are always crawled.
`selector`	string	`""`	CSS selector for links to enqueue. Leave empty to disable link enqueueing.
`keepUrlFragment`	boolean	`false`	URL fragments (the parts of URL after a #) are not considered when the scraper determines whether a URL has already been visited. Turn this on to treat URLs with different fragments as different page…
`useSitemaps`	boolean	`false`	If enabled, the crawler looks for sitemap.xml at the root of each start URL domain and enqueues matching URLs from it in addition to link-following.
`deduplication`	enum (`minimal` \| `standard` \| `aggressive`)	`"standard"`	Deduplication level applied on top of Crawlee's built-in URL deduplication. standard (default): skip pages whose was already extracted, across all handler types. aggressive: al…
`respectRobotsTxtFile`	boolean	`false`	If enabled, the crawler will consult the robots.txt file for each domain before crawling pages.
`initialCookies`	array	optional	Cookies that will be pre-set to all pages the scraper opens. This is useful for pages that require login. The value is expected to be a JSON array of objects with `name` and `value` properties. For e…
`customHttpHeaders`	object	optional	HTTP headers that will be added to all requests made by the crawler. This is useful for setting custom authentication headers or other headers required by the target website. The value is expected to…
`maxRequestsPerCrawl`	integer	`0`	Maximum number of requests the crawler will handle. Counts handled page outcomes (successes and final failures), including start URLs and pagination pages. The crawler automatically finishes after re…
`maxResultsPerCrawl`	integer	`0`	Maximum number of results that will be saved to dataset. The scraper will terminate after reaching this number. 0 means unlimited.
`maxCrawlDepth`	integer	`0`	Maximum link depth from Start URLs. Pages discovered further from start URLs than this limit will not be crawled. 0 means unlimited.
`initialConcurrency`	integer	`0`	Initial number of browser pages or HTTP clients running in parallel. Crawlee auto-scales up to maxConcurrency. 0 lets Crawlee pick the default.
`maxConcurrency`	integer	`3`	Maximum number of browser pages running in parallel. Kept low by default because the browser crawler cannot abort in-flight pages, so concurrency is the only hard cap on peak memory — large pages can…
`maxRequestRetries`	integer	`3`	Maximum number of retries for failed requests on network, proxy, or server errors.
`mode`	enum (`precision` \| `balanced` \| `recall`)	`"balanced"`	Extraction mode. precision minimizes noise (may miss some content); recall maximizes content (may include noise); balanced is the default.
`includeComments`	boolean	`true`	Include HTML comments in the extracted text.
`includeTables`	boolean	`true`	Include table content in the extracted text.
`includeImages`	boolean	`false`	Include image alt text and captions in the extracted text.
`includeLinks`	boolean	`true`	Include hyperlinks in the extracted text.
`languageCode`	string	`""`	Filter extracted content by language code (e.g. "en"). Leave empty to accept any language.
`save`	array	`["markdown-kvs"]`	What to save and where, as `format-destination` tokens. Format is one of `txt`, `markdown`, `json`, `html`, `original` (raw page HTML before extraction); destination is `dataset` (inline in the datas…
`datasetName`	string	optional	Name or ID of the dataset for storing results. Leave empty to use the default run dataset.
`keyValueStoreName`	string	optional	Name or ID of the key-value store for content files. Leave empty to use the default store.
`requestQueueName`	string	optional	Name of the request queue for pending URLs. Leave empty to use the default queue.
`storeSkippedUrls`	boolean	`false`	If enabled, pushes a dataset record for each URL skipped during crawling (excluded by globs, robots.txt, depth limit, or concurrency cap). Can produce high record volume — enable for auditing only.
`proxyConfiguration`	object	optional	Enables loading websites from IP addresses in specific geographies and to circumvent blocking.
`proxyRotation`	enum (`recommended` \| `per-request` \| `until-failure`)	`"recommended"`	Proxy rotation strategy. recommended automatically picks the best proxies. per-request uses a new proxy for each request. until-failure uses one proxy until it fails.
`sessionPoolName`	string	optional	Name for a persistent, shared session pool. Sessions (IP + cookies) are saved under this key and reused across Actor runs. Useful when proxies are frequently blocked — previously working sessions are…
`maxSessionRotations`	integer	`10`	Maximum number of session (IP + browser fingerprint) rotations per request on block detection. Independent of maxRequestRetries. Set to 0 to disable session rotation.
`navigationTimeoutSecs`	integer	`60`	Maximum time to wait for page navigation in seconds
`blockMedia`	boolean	`true`	Block loading of images, stylesheets, fonts (.woff), PDFs, and ZIPs. On by default: it cuts browser memory and bandwidth substantially, which helps avoid out-of-memory on large pages. Disable it (set…
`waitForSelector`	string	`""`	Wait for this CSS selector to appear before extracting content. The request fails and is retried if the selector does not appear within the timeout. Leave empty to disable.
`softWaitForSelector`	string	`""`	Wait for this CSS selector to appear before extracting content. Unlike waitForSelector, the request continues even if the selector does not appear within the timeout. Leave empty to disable.
`waitForDynamicContentSecs`	integer	`0`	Maximum seconds to wait for dynamic page content to load after navigation. The crawler continues when the network goes idle or this timeout elapses, whichever comes first. 0 disables this wait. Also…
`waitUntil`	enum (`load` \| `domcontentloaded` \| `networkidle` \| `commit`)	`"load"`	When to consider navigation finished. networkidle waits for 500ms of network silence (best for JS-heavy SPAs, slower); load waits for the load event (default, good for most articles); domcontentloade…
`headless`	boolean	`true`	Run browser in headless mode
`ignoreCorsAndCsp`	boolean	`false`	Ignore Content Security Policy and Cross-Origin Resource Sharing restrictions. Enables free XHR/Fetch requests from pages.
`closeCookieModals`	boolean	`true`	Automatically handle cookie consent: Ghostery-based ad/tracker blocking, accepting consent walls that replace the page (e.g. consent-or-pay) via the site’s own consent manager and re-fetching the art…
`maxScrollHeight`	integer	`5000`	Maximum pixels (px) to scroll down the page until all content is loaded. Setting to 0 disables scrolling.
`userAgent`	string	`""`	Custom User-Agent string for the browser. Leave empty to use the default browser User-Agent.
`ignoreHttpsErrors`	boolean	`false`	Ignore HTTPS certificate errors. Use at your own risk.

Crawler and extraction options

The most important multiple-choice settings and what each value means. See the Input table above for the full list of fields.

`crawlerType` (default `playwright-adaptive`)

Value	Title
`playwright-adaptive`	Adaptive switching (Recommended)
`playwright-firefox`	Headless browser (Firefox+Playwright)
`playwright-chromium`	Headless browser (Chromium+Playwright)
`cheerio`	Raw HTTP client (Cheerio)

`deduplication` (default `standard`)

Value	Title
`minimal`	Minimal — Crawlee URL dedup only
`standard`	Standard — + canonical URL (default)
`aggressive`	Aggressive — + content hash

`mode` (default `balanced`)

Value	Title
`precision`	Precision (less noise)
`balanced`	Balanced (default)
`recall`	Recall (more content)

`proxyRotation` (default `recommended`)

Value	Title
`recommended`	Recommended
`per-request`	Rotate per request
`until-failure`	Use until failure

`waitUntil` (default `load`)

Value	Title
`load`	Load event
`domcontentloaded`	DOM content loaded
`networkidle`	Network idle
`commit`	Commit

What data does Contextractor return?

Every crawled page becomes one dataset record. Successful pages carry status: "success" with the extracted content and metadata; failed and skipped pages are recorded too, so nothing is silently dropped.

Field	Description
`url`	The original request URL.
`status`	Record outcome: `success`, `failed`, or `skipped`.
`metadata`	Extracted page metadata: `title`, `author`, `publishedAt`, `description`, `siteName`, `languageCode`.
`crawl`	Crawl provenance: `loadedUrl` (final URL after redirects), `loadedTime`, `httpStatusCode`, `depth` (link distance from a start URL), `referrerUrl` (the linking page).
`original`	The raw page HTML as a content node — `hash` (MD5) and `bytes` always present; `content`, or `key` + `url`, added when `original` is saved.
`txt`, `markdown`, `json`, `html`	One content node per saved format — `hash` and `bytes`, plus inline `content` (dataset) or a `key` + `url` reference (key-value store).
`errors`, `retryCount`, `crawledTime`	On `failed` records only: the error messages, number of retries, and when the request was abandoned.
`skipReason`	On `skipped` records only: `robotsTxt`, `limit`, `enqueueLimit`, `filters`, `redirect`, or `depth`.

Example success record (default settings — Markdown saved to the key-value store):

{
  "url": "https://blog.example.com/why-rag-matters",
  "status": "success",
  "metadata": {
    "title": "Why RAG Matters",
    "author": "Jane Doe",
    "publishedAt": "2026-01-15",
    "description": "A practical look at retrieval-augmented generation.",
    "siteName": "Example Blog",
    "languageCode": "en"
  },
  "crawl": {
    "loadedUrl": "https://blog.example.com/why-rag-matters",
    "loadedTime": "2026-05-31T10:00:00.000Z",
    "httpStatusCode": 200,
    "depth": 1,
    "referrerUrl": "https://blog.example.com/"
  },
  "original": {
    "hash": "f8e6bd335e04d03e1be6798c2c72349c",
    "bytes": 89898
  },
  "markdown": {
    "hash": "43f204bfbee5dbe6862cb38620f257b5",
    "bytes": 5234,
    "key": "markdown-c485356090a92c6a45e8c1155c14d8ee.md",
    "url": "https://api.apify.com/v2/key-value-stores/<storeId>/records/markdown-c485356090a92c6a45e8c1155c14d8ee.md"
  }
}

Where your content is saved

Key-value store (default) — each format is stored as a separate file keyed {format}-{md5(url)}.{ext} (e.g. markdown-1a2b3c4d….md), and the dataset record references it by key and url. Best for large content and bulk download.
Dataset — the extracted content is embedded inline in each record under content. Best when you want everything in a single JSON, CSV, or Excel export.

Choose one or both with the Save option.

Integrations and automation

Contextractor outputs standard JSON and Markdown, so its results drop straight into AI and data pipelines:

Apify API & SDKs — start runs, stream the dataset, and fetch key-value-store files programmatically from Python or JavaScript.
Scheduling & monitoring — schedule recurring runs and monitor them from the Apify Console.
MCP server — expose the Actor to AI agents through the Model Context Protocol.
No-code connectors — pipe results into Make, Zapier, n8n, Google Drive, Slack, and more via Apify's integrations.
LLM frameworks — feed the extracted Markdown or JSON into LangChain, LlamaIndex, or a vector database such as Pinecone, Qdrant, Weaviate, or Chroma for retrieval-augmented generation.

FAQ

Is it legal to scrape website content?

Scraping publicly available, non-personal data is generally legal in most jurisdictions. Contextractor can honor each site's robots.txt (enable Respect robots.txt), and you remain responsible for complying with each site's Terms of Service and for how you use extracted content — especially copyrighted material you intend to republish.

Why is some content missing or noisy?

Switch the extraction mode: precision removes more boilerplate (and may drop borderline content), while recall keeps more (and may include some noise). For pages that load content with JavaScript, add a Wait for selector, increase Wait for dynamic content, or raise Max scroll height so lazy-loaded sections appear before extraction.

How do I avoid getting blocked?

Enable Proxy configuration with proxy rotation, set a Session pool name to reuse working sessions across runs, and allow session rotations so the crawler switches IP and fingerprint when a block is detected.

How do I crawl an entire website?

Set a Link selector (e.g. a[href]) to follow links, then bound the crawl with include/exclude URL globs, Max crawl depth, and Max requests per crawl. Enable Use sitemaps to also pull URLs from each domain's sitemap.xml.

How do I remove duplicate pages?

Use Deduplication: standard (the default) skips pages whose canonical URL was already extracted; aggressive additionally skips pages with identical extracted text; minimal keeps only Crawlee's built-in URL deduplication.

Found a bug or have a feature request?

We respond to issues on the Issues tab — please open one and we'll take a look.

🧠 RAG Web Browser — Web Content for AI & LLMs

nexgendata/rag-web-browser

Web browser for RAG pipelines and AI agents. Search Google, scrape top results, return clean Markdown. Feed your LLM with real-time web data. Works with Claude, GPT, LangChain, CrewAI. No API key needed.

NexGenData

Train Your Local LLM for Business & Finance - DataPro

omissive_aurora/train-your-local-llm-for-business-finance---datapro

Train your local LLM for business and finance with Ultimate DataPro. Scrapes live stock prices, SEC EDGAR filings, options chains, and financial news - then auto-builds Alpaca/ShareGPT fine-tuning datasets. Export as JSONL, CSV, or Parquet. Push to HuggingFace Hub.

d.leigh hunte

Clean Web Scraper - Markdown for AI via Firecrawl

clearpath/web-to-markdown

Convert any website to clean, LLM-optimized markdown using Firecrawl. Perfect for RAG pipelines, AI training data, and knowledge bases. No login required, 25% cheaper than Firecrawl direct. Batch process hundreds of URLs. Supports PDF/DOCX. Pay only $0.004 per page - no monthly fees.

ClearPath

RAG Pipeline Data Collector

scraper_guru/rag-pipeline-data-collector

AI-ready web content extraction for RAG systems, LLMs, and AI agents. Single-page or multi-page scraping with parallel processing.

LIAICHI MUSTAPHA

RAG Browser

visita/rag-browser

This Actor provides essential web browsing and content extraction functionality for AI Agents, LLM applications, and Retrieval-Augmented Generation (RAG) pipelines. It functions similarly to the web search feature in popular LLM chatbots, providing fresh, contextualized data directly from the web.

Visita Intelligence

Wick Web Fetcher — Browser-Grade Content Extraction

eventful_notoriety/wick-web-fetcher

Fetch web pages using Chrome's real TLS fingerprint. Returns clean markdown for LLMs and RAG pipelines. No headless browser needed — fast and lightweight.

Adam Fisk

AI Training Data Scraper - LLM and RAG-Ready

george.the.developer/ai-training-data-scraper

Extract web content formatted for LLM fine-tuning and RAG pipelines. Output in OpenAI JSONL, Claude JSONL, Markdown, or raw text.

George Kioko

AI Training Data Collector — Clean Web Datasets for LLMs

avinashchby/ai-training-data-collector

Crawl websites and extract structured, clean text datasets perfect for fine-tuning LLMs and RAG pipelines. Removes boilerplate, deduplicates, and scores content quality.

Avinash

AI Web Content Crawler - Markdown for LLMs

intelscrape/ai-web-content-crawler

Crawl any website and extract clean Markdown optimized for LLM training, RAG pipelines, and AI knowledge bases - removes boilerplate and outputs structured JSON with URL, title, markdown, and metadata.

IntelScrape

Webpage to Markdown Converter for LLMs

andok/markdown-extractor

Convert any URL into clean Markdown text. Remove ads and navbars to perfectly format web content for AI and RAG ingestion.

Andok

Website Content Crawler Pro

datascoutapi/website-content-crawler-pro

Crawl websites and extract clean, structured content in Markdown, JSON, or plain text for AI models, LLMs, vector DBs, or RAG pipelines. Fast, reliable, and stealthy, with bulk processing, advanced metadata extraction, and seamless integration with LangChain, LlamaIndex, and AI workflows.

halam

553

3.7

(3)