GPT Crawler MCP — Knowledge files for ChatGPT, Claude, RAG avatar

GPT Crawler MCP — Knowledge files for ChatGPT, Claude, RAG

Pricing

from $35.00 / 1,000 mcp tool calls

Go to Apify Store
GPT Crawler MCP — Knowledge files for ChatGPT, Claude, RAG

GPT Crawler MCP — Knowledge files for ChatGPT, Claude, RAG

Crawl any website and turn it into a clean knowledge file for your custom GPT, Claude Project, or RAG pipeline. Native MCP server in Standby mode + classic batch mode.

Pricing

from $35.00 / 1,000 mcp tool calls

Rating

0.0

(0)

Developer

KazKN

KazKN

Maintained by Community

Actor stats

0

Bookmarked

1

Total users

0

Monthly active users

2 days ago

Last modified

Share

GPT Crawler MCP — Build knowledge files for ChatGPT, Claude Projects & RAG in one click

Crawl any website. Get a clean JSON knowledge file. Plug it into your custom GPT, Claude Project, or RAG pipeline. Now also as MCP server for AI agents.

Apify Actor License: ISC Built on BuilderIO/gpt-crawler


🎯 Why this Actor

  • No more pip install + Python venv broken at 11pm. One click, no setup, no local Chromium dance.
  • Built on the legendary BuilderIO/gpt-crawler (19k+ GitHub stars, ISC) — battle-tested crawl logic, wrapped for Apify and extended with MCP.
  • Pay only for what you crawl, no subscription. $0.005 per page, hard-capped by your maxPagesToCrawl. No monthly fee.

📚 What is a "knowledge file"?

A single JSON (or Markdown / plain-text) file containing the cleaned content of every page on a docs site, blog, or knowledge base. You upload it to:

  • ChatGPT → custom GPT → "Knowledge" → drop the file.
  • Claude Projects → "Project knowledge" → drop the file.
  • RAG pipelines → embed it, store in Pinecone / pgvector / Weaviate.
  • AI agents → call this Actor's MCP server live, no pre-indexing.

That's the whole pitch: turn a website into LLM-ready context in one click.


⚖️ How it stacks up

FeatureRun BuilderIO/gpt-crawler locallyFirecrawl ($39/mo+)GPT Crawler MCP (this Actor)
Setup time15 min (clone, npm i, Playwright install, fight ESM errors)5 min (account + API key)0 — one click
PricingFree + your time + your laptop$39/mo flat$0.005 / page (PPE), no subscription
MCP server mode for AI agentsNoNoYes — Apify Standby
Auto retries / proxy rotationManualYesYes (Apify infra)
n8n / Zapier / Make integrationsNoNoYes (Apify connectors)
Output as JSON / Markdown / plain textJSON onlyJSON / MarkdownJSON / Markdown / TXT
Headless browser (JS-rendered sites)YesYesYes (Playwright + Chromium)

🚀 Quick start

  1. Go to apify.com/kazkn/gpt-crawler-mcp.
  2. Click Try for free.
  3. Paste your start URL (e.g. https://docs.your-product.com), set match pattern (https://docs.your-product.com/**) and maxPagesToCrawl, and click Start.
  4. When the run finishes, download output.json from the Storage → Key-value store tab (or grab the dataset).

2. From the API

curl -X POST "https://api.apify.com/v2/acts/kazkn~gpt-crawler-mcp/run-sync-get-dataset-items?token=YOUR_APIFY_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"urls": ["https://docs.your-product.com"],
"match": "https://docs.your-product.com/**",
"maxPagesToCrawl": 50
}'

3. From n8n / Zapier / Make

Search for the Apify connector → action Run an Actor → pick kazkn/gpt-crawler-mcp → wire your inputs.


⚙️ Input parameters

FieldTypeDefaultDescription
urlsstring[]— (required)Start URLs. Sitemap .xml URLs are auto-detected.
matchstring**Glob pattern controlling which links to follow.
selectorstringbodyCSS or XPath selector for content extraction.
maxPagesToCrawlinteger10Hard cap on pages crawled (also caps your cost). Max 1000.
outputFileNamestringoutput.jsonName of the combined knowledge file.
outputFormatenumjsonjson / markdown / txt.
headlessbooleantrueRun Chromium headless.
waitForSelectorTimeoutinteger1000ms to wait for the selector.
cookiestringOptional name=value cookie (for cookie-walls or auth).
maxTokensinteger0Optional cap on tokens per output file. 0 = no limit.
mcpModebooleanfalseRun as MCP server (Standby mode).

📦 Output format example

Each page becomes a dataset item:

{
"url": "https://docs.your-product.com/getting-started",
"title": "Getting started — YourProduct docs",
"html": "Welcome to YourProduct...",
"text": "Welcome to YourProduct. This guide walks you through the first 5 minutes...",
"tokens": 412,
"crawledAt": "2026-04-27T09:14:22.181Z"
}

The combined knowledge file (Key-value store → output.json) is the same array, ready to upload to ChatGPT / Claude / a vector store.


🤖 MCP server mode (for ChatGPT, Claude Desktop & AI agents)

This Actor can run as a persistent MCP server via Apify Standby. Instead of pre-crawling a site and uploading a static file, your AI agent calls the crawl_to_knowledge tool live, on demand.

🔧 Tool exposed

ToolDescription
crawl_to_knowledgeCrawl a website and return a JSON knowledge file (array of pages with title, url, text, tokens).

💬 Add to Claude Desktop

Edit ~/Library/Application Support/Claude/claude_desktop_config.json:

{
"mcpServers": {
"gpt-crawler": {
"url": "https://kazkn--gpt-crawler-mcp.apify.actor/mcp?token=YOUR_APIFY_TOKEN"
}
}
}

🟢 Add to ChatGPT (custom GPT, MCP-compatible clients)

Use the same https://kazkn--gpt-crawler-mcp.apify.actor/mcp?token=... URL as your MCP server endpoint.

The tool is then callable in-conversation: "Crawl docs.stripe.com/api, return the first 30 pages as a knowledge file".

⏱️ Client compatibility & timeouts

Read this if you see "interrupted connection" or "invalid authentication" errors.

A crawl_to_knowledge call takes 15 seconds to 3 minutes depending on maxPagesToCrawl, target site latency, and JS-rendering needs:

Pages crawledTypical wall-clock
5 pages10-25 s
30 pages45-90 s
100 pages2-4 min
500+ pages5+ min

Most MCP clients ship with default timeouts of 30 seconds — too short. Configure your client with 120 seconds minimum (180 s if you crawl 100+ pages).

ClientDefaultRecommendedHow to configure
Claude Desktop30 s180 sAdd "timeout": 180000 to your server entry in claude_desktop_config.json
Cursor IDE30 s180 sSettings → MCP → Request timeout (ms) → 180000
Windsurf60 s180 sMCP config → requestTimeoutMs: 180000
Continue.dev30 s180 srequestTimeoutMs: 180000 in MCP config
langchain-mcp (Python)none180 sMultiServerMCPClient(..., timeout=180)
@modelcontextprotocol/sdk (npm)30 s180 snew Client({...}, { requestTimeoutMs: 180000 })

Claude Desktop config example

{
"mcpServers": {
"gpt-crawler": {
"type": "url",
"url": "https://kazkn--gpt-crawler-mcp.apify.actor/mcp?token=YOUR_APIFY_TOKEN",
"timeout": 180000
}
}
}

Best practices to avoid timeouts

  1. Start with maxPagesToCrawl: 10 to validate the site works, then scale up.
  2. One crawl at a time per client. Sequential calls are reliable; concurrent calls hit your client's pool limit.
  3. Cold-start adds 5-8 s. First request after idle wakes the Actor. Consecutive crawls within 60 s share a warm instance.
  4. For very large sites (1000+ pages), use batch mode (run the Actor with input from the Console) instead of MCP. Batch runs have a 1-hour default timeout, no MCP-client-side timeout to fight.

Troubleshooting common errors

Error message you seeLikely causeFix
"Invalid or expired MCP authentication"Client closed the connection before the crawl finishedIncrease MCP timeout to 180 s
"interrupted network connection"Same as aboveIncrease MCP timeout to 180 s
"Tool call returned no content"Site blocked or no pages matched the match patternVerify match pattern; try match: "**"
"403 / blocked by target site"Aggressive anti-bot on targetTry headless: true (default) or use batch mode with custom proxy
"Bad Request: No valid session ID"You called /mcp without the initialize handshakeUse a real MCP client, not raw curl

Verifying it works

After connecting, ask your AI assistant:

"Use gpt-crawler to crawl https://docs.stripe.com/api/customers, max 5 pages, return as JSON."

You should see a JSON knowledge file with 5 page entries (title, url, text, tokens) within 30 seconds. If you get "interrupted connection", your client timeout is the issue.

If problems persist after raising the timeout, open an issue with your Apify Run ID — server logs always tell the truth and we can pinpoint the cause.


💰 Pricing

Pay-Per-Event (PPE):

EventPriceWhen charged
Actor Start (apify-actor-start)$0.00005 (one-time per GB)Each cold-start
MCP Tool Call (tool-request) ⭐ Primary$0.05Each crawl_to_knowledge invocation in MCP standby mode
Page crawled — batch only (apify-default-dataset-item)$0.001Each page written to dataset (batch mode only — never charged in MCP mode)
Capability Discovery (list-request)$0.0001When client lists tools/resources/prompts
Resource Read (resource-request)$0.0001When client reads a server resource
Prompt Request (prompt-request)$0.0001When client requests a prompt
Completion Request (completion-request)$0.0001When client requests a completion

Examples:

  • 1 MCP crawl returning 30 pages → $0.05 (one tool-request)
  • 1 MCP crawl returning 200 pages → $0.05 (still one tool-request — the tool returns all pages in one response)
  • Batch run via Console crawling 30 pages → $0.03 (30 × $0.001)
  • Batch run crawling 200 pages → $0.20

Why MCP mode is flat-rate per call: the tool returns the entire knowledge file in a single response, so we charge once per call regardless of page count. The page cap is enforced by your maxPagesToCrawl input — set it conservatively to control crawl duration.

Apify subscription tier discounts (Bronze 10%, Silver 13%, Gold 20%) apply automatically. There is no monthly fee — if you don't run the Actor, you don't pay.


💡 Use cases

Real workflows people use this Actor for. Pick the closest to yours and the input config is almost identical.

🎓 Build a Custom GPT for your product docs

Crawl docs.your-product.com, drop the JSON file into ChatGPT → Create a GPT → Knowledge. Your GPT now answers support questions in your product's voice, cites exact URLs, and stops hallucinating about features that don't exist.

📊 Sales objection handler for AI agencies

Crawl your competitor's website + your own pricing page + your case studies. The combined knowledge file becomes a Custom GPT that any sales rep can talk to: "Why are we more expensive than X?" and the GPT answers with your pre-vetted positioning.

🔎 Live RAG for customer support agents

Run the Actor in MCP standby mode. Your support agent (Claude Desktop or a custom n8n workflow) calls crawl_to_knowledge whenever the user asks about a topic that isn't already in the cache. Always-fresh context, zero pre-indexing.

📚 Train a RAG pipeline (LangChain / LlamaIndex / pgvector)

Crawl 200 pages of technical content, get a clean JSON, embed each text field with OpenAI / Cohere / Voyage embeddings, store in Pinecone or pgvector. Output JSON is already chunk-friendly with tokens field included.

🛒 Competitive intelligence for B2B SaaS

Schedule a weekly crawl of your top 3 competitors' marketing sites. Diff the resulting knowledge files to detect new features, pricing changes, or messaging pivots before they hit your Slack.


🔗 Other Actors by KazKN

If this Actor helps you, you might also like:


❓ FAQ

How is this different from running BuilderIO/gpt-crawler locally?

Same crawl logic (we wrap their core.ts 1:1 + a tiny adapter). What you get on top: zero local setup, hosted Chromium, automatic retries, Apify proxy rotation, scheduled runs, n8n/Zapier/Make integrations, and an MCP server mode that doesn't exist upstream.

Will this work on JavaScript-heavy sites (React / Vue / Next.js)?

Yes. We use Playwright + Chromium under the hood (same as upstream), so client-rendered content is fully supported. Use the selector input to target the exact container after JS hydration.

Does it respect robots.txt / can I crawl any site?

You are responsible for what you crawl. The Actor will fetch what you ask it to. For competitive/copyrighted content, don't. For your own docs, your customer's docs (with permission), or public technical documentation that explicitly invites indexing — go for it.

Yes — use the cookie input (name=value format). For more complex auth (OAuth, multi-step login), open an issue on GitHub and we'll add it.

What's the difference between batch mode and MCP mode?

  • Batch: you specify URLs once, get a file. Best for building a static knowledge base you'll upload to a custom GPT or RAG store.
  • MCP: an AI agent calls the crawler live, mid-conversation. Best for agentic workflows where the URL to crawl isn't known ahead of time.

🏗️ Built on

This Actor is a thin Apify wrapper around BuilderIO/gpt-crawler (ISC licensed, 19k+ stars). All credit for the core crawl logic goes to the Builder.io team and the upstream contributors. The original ISC license text is preserved in the source repository.

Wrapper authored by KazKN — see all my Actors on Apify Store.

📜 License

ISC — same as upstream. Free for personal and commercial use.

🆘 Support

  • 🐛 Issues / feature requests: open one on GitHub — fastest reply.
  • 💬 Apify Console: use the Issues tab on the Actor page to report bugs directly to the maintainer with run IDs attached.
  • Liked it? Leave a 5-star rating on the Actor page — that's how this Actor stays alive and improves.