GPT Crawler MCP — Knowledge files for ChatGPT, Claude, RAG
Pricing
from $35.00 / 1,000 mcp tool calls
GPT Crawler MCP — Knowledge files for ChatGPT, Claude, RAG
Crawl any website and turn it into a clean knowledge file for your custom GPT, Claude Project, or RAG pipeline. Native MCP server in Standby mode + classic batch mode.
Pricing
from $35.00 / 1,000 mcp tool calls
Rating
0.0
(0)
Developer
KazKN
Actor stats
0
Bookmarked
1
Total users
0
Monthly active users
2 days ago
Last modified
Categories
Share
GPT Crawler MCP — Build knowledge files for ChatGPT, Claude Projects & RAG in one click
Crawl any website. Get a clean JSON knowledge file. Plug it into your custom GPT, Claude Project, or RAG pipeline. Now also as MCP server for AI agents.
🎯 Why this Actor
- No more
pip install+ Python venv broken at 11pm. One click, no setup, no local Chromium dance. - Built on the legendary BuilderIO/gpt-crawler (19k+ GitHub stars, ISC) — battle-tested crawl logic, wrapped for Apify and extended with MCP.
- Pay only for what you crawl, no subscription. $0.005 per page, hard-capped by your
maxPagesToCrawl. No monthly fee.
📚 What is a "knowledge file"?
A single JSON (or Markdown / plain-text) file containing the cleaned content of every page on a docs site, blog, or knowledge base. You upload it to:
- ChatGPT → custom GPT → "Knowledge" → drop the file.
- Claude Projects → "Project knowledge" → drop the file.
- RAG pipelines → embed it, store in Pinecone / pgvector / Weaviate.
- AI agents → call this Actor's MCP server live, no pre-indexing.
That's the whole pitch: turn a website into LLM-ready context in one click.
⚖️ How it stacks up
| Feature | Run BuilderIO/gpt-crawler locally | Firecrawl ($39/mo+) | GPT Crawler MCP (this Actor) |
|---|---|---|---|
| Setup time | 15 min (clone, npm i, Playwright install, fight ESM errors) | 5 min (account + API key) | 0 — one click |
| Pricing | Free + your time + your laptop | $39/mo flat | $0.005 / page (PPE), no subscription |
| MCP server mode for AI agents | No | No | Yes — Apify Standby |
| Auto retries / proxy rotation | Manual | Yes | Yes (Apify infra) |
| n8n / Zapier / Make integrations | No | No | Yes (Apify connectors) |
| Output as JSON / Markdown / plain text | JSON only | JSON / Markdown | JSON / Markdown / TXT |
| Headless browser (JS-rendered sites) | Yes | Yes | Yes (Playwright + Chromium) |
🚀 Quick start
1. From the Apify Console (recommended)
- Go to apify.com/kazkn/gpt-crawler-mcp.
- Click Try for free.
- Paste your start URL (e.g.
https://docs.your-product.com), set match pattern (https://docs.your-product.com/**) andmaxPagesToCrawl, and click Start. - When the run finishes, download
output.jsonfrom the Storage → Key-value store tab (or grab the dataset).
2. From the API
curl -X POST "https://api.apify.com/v2/acts/kazkn~gpt-crawler-mcp/run-sync-get-dataset-items?token=YOUR_APIFY_TOKEN" \-H "Content-Type: application/json" \-d '{"urls": ["https://docs.your-product.com"],"match": "https://docs.your-product.com/**","maxPagesToCrawl": 50}'
3. From n8n / Zapier / Make
Search for the Apify connector → action Run an Actor → pick kazkn/gpt-crawler-mcp → wire your inputs.
⚙️ Input parameters
| Field | Type | Default | Description |
|---|---|---|---|
urls | string[] | — (required) | Start URLs. Sitemap .xml URLs are auto-detected. |
match | string | ** | Glob pattern controlling which links to follow. |
selector | string | body | CSS or XPath selector for content extraction. |
maxPagesToCrawl | integer | 10 | Hard cap on pages crawled (also caps your cost). Max 1000. |
outputFileName | string | output.json | Name of the combined knowledge file. |
outputFormat | enum | json | json / markdown / txt. |
headless | boolean | true | Run Chromium headless. |
waitForSelectorTimeout | integer | 1000 | ms to wait for the selector. |
cookie | string | — | Optional name=value cookie (for cookie-walls or auth). |
maxTokens | integer | 0 | Optional cap on tokens per output file. 0 = no limit. |
mcpMode | boolean | false | Run as MCP server (Standby mode). |
📦 Output format example
Each page becomes a dataset item:
{"url": "https://docs.your-product.com/getting-started","title": "Getting started — YourProduct docs","html": "Welcome to YourProduct...","text": "Welcome to YourProduct. This guide walks you through the first 5 minutes...","tokens": 412,"crawledAt": "2026-04-27T09:14:22.181Z"}
The combined knowledge file (Key-value store → output.json) is the same array, ready to upload to ChatGPT / Claude / a vector store.
🤖 MCP server mode (for ChatGPT, Claude Desktop & AI agents)
This Actor can run as a persistent MCP server via Apify Standby. Instead of pre-crawling a site and uploading a static file, your AI agent calls the crawl_to_knowledge tool live, on demand.
🔧 Tool exposed
| Tool | Description |
|---|---|
crawl_to_knowledge | Crawl a website and return a JSON knowledge file (array of pages with title, url, text, tokens). |
💬 Add to Claude Desktop
Edit ~/Library/Application Support/Claude/claude_desktop_config.json:
{"mcpServers": {"gpt-crawler": {"url": "https://kazkn--gpt-crawler-mcp.apify.actor/mcp?token=YOUR_APIFY_TOKEN"}}}
🟢 Add to ChatGPT (custom GPT, MCP-compatible clients)
Use the same https://kazkn--gpt-crawler-mcp.apify.actor/mcp?token=... URL as your MCP server endpoint.
The tool is then callable in-conversation: "Crawl docs.stripe.com/api, return the first 30 pages as a knowledge file".
⏱️ Client compatibility & timeouts
Read this if you see "interrupted connection" or "invalid authentication" errors.
A crawl_to_knowledge call takes 15 seconds to 3 minutes depending on maxPagesToCrawl, target site latency, and JS-rendering needs:
| Pages crawled | Typical wall-clock |
|---|---|
| 5 pages | 10-25 s |
| 30 pages | 45-90 s |
| 100 pages | 2-4 min |
| 500+ pages | 5+ min |
Most MCP clients ship with default timeouts of 30 seconds — too short. Configure your client with 120 seconds minimum (180 s if you crawl 100+ pages).
Recommended timeout per client
| Client | Default | Recommended | How to configure |
|---|---|---|---|
| Claude Desktop | 30 s | 180 s | Add "timeout": 180000 to your server entry in claude_desktop_config.json |
| Cursor IDE | 30 s | 180 s | Settings → MCP → Request timeout (ms) → 180000 |
| Windsurf | 60 s | 180 s | MCP config → requestTimeoutMs: 180000 |
| Continue.dev | 30 s | 180 s | requestTimeoutMs: 180000 in MCP config |
| langchain-mcp (Python) | none | 180 s | MultiServerMCPClient(..., timeout=180) |
| @modelcontextprotocol/sdk (npm) | 30 s | 180 s | new Client({...}, { requestTimeoutMs: 180000 }) |
Claude Desktop config example
{"mcpServers": {"gpt-crawler": {"type": "url","url": "https://kazkn--gpt-crawler-mcp.apify.actor/mcp?token=YOUR_APIFY_TOKEN","timeout": 180000}}}
Best practices to avoid timeouts
- Start with
maxPagesToCrawl: 10to validate the site works, then scale up. - One crawl at a time per client. Sequential calls are reliable; concurrent calls hit your client's pool limit.
- Cold-start adds 5-8 s. First request after idle wakes the Actor. Consecutive crawls within 60 s share a warm instance.
- For very large sites (1000+ pages), use batch mode (run the Actor with input from the Console) instead of MCP. Batch runs have a 1-hour default timeout, no MCP-client-side timeout to fight.
Troubleshooting common errors
| Error message you see | Likely cause | Fix |
|---|---|---|
| "Invalid or expired MCP authentication" | Client closed the connection before the crawl finished | Increase MCP timeout to 180 s |
| "interrupted network connection" | Same as above | Increase MCP timeout to 180 s |
| "Tool call returned no content" | Site blocked or no pages matched the match pattern | Verify match pattern; try match: "**" |
| "403 / blocked by target site" | Aggressive anti-bot on target | Try headless: true (default) or use batch mode with custom proxy |
| "Bad Request: No valid session ID" | You called /mcp without the initialize handshake | Use a real MCP client, not raw curl |
Verifying it works
After connecting, ask your AI assistant:
"Use gpt-crawler to crawl
https://docs.stripe.com/api/customers, max 5 pages, return as JSON."
You should see a JSON knowledge file with 5 page entries (title, url, text, tokens) within 30 seconds. If you get "interrupted connection", your client timeout is the issue.
If problems persist after raising the timeout, open an issue with your Apify Run ID — server logs always tell the truth and we can pinpoint the cause.
💰 Pricing
Pay-Per-Event (PPE):
| Event | Price | When charged |
|---|---|---|
Actor Start (apify-actor-start) | $0.00005 (one-time per GB) | Each cold-start |
MCP Tool Call (tool-request) ⭐ Primary | $0.05 | Each crawl_to_knowledge invocation in MCP standby mode |
Page crawled — batch only (apify-default-dataset-item) | $0.001 | Each page written to dataset (batch mode only — never charged in MCP mode) |
Capability Discovery (list-request) | $0.0001 | When client lists tools/resources/prompts |
Resource Read (resource-request) | $0.0001 | When client reads a server resource |
Prompt Request (prompt-request) | $0.0001 | When client requests a prompt |
Completion Request (completion-request) | $0.0001 | When client requests a completion |
Examples:
- 1 MCP crawl returning 30 pages → $0.05 (one tool-request)
- 1 MCP crawl returning 200 pages → $0.05 (still one tool-request — the tool returns all pages in one response)
- Batch run via Console crawling 30 pages → $0.03 (30 × $0.001)
- Batch run crawling 200 pages → $0.20
Why MCP mode is flat-rate per call: the tool returns the entire knowledge file in a single response, so we charge once per call regardless of page count. The page cap is enforced by your maxPagesToCrawl input — set it conservatively to control crawl duration.
Apify subscription tier discounts (Bronze 10%, Silver 13%, Gold 20%) apply automatically. There is no monthly fee — if you don't run the Actor, you don't pay.
💡 Use cases
Real workflows people use this Actor for. Pick the closest to yours and the input config is almost identical.
🎓 Build a Custom GPT for your product docs
Crawl docs.your-product.com, drop the JSON file into ChatGPT → Create a GPT → Knowledge. Your GPT now answers support questions in your product's voice, cites exact URLs, and stops hallucinating about features that don't exist.
📊 Sales objection handler for AI agencies
Crawl your competitor's website + your own pricing page + your case studies. The combined knowledge file becomes a Custom GPT that any sales rep can talk to: "Why are we more expensive than X?" and the GPT answers with your pre-vetted positioning.
🔎 Live RAG for customer support agents
Run the Actor in MCP standby mode. Your support agent (Claude Desktop or a custom n8n workflow) calls crawl_to_knowledge whenever the user asks about a topic that isn't already in the cache. Always-fresh context, zero pre-indexing.
📚 Train a RAG pipeline (LangChain / LlamaIndex / pgvector)
Crawl 200 pages of technical content, get a clean JSON, embed each text field with OpenAI / Cohere / Voyage embeddings, store in Pinecone or pgvector. Output JSON is already chunk-friendly with tokens field included.
🛒 Competitive intelligence for B2B SaaS
Schedule a weekly crawl of your top 3 competitors' marketing sites. Diff the resulting knowledge files to detect new features, pricing changes, or messaging pivots before they hit your Slack.
🔗 Other Actors by KazKN
If this Actor helps you, you might also like:
- Vinted MCP Server — MCP server exposing 5 Vinted tools (search, compare prices across 19 EU countries, trending, seller intel).
- Vinted Smart Scraper — bulk Vinted data extraction with cross-country price comparison and arbitrage detection.
- Vinted Turbo Scraper — fast URL-based extractor for individual Vinted listings.
- App Store Localization Scraper — find App Store localization gaps across 175 regions.
❓ FAQ
How is this different from running BuilderIO/gpt-crawler locally?
Same crawl logic (we wrap their core.ts 1:1 + a tiny adapter). What you get on top: zero local setup, hosted Chromium, automatic retries, Apify proxy rotation, scheduled runs, n8n/Zapier/Make integrations, and an MCP server mode that doesn't exist upstream.
Will this work on JavaScript-heavy sites (React / Vue / Next.js)?
Yes. We use Playwright + Chromium under the hood (same as upstream), so client-rendered content is fully supported. Use the selector input to target the exact container after JS hydration.
Does it respect robots.txt / can I crawl any site?
You are responsible for what you crawl. The Actor will fetch what you ask it to. For competitive/copyrighted content, don't. For your own docs, your customer's docs (with permission), or public technical documentation that explicitly invites indexing — go for it.
Can I crawl behind a cookie-wall or auth?
Yes — use the cookie input (name=value format). For more complex auth (OAuth, multi-step login), open an issue on GitHub and we'll add it.
What's the difference between batch mode and MCP mode?
- Batch: you specify URLs once, get a file. Best for building a static knowledge base you'll upload to a custom GPT or RAG store.
- MCP: an AI agent calls the crawler live, mid-conversation. Best for agentic workflows where the URL to crawl isn't known ahead of time.
🏗️ Built on
This Actor is a thin Apify wrapper around BuilderIO/gpt-crawler (ISC licensed, 19k+ stars). All credit for the core crawl logic goes to the Builder.io team and the upstream contributors. The original ISC license text is preserved in the source repository.
Wrapper authored by KazKN — see all my Actors on Apify Store.
📜 License
ISC — same as upstream. Free for personal and commercial use.
🆘 Support
- 🐛 Issues / feature requests: open one on GitHub — fastest reply.
- 💬 Apify Console: use the Issues tab on the Actor page to report bugs directly to the maintainer with run IDs attached.
- ⭐ Liked it? Leave a 5-star rating on the Actor page — that's how this Actor stays alive and improves.