Website To Api & Mcp Generator
Pricing
from $0.50 / 1,000 results
Website To Api & Mcp Generator
Turn public websites into structured data, OpenAPI specs, and MCP-ready descriptors. Crawl pages, detect forms and API-like endpoints, and export clean outputs for agents, chatbots, automation, and developer workflows.
Pricing
from $0.50 / 1,000 results
Rating
0.0
(0)
Developer

Solutions Smart
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
2 days ago
Last modified
Categories
Share
Website to API & MCP Generator ✨
Turn almost any public website into structured data, an OpenAPI spec, and MCP descriptors in one run.
Video walkthrough 🎥
Watch the demo here: https://youtu.be/-B3d1CYdRhE?si=wxhW0tdLJl2MGQmj
This Actor crawls pages from your start URLs, discovers detail pages, extracts entities (product, article, job, profile, form, fallback page), and stores normalized results in an Apify dataset. It is built for technical users who want configurable crawling and predictable output artifacts.
Quick start ⚡
- Open the Actor in Apify Console or run it locally with
apify run. - Add one or more URLs to
startUrls. - Keep
extraction.modeset toautofor the first run. - Start with a small crawl such as
maxPages: 10. - Review the dataset plus
output-*.jsonartifacts, then scale up.
Recommended first run ✅
- Use
startUrlswith the main public docs or product pages you want to map. - Keep
maxPagesbetween10and50until you confirm the output quality. - Keep
maxDepth: 2or3for most sites. - Keep
concurrency: 5to10for stable first runs. - Leave
extraction.rendering.waitForSelectorempty unless the site is strongly JS-driven and you know a reliable selector. - Set
proxy.useApifyProxytofalsefor easy public sites andtruefor harder targets or region-sensitive sites. - Enable
emitOpenApiandemitMcpif you want the generated API and MCP artifacts right away.
What can this Actor do? 🚀
- Crawl websites with depth and page limits (
maxDepth,maxPages) - Use hybrid crawling: Cheerio HTML extraction first, with adaptive Playwright fallback for SPA shells
- Discover list/detail patterns automatically (or use manual extraction mode)
- Extract structured fields from JSON-LD, OpenGraph, and DOM content
- Track per-domain rendering hit rates to reduce unnecessary browser work
- Discover HTML forms and extract field/action schemas (including file uploads and submit selectors)
- Capture same-site
fetch/xhrAPI endpoints during Playwright fallback - Deduplicate records by URL or URL+content fingerprint
- Track changes between runs (added, removed, modified entities)
- Generate machine-friendly artifacts:
output-schema.jsonoutput-index.jsonoutput-changes.jsonoutput-capabilities.jsonoutput-api-endpoints.jsonoutput-rendering-stats.jsonoutput-openapi.json(optional)output-postman-collection.jsonoutput-mcp.jsonandoutput-tools.json(optional)
Why use it on Apify? ☁️
Running this Actor on Apify gives you more than a scraper script:
- Scheduled runs and easy automation
- API access to dataset and key-value store outputs
- Built-in run logs and monitoring
- Proxy configuration support (
RESIDENTIAL, country targeting) - Integrations and webhooks for downstream workflows
What data can it extract? 🧩
The Actor always emits a normalized entity object and enriches fields by detected entity type.
| Field | Description |
|---|---|
type | Entity type (product, article, job, profile, form, page) |
id | Stable SHA256 hash of canonical URL (or form-specific seed for form entities) |
sourceUrl | Original crawled URL |
canonicalUrl | Normalized canonical URL |
title | Best detected page/entity title |
fields | Type-specific extracted fields (price, author, company, etc.) |
text | Optional extracted text/markdown |
images | Collected image URLs |
metadata.confidence | Extraction confidence score |
metadata.fingerprint | Content fingerprint used for change detection |
How to use this Actor 🛠️
- Add at least one URL in
startUrls. - Set crawl controls (
maxPages,maxDepth,concurrency). - Optionally tune
includePatterns/excludePatterns. - Keep
extraction.modeasauto, or switch tomanualand provide selectors. - Optionally tune
extraction.rendering.timeoutSecsandextraction.rendering.waitForSelectorfor JS-heavy targets. - Run the Actor.
- Read results in the dataset and
output-*.jsonartifacts in the default key-value store.
Input example using https://docs.apify.com/ 📝
{"debug": false,"maxPages": 10,"startUrls": [{"url": "https://docs.apify.com/"}],"maxDepth": 3,"concurrency": 10,"includePatterns": ["**/*"],"excludePatterns": ["**/*.pdf","**/*.jpg","**/*.png","**/*.zip","**/wp-admin/**"],"entityHints": ["product","article","job","profile"],"extraction": {"mode": "auto","manual": {"listPageUrl": "","listItemSelector": "","detailLinkSelector": "","fields": []},"rendering": {"timeoutSecs": 8,"waitForSelector": ""}},"output": {"datasetName": "entities","emitOpenApi": true,"emitMcp": true,"emitMarkdown": false},"dedupe": {"enabled": true,"strategy": "url+contentHash","changeDetection": true},"proxy": {"useApifyProxy": true,"apifyProxyGroups": ["RESIDENTIAL"],"apifyProxyCountry": "DE"}}
Hybrid crawling notes:
- Cheerio is always the fast default path for HTML extraction.
- Playwright is used only when the page looks like an SPA shell, HTML extraction is too thin, or the domain has shown a strong render hit rate earlier in the run.
extraction.rendering.timeoutSecscontrols how long the fallback renderer waits for useful content.extraction.rendering.waitForSelectorlets you prioritize a specific selector on JS-heavy pages.
Output example from https://docs.apify.com/ 📦
{"type": "page","id": "7d48ca31ddac67e2ad26b02c4fa26b9656c527eb537d269d46925f3aab45596d","sourceUrl": "https://docs.apify.com/academy","canonicalUrl": "https://docs.apify.com/academy","title": "Apify Academy | Academy | Apify Documentation","fields": {"description": "Learn everything about web scraping and automation with our free courses that will turn you into an expert scraper developer."},"text": "Apify AcademyCopy for LLMLearn everything about web scraping and automation...","images": ["https://apify.com/og-image/docs-article?title=Apify+Academy","https://docs.apify.com/img/apify_sdk.svg"],"metadata": {"discoveredAt": "2026-03-14T09:57:20.651Z","fingerprint": "bcd44f5d99e7ac5dc275f27246634c5ef745c103297de301be6b7dd99435fd6b","listUrl": "https://docs.apify.com/academy","confidence": 0.45}}
Form entity example on a site that contains a form:
{"type": "form","id": "bcda...","sourceUrl": "https://target-site.example/contact","canonicalUrl": "https://target-site.example/contact#form-1","title": "Contact Form","fields": {"formId": "contact_form","method": "POST","target": "https://target-site.example/contact","fields": [{ "name": "name", "type": "text", "required": true },{ "name": "birthDate", "type": "date" },{ "name": "documents", "type": "file" }],"actions": [{ "type": "submit", "selector": "button[type=\"submit\"]", "method": "POST" }],"supportsFileUpload": true},"images": [],"metadata": {"discoveredAt": "2026-03-04T12:00:00.000Z","fingerprint": "5c21...","confidence": 0.96}}
How to use the API outputs 🔌
After the Actor finishes, you usually use the outputs in one of these ways:
- Use the dataset as your main API for extracted entities.
- Use
output-openapi.jsonas a contract for documentation, client generation, or downstream API tooling. - Use
output-api-endpoints.jsonandoutput-postman-collection.jsonto inspect and test realfetch/xhrendpoints observed on the target site.
Example dataset API call:
$curl "https://api.apify.com/v2/datasets/<DATASET_ID>/items?token=<APIFY_TOKEN>&format=json&clean=true"
Useful artifact URLs:
https://api.apify.com/v2/key-value-stores/<KV_STORE_ID>/records/output-schema.json?token=<APIFY_TOKEN>https://api.apify.com/v2/key-value-stores/<KV_STORE_ID>/records/output-openapi.json?token=<APIFY_TOKEN>https://api.apify.com/v2/key-value-stores/<KV_STORE_ID>/records/output-api-endpoints.json?token=<APIFY_TOKEN>https://api.apify.com/v2/key-value-stores/<KV_STORE_ID>/records/output-postman-collection.json?token=<APIFY_TOKEN>
Important note:
output-openapi.jsondescribes a stable API contract, but this Actor does not start a permanent standalone API server by itself.- The discovered site APIs in
output-api-endpoints.jsonmay require auth headers, cookies, or CSRF tokens depending on the target website.
Postman example 📬
- Download
output-postman-collection.jsonfrom the Actor run's default key-value store. - Open Postman and use
Importto load that collection. - Open
Actor Outputsto read the generated dataset and key-value store artifacts, or openDiscovered APIsif the run captured real network endpoints. Run Actor (Async)returns a run object immediately. It does not wait for the actor to finish.- If you want Postman to wait for the run and return extracted entities directly, use
Run Actor Sync (Dataset Items)instead. - After
Run Actor (Async), useWait For Run FinishorGet Run Statusbefore fetching dataset or key-value store outputs. - If the target site requires authentication, add the needed headers, cookies, or bearer token in Postman before retrying.
Example Postman flow:
- Import
output-postman-collection.json - Open
Actor Outputs - For quick testing, run
Run Actor Sync (Dataset Items) - If you need the run object and IDs first, run
Run Actor (Async) - The collection stores
{{run_id}},{{dataset_id}}, and{{key_value_store_id}}from the response automatically - Run
Wait For Run Finish - When the run status becomes
SUCCEEDED, selectGet Dataset Items - Fill
{{apify_token}}if needed - Review the prefilled URL
- Click
Send - Inspect the JSON response with the extracted entities and save the request into your workspace if needed
Pricing expectations 💰
This Actor uses Apify platform resources (compute units, proxy traffic if enabled, and storage). Total cost depends on:
- Number of pages crawled (
maxPages) - Target website complexity and latency
- Proxy usage and retries
- Whether Playwright fallback is needed for JS-heavy pages
To keep runs cheap, start with a small maxPages value, review outputs, then scale gradually.
Limitations ⚠️
- No login/paywall bypass
- Heuristic extraction (no LLM post-processing)
- JS-heavy websites can still be partially parsed even with Playwright fallback
- Proxy/network instability may cause retries and longer runs
FAQ ❓
Why does it crawl fewer pages than maxPages? 📉
maxPages is an upper bound, not a guarantee. The Actor may stop early if it cannot discover more valid URLs under your filters and depth limits.
Why do I see example.com in logs? 👀
If input is missing or malformed, the UI prefill may be used. Always verify the run input and confirm startUrls contains your intended domain.
How do I get only specific pages? 🎯
Use includePatterns, excludePatterns, and lower maxDepth. For strict control, use extraction.mode = manual with explicit selectors.
Can I use this as an MCP server directly? 🧩
The Actor generates MCP descriptor artifacts (output-mcp.json, output-tools.json). It does not run a permanent MCP server inside Apify.
Does it discover APIs too? 🔌
Yes. When a page needs Playwright fallback, the Actor captures same-site fetch and xhr traffic that looks API-like (for example JSON or /api/* endpoints). Those observations are stored in output-api-endpoints.json and exported as output-postman-collection.json.
What is output-rendering-stats.json? 🎭
It summarizes hybrid crawl behavior per domain, including HTML-only pages, fallback count, SPA shell detections, adaptive fallbacks, and render hit rate.
What does output-capabilities.json contain? 🗺️
A compact capability graph for agents, for example:
{"entities": ["article", "form", "product"],"actions": ["fillField", "submitForm", "uploadDocument"],"auth": false,"pagination": true}
Ecosystem examples 🌐
You can present this Actor as a producer in a broader MCP ecosystem:
- web-mcp-hub
- Use it as a reference for how MCP tools are organized and discovered.
- Position this Actor as an upstream source that generates MCP-ready tool descriptors from crawled websites.
- webmcp-extension
- Use it as a client-side integration example.
- Demonstrate how generated artifacts (
output-mcp.json,output-tools.json) can be consumed by extension/client workflows.
Suggested demo storyline:
- Run this Actor on a target site (for example
webmcp.dev). - Show extracted entities in the dataset.
- Open
output-schema.jsonandoutput-tools.json. - Explain how the generated MCP descriptors can plug into hub/extension-style consumers.
Support 🤝
- Open an issue in the Actor's Issues tab if results are unexpected
- Share run input (without secrets), run ID, and sample URLs for faster debugging
- Feature requests are welcome