Website To Api & Mcp Generator avatar

Website To Api & Mcp Generator

Pricing

from $0.50 / 1,000 results

Go to Apify Store
Website To Api & Mcp Generator

Website To Api & Mcp Generator

Turn public websites into structured data, OpenAPI specs, and MCP-ready descriptors. Crawl pages, detect forms and API-like endpoints, and export clean outputs for agents, chatbots, automation, and developer workflows.

Pricing

from $0.50 / 1,000 results

Rating

0.0

(0)

Developer

Solutions Smart

Solutions Smart

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

2 days ago

Last modified

Share

Website to API & MCP Generator ✨

Turn almost any public website into structured data, an OpenAPI spec, and MCP descriptors in one run.

Video walkthrough 🎥

Watch the demo here: https://youtu.be/-B3d1CYdRhE?si=wxhW0tdLJl2MGQmj

This Actor crawls pages from your start URLs, discovers detail pages, extracts entities (product, article, job, profile, form, fallback page), and stores normalized results in an Apify dataset. It is built for technical users who want configurable crawling and predictable output artifacts.

Quick start ⚡

  1. Open the Actor in Apify Console or run it locally with apify run.
  2. Add one or more URLs to startUrls.
  3. Keep extraction.mode set to auto for the first run.
  4. Start with a small crawl such as maxPages: 10.
  5. Review the dataset plus output-*.json artifacts, then scale up.
  • Use startUrls with the main public docs or product pages you want to map.
  • Keep maxPages between 10 and 50 until you confirm the output quality.
  • Keep maxDepth: 2 or 3 for most sites.
  • Keep concurrency: 5 to 10 for stable first runs.
  • Leave extraction.rendering.waitForSelector empty unless the site is strongly JS-driven and you know a reliable selector.
  • Set proxy.useApifyProxy to false for easy public sites and true for harder targets or region-sensitive sites.
  • Enable emitOpenApi and emitMcp if you want the generated API and MCP artifacts right away.

What can this Actor do? 🚀

  • Crawl websites with depth and page limits (maxDepth, maxPages)
  • Use hybrid crawling: Cheerio HTML extraction first, with adaptive Playwright fallback for SPA shells
  • Discover list/detail patterns automatically (or use manual extraction mode)
  • Extract structured fields from JSON-LD, OpenGraph, and DOM content
  • Track per-domain rendering hit rates to reduce unnecessary browser work
  • Discover HTML forms and extract field/action schemas (including file uploads and submit selectors)
  • Capture same-site fetch / xhr API endpoints during Playwright fallback
  • Deduplicate records by URL or URL+content fingerprint
  • Track changes between runs (added, removed, modified entities)
  • Generate machine-friendly artifacts:
    • output-schema.json
    • output-index.json
    • output-changes.json
    • output-capabilities.json
    • output-api-endpoints.json
    • output-rendering-stats.json
    • output-openapi.json (optional)
    • output-postman-collection.json
    • output-mcp.json and output-tools.json (optional)

Why use it on Apify? ☁️

Running this Actor on Apify gives you more than a scraper script:

  • Scheduled runs and easy automation
  • API access to dataset and key-value store outputs
  • Built-in run logs and monitoring
  • Proxy configuration support (RESIDENTIAL, country targeting)
  • Integrations and webhooks for downstream workflows

What data can it extract? 🧩

The Actor always emits a normalized entity object and enriches fields by detected entity type.

FieldDescription
typeEntity type (product, article, job, profile, form, page)
idStable SHA256 hash of canonical URL (or form-specific seed for form entities)
sourceUrlOriginal crawled URL
canonicalUrlNormalized canonical URL
titleBest detected page/entity title
fieldsType-specific extracted fields (price, author, company, etc.)
textOptional extracted text/markdown
imagesCollected image URLs
metadata.confidenceExtraction confidence score
metadata.fingerprintContent fingerprint used for change detection

How to use this Actor 🛠️

  1. Add at least one URL in startUrls.
  2. Set crawl controls (maxPages, maxDepth, concurrency).
  3. Optionally tune includePatterns / excludePatterns.
  4. Keep extraction.mode as auto, or switch to manual and provide selectors.
  5. Optionally tune extraction.rendering.timeoutSecs and extraction.rendering.waitForSelector for JS-heavy targets.
  6. Run the Actor.
  7. Read results in the dataset and output-*.json artifacts in the default key-value store.

Input example using https://docs.apify.com/ 📝

{
"debug": false,
"maxPages": 10,
"startUrls": [
{
"url": "https://docs.apify.com/"
}
],
"maxDepth": 3,
"concurrency": 10,
"includePatterns": [
"**/*"
],
"excludePatterns": [
"**/*.pdf",
"**/*.jpg",
"**/*.png",
"**/*.zip",
"**/wp-admin/**"
],
"entityHints": [
"product",
"article",
"job",
"profile"
],
"extraction": {
"mode": "auto",
"manual": {
"listPageUrl": "",
"listItemSelector": "",
"detailLinkSelector": "",
"fields": []
},
"rendering": {
"timeoutSecs": 8,
"waitForSelector": ""
}
},
"output": {
"datasetName": "entities",
"emitOpenApi": true,
"emitMcp": true,
"emitMarkdown": false
},
"dedupe": {
"enabled": true,
"strategy": "url+contentHash",
"changeDetection": true
},
"proxy": {
"useApifyProxy": true,
"apifyProxyGroups": [
"RESIDENTIAL"
],
"apifyProxyCountry": "DE"
}
}

Hybrid crawling notes:

  • Cheerio is always the fast default path for HTML extraction.
  • Playwright is used only when the page looks like an SPA shell, HTML extraction is too thin, or the domain has shown a strong render hit rate earlier in the run.
  • extraction.rendering.timeoutSecs controls how long the fallback renderer waits for useful content.
  • extraction.rendering.waitForSelector lets you prioritize a specific selector on JS-heavy pages.

Output example from https://docs.apify.com/ 📦

{
"type": "page",
"id": "7d48ca31ddac67e2ad26b02c4fa26b9656c527eb537d269d46925f3aab45596d",
"sourceUrl": "https://docs.apify.com/academy",
"canonicalUrl": "https://docs.apify.com/academy",
"title": "Apify Academy | Academy | Apify Documentation",
"fields": {
"description": "Learn everything about web scraping and automation with our free courses that will turn you into an expert scraper developer."
},
"text": "Apify AcademyCopy for LLMLearn everything about web scraping and automation...",
"images": [
"https://apify.com/og-image/docs-article?title=Apify+Academy",
"https://docs.apify.com/img/apify_sdk.svg"
],
"metadata": {
"discoveredAt": "2026-03-14T09:57:20.651Z",
"fingerprint": "bcd44f5d99e7ac5dc275f27246634c5ef745c103297de301be6b7dd99435fd6b",
"listUrl": "https://docs.apify.com/academy",
"confidence": 0.45
}
}

Form entity example on a site that contains a form:

{
"type": "form",
"id": "bcda...",
"sourceUrl": "https://target-site.example/contact",
"canonicalUrl": "https://target-site.example/contact#form-1",
"title": "Contact Form",
"fields": {
"formId": "contact_form",
"method": "POST",
"target": "https://target-site.example/contact",
"fields": [
{ "name": "name", "type": "text", "required": true },
{ "name": "birthDate", "type": "date" },
{ "name": "documents", "type": "file" }
],
"actions": [
{ "type": "submit", "selector": "button[type=\"submit\"]", "method": "POST" }
],
"supportsFileUpload": true
},
"images": [],
"metadata": {
"discoveredAt": "2026-03-04T12:00:00.000Z",
"fingerprint": "5c21...",
"confidence": 0.96
}
}

How to use the API outputs 🔌

After the Actor finishes, you usually use the outputs in one of these ways:

  • Use the dataset as your main API for extracted entities.
  • Use output-openapi.json as a contract for documentation, client generation, or downstream API tooling.
  • Use output-api-endpoints.json and output-postman-collection.json to inspect and test real fetch / xhr endpoints observed on the target site.

Example dataset API call:

$curl "https://api.apify.com/v2/datasets/<DATASET_ID>/items?token=<APIFY_TOKEN>&format=json&clean=true"

Useful artifact URLs:

  • https://api.apify.com/v2/key-value-stores/<KV_STORE_ID>/records/output-schema.json?token=<APIFY_TOKEN>
  • https://api.apify.com/v2/key-value-stores/<KV_STORE_ID>/records/output-openapi.json?token=<APIFY_TOKEN>
  • https://api.apify.com/v2/key-value-stores/<KV_STORE_ID>/records/output-api-endpoints.json?token=<APIFY_TOKEN>
  • https://api.apify.com/v2/key-value-stores/<KV_STORE_ID>/records/output-postman-collection.json?token=<APIFY_TOKEN>

Important note:

  • output-openapi.json describes a stable API contract, but this Actor does not start a permanent standalone API server by itself.
  • The discovered site APIs in output-api-endpoints.json may require auth headers, cookies, or CSRF tokens depending on the target website.

Postman example 📬

  1. Download output-postman-collection.json from the Actor run's default key-value store.
  2. Open Postman and use Import to load that collection.
  3. Open Actor Outputs to read the generated dataset and key-value store artifacts, or open Discovered APIs if the run captured real network endpoints.
  4. Run Actor (Async) returns a run object immediately. It does not wait for the actor to finish.
  5. If you want Postman to wait for the run and return extracted entities directly, use Run Actor Sync (Dataset Items) instead.
  6. After Run Actor (Async), use Wait For Run Finish or Get Run Status before fetching dataset or key-value store outputs.
  7. If the target site requires authentication, add the needed headers, cookies, or bearer token in Postman before retrying.

Example Postman flow:

  • Import output-postman-collection.json
  • Open Actor Outputs
  • For quick testing, run Run Actor Sync (Dataset Items)
  • If you need the run object and IDs first, run Run Actor (Async)
  • The collection stores {{run_id}}, {{dataset_id}}, and {{key_value_store_id}} from the response automatically
  • Run Wait For Run Finish
  • When the run status becomes SUCCEEDED, select Get Dataset Items
  • Fill {{apify_token}} if needed
  • Review the prefilled URL
  • Click Send
  • Inspect the JSON response with the extracted entities and save the request into your workspace if needed

Pricing expectations 💰

This Actor uses Apify platform resources (compute units, proxy traffic if enabled, and storage). Total cost depends on:

  • Number of pages crawled (maxPages)
  • Target website complexity and latency
  • Proxy usage and retries
  • Whether Playwright fallback is needed for JS-heavy pages

To keep runs cheap, start with a small maxPages value, review outputs, then scale gradually.

Limitations ⚠️

  • No login/paywall bypass
  • Heuristic extraction (no LLM post-processing)
  • JS-heavy websites can still be partially parsed even with Playwright fallback
  • Proxy/network instability may cause retries and longer runs

FAQ ❓

Why does it crawl fewer pages than maxPages? 📉

maxPages is an upper bound, not a guarantee. The Actor may stop early if it cannot discover more valid URLs under your filters and depth limits.

Why do I see example.com in logs? 👀

If input is missing or malformed, the UI prefill may be used. Always verify the run input and confirm startUrls contains your intended domain.

How do I get only specific pages? 🎯

Use includePatterns, excludePatterns, and lower maxDepth. For strict control, use extraction.mode = manual with explicit selectors.

Can I use this as an MCP server directly? 🧩

The Actor generates MCP descriptor artifacts (output-mcp.json, output-tools.json). It does not run a permanent MCP server inside Apify.

Does it discover APIs too? 🔌

Yes. When a page needs Playwright fallback, the Actor captures same-site fetch and xhr traffic that looks API-like (for example JSON or /api/* endpoints). Those observations are stored in output-api-endpoints.json and exported as output-postman-collection.json.

What is output-rendering-stats.json? 🎭

It summarizes hybrid crawl behavior per domain, including HTML-only pages, fallback count, SPA shell detections, adaptive fallbacks, and render hit rate.

What does output-capabilities.json contain? 🗺️

A compact capability graph for agents, for example:

{
"entities": ["article", "form", "product"],
"actions": ["fillField", "submitForm", "uploadDocument"],
"auth": false,
"pagination": true
}

Ecosystem examples 🌐

You can present this Actor as a producer in a broader MCP ecosystem:

  • web-mcp-hub
    • Use it as a reference for how MCP tools are organized and discovered.
    • Position this Actor as an upstream source that generates MCP-ready tool descriptors from crawled websites.
  • webmcp-extension
    • Use it as a client-side integration example.
    • Demonstrate how generated artifacts (output-mcp.json, output-tools.json) can be consumed by extension/client workflows.

Suggested demo storyline:

  1. Run this Actor on a target site (for example webmcp.dev).
  2. Show extracted entities in the dataset.
  3. Open output-schema.json and output-tools.json.
  4. Explain how the generated MCP descriptors can plug into hub/extension-style consumers.

Support 🤝

  • Open an issue in the Actor's Issues tab if results are unexpected
  • Share run input (without secrets), run ID, and sample URLs for faster debugging
  • Feature requests are welcome