Pricing

from $0.50 / 1,000 results

Website To Api & Mcp Generator

Turn public websites into structured data, OpenAPI specs, and MCP-ready descriptors. Crawl pages, detect forms and API-like endpoints, and export clean outputs for agents, chatbots, automation, and developer workflows.

Pricing

from $0.50 / 1,000 results

Rating

0.0

(0)

Developer

Solutions Smart

Actor stats

Bookmarked

Total users

Monthly active users

4 months ago

Last modified

Website to API & MCP Generator ✨

Turn almost any public website into structured data, an OpenAPI spec, and MCP descriptors in one run.

Video walkthrough 🎥

Watch the demo here: https://youtu.be/-B3d1CYdRhE?si=wxhW0tdLJl2MGQmj

This Actor crawls pages from your start URLs, discovers detail pages, extracts entities (product, article, job, profile, form, fallback page), and stores normalized results in an Apify dataset. It is built for technical users who want configurable crawling and predictable output artifacts.

Quick start ⚡

Open the Actor in Apify Console or run it locally with apify run.
Add one or more URLs to startUrls.
Keep extraction.mode set to auto for the first run.
Start with a small crawl such as maxPages: 10.
Review the dataset plus output-*.json artifacts, then scale up.

Recommended first run ✅

Use startUrls with the main public docs or product pages you want to map.
Keep maxPages between 10 and 50 until you confirm the output quality.
Keep maxDepth: 2 or 3 for most sites.
Keep concurrency: 5 to 10 for stable first runs.
Leave extraction.rendering.waitForSelector empty unless the site is strongly JS-driven and you know a reliable selector.
Set proxy.useApifyProxy to false for easy public sites and true for harder targets or region-sensitive sites.
Enable emitOpenApi and emitMcp if you want the generated API and MCP artifacts right away.

What can this Actor do? 🚀

Crawl websites with depth and page limits (maxDepth, maxPages)
Use hybrid crawling: Cheerio HTML extraction first, with adaptive Playwright fallback for SPA shells
Discover list/detail patterns automatically (or use manual extraction mode)
Extract structured fields from JSON-LD, OpenGraph, and DOM content
Track per-domain rendering hit rates to reduce unnecessary browser work
Discover HTML forms and extract field/action schemas (including file uploads and submit selectors)
Capture same-site fetch / xhr API endpoints during Playwright fallback
Deduplicate records by URL or URL+content fingerprint
Track changes between runs (added, removed, modified entities)
Generate machine-friendly artifacts:
- output-schema.json
- output-index.json
- output-changes.json
- output-capabilities.json
- output-api-endpoints.json
- output-rendering-stats.json
- output-openapi.json (optional)
- output-postman-collection.json
- output-mcp.json and output-tools.json (optional)

Why use it on Apify? ☁️

Running this Actor on Apify gives you more than a scraper script:

Scheduled runs and easy automation
API access to dataset and key-value store outputs
Built-in run logs and monitoring
Proxy configuration support (RESIDENTIAL, country targeting)
Integrations and webhooks for downstream workflows

What data can it extract? 🧩

The Actor always emits a normalized entity object and enriches fields by detected entity type.

Field	Description
`type`	Entity type (`product`, `article`, `job`, `profile`, `form`, `page`)
`id`	Stable SHA256 hash of canonical URL (or form-specific seed for form entities)
`sourceUrl`	Original crawled URL
`canonicalUrl`	Normalized canonical URL
`title`	Best detected page/entity title
`fields`	Type-specific extracted fields (price, author, company, etc.)
`text`	Optional extracted text/markdown
`images`	Collected image URLs
`metadata.confidence`	Extraction confidence score
`metadata.fingerprint`	Content fingerprint used for change detection

How to use this Actor 🛠️

Add at least one URL in startUrls.
Set crawl controls (maxPages, maxDepth, concurrency).
Optionally tune includePatterns / excludePatterns.
Keep extraction.mode as auto, or switch to manual and provide selectors.
Optionally tune extraction.rendering.timeoutSecs and extraction.rendering.waitForSelector for JS-heavy targets.
Run the Actor.
Read results in the dataset and output-*.json artifacts in the default key-value store.

Input example using `https://docs.apify.com/` 📝

{
  "debug": false,
  "maxPages": 10,
  "startUrls": [
    {
      "url": "https://docs.apify.com/"
    }
  ],
  "maxDepth": 3,
  "concurrency": 10,
  "includePatterns": [
    "**/*"
  ],
  "excludePatterns": [
    "**/*.pdf",
    "**/*.jpg",
    "**/*.png",
    "**/*.zip",
    "**/wp-admin/**"
  ],
  "entityHints": [
    "product",
    "article",
    "job",
    "profile"
  ],
  "extraction": {
    "mode": "auto",
    "manual": {
      "listPageUrl": "",
      "listItemSelector": "",
      "detailLinkSelector": "",
      "fields": []
    },
    "rendering": {
      "timeoutSecs": 8,
      "waitForSelector": ""
    }
  },
  "output": {
    "datasetName": "entities",
    "emitOpenApi": true,
    "emitMcp": true,
    "emitMarkdown": false
  },
  "dedupe": {
    "enabled": true,
    "strategy": "url+contentHash",
    "changeDetection": true
  },
  "proxy": {
    "useApifyProxy": true,
    "apifyProxyGroups": [
      "RESIDENTIAL"
    ],
    "apifyProxyCountry": "DE"
  }
}

Hybrid crawling notes:

Cheerio is always the fast default path for HTML extraction.
Playwright is used only when the page looks like an SPA shell, HTML extraction is too thin, or the domain has shown a strong render hit rate earlier in the run.
extraction.rendering.timeoutSecs controls how long the fallback renderer waits for useful content.
extraction.rendering.waitForSelector lets you prioritize a specific selector on JS-heavy pages.

Output example from `https://docs.apify.com/` 📦

{
  "type": "page",
  "id": "7d48ca31ddac67e2ad26b02c4fa26b9656c527eb537d269d46925f3aab45596d",
  "sourceUrl": "https://docs.apify.com/academy",
  "canonicalUrl": "https://docs.apify.com/academy",
  "title": "Apify Academy | Academy | Apify Documentation",
  "fields": {
    "description": "Learn everything about web scraping and automation with our free courses that will turn you into an expert scraper developer."
  },
  "text": "Apify AcademyCopy for LLMLearn everything about web scraping and automation...",
  "images": [
    "https://apify.com/og-image/docs-article?title=Apify+Academy",
    "https://docs.apify.com/img/apify_sdk.svg"
  ],
  "metadata": {
    "discoveredAt": "2026-03-14T09:57:20.651Z",
    "fingerprint": "bcd44f5d99e7ac5dc275f27246634c5ef745c103297de301be6b7dd99435fd6b",
    "listUrl": "https://docs.apify.com/academy",
    "confidence": 0.45
  }
}

Form entity example on a site that contains a form:

{
  "type": "form",
  "id": "bcda...",
  "sourceUrl": "https://target-site.example/contact",
  "canonicalUrl": "https://target-site.example/contact#form-1",
  "title": "Contact Form",
  "fields": {
    "formId": "contact_form",
    "method": "POST",
    "target": "https://target-site.example/contact",
    "fields": [
      { "name": "name", "type": "text", "required": true },
      { "name": "birthDate", "type": "date" },
      { "name": "documents", "type": "file" }
    ],
    "actions": [
      { "type": "submit", "selector": "button[type=\"submit\"]", "method": "POST" }
    ],
    "supportsFileUpload": true
  },
  "images": [],
  "metadata": {
    "discoveredAt": "2026-03-04T12:00:00.000Z",
    "fingerprint": "5c21...",
    "confidence": 0.96
  }
}

How to use the API outputs 🔌

After the Actor finishes, you usually use the outputs in one of these ways:

Use the dataset as your main API for extracted entities.
Use output-openapi.json as a contract for documentation, client generation, or downstream API tooling.
Use output-api-endpoints.json and output-postman-collection.json to inspect and test real fetch / xhr endpoints observed on the target site.

Example dataset API call:

$curl "https://api.apify.com/v2/datasets/<DATASET_ID>/items?token=<APIFY_TOKEN>&format=json&clean=true"

Useful artifact URLs:

https://api.apify.com/v2/key-value-stores/<KV_STORE_ID>/records/output-schema.json?token=<APIFY_TOKEN>
https://api.apify.com/v2/key-value-stores/<KV_STORE_ID>/records/output-openapi.json?token=<APIFY_TOKEN>
https://api.apify.com/v2/key-value-stores/<KV_STORE_ID>/records/output-api-endpoints.json?token=<APIFY_TOKEN>
https://api.apify.com/v2/key-value-stores/<KV_STORE_ID>/records/output-postman-collection.json?token=<APIFY_TOKEN>

Important note:

output-openapi.json describes a stable API contract, but this Actor does not start a permanent standalone API server by itself.
The discovered site APIs in output-api-endpoints.json may require auth headers, cookies, or CSRF tokens depending on the target website.

Postman example 📬

Download output-postman-collection.json from the Actor run's default key-value store.
Open Postman and use Import to load that collection.
Open Actor Outputs to read the generated dataset and key-value store artifacts, or open Discovered APIs if the run captured real network endpoints.
Run Actor (Async) returns a run object immediately. It does not wait for the actor to finish.
If you want Postman to wait for the run and return extracted entities directly, use Run Actor Sync (Dataset Items) instead.
After Run Actor (Async), use Wait For Run Finish or Get Run Status before fetching dataset or key-value store outputs.
If the target site requires authentication, add the needed headers, cookies, or bearer token in Postman before retrying.

Example Postman flow:

Import output-postman-collection.json
Open Actor Outputs
For quick testing, run Run Actor Sync (Dataset Items)
If you need the run object and IDs first, run Run Actor (Async)
The collection stores {{run_id}}, {{dataset_id}}, and {{key_value_store_id}} from the response automatically
Run Wait For Run Finish
When the run status becomes SUCCEEDED, select Get Dataset Items
Fill {{apify_token}} if needed
Review the prefilled URL
Click Send
Inspect the JSON response with the extracted entities and save the request into your workspace if needed

Pricing expectations 💰

This Actor uses Apify platform resources (compute units, proxy traffic if enabled, and storage). Total cost depends on:

Number of pages crawled (maxPages)
Target website complexity and latency
Proxy usage and retries
Whether Playwright fallback is needed for JS-heavy pages

To keep runs cheap, start with a small maxPages value, review outputs, then scale gradually.

Limitations ⚠️

No login/paywall bypass
Heuristic extraction (no LLM post-processing)
JS-heavy websites can still be partially parsed even with Playwright fallback
Proxy/network instability may cause retries and longer runs

FAQ ❓

Why does it crawl fewer pages than `maxPages`? 📉

maxPages is an upper bound, not a guarantee. The Actor may stop early if it cannot discover more valid URLs under your filters and depth limits.

Why do I see `example.com` in logs? 👀

If input is missing or malformed, the UI prefill may be used. Always verify the run input and confirm startUrls contains your intended domain.

How do I get only specific pages? 🎯

Use includePatterns, excludePatterns, and lower maxDepth. For strict control, use extraction.mode = manual with explicit selectors.

Can I use this as an MCP server directly? 🧩

The Actor generates MCP descriptor artifacts (output-mcp.json, output-tools.json). It does not run a permanent MCP server inside Apify.

Does it discover APIs too? 🔌

Yes. When a page needs Playwright fallback, the Actor captures same-site fetch and xhr traffic that looks API-like (for example JSON or /api/* endpoints). Those observations are stored in output-api-endpoints.json and exported as output-postman-collection.json.

What is `output-rendering-stats.json`? 🎭

It summarizes hybrid crawl behavior per domain, including HTML-only pages, fallback count, SPA shell detections, adaptive fallbacks, and render hit rate.

What does `output-capabilities.json` contain? 🗺️

A compact capability graph for agents, for example:

{
  "entities": ["article", "form", "product"],
  "actions": ["fillField", "submitForm", "uploadDocument"],
  "auth": false,
  "pagination": true
}

Ecosystem examples 🌐

You can present this Actor as a producer in a broader MCP ecosystem:

web-mcp-hub
- Use it as a reference for how MCP tools are organized and discovered.
- Position this Actor as an upstream source that generates MCP-ready tool descriptors from crawled websites.
webmcp-extension
- Use it as a client-side integration example.
- Demonstrate how generated artifacts (output-mcp.json, output-tools.json) can be consumed by extension/client workflows.

Suggested demo storyline:

Run this Actor on a target site (for example webmcp.dev).
Show extracted entities in the dataset.
Open output-schema.json and output-tools.json.
Explain how the generated MCP descriptors can plug into hub/extension-style consumers.

Support 🤝

Open an issue in the Actor's Issues tab if results are unexpected
Share run input (without secrets), run ID, and sample URLs for faster debugging
Feature requests are welcome

Mcp Server Generator

fiery_dream/mcp-server-generator

Cody Churchwell

Techionik Website RAG MCP

techionik9993/techionik-website-rag-mcp

Prompt-first MCP-ready Apify tool for AI agents to crawl public websites and return clean RAG chunks with URLs, titles, markdown, and token estimates.

Techionik

Openapi To Mcp Converter

theguide/openapi-to-mcp-converter

Convert any OpenAPI specification into a Model Context Protocol (MCP) server that AI assistants can use to interact with REST APIs.

TheGuide

REST API to MCP Server — AI Agent Tools

wsgcjj/mcp-server-adapter

Convert any REST API into a Model Context Protocol (MCP) server configuration. Automatically generates tools and resources from OpenAPI specs or API documentation.

陈俊杰

MCP Research Agent

enezli/mcp-research-agent

Structured web research for AI agents. Search, fetch and structure web data into the exact JSON schema you request. Clean, predictable output for Claude and MCP workflows, ready to drop into RAG pipelines.

Turgay NANTA

API Docs Scraper & Postman Collection Generator

express_kingfisher/api-docs-scraper-generator

Scrape API documentation pages, extract endpoints, parameters, and generate Postman collections or OpenAPI specs.

Prince Raj

AI-Ready Content Extractor — Structured Web Data for LLM & MCP

yuchiaoniu/ai-content-extractor

Extract structured JSON from any URL for LLM, RAG, and MCP integration. Outputs title, sections, contact info, links, structured data, and clean plain text.

Niu Yuchiao

MCP tools – Turn Any Website into an AI Tool in 60 Seconds

clever_fashion/mcp-website-tool

Automatically extract buttons, inputs & forms from any site and get ready-to-use MCP (Model Context Protocol) tools for Cursor, Claude, Claude Desktop, Windsurf, Cline, and any MCP-compatible AI.

Data Farming Team

Mcp Gateway

maximus242/mcp-gateway

Convert any REST API into an MCP server for AI agents

Philip

Website Intelligence API

ladra/Website-Intelligence-API

Crawl any public website and turn it into AI-ready intelligence. Extract screenshots, Markdown, HTML, metadata, links, PDFs, compliance evidence, RAG chunks, and structured JSON for sales research, audits, website snapshots, and automation.