Page to API - Sitemap to JSON
Pricing
Pay per usage
Page to API - Sitemap to JSON
Turn any public site URL or sitemap.xml into a clean API-style JSON feed. Crawls a bounded set of pages (hard cap 50/run) and returns one structured record per page: title, meta, headings, links, main text, JSON-LD + OpenGraph. SSRF-guarded, pure code, no AI by default.
Pricing
Pay per usage
Rating
0.0
(0)
Developer
Ahmed Moussa
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
4 days ago
Last modified
Categories
Share
Page to API — Sitemap to JSON
Turn any public site URL or sitemap.xml into a clean, API-style JSON feed.
What it does
Give it a page URL or a sitemap, and it crawls a bounded set of pages and returns one structured record per page (title, meta, headings, links, JSON-LD, OpenGraph and main text).
What you get per page
{"url": "https://example.com/","status": "ok","fetched_at": "2026-06-23T12:00:00+00:00","title": "Example Domain","meta_description": "...","meta": { "og:title": "...", "description": "..." },"headings": [ { "level": "h1", "text": "Example Domain" } ],"links": [ { "href": "https://www.iana.org/domains/example", "text": "More information..." } ],"structured_data": [ { "@context": "https://schema.org", "@type": "WebPage" } ],"main_text": "Example Domain This domain is for use in...","word_count": 28}
Input
| Field | Type | Default | Notes |
|---|---|---|---|
url | string | — | A single page URL. If it points to a sitemap it is crawled as one. |
sitemap_url | string | — | A sitemap.xml (or sitemap index) URL. |
max_pages | integer | 20 | Bounded crawl cap. Hard-capped at 50 per run. |
llm_api_key | string (secret) | — | Optional, your own key. The default path uses no AI. |
Provide either url or sitemap_url.
Output
One item per page is pushed to the actor's default dataset (see the per-page schema above).
Use cases
- Headless-CMS-style JSON feed from a static or marketing site.
- Bulk-ingest a site's pages (via sitemap) into a search index or RAG pipeline.
- Snapshot a site's structured data (JSON-LD / OpenGraph) for monitoring.
How it works (deterministic, code-only)
Each URL (and every sitemap entry) is fetched through an SSRF-guarded client and parsed with regex + stdlib into title, meta, headings, links, JSON-LD, OpenGraph and main text. The crawl is hard-capped at 50 pages/run. No AI on the default path.
Cost-safety & security (always on)
- Deterministic, code-only parsing (regex + stdlib). No LLM, no paid API by default → no per-run AI/API cost.
- Bounded crawl cap: at most
max_pagespages, hard-ceilinged at 50 regardless of input, so a run can never explode compute/cost. - SSRF guard: every fetch (including the sitemap and every redirect hop) is re-validated; private / loopback / link-local / reserved IPs are blocked (fail-closed).
- Bounded fetch: hard size cap (2 MB/page), connect/read timeouts, max 3 redirects, content-type allowlist.
- Domain blocklist for login-walled / ToS-sensitive sites.
- The extractor never hangs or raises — any failure yields a record with a non-
okstatusand anerrormessage.
Limitations (honest)
- Hard cap of 50 pages per run — for larger sites, page through multiple runs.
- Client-side-rendered pages (heavy JS) expose less content; there is no headless browser.
- Login-walled / blocklisted domains are refused with a non-
okstatus.