Page to API - Sitemap to JSON avatar

Page to API - Sitemap to JSON

Pricing

Pay per usage

Go to Apify Store
Page to API - Sitemap to JSON

Page to API - Sitemap to JSON

Turn any public site URL or sitemap.xml into a clean API-style JSON feed. Crawls a bounded set of pages (hard cap 50/run) and returns one structured record per page: title, meta, headings, links, main text, JSON-LD + OpenGraph. SSRF-guarded, pure code, no AI by default.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Ahmed Moussa

Ahmed Moussa

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

4 days ago

Last modified

Categories

Share

Page to API — Sitemap to JSON

Turn any public site URL or sitemap.xml into a clean, API-style JSON feed.

What it does

Give it a page URL or a sitemap, and it crawls a bounded set of pages and returns one structured record per page (title, meta, headings, links, JSON-LD, OpenGraph and main text).

What you get per page

{
"url": "https://example.com/",
"status": "ok",
"fetched_at": "2026-06-23T12:00:00+00:00",
"title": "Example Domain",
"meta_description": "...",
"meta": { "og:title": "...", "description": "..." },
"headings": [ { "level": "h1", "text": "Example Domain" } ],
"links": [ { "href": "https://www.iana.org/domains/example", "text": "More information..." } ],
"structured_data": [ { "@context": "https://schema.org", "@type": "WebPage" } ],
"main_text": "Example Domain This domain is for use in...",
"word_count": 28
}

Input

FieldTypeDefaultNotes
urlstringA single page URL. If it points to a sitemap it is crawled as one.
sitemap_urlstringA sitemap.xml (or sitemap index) URL.
max_pagesinteger20Bounded crawl cap. Hard-capped at 50 per run.
llm_api_keystring (secret)Optional, your own key. The default path uses no AI.

Provide either url or sitemap_url.

Output

One item per page is pushed to the actor's default dataset (see the per-page schema above).

Use cases

  • Headless-CMS-style JSON feed from a static or marketing site.
  • Bulk-ingest a site's pages (via sitemap) into a search index or RAG pipeline.
  • Snapshot a site's structured data (JSON-LD / OpenGraph) for monitoring.

How it works (deterministic, code-only)

Each URL (and every sitemap entry) is fetched through an SSRF-guarded client and parsed with regex + stdlib into title, meta, headings, links, JSON-LD, OpenGraph and main text. The crawl is hard-capped at 50 pages/run. No AI on the default path.

Cost-safety & security (always on)

  • Deterministic, code-only parsing (regex + stdlib). No LLM, no paid API by default → no per-run AI/API cost.
  • Bounded crawl cap: at most max_pages pages, hard-ceilinged at 50 regardless of input, so a run can never explode compute/cost.
  • SSRF guard: every fetch (including the sitemap and every redirect hop) is re-validated; private / loopback / link-local / reserved IPs are blocked (fail-closed).
  • Bounded fetch: hard size cap (2 MB/page), connect/read timeouts, max 3 redirects, content-type allowlist.
  • Domain blocklist for login-walled / ToS-sensitive sites.
  • The extractor never hangs or raises — any failure yields a record with a non-ok status and an error message.

Limitations (honest)

  • Hard cap of 50 pages per run — for larger sites, page through multiple runs.
  • Client-side-rendered pages (heavy JS) expose less content; there is no headless browser.
  • Login-walled / blocklisted domains are refused with a non-ok status.