Pricing

$1.00 / 1,000 results

Page to API - Sitemap to JSON

Turn any public site URL or sitemap.xml into a clean API-style JSON feed. Crawls a bounded set of pages (hard cap 50/run) and returns one structured record per page: title, meta, headings, links, main text, JSON-LD + OpenGraph. SSRF-guarded, pure code, no AI by default.

Pricing

$1.00 / 1,000 results

Rating

0.0

(0)

Developer

Ahmed Moussa

Actor stats

Bookmarked

Total users

Monthly active users

a month ago

Last modified

Page to API — Sitemap to JSON

Turn any public site URL or sitemap.xml into a clean, API-style JSON feed.

What it does

Give it a page URL or a sitemap, and it crawls a bounded set of pages and returns one structured record per page (title, meta, headings, links, JSON-LD, OpenGraph and main text).

What you get per page

{
  "url": "https://example.com/",
  "status": "ok",
  "fetched_at": "2026-06-23T12:00:00+00:00",
  "title": "Example Domain",
  "meta_description": "...",
  "meta": { "og:title": "...", "description": "..." },
  "headings": [ { "level": "h1", "text": "Example Domain" } ],
  "links": [ { "href": "https://www.iana.org/domains/example", "text": "More information..." } ],
  "structured_data": [ { "@context": "https://schema.org", "@type": "WebPage" } ],
  "main_text": "Example Domain This domain is for use in...",
  "word_count": 28
}

Input

Field	Type	Default	Notes
`url`	string	—	A single page URL. If it points to a sitemap it is crawled as one.
`sitemap_url`	string	—	A `sitemap.xml` (or sitemap index) URL.
`max_pages`	integer	20	Bounded crawl cap. Hard-capped at 50 per run.
`llm_api_key`	string (secret)	—	Optional, your own key. The default path uses no AI.

Provide either url or sitemap_url.

Output

One item per page is pushed to the actor's default dataset (see the per-page schema above).

Use cases

Headless-CMS-style JSON feed from a static or marketing site.
Bulk-ingest a site's pages (via sitemap) into a search index or RAG pipeline.
Snapshot a site's structured data (JSON-LD / OpenGraph) for monitoring.

How it works (deterministic, code-only)

Each URL (and every sitemap entry) is fetched through an SSRF-guarded client and parsed with regex + stdlib into title, meta, headings, links, JSON-LD, OpenGraph and main text. The crawl is hard-capped at 50 pages/run. No AI on the default path.

Cost-safety & security (always on)

Deterministic, code-only parsing (regex + stdlib). No LLM, no paid API by default → no per-run AI/API cost.
Bounded crawl cap: at most max_pages pages, hard-ceilinged at 50 regardless of input, so a run can never explode compute/cost.
SSRF guard: every fetch (including the sitemap and every redirect hop) is re-validated; private / loopback / link-local / reserved IPs are blocked (fail-closed).
Bounded fetch: hard size cap (2 MB/page), connect/read timeouts, max 3 redirects, content-type allowlist.
Domain blocklist for login-walled / ToS-sensitive sites.
The extractor never hangs or raises — any failure yields a record with a non-ok status and an error message.

Limitations (honest)

Hard cap of 50 pages per run — for larger sites, page through multiple runs.
Client-side-rendered pages (heavy JS) expose less content; there is no headless browser.
Login-walled / blocklisted domains are refused with a non-ok status.

E-commerce Product Scraper

timely_quarterstaff/ecommerce-scraper

Deterministic SSRF-guarded extraction of structured product data from a SINGLE public product-page URL: title, price, currency, availability, brand, rating, reviews, images, SKU, description via JSON-LD/OpenGraph/meta. Pure code, no proxy/headless/AI/paid API. Single-page, not bulk crawling.

Ahmed Moussa

DataPulse URL Extractor

timely_quarterstaff/datapulse-url-extractor

Deterministic, SSRF-guarded structured-data extraction from any public URL. Returns title, meta tags, headings, links and clean text with a code-computed summary. Optional AI enrichment.

Ahmed Moussa

Local Business Directory Extractor (single page)

timely_quarterstaff/business-directory-scraper

Extract structured business data from a SINGLE public business/company page via schema.org LocalBusiness/Organization JSON-LD, OpenGraph & meta: name, address, phone, email, website, hours, rating, reviews, geo. SSRF-guarded pure code (no proxy/browser/AI). Single-page, not bulk directory scraping.

Ahmed Moussa

Sitemap API

vivid_astronaut/sitemap

Fabio Suizu

Sitemap & URL Extractor — Get Every URL of a Website

dataquarry/sitemap-url-extractor

Get every URL of a website: parses sitemap.xml and sitemap-indexes (discovered via robots.txt or the default location), with a same-site crawl fallback when there's no sitemap. Returns each URL + lastmod. No API key.

Daniel Brenner

Sitemap URL Extractor - List All URLs in a Sitemap

dltik/sitemap-url-extractor

Extract every URL from any XML sitemap, with lastmod, changefreq and priority. Resolves sitemap indexes recursively. Pass a sitemap.xml or just a site root to auto-discover its sitemaps. Pure HTTP, no browser — fast and cheap.

Walid

Sitemap Generator - Crawl Website & Create XML Sitemap

scrappy_garden/sitemap-generator

Generate an XML sitemap for any website. Crawls internal pages from start URLs (with depth + page limits), deduplicates URLs, and stores a ready-to-submit sitemap.xml plus a structured dataset and summary for SEO audits.

Bikram Adhikari

Real Estate Listing Extractor

timely_quarterstaff/real-estate-extractor

Extract structured data from a SINGLE public real-estate listing page: address, price, beds, baths, area, property type, sale/rent, year built, agent, images, geo. schema.org JSON-LD -> OpenGraph -> heuristics. Pure code, SSRF-guarded, cost-safe (no proxy/headless/AI). Single-page, not bulk.

Ahmed Moussa

Structured Data Extractor - JSON-LD, OpenGraph, Meta

piposlab/structured-data-extractor

Extract JSON-LD, OpenGraph, Twitter cards, microdata and meta tags from any URL. For SEO audits, AI dataset building and competitor research. No API key.

Alejandro Bufarini

Sitemap Generator - Creates sitemap.xml for any domain

wisteria_banjo/sitemap-generator---creates-sitemap-xml-for-any-domain

Generate a clean, standards-compliant sitemap.xml for a website. This actor crawls a single website, discovers all indexable pages, and produces: ✅ A ready-to-submit sitemap.xml (Google-compliant) ✅ A structured JSON dataset of discovered URLs (for auditing, reporting, and billing)