DataPulse URL Extractor
Pricing
Pay per usage
DataPulse URL Extractor
Deterministic, SSRF-guarded structured-data extraction from any public URL. Returns title, meta tags, headings, links and clean text with a code-computed summary. Optional AI enrichment.
Pricing
Pay per usage
Rating
0.0
(0)
Developer
Ahmed Moussa
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
4 days ago
Last modified
Categories
Share
Deterministic, SSRF-guarded structured-data extraction from any public URL.
What it does
For each input URL the actor returns a single dataset item with:
url— the requested URLstatus—completed,blocked,failed, oremptydata— structured extraction:title,meta_tags,headings,links,text(and, when AI enrichment is enabled, anai_extractedblock)meta— code-owned, deterministic:extracted_at(real server time),method, and asummary(row_count, numeric stats) computed in code from the extracted data — never trusted from any model.
Input
{"url": "https://example.com","urls": ["https://www.iana.org/help/example-domains"],"schema_hint": "company info","llm_api_key": "(optional — enables AI enrichment)","llm_model": "deepseek/deepseek-chat"}
url/urls are public http(s) URLs. If llm_api_key (or the OPENROUTER_API_KEY
secret) is empty, the actor returns a fully deterministic, code-only extraction.
Output
One JSON item per URL pushed to the default dataset:
{"url": "https://example.com","status": "completed","data": {"title": "Example Domain","meta_tags": { "description": "..." },"headings": [ { "level": "h1", "text": "Example Domain" } ],"links": [ { "href": "https://www.iana.org/domains/example", "text": "More information..." } ],"text": "Example Domain This domain is for use in..."},"meta": { "extracted_at": "2026-06-23T00:00:00+00:00", "method": "deterministic_code", "summary": { "row_count": 1 } }}
Use cases
- Turn an arbitrary public page into clean, structured JSON for a pipeline or LLM prompt.
- Pull title / meta / headings / links / body text for SEO, monitoring or content audits.
- Lightweight "fetch + parse" step that is safe to run on untrusted URLs (SSRF-guarded).
How it works (deterministic, code-only)
Pure code: HTTP fetch through an SSRF-guarded client, then stdlib/regex HTML parsing
to extract title, meta tags, headings, links and clean text. The meta.summary is
computed in code from the parsed data. No headless browser, no AI on the default path.
Safety (always on)
- SSRF guard — any URL that resolves to a private / loopback / link-local / reserved address is blocked (fail-closed); each redirect hop is re-validated before being followed.
- Blocklist — login-walled / ToS-sensitive domains (LinkedIn, Facebook, Instagram, X, Glassdoor, Indeed, Zillow, Yelp, …) are refused.
- Bounded fetch — 5s connect / 10s read timeout, 2 MB hard size cap, content-type allowlist, max 3 redirects. Never crashes or hangs.
Limitations (honest)
- Pages that render entirely client-side (heavy JS, no server-side HTML) expose less content — there is no headless browser on the default path.
- Login-walled / blocklisted domains return
status: "blocked"by design. - AI enrichment requires your own OpenRouter key; the actor ships no built-in key.
Powered by OMEGA · DataPulse extraction core.