DataPulse URL Extractor avatar

DataPulse URL Extractor

Pricing

Pay per usage

Go to Apify Store
DataPulse URL Extractor

DataPulse URL Extractor

Deterministic, SSRF-guarded structured-data extraction from any public URL. Returns title, meta tags, headings, links and clean text with a code-computed summary. Optional AI enrichment.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Ahmed Moussa

Ahmed Moussa

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

4 days ago

Last modified

Categories

Share

Deterministic, SSRF-guarded structured-data extraction from any public URL.

What it does

For each input URL the actor returns a single dataset item with:

  • url — the requested URL
  • statuscompleted, blocked, failed, or empty
  • data — structured extraction: title, meta_tags, headings, links, text (and, when AI enrichment is enabled, an ai_extracted block)
  • metacode-owned, deterministic: extracted_at (real server time), method, and a summary (row_count, numeric stats) computed in code from the extracted data — never trusted from any model.

Input

{
"url": "https://example.com",
"urls": ["https://www.iana.org/help/example-domains"],
"schema_hint": "company info",
"llm_api_key": "(optional — enables AI enrichment)",
"llm_model": "deepseek/deepseek-chat"
}

url/urls are public http(s) URLs. If llm_api_key (or the OPENROUTER_API_KEY secret) is empty, the actor returns a fully deterministic, code-only extraction.

Output

One JSON item per URL pushed to the default dataset:

{
"url": "https://example.com",
"status": "completed",
"data": {
"title": "Example Domain",
"meta_tags": { "description": "..." },
"headings": [ { "level": "h1", "text": "Example Domain" } ],
"links": [ { "href": "https://www.iana.org/domains/example", "text": "More information..." } ],
"text": "Example Domain This domain is for use in..."
},
"meta": { "extracted_at": "2026-06-23T00:00:00+00:00", "method": "deterministic_code", "summary": { "row_count": 1 } }
}

Use cases

  • Turn an arbitrary public page into clean, structured JSON for a pipeline or LLM prompt.
  • Pull title / meta / headings / links / body text for SEO, monitoring or content audits.
  • Lightweight "fetch + parse" step that is safe to run on untrusted URLs (SSRF-guarded).

How it works (deterministic, code-only)

Pure code: HTTP fetch through an SSRF-guarded client, then stdlib/regex HTML parsing to extract title, meta tags, headings, links and clean text. The meta.summary is computed in code from the parsed data. No headless browser, no AI on the default path.

Safety (always on)

  • SSRF guard — any URL that resolves to a private / loopback / link-local / reserved address is blocked (fail-closed); each redirect hop is re-validated before being followed.
  • Blocklist — login-walled / ToS-sensitive domains (LinkedIn, Facebook, Instagram, X, Glassdoor, Indeed, Zillow, Yelp, …) are refused.
  • Bounded fetch — 5s connect / 10s read timeout, 2 MB hard size cap, content-type allowlist, max 3 redirects. Never crashes or hangs.

Limitations (honest)

  • Pages that render entirely client-side (heavy JS, no server-side HTML) expose less content — there is no headless browser on the default path.
  • Login-walled / blocklisted domains return status: "blocked" by design.
  • AI enrichment requires your own OpenRouter key; the actor ships no built-in key.

Powered by OMEGA · DataPulse extraction core.