SEO Content Extraction avatar

SEO Content Extraction

Pricing

Pay per usage

Go to Apify Store
SEO Content Extraction

SEO Content Extraction

Extract SEO-ready content from public web pages with robots.txt checks, strict limits, SSRF protection, and clean structured output.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

ping

ping

Maintained by Community

Actor stats

0

Bookmarked

1

Total users

1

Monthly active users

24 days ago

Last modified

Share

SEO Content Extraction reads public web pages and returns clean structured data: title, meta description, headings, body text, and normalized links.

It is built for lightweight SEO audits, content inventories, RAG inputs, and agent workflows that need page content without opening unsafe network access.

The Actor is intentionally conservative. It does not log in, bypass access controls, execute page JavaScript, solve CAPTCHAs, or scrape private networks.

What It Returns

Each dataset item contains:

  • url and finalUrl
  • HTTP statusCode and contentType
  • title
  • meta description
  • headings (h1, h2, h3)
  • cleaned text
  • normalized outbound links

Example Input

{
"startUrls": ["https://example.com"],
"maxPages": 3,
"maxDepth": 1,
"sameDomainOnly": true,
"respectRobotsTxt": true,
"includeLinks": true,
"textMaxChars": 4000
}

Input Notes

  • startUrls: 1 to 10 public HTTP/HTTPS URLs.
  • maxPages: 1 to 25 pages per run.
  • maxDepth: 0 to 3 link-following depth.
  • sameDomainOnly: enabled by default.
  • respectRobotsTxt: enabled by default.
  • includeHtml: disabled by default.

Good Uses

  • SEO page inventory
  • Title, meta description, and heading extraction
  • Lightweight content checks for public websites
  • RAG and agent data collection from public pages
  • Internal link discovery within a small site section

Security And Privacy

The Actor blocks:

  • localhost and private network targets
  • link-local and metadata IP targets
  • special-use hostnames such as .local and .internal
  • URLs with embedded credentials
  • shell/process/proxy override fields in input JSON
  • script-like input strings

The Actor does not accept custom proxy settings, shell commands, environment variables, worker URLs, or worker tokens from callers.

Limitations

This is a public-page content extractor. It is not a browser automation Actor, does not render JavaScript-only content, and is not designed for login-only sites, CAPTCHA flows, anti-bot bypass, or high-volume harvesting.