SEO Fields Scraper avatar

SEO Fields Scraper

Pricing

from $0.85 / 1,000 urls

Go to Apify Store
SEO Fields Scraper

SEO Fields Scraper

Extracts website SEO metadata with titles, descriptions, canonicals, robots tags, headings, Open Graph fields, and audit issues. Export data, run via API, schedule and monitor runs, or integrate with other tools.

Pricing

from $0.85 / 1,000 urls

Rating

0.0

(0)

Developer

Trove Vault

Trove Vault

Maintained by Community

Actor stats

1

Bookmarked

2

Total users

1

Monthly active users

8 days ago

Last modified

Share

SEO Fields Scraper extracts page-level SEO metadata from public websites and turns it into an audit-ready dataset. Give it a homepage, a list of URLs, or a sitemap URL and it returns titles, meta descriptions, canonicals, robots directives, headings, Open Graph fields, Twitter Card fields, and practical issue flags.

Use it when you need a fast metadata inventory for site QA, content migrations, competitor checks, agency reporting, or scheduled monitoring. The actor is HTTP-first, so it is cheap to run and suitable for repeated audits where you want clear fields instead of screenshots or bulky crawl exports.

Why Use This Actor

  • Audit title tags and meta descriptions across important pages.
  • Catch missing or weak canonicals, noindex directives, and H1 problems.
  • Review Open Graph and Twitter Card metadata for social sharing previews.
  • Sample pages from a site without running a heavy browser crawler.
  • Export structured metadata to Apify datasets, API clients, spreadsheets, or downstream workflows.

What It Extracts

For each processed page, the actor can return:

  • title, titleLength, and titleStatus
  • metaDescription, metaDescriptionLength, and metaDescriptionStatus
  • canonicalUrl and canonicalStatus
  • robotsMeta, isNoindex, and isNofollow
  • h1, h1Count, and h2Sample
  • Open Graph fields such as openGraphTitle, openGraphDescription, and openGraphImage
  • Twitter Card fields such as twitterTitle, twitterDescription, and twitterImage
  • renderingUsed, showing whether the row came from http or browser
  • seoScore, issues, and warnings
  • structured error fields when a URL cannot be fetched

The seoScore is a simple completeness score from 0 to 100. It is meant for triage and prioritization, not as a replacement for a full SEO strategy. Use renderingUsed to understand whether a row came from the fast HTTP path or from Playwright browser rendering.

Use Cases

Content teams can check whether newly published pages have complete search and social metadata.

SEO consultants can produce a quick metadata export for audits, migration QA, or retainer reporting.

Growth teams can compare competitor landing pages and identify common metadata patterns.

Developers can run the actor after releases to catch missing titles, incorrect canonicals, or accidental noindex tags.

Automation teams can schedule runs and append results to an existing dataset for monitoring.

Input Example

{
"startUrls": [
{ "url": "https://apify.com/" }
],
"maxPages": 10,
"crawlDepth": 1,
"requestTimeoutSecs": 30,
"renderingMode": "BROWSER_FALLBACK",
"browserWaitSecs": 5,
"sameDomainOnly": true,
"includeOpenGraph": true,
"includeTwitterCards": true,
"includeHeadings": true
}

Use crawlDepth: 0 when you already have the exact URLs you want to audit. Use crawlDepth: 1 for a fast same-domain sample from a homepage. Use a sitemap URL when the site exposes one and you want broader coverage.

Input Reference

FieldTypeDescription
startUrlsarrayWebsite URLs or sitemap URLs to audit.
maxPagesintegerMaximum number of successful page rows to create.
crawlDepthintegerNumber of HTML link levels to follow from each start URL.
requestTimeoutSecsintegerHTTP timeout for each page or sitemap request.
renderingModestringHTTP_ONLY, BROWSER_FALLBACK, or BROWSER_ONLY.
browserWaitSecsintegerExtra wait time after browser page load when Playwright is used.
sameDomainOnlybooleanKeeps discovered links on the same hostname as the start URL.
includeOpenGraphbooleanExtracts Open Graph social preview fields.
includeTwitterCardsbooleanExtracts Twitter Card social preview fields.
includeHeadingsbooleanExtracts H1 and H2 heading signals.
proxyConfigurationobjectOptional Apify Proxy settings for blocked sites.
datasetIdstringOptional existing dataset to append results to.
runIdstringOptional upstream run ID copied into output rows.

Output Example

{
"url": "https://apify.com/",
"finalUrl": "https://apify.com/",
"statusCode": 200,
"title": "Apify: Full-stack web scraping and data extraction platform",
"titleLength": 61,
"titleStatus": "ok",
"metaDescription": "Apify is a full-stack web scraping and browser automation platform.",
"metaDescriptionLength": 70,
"metaDescriptionStatus": "ok",
"canonicalUrl": "https://apify.com/",
"canonicalStatus": "self",
"renderingUsed": "http",
"robotsMeta": null,
"isNoindex": false,
"h1": "Web scraping, automation, and AI agents",
"h1Count": 1,
"seoScore": 88,
"issues": [],
"warningCount": 2,
"warnings": ["Missing Twitter Card image", "Missing Open Graph image"],
"discoveredVia": "input",
"scrapedAt": "2026-04-27T12:00:00.000Z",
"error": false
}

API Usage

curl "https://api.apify.com/v2/acts/TroveVault~seo-fields-scraper/runs" \
-X POST \
-H "Authorization: Bearer $APIFY_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"startUrls": [{ "url": "https://apify.com/" }],
"maxPages": 10,
"crawlDepth": 1,
"renderingMode": "BROWSER_FALLBACK",
"sameDomainOnly": true
}'

After the run finishes, download results from the default dataset URL in the run output or from the Apify Console.

How to Use Browser Rendering

Use HTTP_ONLY for most websites. It is the default, fastest, and cheapest mode. It works when SEO tags are present in the raw HTML returned by the server, which is common for well-built marketing sites, content sites, and many server-rendered apps.

Use BROWSER_FALLBACK when you are not sure. The actor first tries HTTP, checks whether core metadata is present, and only opens Playwright when the raw HTML looks sparse. This is the best setting for mixed crawls because normal pages stay cheap while JavaScript-rendered pages get a second pass.

Use BROWSER_ONLY when you know the site renders metadata or headings in JavaScript. This opens a browser for every page, so it is slower and more expensive. Start with maxPages: 1 to 5, keep crawlDepth: 0 for tests, and increase browserWaitSecs only if the site is slow to hydrate.

For protected websites such as Amazon, browser mode may still return empty or blocked data. Amazon often serves bot checks, alternate HTML, regional pages, or sparse responses to automation. Try BROWSER_FALLBACK or BROWSER_ONLY with Apify Proxy enabled and a very small page limit, but do not expect guaranteed extraction from strongly protected domains.

Limitations

Browser rendering is available through Playwright, but it should be used deliberately because it costs more than HTTP extraction.

It does not perform keyword research, backlink analysis, Core Web Vitals testing, search ranking checks, or screenshot analysis. It focuses on metadata extraction and lightweight page-level QA.

Some websites block automated HTTP clients. If you see BLOCKED, retry with Apify Proxy enabled and keep maxPages small.

Troubleshooting

If results are empty, check that the URL returns public HTML or XML and is not a PDF, image, or login page.

If many pages are blocked, enable Apify Proxy and reduce concurrency by using a smaller maxPages and crawlDepth.

If a canonical appears different from the final URL, inspect redirects and trailing slash behavior before treating it as an error.

If social metadata is missing, verify whether the site uses Open Graph, Twitter Card tags, or JavaScript-rendered metadata. Retry with renderingMode: "BROWSER_FALLBACK" before assuming the tags do not exist.

If the actor finds too many irrelevant URLs, keep sameDomainOnly enabled and start from a narrower section URL or sitemap.

FAQ

Can it crawl a whole website?

It can crawl same-domain links up to the maxPages and crawlDepth limits. For very large websites, use a sitemap and a deliberate page cap.

Does it support sitemap XML?

Yes. Add a sitemap URL to startUrls and the actor will enqueue URLs from <loc> entries when they match the domain rules.

Does it use a browser?

Only when you ask it to. HTTP_ONLY never opens a browser. BROWSER_FALLBACK opens a browser only when the raw HTML is missing useful metadata. BROWSER_ONLY opens a browser for every page.

Can I monitor metadata changes?

Yes. Schedule the actor and compare datasets over time in your own workflow or append runs to a shared datasetId.

What does seoScore mean?

It is a lightweight completeness score based on missing or weak metadata fields. Use it to prioritize QA, not as a universal ranking metric.

By default, no. sameDomainOnly keeps the crawl focused on the start URL hostname.

Can it scrape blocked websites?

Sometimes. Enable Apify Proxy when a site blocks datacenter traffic, but always follow the target website's terms and applicable laws.

Should I use browser mode for Amazon?

You can try it, but Amazon is heavily protected and may still return sparse or blocked pages. Use Apify Proxy, maxPages: 1, crawlDepth: 0, and BROWSER_ONLY for a small test before running a larger job.

Use this actor with product, catalog, review, or competitor intelligence actors when you need both page metadata and business data in the same workflow.

Changelog

  • 0.1 Initial release with HTTP-first metadata extraction, social fields, headings, scoring, structured errors, and dataset append support.

Support

Open an issue from the Apify actor page or contact TroveVault with the run ID, input, and a short description of the page behavior you expected.