SEO Fields Scraper
Pricing
from $0.85 / 1,000 urls
SEO Fields Scraper
Extracts website SEO metadata with titles, descriptions, canonicals, robots tags, headings, Open Graph fields, and audit issues. Export data, run via API, schedule and monitor runs, or integrate with other tools.
Pricing
from $0.85 / 1,000 urls
Rating
0.0
(0)
Developer
Trove Vault
Actor stats
1
Bookmarked
2
Total users
1
Monthly active users
8 days ago
Last modified
Share
SEO Fields Scraper extracts page-level SEO metadata from public websites and turns it into an audit-ready dataset. Give it a homepage, a list of URLs, or a sitemap URL and it returns titles, meta descriptions, canonicals, robots directives, headings, Open Graph fields, Twitter Card fields, and practical issue flags.
Use it when you need a fast metadata inventory for site QA, content migrations, competitor checks, agency reporting, or scheduled monitoring. The actor is HTTP-first, so it is cheap to run and suitable for repeated audits where you want clear fields instead of screenshots or bulky crawl exports.
Why Use This Actor
- Audit title tags and meta descriptions across important pages.
- Catch missing or weak canonicals, noindex directives, and H1 problems.
- Review Open Graph and Twitter Card metadata for social sharing previews.
- Sample pages from a site without running a heavy browser crawler.
- Export structured metadata to Apify datasets, API clients, spreadsheets, or downstream workflows.
What It Extracts
For each processed page, the actor can return:
title,titleLength, andtitleStatusmetaDescription,metaDescriptionLength, andmetaDescriptionStatuscanonicalUrlandcanonicalStatusrobotsMeta,isNoindex, andisNofollowh1,h1Count, andh2Sample- Open Graph fields such as
openGraphTitle,openGraphDescription, andopenGraphImage - Twitter Card fields such as
twitterTitle,twitterDescription, andtwitterImage renderingUsed, showing whether the row came fromhttporbrowserseoScore,issues, andwarnings- structured error fields when a URL cannot be fetched
The seoScore is a simple completeness score from 0 to 100. It is meant for triage and prioritization, not as a replacement for a full SEO strategy. Use renderingUsed to understand whether a row came from the fast HTTP path or from Playwright browser rendering.
Use Cases
Content teams can check whether newly published pages have complete search and social metadata.
SEO consultants can produce a quick metadata export for audits, migration QA, or retainer reporting.
Growth teams can compare competitor landing pages and identify common metadata patterns.
Developers can run the actor after releases to catch missing titles, incorrect canonicals, or accidental noindex tags.
Automation teams can schedule runs and append results to an existing dataset for monitoring.
Input Example
{"startUrls": [{ "url": "https://apify.com/" }],"maxPages": 10,"crawlDepth": 1,"requestTimeoutSecs": 30,"renderingMode": "BROWSER_FALLBACK","browserWaitSecs": 5,"sameDomainOnly": true,"includeOpenGraph": true,"includeTwitterCards": true,"includeHeadings": true}
Use crawlDepth: 0 when you already have the exact URLs you want to audit. Use crawlDepth: 1 for a fast same-domain sample from a homepage. Use a sitemap URL when the site exposes one and you want broader coverage.
Input Reference
| Field | Type | Description |
|---|---|---|
startUrls | array | Website URLs or sitemap URLs to audit. |
maxPages | integer | Maximum number of successful page rows to create. |
crawlDepth | integer | Number of HTML link levels to follow from each start URL. |
requestTimeoutSecs | integer | HTTP timeout for each page or sitemap request. |
renderingMode | string | HTTP_ONLY, BROWSER_FALLBACK, or BROWSER_ONLY. |
browserWaitSecs | integer | Extra wait time after browser page load when Playwright is used. |
sameDomainOnly | boolean | Keeps discovered links on the same hostname as the start URL. |
includeOpenGraph | boolean | Extracts Open Graph social preview fields. |
includeTwitterCards | boolean | Extracts Twitter Card social preview fields. |
includeHeadings | boolean | Extracts H1 and H2 heading signals. |
proxyConfiguration | object | Optional Apify Proxy settings for blocked sites. |
datasetId | string | Optional existing dataset to append results to. |
runId | string | Optional upstream run ID copied into output rows. |
Output Example
{"url": "https://apify.com/","finalUrl": "https://apify.com/","statusCode": 200,"title": "Apify: Full-stack web scraping and data extraction platform","titleLength": 61,"titleStatus": "ok","metaDescription": "Apify is a full-stack web scraping and browser automation platform.","metaDescriptionLength": 70,"metaDescriptionStatus": "ok","canonicalUrl": "https://apify.com/","canonicalStatus": "self","renderingUsed": "http","robotsMeta": null,"isNoindex": false,"h1": "Web scraping, automation, and AI agents","h1Count": 1,"seoScore": 88,"issues": [],"warningCount": 2,"warnings": ["Missing Twitter Card image", "Missing Open Graph image"],"discoveredVia": "input","scrapedAt": "2026-04-27T12:00:00.000Z","error": false}
API Usage
curl "https://api.apify.com/v2/acts/TroveVault~seo-fields-scraper/runs" \-X POST \-H "Authorization: Bearer $APIFY_TOKEN" \-H "Content-Type: application/json" \-d '{"startUrls": [{ "url": "https://apify.com/" }],"maxPages": 10,"crawlDepth": 1,"renderingMode": "BROWSER_FALLBACK","sameDomainOnly": true}'
After the run finishes, download results from the default dataset URL in the run output or from the Apify Console.
How to Use Browser Rendering
Use HTTP_ONLY for most websites. It is the default, fastest, and cheapest mode. It works when SEO tags are present in the raw HTML returned by the server, which is common for well-built marketing sites, content sites, and many server-rendered apps.
Use BROWSER_FALLBACK when you are not sure. The actor first tries HTTP, checks whether core metadata is present, and only opens Playwright when the raw HTML looks sparse. This is the best setting for mixed crawls because normal pages stay cheap while JavaScript-rendered pages get a second pass.
Use BROWSER_ONLY when you know the site renders metadata or headings in JavaScript. This opens a browser for every page, so it is slower and more expensive. Start with maxPages: 1 to 5, keep crawlDepth: 0 for tests, and increase browserWaitSecs only if the site is slow to hydrate.
For protected websites such as Amazon, browser mode may still return empty or blocked data. Amazon often serves bot checks, alternate HTML, regional pages, or sparse responses to automation. Try BROWSER_FALLBACK or BROWSER_ONLY with Apify Proxy enabled and a very small page limit, but do not expect guaranteed extraction from strongly protected domains.
Limitations
Browser rendering is available through Playwright, but it should be used deliberately because it costs more than HTTP extraction.
It does not perform keyword research, backlink analysis, Core Web Vitals testing, search ranking checks, or screenshot analysis. It focuses on metadata extraction and lightweight page-level QA.
Some websites block automated HTTP clients. If you see BLOCKED, retry with Apify Proxy enabled and keep maxPages small.
Troubleshooting
If results are empty, check that the URL returns public HTML or XML and is not a PDF, image, or login page.
If many pages are blocked, enable Apify Proxy and reduce concurrency by using a smaller maxPages and crawlDepth.
If a canonical appears different from the final URL, inspect redirects and trailing slash behavior before treating it as an error.
If social metadata is missing, verify whether the site uses Open Graph, Twitter Card tags, or JavaScript-rendered metadata. Retry with renderingMode: "BROWSER_FALLBACK" before assuming the tags do not exist.
If the actor finds too many irrelevant URLs, keep sameDomainOnly enabled and start from a narrower section URL or sitemap.
FAQ
Can it crawl a whole website?
It can crawl same-domain links up to the maxPages and crawlDepth limits. For very large websites, use a sitemap and a deliberate page cap.
Does it support sitemap XML?
Yes. Add a sitemap URL to startUrls and the actor will enqueue URLs from <loc> entries when they match the domain rules.
Does it use a browser?
Only when you ask it to. HTTP_ONLY never opens a browser. BROWSER_FALLBACK opens a browser only when the raw HTML is missing useful metadata. BROWSER_ONLY opens a browser for every page.
Can I monitor metadata changes?
Yes. Schedule the actor and compare datasets over time in your own workflow or append runs to a shared datasetId.
What does seoScore mean?
It is a lightweight completeness score based on missing or weak metadata fields. Use it to prioritize QA, not as a universal ranking metric.
Will it respect external links?
By default, no. sameDomainOnly keeps the crawl focused on the start URL hostname.
Can it scrape blocked websites?
Sometimes. Enable Apify Proxy when a site blocks datacenter traffic, but always follow the target website's terms and applicable laws.
Should I use browser mode for Amazon?
You can try it, but Amazon is heavily protected and may still return sparse or blocked pages. Use Apify Proxy, maxPages: 1, crawlDepth: 0, and BROWSER_ONLY for a small test before running a larger job.
Related Actors
Use this actor with product, catalog, review, or competitor intelligence actors when you need both page metadata and business data in the same workflow.
Changelog
0.1Initial release with HTTP-first metadata extraction, social fields, headings, scoring, structured errors, and dataset append support.
Support
Open an issue from the Apify actor page or contact TroveVault with the run ID, input, and a short description of the page behavior you expected.