Schema Markup Scraper & SEO Auditor
Pricing
from $3.00 / 1,000 results
Schema Markup Scraper & SEO Auditor
Extract JSON-LD, Microdata, RDFa, Open Graph & Twitter Cards. Runs a 0-100 SEO audit — checks canonical, hreflang, headings, image alt, EEAT author signals. Detects 80+ schema.org types including LocalBusiness with NAP, geo coordinates, and Google Place IDs.
Pricing
from $3.00 / 1,000 results
Rating
0.0
(0)
Developer
Richard Feng
Actor stats
6
Bookmarked
77
Total users
7
Monthly active users
11 days ago
Last modified
Categories
Share
Schema Markup Scraper & SEO Analyzer
Extract structured data, metadata, and SEO signals from any web page. Built for technical SEO auditing, local business intelligence, content aggregation, and competitive analysis.
What it does
This scraper visits one or more URLs and extracts everything a search engine sees: structured data (JSON-LD, Microdata, RDFa), social meta tags (Open Graph, Twitter Cards), and dozens of SEO signals. It then runs an automated audit and returns a 0-100 SEO score with actionable issues — aligned with Google's 2025 ranking signals and EEAT guidelines.
Key capabilities
Structured data extraction
- JSON-LD — Parses all
<script type="application/ld+json">blocks, including nested@graphstructures - Microdata — Extracts
itemscope/itempropschema.org markup with full nesting support - RDFa (opt-in) — Parses
typeof/property/vocabattributes with schema.org vocabulary resolution - Schema type detection — Identifies all schema.org types present (Product, Article, LocalBusiness, BreadcrumbList, etc.)
Social & meta tags
- Open Graph — All
og:*properties (title, description, image, type, locale, etc.) - Twitter Cards — All
twitter:*properties with special handling forsummary_large_image - Dublin Core —
DC.*andDCTerms.*academic/institutional metadata - Standard meta tags — viewport, description, keywords, robots, theme-color, and all others
SEO analysis
- Canonical URL — Detects
<link rel="canonical"> - Robots meta — Extracts directives for
robots,googlebot,bingbot, etc. - Heading hierarchy — Maps H1–H6 structure, counts H1 tags, detects skipped levels
- Image alt text audit — Counts images with/without alt attributes, calculates coverage percentage
- Viewport & charset — Verifies mobile-first indexing prerequisites
- SEO score (0-100) — Automated audit checking 15 Google ranking signals with error/warning/info severity
International SEO
- Hreflang tags — Extracts all
<link rel="alternate" hreflang="...">with built-in validation:- Flags missing
x-defaultfallback - Validates ISO 639-1 language codes (catches common mistakes like
en-UK→ should been-GB) - Detects missing self-referencing tags
- Flags missing
- Language detection —
<html lang>,<meta http-equiv="content-language">,og:locale
EEAT & author signals
- Author extraction — Pulls author info from JSON-LD (
Persontype withsameAslinks),<meta name="author">, and<a rel="author"> - Article metadata —
datePublished,dateModified,headline,wordCount,publisherfrom Article/NewsArticle/BlogPosting schema
Local / Geo SEO
- LocalBusiness extraction — Detects 80+ schema.org LocalBusiness subtypes (Restaurant, Hotel, Dentist, Store, etc.) and extracts NAP, geo coordinates, opening hours, price range
- NAP (Name/Address/Phone) — From any Organization or LocalBusiness schema
- Geo meta tags —
geo.region,geo.placename,geo.position,ICBM - Google Maps references — Embedded map iframes, Place IDs, CID numbers
- hCard/vCard (opt-in) —
.vcard/.h-cardmicroformat contact data
Breadcrumb validation
- Extracts
BreadcrumbListschema items with position, name, and URL - Validates sequential positions and flags relative URLs
Input parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
startUrls | Array | (required) | URLs to scrape |
proxy | Object | Apify Proxy | Proxy configuration |
maxRequestsPerCrawl | Integer | 100 | Maximum pages to scrape (1–100,000) |
maxConcurrency | Integer | 10 | Parallel pages (1–100) |
extractMetaTags | Boolean | true | Extract all meta tags |
extractSeoAnalysis | Boolean | true | SEO signals: canonical, hreflang, robots, headings, author, images, breadcrumbs, Dublin Core, viewport, charset |
extractGeoData | Boolean | true | Geo tags, LocalBusiness, NAP, Google Maps references |
computeSeoScore | Boolean | true | Run SEO audit (0-100 score + issues list) |
extractRdfa | Boolean | false | RDFa structured data (opt-in) |
extractHCard | Boolean | false | hCard/vCard microformats (opt-in) |
Output fields
Each scraped page produces a JSON object with these fields:
Core metadata
| Field | Type | Description |
|---|---|---|
url | String | Final URL after redirects |
title | String | Page <title> content |
icon | String | Favicon/apple-touch-icon URL |
linkedData | Array | JSON-LD structured data blocks |
microdata | Array | Microdata (schema.org) items |
openGraph | Object | Open Graph properties |
twitterCard | Object | Twitter Card properties |
metaTags | Object | All other meta tags |
SEO analysis (when extractSeoAnalysis is enabled)
| Field | Type | Description |
|---|---|---|
canonical | String/null | Canonical URL |
robotsMeta | Object | Robots directives ({ robots: "index, follow", googlebot: "noarchive" }) |
hreflang | Object | { tags: [{lang, url}], hasXDefault: bool, issues: string[] } |
language | Object | { htmlLang, contentLanguage, ogLocale } |
dublinCore | Object | Dublin Core metadata |
viewport | String/null | Viewport meta tag content |
charset | String/null | Character encoding |
headings | Object | { headings: [{level, text}], h1Count, issues[] } |
imageAudit | Object | { totalImages, imagesWithAlt, imagesWithoutAlt, altTexts[], issues[] } |
authorInfo | Object/null | { name, url, sameAs[], jobTitle, source } |
schemaTypes | Array | All schema.org types detected (e.g., ["Product", "BreadcrumbList"]) |
articleMetadata | Object/null | { datePublished, dateModified, headline, description, wordCount, publisher } |
breadcrumbs | Object/null | { items: [{position, name, url}], issues[] } |
Geo / Local SEO (when extractGeoData is enabled)
| Field | Type | Description |
|---|---|---|
geoTags | Object/null | Geo meta tags ({ region, placename, position, icbm }) |
localBusiness | Object/null | LocalBusiness schema data with NAP, geo coordinates, opening hours |
nap | Object/null | Name, Address, Phone from Organization/LocalBusiness |
mapReferences | Object/null | { googleMapsEmbeds[], placeIds[], cids[] } |
SEO audit (when computeSeoScore is enabled)
| Field | Type | Description |
|---|---|---|
seoAudit | Object | { score: 0-100, issues: [{severity, code, message}] } |
Optional extractors
| Field | Type | Description |
|---|---|---|
rdfa | Array | RDFa structured data (when extractRdfa is enabled) |
hCards | Array | hCard/vCard contact data (when extractHCard is enabled) |
SEO audit checks
The audit starts at 100 and deducts points for each issue found:
| Check | Severity | Points | What it catches |
|---|---|---|---|
Missing <title> | Error | -10 | Core ranking signal |
| Title > 60 chars | Warning | -5 | SERP truncation |
| Missing meta description | Error | -10 | CTR impact |
| Description > 160 chars | Warning | -5 | SERP truncation |
| Missing or multiple H1 | Error | -10 | Content hierarchy |
| Missing canonical URL | Warning | -5 | Duplicate content risk |
| Missing og:title | Warning | -5 | Social share CTR |
| Missing og:description | Warning | -5 | Social share CTR |
| Missing og:image | Warning | -5 | Social CTR (40-60% impact) |
| No structured data | Warning | -5 | Rich results eligibility |
| Missing favicon | Warning | -5 | Brand trust signal |
| Missing viewport | Warning | -5 | Mobile-first indexing |
| Missing hreflang x-default | Warning | -5 | International SEO trust |
| Missing author (on articles) | Warning | -5 | EEAT signal |
| Missing datePublished (on articles) | Warning | -5 | Freshness signal |
| > 50% images without alt | Warning | -5 | Accessibility + AI |
| No BreadcrumbList (deep pages) | Info | -1 | Navigation hierarchy |
Use cases
Technical SEO audit
Crawl your site and get an instant SEO health check across every page. Identify missing titles, broken heading hierarchies, absent structured data, and more — with a prioritized issues list.
E-commerce competitive analysis
Extract product schema (pricing, availability, reviews, return policies), breadcrumb structures, and rich snippet eligibility from competitor product pages.
Local business intelligence
Scrape LocalBusiness schema from directories, review sites, or business websites. Extract NAP data, opening hours, geo coordinates, Google Place IDs, and CID numbers for lead generation or data enrichment.
International SEO validation
Audit hreflang implementations across multilingual sites. Catch the errors that 75% of implementations contain: missing x-default fallbacks, invalid ISO codes, missing self-references.
EEAT & content analysis
Extract author information, sameAs links to verified profiles, publication dates, and publisher data from article pages. Monitor how well your content signals expertise and authority.
Social media preview testing
Verify how pages will appear when shared on Facebook (Open Graph) and Twitter/X (Twitter Cards). Check for missing images, truncated descriptions, and incomplete metadata.
Content aggregation
Build news aggregators or content feeds by extracting article metadata, publication dates, authors, and descriptions from multiple sources in a single crawl.
Example output
News article (CNN)
{"url": "https://edition.cnn.com/2025/04/18/politics/...","title": "Supreme Court temporarily pauses deportations under Alien Enemies Act","canonical": "https://www.cnn.com/2025/04/18/politics/...","language": {"htmlLang": "en","contentLanguage": null,"ogLocale": "en_US"},"hreflang": {"tags": [{ "lang": "en-gb", "url": "https://edition.cnn.com/..." },{ "lang": "en-us", "url": "https://www.cnn.com/..." },{ "lang": "x-default", "url": "https://edition.cnn.com/..." }],"hasXDefault": true,"issues": ["Missing self-referencing hreflang tag"]},"authorInfo": {"name": "Tierney Sneed, John Fritze","url": null,"sameAs": [],"jobTitle": null,"source": "meta"},"schemaTypes": ["NewsArticle", "Person", "ImageObject", "Organization", "WebPage", "NewsMediaOrganization"],"seoAudit": {"score": 79,"issues": [{ "severity": "warning", "code": "TITLE_TOO_LONG", "message": "Title is 94 chars (recommended: max 60)" },{ "severity": "warning", "code": "DESCRIPTION_TOO_LONG", "message": "Meta description is 267 chars (recommended: max 160)" },{ "severity": "warning", "code": "MISSING_DATE_PUBLISHED", "message": "Article page missing datePublished (impacts freshness signals)" },{ "severity": "info", "code": "NO_BREADCRUMBS", "message": "Deep page with no BreadcrumbList schema (helps navigation hierarchy)" }]}}
E-commerce product (Farfetch)
{"url": "https://www.farfetch.com/shopping/women/jacquemus-les-doubles-sandals-item-28543291.aspx","title": "Jacquemus Les Doubles Sandals | Brown | FARFETCH","canonical": "https://www.farfetch.com/shopping/women/jacquemus-les-doubles-sandals-item-28543291.aspx","schemaTypes": ["ProductGroup", "ImageObject", "Brand", "Product", "Offer", "MerchantReturnPolicy", "UnitPriceSpecification", "BreadcrumbList", "ListItem"],"breadcrumbs": {"items": [{ "position": 1, "name": "Women Home", "url": "/shopping/women/items.aspx" },{ "position": 2, "name": "Jacquemus", "url": "/shopping/women/jacquemus/items.aspx" },{ "position": 3, "name": "Shoes", "url": "/shopping/women/jacquemus/shoes-1/items.aspx" },{ "position": 4, "name": "Heeled Sandals", "url": "/shopping/women/jacquemus/heeled-sandals-1/items.aspx" }],"issues": ["Relative URL in breadcrumb position 1: \"/shopping/women/items.aspx\""]},"headings": {"h1Count": 1,"issues": ["Skipped heading level: h2 to h4"]},"imageAudit": {"totalImages": 6,"imagesWithAlt": 5,"imagesWithoutAlt": 1,"issues": ["1 image(s) missing alt attribute"]},"robotsMeta": { "robots": "noindex" },"seoAudit": { "score": 100, "issues": [] }}
Wikipedia (RDFa + hCard)
{"url": "https://en.wikipedia.org/wiki/San_Francisco","title": "San Francisco - Wikipedia","canonical": "https://en.wikipedia.org/wiki/San_Francisco","schemaTypes": ["Article", "Organization", "ImageObject"],"authorInfo": {"name": "Contributors to Wikimedia projects","source": "json-ld"},"articleMetadata": {"datePublished": "2001-11-13T04:30:40Z","dateModified": "2026-03-28T17:14:57Z","headline": "consolidated city and county in California, United States","publisher": { "name": "Wikimedia Foundation, Inc." }},"rdfa": ["... 114 RDFa items extracted ..."],"hCards": [{ "name": "San Francisco", "... ": "..." }],"imageAudit": {"totalImages": 119,"imagesWithAlt": 43,"imagesWithoutAlt": 76,"issues": ["64% of images missing alt text (accessibility + AI understanding)"]},"seoAudit": {"score": 80,"issues": [{ "severity": "error", "code": "MISSING_DESCRIPTION", "message": "Page has no meta description" },{ "severity": "warning", "code": "MISSING_OG_DESCRIPTION", "message": "Missing og:description meta tag" },{ "severity": "warning", "code": "IMAGES_MISSING_ALT", "message": "64% of images missing alt text" }]}}
Technical details
- Engine: CheerioCrawler (server-side HTML parsing, no JavaScript execution)
- Runtime: Node.js 22, Apify SDK 3.5.3, Crawlee 3.15.3
- Performance: Lightweight and fast — no browser overhead
- Sessions: Automatic session rotation with cookie persistence
- Proxies: Full proxy support including Apify Proxy residential groups
- Retries: Up to 3 retries per request with automatic session rotation on blocks
- Link following: Optional crawling with configurable depth
Limitations
- No JavaScript rendering — Pages that require JS to load content (e.g., SPAs, IMDB) will return minimal data. Use a browser-based scraper for these.
- Anti-bot protection — Some sites (Yelp, BBC, Medium) may block requests even with residential proxies. Results depend on proxy quality.
- RDFa complexity — The RDFa extractor handles the common schema.org vocabulary case. Exotic namespace prefixes may not be fully resolved.