Schema Markup Scraper & SEO Auditor avatar

Schema Markup Scraper & SEO Auditor

Pricing

from $3.00 / 1,000 results

Go to Apify Store
Schema Markup Scraper & SEO Auditor

Schema Markup Scraper & SEO Auditor

Extract JSON-LD, Microdata, RDFa, Open Graph & Twitter Cards. Runs a 0-100 SEO audit — checks canonical, hreflang, headings, image alt, EEAT author signals. Detects 80+ schema.org types including LocalBusiness with NAP, geo coordinates, and Google Place IDs.

Pricing

from $3.00 / 1,000 results

Rating

0.0

(0)

Developer

Richard Feng

Richard Feng

Maintained by Community

Actor stats

6

Bookmarked

77

Total users

7

Monthly active users

11 days ago

Last modified

Share

Schema Markup Scraper & SEO Analyzer

Extract structured data, metadata, and SEO signals from any web page. Built for technical SEO auditing, local business intelligence, content aggregation, and competitive analysis.

What it does

This scraper visits one or more URLs and extracts everything a search engine sees: structured data (JSON-LD, Microdata, RDFa), social meta tags (Open Graph, Twitter Cards), and dozens of SEO signals. It then runs an automated audit and returns a 0-100 SEO score with actionable issues — aligned with Google's 2025 ranking signals and EEAT guidelines.

Key capabilities

Structured data extraction

  • JSON-LD — Parses all <script type="application/ld+json"> blocks, including nested @graph structures
  • Microdata — Extracts itemscope/itemprop schema.org markup with full nesting support
  • RDFa (opt-in) — Parses typeof/property/vocab attributes with schema.org vocabulary resolution
  • Schema type detection — Identifies all schema.org types present (Product, Article, LocalBusiness, BreadcrumbList, etc.)

Social & meta tags

  • Open Graph — All og:* properties (title, description, image, type, locale, etc.)
  • Twitter Cards — All twitter:* properties with special handling for summary_large_image
  • Dublin CoreDC.* and DCTerms.* academic/institutional metadata
  • Standard meta tags — viewport, description, keywords, robots, theme-color, and all others

SEO analysis

  • Canonical URL — Detects <link rel="canonical">
  • Robots meta — Extracts directives for robots, googlebot, bingbot, etc.
  • Heading hierarchy — Maps H1–H6 structure, counts H1 tags, detects skipped levels
  • Image alt text audit — Counts images with/without alt attributes, calculates coverage percentage
  • Viewport & charset — Verifies mobile-first indexing prerequisites
  • SEO score (0-100) — Automated audit checking 15 Google ranking signals with error/warning/info severity

International SEO

  • Hreflang tags — Extracts all <link rel="alternate" hreflang="..."> with built-in validation:
    • Flags missing x-default fallback
    • Validates ISO 639-1 language codes (catches common mistakes like en-UK → should be en-GB)
    • Detects missing self-referencing tags
  • Language detection<html lang>, <meta http-equiv="content-language">, og:locale

EEAT & author signals

  • Author extraction — Pulls author info from JSON-LD (Person type with sameAs links), <meta name="author">, and <a rel="author">
  • Article metadatadatePublished, dateModified, headline, wordCount, publisher from Article/NewsArticle/BlogPosting schema

Local / Geo SEO

  • LocalBusiness extraction — Detects 80+ schema.org LocalBusiness subtypes (Restaurant, Hotel, Dentist, Store, etc.) and extracts NAP, geo coordinates, opening hours, price range
  • NAP (Name/Address/Phone) — From any Organization or LocalBusiness schema
  • Geo meta tagsgeo.region, geo.placename, geo.position, ICBM
  • Google Maps references — Embedded map iframes, Place IDs, CID numbers
  • hCard/vCard (opt-in).vcard/.h-card microformat contact data
  • Extracts BreadcrumbList schema items with position, name, and URL
  • Validates sequential positions and flags relative URLs

Input parameters

ParameterTypeDefaultDescription
startUrlsArray(required)URLs to scrape
proxyObjectApify ProxyProxy configuration
maxRequestsPerCrawlInteger100Maximum pages to scrape (1–100,000)
maxConcurrencyInteger10Parallel pages (1–100)
extractMetaTagsBooleantrueExtract all meta tags
extractSeoAnalysisBooleantrueSEO signals: canonical, hreflang, robots, headings, author, images, breadcrumbs, Dublin Core, viewport, charset
extractGeoDataBooleantrueGeo tags, LocalBusiness, NAP, Google Maps references
computeSeoScoreBooleantrueRun SEO audit (0-100 score + issues list)
extractRdfaBooleanfalseRDFa structured data (opt-in)
extractHCardBooleanfalsehCard/vCard microformats (opt-in)

Output fields

Each scraped page produces a JSON object with these fields:

Core metadata

FieldTypeDescription
urlStringFinal URL after redirects
titleStringPage <title> content
iconStringFavicon/apple-touch-icon URL
linkedDataArrayJSON-LD structured data blocks
microdataArrayMicrodata (schema.org) items
openGraphObjectOpen Graph properties
twitterCardObjectTwitter Card properties
metaTagsObjectAll other meta tags

SEO analysis (when extractSeoAnalysis is enabled)

FieldTypeDescription
canonicalString/nullCanonical URL
robotsMetaObjectRobots directives ({ robots: "index, follow", googlebot: "noarchive" })
hreflangObject{ tags: [{lang, url}], hasXDefault: bool, issues: string[] }
languageObject{ htmlLang, contentLanguage, ogLocale }
dublinCoreObjectDublin Core metadata
viewportString/nullViewport meta tag content
charsetString/nullCharacter encoding
headingsObject{ headings: [{level, text}], h1Count, issues[] }
imageAuditObject{ totalImages, imagesWithAlt, imagesWithoutAlt, altTexts[], issues[] }
authorInfoObject/null{ name, url, sameAs[], jobTitle, source }
schemaTypesArrayAll schema.org types detected (e.g., ["Product", "BreadcrumbList"])
articleMetadataObject/null{ datePublished, dateModified, headline, description, wordCount, publisher }
breadcrumbsObject/null{ items: [{position, name, url}], issues[] }

Geo / Local SEO (when extractGeoData is enabled)

FieldTypeDescription
geoTagsObject/nullGeo meta tags ({ region, placename, position, icbm })
localBusinessObject/nullLocalBusiness schema data with NAP, geo coordinates, opening hours
napObject/nullName, Address, Phone from Organization/LocalBusiness
mapReferencesObject/null{ googleMapsEmbeds[], placeIds[], cids[] }

SEO audit (when computeSeoScore is enabled)

FieldTypeDescription
seoAuditObject{ score: 0-100, issues: [{severity, code, message}] }

Optional extractors

FieldTypeDescription
rdfaArrayRDFa structured data (when extractRdfa is enabled)
hCardsArrayhCard/vCard contact data (when extractHCard is enabled)

SEO audit checks

The audit starts at 100 and deducts points for each issue found:

CheckSeverityPointsWhat it catches
Missing <title>Error-10Core ranking signal
Title > 60 charsWarning-5SERP truncation
Missing meta descriptionError-10CTR impact
Description > 160 charsWarning-5SERP truncation
Missing or multiple H1Error-10Content hierarchy
Missing canonical URLWarning-5Duplicate content risk
Missing og:titleWarning-5Social share CTR
Missing og:descriptionWarning-5Social share CTR
Missing og:imageWarning-5Social CTR (40-60% impact)
No structured dataWarning-5Rich results eligibility
Missing faviconWarning-5Brand trust signal
Missing viewportWarning-5Mobile-first indexing
Missing hreflang x-defaultWarning-5International SEO trust
Missing author (on articles)Warning-5EEAT signal
Missing datePublished (on articles)Warning-5Freshness signal
> 50% images without altWarning-5Accessibility + AI
No BreadcrumbList (deep pages)Info-1Navigation hierarchy

Use cases

Technical SEO audit

Crawl your site and get an instant SEO health check across every page. Identify missing titles, broken heading hierarchies, absent structured data, and more — with a prioritized issues list.

E-commerce competitive analysis

Extract product schema (pricing, availability, reviews, return policies), breadcrumb structures, and rich snippet eligibility from competitor product pages.

Local business intelligence

Scrape LocalBusiness schema from directories, review sites, or business websites. Extract NAP data, opening hours, geo coordinates, Google Place IDs, and CID numbers for lead generation or data enrichment.

International SEO validation

Audit hreflang implementations across multilingual sites. Catch the errors that 75% of implementations contain: missing x-default fallbacks, invalid ISO codes, missing self-references.

EEAT & content analysis

Extract author information, sameAs links to verified profiles, publication dates, and publisher data from article pages. Monitor how well your content signals expertise and authority.

Social media preview testing

Verify how pages will appear when shared on Facebook (Open Graph) and Twitter/X (Twitter Cards). Check for missing images, truncated descriptions, and incomplete metadata.

Content aggregation

Build news aggregators or content feeds by extracting article metadata, publication dates, authors, and descriptions from multiple sources in a single crawl.

Example output

News article (CNN)

{
"url": "https://edition.cnn.com/2025/04/18/politics/...",
"title": "Supreme Court temporarily pauses deportations under Alien Enemies Act",
"canonical": "https://www.cnn.com/2025/04/18/politics/...",
"language": {
"htmlLang": "en",
"contentLanguage": null,
"ogLocale": "en_US"
},
"hreflang": {
"tags": [
{ "lang": "en-gb", "url": "https://edition.cnn.com/..." },
{ "lang": "en-us", "url": "https://www.cnn.com/..." },
{ "lang": "x-default", "url": "https://edition.cnn.com/..." }
],
"hasXDefault": true,
"issues": ["Missing self-referencing hreflang tag"]
},
"authorInfo": {
"name": "Tierney Sneed, John Fritze",
"url": null,
"sameAs": [],
"jobTitle": null,
"source": "meta"
},
"schemaTypes": ["NewsArticle", "Person", "ImageObject", "Organization", "WebPage", "NewsMediaOrganization"],
"seoAudit": {
"score": 79,
"issues": [
{ "severity": "warning", "code": "TITLE_TOO_LONG", "message": "Title is 94 chars (recommended: max 60)" },
{ "severity": "warning", "code": "DESCRIPTION_TOO_LONG", "message": "Meta description is 267 chars (recommended: max 160)" },
{ "severity": "warning", "code": "MISSING_DATE_PUBLISHED", "message": "Article page missing datePublished (impacts freshness signals)" },
{ "severity": "info", "code": "NO_BREADCRUMBS", "message": "Deep page with no BreadcrumbList schema (helps navigation hierarchy)" }
]
}
}

E-commerce product (Farfetch)

{
"url": "https://www.farfetch.com/shopping/women/jacquemus-les-doubles-sandals-item-28543291.aspx",
"title": "Jacquemus Les Doubles Sandals | Brown | FARFETCH",
"canonical": "https://www.farfetch.com/shopping/women/jacquemus-les-doubles-sandals-item-28543291.aspx",
"schemaTypes": ["ProductGroup", "ImageObject", "Brand", "Product", "Offer", "MerchantReturnPolicy", "UnitPriceSpecification", "BreadcrumbList", "ListItem"],
"breadcrumbs": {
"items": [
{ "position": 1, "name": "Women Home", "url": "/shopping/women/items.aspx" },
{ "position": 2, "name": "Jacquemus", "url": "/shopping/women/jacquemus/items.aspx" },
{ "position": 3, "name": "Shoes", "url": "/shopping/women/jacquemus/shoes-1/items.aspx" },
{ "position": 4, "name": "Heeled Sandals", "url": "/shopping/women/jacquemus/heeled-sandals-1/items.aspx" }
],
"issues": ["Relative URL in breadcrumb position 1: \"/shopping/women/items.aspx\""]
},
"headings": {
"h1Count": 1,
"issues": ["Skipped heading level: h2 to h4"]
},
"imageAudit": {
"totalImages": 6,
"imagesWithAlt": 5,
"imagesWithoutAlt": 1,
"issues": ["1 image(s) missing alt attribute"]
},
"robotsMeta": { "robots": "noindex" },
"seoAudit": { "score": 100, "issues": [] }
}

Wikipedia (RDFa + hCard)

{
"url": "https://en.wikipedia.org/wiki/San_Francisco",
"title": "San Francisco - Wikipedia",
"canonical": "https://en.wikipedia.org/wiki/San_Francisco",
"schemaTypes": ["Article", "Organization", "ImageObject"],
"authorInfo": {
"name": "Contributors to Wikimedia projects",
"source": "json-ld"
},
"articleMetadata": {
"datePublished": "2001-11-13T04:30:40Z",
"dateModified": "2026-03-28T17:14:57Z",
"headline": "consolidated city and county in California, United States",
"publisher": { "name": "Wikimedia Foundation, Inc." }
},
"rdfa": ["... 114 RDFa items extracted ..."],
"hCards": [{ "name": "San Francisco", "... ": "..." }],
"imageAudit": {
"totalImages": 119,
"imagesWithAlt": 43,
"imagesWithoutAlt": 76,
"issues": ["64% of images missing alt text (accessibility + AI understanding)"]
},
"seoAudit": {
"score": 80,
"issues": [
{ "severity": "error", "code": "MISSING_DESCRIPTION", "message": "Page has no meta description" },
{ "severity": "warning", "code": "MISSING_OG_DESCRIPTION", "message": "Missing og:description meta tag" },
{ "severity": "warning", "code": "IMAGES_MISSING_ALT", "message": "64% of images missing alt text" }
]
}
}

Technical details

  • Engine: CheerioCrawler (server-side HTML parsing, no JavaScript execution)
  • Runtime: Node.js 22, Apify SDK 3.5.3, Crawlee 3.15.3
  • Performance: Lightweight and fast — no browser overhead
  • Sessions: Automatic session rotation with cookie persistence
  • Proxies: Full proxy support including Apify Proxy residential groups
  • Retries: Up to 3 retries per request with automatic session rotation on blocks
  • Link following: Optional crawling with configurable depth

Limitations

  • No JavaScript rendering — Pages that require JS to load content (e.g., SPAs, IMDB) will return minimal data. Use a browser-based scraper for these.
  • Anti-bot protection — Some sites (Yelp, BBC, Medium) may block requests even with residential proxies. Results depend on proxy quality.
  • RDFa complexity — The RDFa extractor handles the common schema.org vocabulary case. Exotic namespace prefixes may not be fully resolved.