Pricing

from $9.00 / 1,000 results

Structured Data Scraper & Validator

Crawl websites to extract JSON-LD and Microdata, validate schema markup syntax, and flag missing fields across massive URL lists.

Pricing

from $9.00 / 1,000 results

Rating

0.0

(0)

Developer

太郎山田

Actor stats

Bookmarked

Total users

Monthly active users

2 days ago

Last modified

Structured Data Validator API | Schema Coverage & Errors

After this run

Turn this Actor's output into a capped paid report with Website RAG Readiness Audit Report. Use it when AI builders, documentation teams, support teams, and technical marketers need to decide whether public website pages are clean and complete enough for RAG ingestion.

First report: $9 / website_rag_snapshot_report; set maxChargeUsd to $9.
Deeper report: $29 / website_rag_readiness_report; use only when the first result needs competitor or action-depth.
This is an internal Apify flow aid. It is not revenue proof until accounted paid usage appears.

Crawling websites to extract structured data is essential for maintaining robust organic search visibility. Broken markup often goes unnoticed until traffic and rich snippets drop, costing you valuable clicks. This schema validation tool acts as a rigorous web scraper that parses every page to evaluate JSON-LD and Microdata quality at scale. Instead of manually checking individual URLs in external tools, you can schedule automated schema audits across thousands of pages to ensure semantic web standards are perfectly maintained.

Built for technical SEO and recurring compliance workflows, this scraper identifies missing schema fields, exposes invalid JSON syntaxes, and flags recommended Schema.org properties you might have overlooked. Every page scraped receives a clear 0-100 quality score, giving SEO and operations teams immediate, actionable insight into technical website health.

Easily schedule daily or weekly runs to monitor critical landing pages, product pages, or massive URL lists. By extracting the exact schema output mapped against search engine guidelines, you can rapidly export a comprehensive health report, catch syntax errors early, and secure your organic SERP real estate without manual intervention.

Store Quickstart

Start with Quickstart (Dataset) to validate the score and error model on two public pages.
For full audits, use Batch Validation for multi-page quality scoring.
For recurring monitoring, use Webhook Alert to catch schema errors immediately.

Key Features

🔍 JSON-LD + Microdata extraction — Both formats supported
📊 Quality scoring — 0-100 with A-F grade per page
⚠️ Error detection — Missing @type, invalid JSON, missing @context
💡 Warnings — Sparse data, missing recommended properties
📋 Bulk processing — Check up to 200 URLs per run
🪝 Webhook delivery — Integrate into SEO monitoring workflows

Use Cases

Who	Why
Developers	Automate recurring data fetches without building custom scrapers
Data teams	Pipe structured output into analytics warehouses
Ops teams	Monitor changes via webhook alerts
Product managers	Track competitor/market signals without engineering time

Input

Field	Type	Default	Description
urls	array	prefilled	List of page URLs to check for Schema.org structured data (JSON-LD, Microdata). Max 200.
delivery	string	`"dataset"`	How to deliver results. 'dataset' saves to Apify Dataset (recommended), 'webhook' sends to a URL.
webhookUrl	string	—	Webhook URL to send results to (only used when delivery is 'webhook'). Works with Slack, Discord, or any HTTP endpoint.
concurrency	integer	`3`	Maximum number of parallel requests. Higher = faster but may trigger rate limits.
dryRun	boolean	`false`	If true, runs without saving results or sending webhooks. Useful for testing.

Input Example

{
  "urls": ["https://www.google.com", "https://github.com", "https://schema.org"],
  "concurrency": 3
}

Input Examples

Example: Single page validation

{
  "urls": [
    "https://www.example.com/product/sku-1234"
  ],
  "includeWarnings": true
}

Example: Sitemap-driven crawl

{
  "sitemapUrl": "https://www.example.com/sitemap.xml",
  "limit": 100,
  "vocabularies": [
    "JSON-LD",
    "Microdata"
  ]
}

Example: Audit batch with null-rate report

{
  "urls": [
    "https://shop.example.com/",
    "https://shop.example.com/about"
  ],
  "emitNullRateReport": true
}

Output

Field	Type	Description
`meta`	object
`results`	array
`results[].url`	string (url)
`results[].jsonLd`	array
`results[].microdata`	array
`results[].errors`	array
`results[].warnings`	array
`results[].score`	object
`results[].error`	null
`results[].checkedAt`	timestamp

Output Example

{
  "url": "https://www.google.com",
  "jsonLd": [
    { "type": "WebSite", "context": "https://schema.org", "name": "Google", "_keyCount": 7 }
  ],
  "microdata": [],
  "errors": [],
  "warnings": [],
  "score": { "total": 80, "grade": "A" }
}

API Usage

Run this actor programmatically using the Apify API. Replace YOUR_API_TOKEN with your token from Apify Console → Settings → Integrations.

cURL

curl -X POST "https://api.apify.com/v2/acts/taroyamada~structured-data-validator/run-sync-get-dataset-items?token=YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{ "urls": ["https://www.google.com", "https://github.com", "https://schema.org"], "concurrency": 3 }'

Python

from apify_client import ApifyClient

client = ApifyClient("YOUR_API_TOKEN")
run = client.actor("taroyamada/structured-data-validator").call(run_input={
  "urls": ["https://www.google.com", "https://github.com", "https://schema.org"],
  "concurrency": 3
})

for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

JavaScript / Node.js

import { ApifyClient } from 'apify-client';

const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });
const run = await client.actor('taroyamada/structured-data-validator').call({
  "urls": ["https://www.google.com", "https://github.com", "https://schema.org"],
  "concurrency": 3
});

const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(items);

Tips & Limitations

Keep concurrency ≤ 5 when auditing production sites to avoid WAF rate-limit triggers.
Use webhook delivery for recurring cron runs — push only deltas to downstream systems.
Enable dryRun for cheap validation before committing to a paid cron schedule.
Results are dataset-first; use Apify API run-sync-get-dataset-items for instant JSON in CI pipelines.
Run a tiny URL count first, review the sample, then scale up — pay-per-event means you only pay for what you use.

FAQ

Is there a rate limit?

Built-in concurrency throttling keeps requests polite. For most public APIs this actor can run 1–10 parallel requests without issues.

What happens when the input URL is unreachable?

The actor records an error row with the failure reason — successful URLs keep processing.

Can I schedule recurring runs?

Yes — use Apify Schedules to run this actor on a cron (hourly, daily, weekly). Combine with webhook delivery for change alerts.

Does this actor respect robots.txt?

Yes — requests use a standard User-Agent and honor site rate limits. For aggressive audits, set a higher concurrency only on your own properties.

Can I integrate with Google Sheets or Airtable?

Use webhook delivery with a Zapier/Make/n8n catcher, or call the Apify REST API from Apps Script / Airtable automations.

Complete Your Website Health Audit

Website Health Suite — Build a comprehensive compliance and trust monitoring workflow:

1. Link & URL Health

🔗 Broken Link Checker — Find broken links across your entire site structure
🔗 Bulk URL Health Checker — Validate HTTP status, redirects, SSL, and response times

2. SEO & Metadata Quality (you are here)

Schema.org Validator — Validate JSON-LD and Microdata with quality scoring
🏷️ Meta Tag Analyzer — Audit title tags, Open Graph, Twitter Cards, and hreflang

3. Security & Email Deliverability

DNS/DMARC Security Checker — Audit SPF, DKIM, DMARC, and MX records

4. Historical Data & Recovery

📚 Wayback Machine Checker — Find archived snapshots for content recovery

Recommended workflow: Weekly schema validation → Fix errors/warnings → Validate metadata with Meta Tag Analyzer → Monitor with webhooks → Track rich snippet performance in Search Console.

Other Website Tools:

Sitemap Analyzer — SEO sitemap audit
Site Governance Monitor — Robots.txt and schema monitoring
Domain Trust Monitor — SSL expiry and security headers

Cost

Pay Per Event:

actor-start: $0.01 (flat fee per run)
dataset-item: $0.003 per output item

Example: 1,000 items = $0.01 + (1,000 × $0.003) = $3.01

No subscription required — you only pay for what you use.

⭐ Was this helpful?

If this actor saved you time, please leave a ★ rating on Apify Store. It takes 10 seconds, helps other developers discover it, and keeps updates free.

Bug report or feature request? Open an issue on the Issues tab of this actor.

Use these follow-on Actors when you want a capped, decision-ready report instead of more raw rows. They use public or user-provided inputs, respect maxChargeUsd, and do not promise rankings, revenue, conversion lifts, or sales outcomes.

Website RAG Readiness Audit - add retrieval risk, source coverage, and cleanup actions after technical page checks.

If this Actor gave you raw rows or source context, these follow-on report Actors are designed for a small capped paid run. They help make a decision, not just collect more data.

Website RAG Readiness Audit Report - decide whether public website pages are clean and complete enough for RAG ingestion. Entry $9 / website_rag_snapshot_report; premium $29 / website_rag_readiness_report.

Keep maxChargeUsd equal to the selected tier. Internal links are traffic aids only; real proof requires accounted paid usage.

Schema.org Markup Validator

scrappy_garden/schema-org-markup-validator

Validate Schema.org structured data for SEO. Parses JSON-LD, detects Microdata and RDFa, highlights schema types, and reports common issues like invalid JSON-LD, missing @type, non-schema.org @context, and missing key properties for popular schema types.

Bikram Adhikari

Structured Data Extractor - JSON-LD, Microdata & RDFa

scrappy_garden/structured-data-extractor

Extract and validate structured data from any web page for SEO. Parses JSON-LD, detects Microdata and RDFa, highlights schema.org types, and reports common markup issues.

Bikram Adhikari

JSON-LD & Schema.org Extractor

andok/jsonld-extractor

Extract structured microdata (JSON-LD) from webpages to audit SEO schema implementations and rich snippets.

Andok

Structured Data Validator (JSON-LD / OG)

jungle_synthesizer/structured-data-validator-pro

Extract and validate structured data from any URL: JSON-LD, Open Graph, Twitter Cards, microdata, RDFa, meta tags. Local schema.org validation. Flags Google rich-result eligibility and AI-discovery readiness. Pure HTTP. Built for SEO audits and structured-data debugging at scale.

BowTiedRaccoon

Structured Data Scraper (Schema.org)

datavault/schemaorg

Fast, lightweight scraper that extracts structured data (JSON-LD & microdata) from HTML pages. Ideal for e-commerce and sites that embed schema.org markup without heavy client-side rendering.

Datavault

JSON-LD Validator

automation-lab/jsonld-validator

This actor validates JSON-LD structured data on web pages. It extracts all `<script type="application/ld+json">` blocks, validates JSON syntax, checks for required properties (@context, @type), and verifies recommended fields for known Schema.org types like Product, Article, Organization,...

Stas Persiianenko

Seo Schema Validator

naive_zing/seo-schema-validator

Bulk validate schema markup and JSON-LD across your entire website by crawling sitemaps. Generate agency-ready SEO health reports with per-page health scores for improved rich results and technical SEO audits.

Wyald

Structured Data Extractor

automation-lab/structured-data-extractor

This actor extracts structured data markup from web pages. It parses all three major formats: JSON-LD (`<script type="application/ld+json">`), Microdata (`itemscope`/`itemprop`), and RDFa (`typeof`/`property`). For each page, it returns the full structured data objects, detected Schema.org...