Structured Data Scraper & Validator
Pricing
from $9.00 / 1,000 results
Structured Data Scraper & Validator
Crawl websites to extract JSON-LD and Microdata, validate schema markup syntax, and flag missing fields across massive URL lists.
Pricing
from $9.00 / 1,000 results
Rating
0.0
(0)
Developer
太郎 山田
Actor stats
0
Bookmarked
6
Total users
2
Monthly active users
4 days ago
Last modified
Categories
Share
Structured Data Validator API | Schema Coverage & Errors
Crawling websites to extract structured data is essential for maintaining robust organic search visibility. Broken markup often goes unnoticed until traffic and rich snippets drop, costing you valuable clicks. This schema validation tool acts as a rigorous web scraper that parses every page to evaluate JSON-LD and Microdata quality at scale. Instead of manually checking individual URLs in external tools, you can schedule automated schema audits across thousands of pages to ensure semantic web standards are perfectly maintained.
Built for technical SEO and recurring compliance workflows, this scraper identifies missing schema fields, exposes invalid JSON syntaxes, and flags recommended Schema.org properties you might have overlooked. Every page scraped receives a clear 0-100 quality score, giving SEO and operations teams immediate, actionable insight into technical website health.
Easily schedule daily or weekly runs to monitor critical landing pages, product pages, or massive URL lists. By extracting the exact schema output mapped against search engine guidelines, you can rapidly export a comprehensive health report, catch syntax errors early, and secure your organic SERP real estate without manual intervention.
Store Quickstart
- Start with
Quickstart (Dataset)to validate the score and error model on two public pages. - For full audits, use
Batch Validationfor multi-page quality scoring. - For recurring monitoring, use
Webhook Alertto catch schema errors immediately.
Key Features
- 🔍 JSON-LD + Microdata extraction — Both formats supported
- 📊 Quality scoring — 0-100 with A-F grade per page
- ⚠️ Error detection — Missing
@type, invalid JSON, missing@context - 💡 Warnings — Sparse data, missing recommended properties
- 📋 Bulk processing — Check up to 200 URLs per run
- 🪝 Webhook delivery — Integrate into SEO monitoring workflows
Use Cases
| Who | Why |
|---|---|
| Developers | Automate recurring data fetches without building custom scrapers |
| Data teams | Pipe structured output into analytics warehouses |
| Ops teams | Monitor changes via webhook alerts |
| Product managers | Track competitor/market signals without engineering time |
Input
| Field | Type | Default | Description |
|---|---|---|---|
| urls | array | prefilled | List of page URLs to check for Schema.org structured data (JSON-LD, Microdata). Max 200. |
| delivery | string | "dataset" | How to deliver results. 'dataset' saves to Apify Dataset (recommended), 'webhook' sends to a URL. |
| webhookUrl | string | — | Webhook URL to send results to (only used when delivery is 'webhook'). Works with Slack, Discord, or any HTTP endpoint. |
| concurrency | integer | 3 | Maximum number of parallel requests. Higher = faster but may trigger rate limits. |
| dryRun | boolean | false | If true, runs without saving results or sending webhooks. Useful for testing. |
Input Example
{"urls": ["https://www.google.com", "https://github.com", "https://schema.org"],"concurrency": 3}
Output
| Field | Type | Description |
|---|---|---|
meta | object | |
results | array | |
results[].url | string (url) | |
results[].jsonLd | array | |
results[].microdata | array | |
results[].errors | array | |
results[].warnings | array | |
results[].score | object | |
results[].error | null | |
results[].checkedAt | timestamp |
Output Example
{"url": "https://www.google.com","jsonLd": [{ "type": "WebSite", "context": "https://schema.org", "name": "Google", "_keyCount": 7 }],"microdata": [],"errors": [],"warnings": [],"score": { "total": 80, "grade": "A" }}
API Usage
Run this actor programmatically using the Apify API. Replace YOUR_API_TOKEN with your token from Apify Console → Settings → Integrations.
cURL
curl -X POST "https://api.apify.com/v2/acts/taroyamada~structured-data-validator/run-sync-get-dataset-items?token=YOUR_API_TOKEN" \-H "Content-Type: application/json" \-d '{ "urls": ["https://www.google.com", "https://github.com", "https://schema.org"], "concurrency": 3 }'
Python
from apify_client import ApifyClientclient = ApifyClient("YOUR_API_TOKEN")run = client.actor("taroyamada/structured-data-validator").call(run_input={"urls": ["https://www.google.com", "https://github.com", "https://schema.org"],"concurrency": 3})for item in client.dataset(run["defaultDatasetId"]).iterate_items():print(item)
JavaScript / Node.js
import { ApifyClient } from 'apify-client';const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });const run = await client.actor('taroyamada/structured-data-validator').call({"urls": ["https://www.google.com", "https://github.com", "https://schema.org"],"concurrency": 3});const { items } = await client.dataset(run.defaultDatasetId).listItems();console.log(items);
Tips & Limitations
- Keep concurrency ≤ 5 when auditing production sites to avoid WAF rate-limit triggers.
- Use webhook delivery for recurring cron runs — push only deltas to downstream systems.
- Enable
dryRunfor cheap validation before committing to a paid cron schedule. - Results are dataset-first; use Apify API
run-sync-get-dataset-itemsfor instant JSON in CI pipelines. - Run a tiny URL count first, review the sample, then scale up — pay-per-event means you only pay for what you use.
FAQ
Is there a rate limit?
Built-in concurrency throttling keeps requests polite. For most public APIs this actor can run 1–10 parallel requests without issues.
What happens when the input URL is unreachable?
The actor records an error row with the failure reason — successful URLs keep processing.
Can I schedule recurring runs?
Yes — use Apify Schedules to run this actor on a cron (hourly, daily, weekly). Combine with webhook delivery for change alerts.
Does this actor respect robots.txt?
Yes — requests use a standard User-Agent and honor site rate limits. For aggressive audits, set a higher concurrency only on your own properties.
Can I integrate with Google Sheets or Airtable?
Use webhook delivery with a Zapier/Make/n8n catcher, or call the Apify REST API from Apps Script / Airtable automations.
Complete Your Website Health Audit
Website Health Suite — Build a comprehensive compliance and trust monitoring workflow:
1. Link & URL Health
- 🔗 Broken Link Checker — Find broken links across your entire site structure
- 🔗 Bulk URL Health Checker — Validate HTTP status, redirects, SSL, and response times
2. SEO & Metadata Quality (you are here)
- Schema.org Validator — Validate JSON-LD and Microdata with quality scoring
- 🏷️ Meta Tag Analyzer — Audit title tags, Open Graph, Twitter Cards, and hreflang
3. Security & Email Deliverability
- DNS/DMARC Security Checker — Audit SPF, DKIM, DMARC, and MX records
4. Historical Data & Recovery
- 📚 Wayback Machine Checker — Find archived snapshots for content recovery
Recommended workflow: Weekly schema validation → Fix errors/warnings → Validate metadata with Meta Tag Analyzer → Monitor with webhooks → Track rich snippet performance in Search Console.
Other Website Tools:
- Sitemap Analyzer — SEO sitemap audit
- Site Governance Monitor — Robots.txt and schema monitoring
- Domain Trust Monitor — SSL expiry and security headers
Cost
Pay Per Event:
actor-start: $0.01 (flat fee per run)dataset-item: $0.003 per output item
Example: 1,000 items = $0.01 + (1,000 × $0.003) = $3.01
No subscription required — you only pay for what you use.
⭐ Was this helpful?
If this actor saved you time, please leave a ★ rating on Apify Store. It takes 10 seconds, helps other developers discover it, and keeps updates free.
Bug report or feature request? Open an issue on the Issues tab of this actor.