Structured Data Scraper & Validator avatar

Structured Data Scraper & Validator

Pricing

from $9.00 / 1,000 results

Go to Apify Store
Structured Data Scraper & Validator

Structured Data Scraper & Validator

Crawl websites to extract JSON-LD and Microdata, validate schema markup syntax, and flag missing fields across massive URL lists.

Pricing

from $9.00 / 1,000 results

Rating

0.0

(0)

Developer

太郎 山田

太郎 山田

Maintained by Community

Actor stats

0

Bookmarked

6

Total users

2

Monthly active users

4 days ago

Last modified

Share

Structured Data Validator API | Schema Coverage & Errors

Crawling websites to extract structured data is essential for maintaining robust organic search visibility. Broken markup often goes unnoticed until traffic and rich snippets drop, costing you valuable clicks. This schema validation tool acts as a rigorous web scraper that parses every page to evaluate JSON-LD and Microdata quality at scale. Instead of manually checking individual URLs in external tools, you can schedule automated schema audits across thousands of pages to ensure semantic web standards are perfectly maintained.

Built for technical SEO and recurring compliance workflows, this scraper identifies missing schema fields, exposes invalid JSON syntaxes, and flags recommended Schema.org properties you might have overlooked. Every page scraped receives a clear 0-100 quality score, giving SEO and operations teams immediate, actionable insight into technical website health.

Easily schedule daily or weekly runs to monitor critical landing pages, product pages, or massive URL lists. By extracting the exact schema output mapped against search engine guidelines, you can rapidly export a comprehensive health report, catch syntax errors early, and secure your organic SERP real estate without manual intervention.

Store Quickstart

  • Start with Quickstart (Dataset) to validate the score and error model on two public pages.
  • For full audits, use Batch Validation for multi-page quality scoring.
  • For recurring monitoring, use Webhook Alert to catch schema errors immediately.

Key Features

  • 🔍 JSON-LD + Microdata extraction — Both formats supported
  • 📊 Quality scoring — 0-100 with A-F grade per page
  • ⚠️ Error detection — Missing @type, invalid JSON, missing @context
  • 💡 Warnings — Sparse data, missing recommended properties
  • 📋 Bulk processing — Check up to 200 URLs per run
  • 🪝 Webhook delivery — Integrate into SEO monitoring workflows

Use Cases

WhoWhy
DevelopersAutomate recurring data fetches without building custom scrapers
Data teamsPipe structured output into analytics warehouses
Ops teamsMonitor changes via webhook alerts
Product managersTrack competitor/market signals without engineering time

Input

FieldTypeDefaultDescription
urlsarrayprefilledList of page URLs to check for Schema.org structured data (JSON-LD, Microdata). Max 200.
deliverystring"dataset"How to deliver results. 'dataset' saves to Apify Dataset (recommended), 'webhook' sends to a URL.
webhookUrlstringWebhook URL to send results to (only used when delivery is 'webhook'). Works with Slack, Discord, or any HTTP endpoint.
concurrencyinteger3Maximum number of parallel requests. Higher = faster but may trigger rate limits.
dryRunbooleanfalseIf true, runs without saving results or sending webhooks. Useful for testing.

Input Example

{
"urls": ["https://www.google.com", "https://github.com", "https://schema.org"],
"concurrency": 3
}

Output

FieldTypeDescription
metaobject
resultsarray
results[].urlstring (url)
results[].jsonLdarray
results[].microdataarray
results[].errorsarray
results[].warningsarray
results[].scoreobject
results[].errornull
results[].checkedAttimestamp

Output Example

{
"url": "https://www.google.com",
"jsonLd": [
{ "type": "WebSite", "context": "https://schema.org", "name": "Google", "_keyCount": 7 }
],
"microdata": [],
"errors": [],
"warnings": [],
"score": { "total": 80, "grade": "A" }
}

API Usage

Run this actor programmatically using the Apify API. Replace YOUR_API_TOKEN with your token from Apify Console → Settings → Integrations.

cURL

curl -X POST "https://api.apify.com/v2/acts/taroyamada~structured-data-validator/run-sync-get-dataset-items?token=YOUR_API_TOKEN" \
-H "Content-Type: application/json" \
-d '{ "urls": ["https://www.google.com", "https://github.com", "https://schema.org"], "concurrency": 3 }'

Python

from apify_client import ApifyClient
client = ApifyClient("YOUR_API_TOKEN")
run = client.actor("taroyamada/structured-data-validator").call(run_input={
"urls": ["https://www.google.com", "https://github.com", "https://schema.org"],
"concurrency": 3
})
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
print(item)

JavaScript / Node.js

import { ApifyClient } from 'apify-client';
const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });
const run = await client.actor('taroyamada/structured-data-validator').call({
"urls": ["https://www.google.com", "https://github.com", "https://schema.org"],
"concurrency": 3
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(items);

Tips & Limitations

  • Keep concurrency ≤ 5 when auditing production sites to avoid WAF rate-limit triggers.
  • Use webhook delivery for recurring cron runs — push only deltas to downstream systems.
  • Enable dryRun for cheap validation before committing to a paid cron schedule.
  • Results are dataset-first; use Apify API run-sync-get-dataset-items for instant JSON in CI pipelines.
  • Run a tiny URL count first, review the sample, then scale up — pay-per-event means you only pay for what you use.

FAQ

Is there a rate limit?

Built-in concurrency throttling keeps requests polite. For most public APIs this actor can run 1–10 parallel requests without issues.

What happens when the input URL is unreachable?

The actor records an error row with the failure reason — successful URLs keep processing.

Can I schedule recurring runs?

Yes — use Apify Schedules to run this actor on a cron (hourly, daily, weekly). Combine with webhook delivery for change alerts.

Does this actor respect robots.txt?

Yes — requests use a standard User-Agent and honor site rate limits. For aggressive audits, set a higher concurrency only on your own properties.

Can I integrate with Google Sheets or Airtable?

Use webhook delivery with a Zapier/Make/n8n catcher, or call the Apify REST API from Apps Script / Airtable automations.

Complete Your Website Health Audit

Website Health Suite — Build a comprehensive compliance and trust monitoring workflow:

1. Link & URL Health

2. SEO & Metadata Quality (you are here)

3. Security & Email Deliverability

4. Historical Data & Recovery

Recommended workflow: Weekly schema validation → Fix errors/warnings → Validate metadata with Meta Tag Analyzer → Monitor with webhooks → Track rich snippet performance in Search Console.

Other Website Tools:

Cost

Pay Per Event:

  • actor-start: $0.01 (flat fee per run)
  • dataset-item: $0.003 per output item

Example: 1,000 items = $0.01 + (1,000 × $0.003) = $3.01

No subscription required — you only pay for what you use.

⭐ Was this helpful?

If this actor saved you time, please leave a ★ rating on Apify Store. It takes 10 seconds, helps other developers discover it, and keeps updates free.

Bug report or feature request? Open an issue on the Issues tab of this actor.