Pricing

from $0.50 / 1,000 url processeds

Schema.Org Json Ld Extractor

Extract Schema.org JSON-LD structured data from any website. Fast, lightweight HTTP-based scraper that pulls all JSON-LD scripts - perfect for SEO analysis, product data extraction, and AI/RAG pipelines. No browser overhead.

Pricing

from $0.50 / 1,000 url processeds

Rating

0.0

(0)

Developer

Alam

Actor stats

Bookmarked

Total users

Monthly active users

3 months ago

Last modified

Schema.org JSON-LD Extractor

Extract structured Schema.org (JSON-LD) data from websites via simple HTTP request. No browser needed — pure HTTP extraction with regex parsing.

Features

🚀 Fast — HTTP request only, no browser rendering overhead
🎯 Universal — Works across thousands of sites using Schema.org
📊 Structured — Returns clean JSON-LD ready for analysis
🔍 Comprehensive — Extracts all Schema.org types (Product, Article, LocalBusiness, etc.)
🔄 Proxy Support — Optional Apify Proxy for bypassing anti-bot measures

What It Does

Schema.org is a standard vocabulary for structured data on the web. Major sites (news publishers, e-commerce, Wikipedia, etc.) embed this data in <script type="application/ld+json"> tags for SEO purposes.

This actor extracts that structured data without needing complex HTML parsing or browser automation.

Use Cases

SEO Analysis — Audit competitors' structured data implementation
Price Comparison — Aggregate product data from multiple e-commerce sites
Lead Generation — Extract LocalBusiness data (contact info, addresses, business hours)
AI/RAG Development — Feed clean structured data to LLMs and knowledge bases
Content Platforms — Pull article metadata for aggregators and news feeds
Research — Analyze Schema.org adoption patterns across the web

Input

Field	Type	Description	Default
`startUrls`	array	URLs to extract Schema.org data from	`[{ "url": "https://example.com" }]`
`timeout`	integer	Request timeout in seconds (1-60)	`10`
`userAgent`	string	Custom User-Agent header	`Mozilla/5.0 (compatible; SchemaOrgExtractor/1.0; +https://github.com/)`
`proxy`	object	Proxy configuration (Apify Proxy)	`{ "useApifyProxy": false }`

Output

Each URL produces one dataset item with the following structure:

{
  "#url": "https://example.com",
  "#schemaCount": 2,
  "#contexts": ["https://schema.org"],
  "#types": ["WebSite", "Organization"],
  "schemas": [
    {
      "@context": "https://schema.org",
      "@type": "WebSite",
      "url": "https://example.com",
      "name": "Example Website"
    },
    {
      "@context": "https://schema.org",
      "@type": "Organization",
      "name": "Example Inc.",
      "url": "https://example.com"
    }
  ]
}

Fields

#url — Source URL
#schemaCount — Number of JSON-LD blocks found
#contexts — Unique @context values found
#types — Unique @type values found
schemas — Array of parsed Schema.org objects (or error details if parsing failed)

Usage Examples

Basic Extraction

{
  "startUrls": [
    { "url": "https://www.nytimes.com/" }
  ]
}

Multiple URLs with Proxy

{
  "startUrls": [
    { "url": "https://www.bbc.com/news/technology" },
    { "url": "https://en.wikipedia.org/wiki/Artificial_intelligence" }
  ],
  "timeout": 15,
  "proxy": {
    "useApifyProxy": true
  }
}

Custom Timeout and User-Agent

{
  "startUrls": [
    { "url": "https://example.com/product/123" }
  ],
  "timeout": 30,
  "userAgent": "Mozilla/5.0 (compatible; MyBot/1.0)"
}

Schema.org Coverage

This actor works on sites that implement Schema.org JSON-LD. Adoption varies by industry:

News publishers: ~70% adoption (NYTimes, BBC, etc.)
E-commerce: ~50-60% adoption (Amazon, eBay, etc.)
Wikipedia: 100% adoption
Tech blogs: Medium-high adoption
Recipe sites: Very high adoption

Sites without Schema.org will return schemas: [] — this is expected behavior, not an error.

Limitations

Blocked Sites: Some sites block simple HTTP requests (403, timeouts). Enable proxy support if needed.
No Rendering: Static HTTP extraction only — doesn't execute JavaScript.
Success Rate: ~30-60% coverage depending on industry and blocking countermeasures.

Common Schema.org Types Extracted

Product — E-commerce products with prices, availability, SKUs
Article — News articles with authors, publish dates, headlines
LocalBusiness — Business contact info, addresses, opening hours
Organization — Company details, logos, social links
WebSite — Site metadata, search action URLs
BreadcrumbList — Navigation breadcrumbs
FAQPage — Frequently asked questions
Recipe — Recipe ingredients, cooking times, nutrition

Why This Actor?

vs. Custom HTML Scrapers

Universal — Same code works across thousands of sites
No Maintenance — Schema.org is a standard, not site-specific markup
Clean Data — Structured JSON-LD vs. messy HTML parsing

vs. Browser-Based Extractors

Fast — HTTP request only, no rendering overhead
Cheap — No browser memory/CPU costs
Reliable — No JavaScript execution issues

vs. Similar Actors

Simplicity — Clear output structure with summary fields
Proxy Support — Built-in Apify Proxy integration
Comprehensive — Extracts all schema types, not just specific ones

Credits

Built by Sync Computers

License

Apache 2.0

LD+JSON Schema scraper

pocesar/json-ld-schema

Extract all LD+JSON tags from the given URLs.

Paulo Cesar

454

5.0

JSON-LD & Schema.org Extractor

andok/jsonld-extractor

Extract structured microdata (JSON-LD) from webpages to audit SEO schema implementations and rich snippets.

Andok

Schema.org Markup Validator

scrappy_garden/schema-org-markup-validator

Validate Schema.org structured data for SEO. Parses JSON-LD, detects Microdata and RDFa, highlights schema types, and reports common issues like invalid JSON-LD, missing @type, non-schema.org @context, and missing key properties for popular schema types.

Bikram Adhikari

JSON-LD Schema & Meta Tag Extractor

logiover/json-ld-schema-meta-tag-extractor

Extract JSON-LD/Schema.org structured data, Meta tags, OpenGraph and Twitter Cards from any URL. Get page title + meta description with a clean JSON output for SEO audits, validation, competitor research and AI datasets. Proxy-ready for large crawls.

Logiover

Structured Data Scraper (Schema.org)

datavault/schemaorg

Fast, lightweight scraper that extracts structured data (JSON-LD & microdata) from HTML pages. Ideal for e-commerce and sites that embed schema.org markup without heavy client-side rendering.

Datavault

JSON-LD Validator

automation-lab/jsonld-validator

This actor validates JSON-LD structured data on web pages. It extracts all `<script type="application/ld+json">` blocks, validates JSON syntax, checks for required properties (@context, @type), and verifies recommended fields for known Schema.org types like Product, Article, Organization,...

Stas Persiianenko

Structured Data Validator (JSON-LD / OG)

jungle_synthesizer/structured-data-validator-pro

Extract and validate structured data from any URL: JSON-LD, Open Graph, Twitter Cards, microdata, RDFa, meta tags. Local schema.org validation. Flags Google rich-result eligibility and AI-discovery readiness. Pure HTTP. Built for SEO audits and structured-data debugging at scale.

BowTiedRaccoon

Meetup Events Scraper — Extract Event Data with JSON-LD

klondikeking/meetup-events-scraper

Scrape Meetup events with structured data extraction using Schema.org JSON-LD. Extract event names, descriptions, dates, locations, and organizer details from any Meetup search or group page.

Pierrick McD0nald

Structured Data Extractor - JSON-LD, Microdata & RDFa

scrappy_garden/structured-data-extractor

Extract and validate structured data from any web page for SEO. Parses JSON-LD, detects Microdata and RDFa, highlights schema.org types, and reports common markup issues.

Bikram Adhikari

Schema Markup Extractor - Structured Data & SEO

pink_comic/schema-markup-extractor

Extract JSON-LD structured data, Open Graph tags, Twitter Card metadata, and all meta tags from any URL. Returns @type values, schema objects, og: properties. Fast pure-HTTP SEO audit tool.