Pricing

from $0.50 / 1,000 url processeds

Schema.Org Json Ld Extractor

Extract Schema.org JSON-LD structured data from any website. Fast, lightweight HTTP-based scraper that pulls all JSON-LD scripts - perfect for SEO analysis, product data extraction, and AI/RAG pipelines. No browser overhead.

Pricing

from $0.50 / 1,000 url processeds

Rating

0.0

(0)

Developer

Alam

Actor stats

Bookmarked

Total users

Monthly active users

20 days ago

Last modified

Schema.org JSON-LD Extractor

Extract structured Schema.org (JSON-LD) data from websites via simple HTTP request. No browser needed — pure HTTP extraction with regex parsing.

Features

🚀 Fast — HTTP request only, no browser rendering overhead
🎯 Universal — Works across thousands of sites using Schema.org
📊 Structured — Returns clean JSON-LD ready for analysis
🔍 Comprehensive — Extracts all Schema.org types (Product, Article, LocalBusiness, etc.)
🔄 Proxy Support — Optional Apify Proxy for bypassing anti-bot measures

What It Does

Schema.org is a standard vocabulary for structured data on the web. Major sites (news publishers, e-commerce, Wikipedia, etc.) embed this data in <script type="application/ld+json"> tags for SEO purposes.

This actor extracts that structured data without needing complex HTML parsing or browser automation.

Use Cases

SEO Analysis — Audit competitors' structured data implementation
Price Comparison — Aggregate product data from multiple e-commerce sites
Lead Generation — Extract LocalBusiness data (contact info, addresses, business hours)
AI/RAG Development — Feed clean structured data to LLMs and knowledge bases
Content Platforms — Pull article metadata for aggregators and news feeds
Research — Analyze Schema.org adoption patterns across the web

Input

Field	Type	Description	Default
`startUrls`	array	URLs to extract Schema.org data from	`[{ "url": "https://example.com" }]`
`timeout`	integer	Request timeout in seconds (1-60)	`10`
`userAgent`	string	Custom User-Agent header	`Mozilla/5.0 (compatible; SchemaOrgExtractor/1.0; +https://github.com/)`
`proxy`	object	Proxy configuration (Apify Proxy)	`{ "useApifyProxy": false }`

Output

Each URL produces one dataset item with the following structure:

{
  "#url": "https://example.com",
  "#schemaCount": 2,
  "#contexts": ["https://schema.org"],
  "#types": ["WebSite", "Organization"],
  "schemas": [
    {
      "@context": "https://schema.org",
      "@type": "WebSite",
      "url": "https://example.com",
      "name": "Example Website"
    },
    {
      "@context": "https://schema.org",
      "@type": "Organization",
      "name": "Example Inc.",
      "url": "https://example.com"
    }
  ]
}

Fields

#url — Source URL
#schemaCount — Number of JSON-LD blocks found
#contexts — Unique @context values found
#types — Unique @type values found
schemas — Array of parsed Schema.org objects (or error details if parsing failed)

Usage Examples

Basic Extraction

{
  "startUrls": [
    { "url": "https://www.nytimes.com/" }
  ]
}

Multiple URLs with Proxy

{
  "startUrls": [
    { "url": "https://www.bbc.com/news/technology" },
    { "url": "https://en.wikipedia.org/wiki/Artificial_intelligence" }
  ],
  "timeout": 15,
  "proxy": {
    "useApifyProxy": true
  }
}

Custom Timeout and User-Agent

{
  "startUrls": [
    { "url": "https://example.com/product/123" }
  ],
  "timeout": 30,
  "userAgent": "Mozilla/5.0 (compatible; MyBot/1.0)"
}

Schema.org Coverage

This actor works on sites that implement Schema.org JSON-LD. Adoption varies by industry:

News publishers: ~70% adoption (NYTimes, BBC, etc.)
E-commerce: ~50-60% adoption (Amazon, eBay, etc.)
Wikipedia: 100% adoption
Tech blogs: Medium-high adoption
Recipe sites: Very high adoption

Sites without Schema.org will return schemas: [] — this is expected behavior, not an error.

Limitations

Blocked Sites: Some sites block simple HTTP requests (403, timeouts). Enable proxy support if needed.
No Rendering: Static HTTP extraction only — doesn't execute JavaScript.
Success Rate: ~30-60% coverage depending on industry and blocking countermeasures.

Common Schema.org Types Extracted

Product — E-commerce products with prices, availability, SKUs
Article — News articles with authors, publish dates, headlines
LocalBusiness — Business contact info, addresses, opening hours
Organization — Company details, logos, social links
WebSite — Site metadata, search action URLs
BreadcrumbList — Navigation breadcrumbs
FAQPage — Frequently asked questions
Recipe — Recipe ingredients, cooking times, nutrition

Why This Actor?

vs. Custom HTML Scrapers

Universal — Same code works across thousands of sites
No Maintenance — Schema.org is a standard, not site-specific markup
Clean Data — Structured JSON-LD vs. messy HTML parsing

vs. Browser-Based Extractors

Fast — HTTP request only, no rendering overhead
Cheap — No browser memory/CPU costs
Reliable — No JavaScript execution issues

vs. Similar Actors

Simplicity — Clear output structure with summary fields
Proxy Support — Built-in Apify Proxy integration
Comprehensive — Extracts all schema types, not just specific ones

Credits

Built by Sync Computers

License

Apache 2.0

LD+JSON Schema scraper

pocesar/json-ld-schema

Extract all LD+JSON tags from the given URLs.

Paulo Cesar

441

5.0

JSON-LD & Schema.org Extractor

andok/jsonld-extractor

Extract structured microdata (JSON-LD) from webpages to audit SEO schema implementations and rich snippets.

Andok

Schema.org Markup Validator

scrappy_garden/schema-org-markup-validator

Validate Schema.org structured data for SEO. Parses JSON-LD, detects Microdata and RDFa, highlights schema types, and reports common issues like invalid JSON-LD, missing @type, non-schema.org @context, and missing key properties for popular schema types.

Bikram Adhikari

Schema.org Validator API | JSON-LD + Microdata

taroyamada/structured-data-validator

Validate JSON-LD and Microdata across multiple pages, score markup quality, and flag missing or malformed Schema.org data.

太郎山田

JSON-LD Schema & Meta Tag Extractor

logiover/json-ld-schema-meta-tag-extractor

Extract JSON-LD/Schema.org structured data, Meta tags, OpenGraph and Twitter Cards from any URL. Get page title + meta description with a clean JSON output for SEO audits, validation, competitor research and AI datasets. Proxy-ready for large crawls.

Logiover

Structured Data Scraper (Schema.org)

datavault/schemaorg

Fast, lightweight scraper that extracts structured data (JSON-LD & microdata) from HTML pages. Ideal for e-commerce and sites that embed schema.org markup without heavy client-side rendering.

Datavault

JSON-LD Validator

automation-lab/jsonld-validator

This actor validates JSON-LD structured data on web pages. It extracts all `<script type="application/ld+json">` blocks, validates JSON syntax, checks for required properties (@context, @type), and verifies recommended fields for known Schema.org types like Product, Article, Organization,...

Stas Persiianenko

Structured Data Extractor - JSON-LD, Microdata & RDFa

scrappy_garden/structured-data-extractor

Extract and validate structured data from any web page for SEO. Parses JSON-LD, detects Microdata and RDFa, highlights schema.org types, and reports common markup issues.

Bikram Adhikari

Structured Data Extractor

automation-lab/structured-data-extractor

This actor extracts structured data markup from web pages. It parses all three major formats: JSON-LD (`<script type="application/ld+json">`), Microdata (`itemscope`/`itemprop`), and RDFa (`typeof`/`property`). For each page, it returns the full structured data objects, detected Schema.org...

Stas Persiianenko

LD+JSON Tag Extractor

dainty_screw/ld-json-tag-extractor

Extract LD+JSON tags from URLs with ease! Our Apify actor lets you quickly gather structured data, ideal for SEO analysis, rich snippets, and web page insights. Perfect for web scraping, digital marketing, and data-driven decisions. Simplify your workflow with this powerful tool.