Schema.Org Json Ld Extractor avatar

Schema.Org Json Ld Extractor

Pricing

from $0.50 / 1,000 url processeds

Go to Apify Store
Schema.Org Json Ld Extractor

Schema.Org Json Ld Extractor

Extract Schema.org JSON-LD structured data from any website. Fast, lightweight HTTP-based scraper that pulls all JSON-LD scripts - perfect for SEO analysis, product data extraction, and AI/RAG pipelines. No browser overhead.

Pricing

from $0.50 / 1,000 url processeds

Rating

0.0

(0)

Developer

Alam

Alam

Maintained by Community

Actor stats

0

Bookmarked

3

Total users

2

Monthly active users

20 days ago

Last modified

Share

Schema.org JSON-LD Extractor

Extract structured Schema.org (JSON-LD) data from websites via simple HTTP request. No browser needed — pure HTTP extraction with regex parsing.

Features

  • 🚀 Fast — HTTP request only, no browser rendering overhead
  • 🎯 Universal — Works across thousands of sites using Schema.org
  • 📊 Structured — Returns clean JSON-LD ready for analysis
  • 🔍 Comprehensive — Extracts all Schema.org types (Product, Article, LocalBusiness, etc.)
  • 🔄 Proxy Support — Optional Apify Proxy for bypassing anti-bot measures

What It Does

Schema.org is a standard vocabulary for structured data on the web. Major sites (news publishers, e-commerce, Wikipedia, etc.) embed this data in <script type="application/ld+json"> tags for SEO purposes.

This actor extracts that structured data without needing complex HTML parsing or browser automation.

Use Cases

  • SEO Analysis — Audit competitors' structured data implementation
  • Price Comparison — Aggregate product data from multiple e-commerce sites
  • Lead Generation — Extract LocalBusiness data (contact info, addresses, business hours)
  • AI/RAG Development — Feed clean structured data to LLMs and knowledge bases
  • Content Platforms — Pull article metadata for aggregators and news feeds
  • Research — Analyze Schema.org adoption patterns across the web

Input

FieldTypeDescriptionDefault
startUrlsarrayURLs to extract Schema.org data from[{ "url": "https://example.com" }]
timeoutintegerRequest timeout in seconds (1-60)10
userAgentstringCustom User-Agent headerMozilla/5.0 (compatible; SchemaOrgExtractor/1.0; +https://github.com/)
proxyobjectProxy configuration (Apify Proxy){ "useApifyProxy": false }

Output

Each URL produces one dataset item with the following structure:

{
"#url": "https://example.com",
"#schemaCount": 2,
"#contexts": ["https://schema.org"],
"#types": ["WebSite", "Organization"],
"schemas": [
{
"@context": "https://schema.org",
"@type": "WebSite",
"url": "https://example.com",
"name": "Example Website"
},
{
"@context": "https://schema.org",
"@type": "Organization",
"name": "Example Inc.",
"url": "https://example.com"
}
]
}

Fields

  • #url — Source URL
  • #schemaCount — Number of JSON-LD blocks found
  • #contexts — Unique @context values found
  • #types — Unique @type values found
  • schemas — Array of parsed Schema.org objects (or error details if parsing failed)

Usage Examples

Basic Extraction

{
"startUrls": [
{ "url": "https://www.nytimes.com/" }
]
}

Multiple URLs with Proxy

{
"startUrls": [
{ "url": "https://www.bbc.com/news/technology" },
{ "url": "https://en.wikipedia.org/wiki/Artificial_intelligence" }
],
"timeout": 15,
"proxy": {
"useApifyProxy": true
}
}

Custom Timeout and User-Agent

{
"startUrls": [
{ "url": "https://example.com/product/123" }
],
"timeout": 30,
"userAgent": "Mozilla/5.0 (compatible; MyBot/1.0)"
}

Schema.org Coverage

This actor works on sites that implement Schema.org JSON-LD. Adoption varies by industry:

  • News publishers: ~70% adoption (NYTimes, BBC, etc.)
  • E-commerce: ~50-60% adoption (Amazon, eBay, etc.)
  • Wikipedia: 100% adoption
  • Tech blogs: Medium-high adoption
  • Recipe sites: Very high adoption

Sites without Schema.org will return schemas: [] — this is expected behavior, not an error.

Limitations

  • Blocked Sites: Some sites block simple HTTP requests (403, timeouts). Enable proxy support if needed.
  • No Rendering: Static HTTP extraction only — doesn't execute JavaScript.
  • Success Rate: ~30-60% coverage depending on industry and blocking countermeasures.

Common Schema.org Types Extracted

  • Product — E-commerce products with prices, availability, SKUs
  • Article — News articles with authors, publish dates, headlines
  • LocalBusiness — Business contact info, addresses, opening hours
  • Organization — Company details, logos, social links
  • WebSite — Site metadata, search action URLs
  • BreadcrumbList — Navigation breadcrumbs
  • FAQPage — Frequently asked questions
  • Recipe — Recipe ingredients, cooking times, nutrition

Why This Actor?

vs. Custom HTML Scrapers

  • Universal — Same code works across thousands of sites
  • No Maintenance — Schema.org is a standard, not site-specific markup
  • Clean Data — Structured JSON-LD vs. messy HTML parsing

vs. Browser-Based Extractors

  • Fast — HTTP request only, no rendering overhead
  • Cheap — No browser memory/CPU costs
  • Reliable — No JavaScript execution issues

vs. Similar Actors

  • Simplicity — Clear output structure with summary fields
  • Proxy Support — Built-in Apify Proxy integration
  • Comprehensive — Extracts all schema types, not just specific ones

Credits

Built by Sync Computers

License

Apache 2.0