Schema.Org Json Ld Extractor avatar

Schema.Org Json Ld Extractor

Pricing

from $0.50 / 1,000 url processeds

Go to Apify Store
Schema.Org Json Ld Extractor

Schema.Org Json Ld Extractor

Extract Schema.org JSON-LD structured data from any website. Fast, lightweight HTTP-based scraper that pulls all JSON-LD scripts - perfect for SEO analysis, product data extraction, and AI/RAG pipelines. No browser overhead.

Pricing

from $0.50 / 1,000 url processeds

Rating

0.0

(0)

Developer

Alam

Alam

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

3 days ago

Last modified

Share

Schema.org JSON-LD Extractor

Extract structured Schema.org (JSON-LD) data from websites via simple HTTP request. No browser needed β€” pure HTTP extraction with regex parsing.

Features

  • πŸš€ Fast β€” HTTP request only, no browser rendering overhead
  • 🎯 Universal β€” Works across thousands of sites using Schema.org
  • πŸ“Š Structured β€” Returns clean JSON-LD ready for analysis
  • πŸ” Comprehensive β€” Extracts all Schema.org types (Product, Article, LocalBusiness, etc.)
  • πŸ”„ Proxy Support β€” Optional Apify Proxy for bypassing anti-bot measures

What It Does

Schema.org is a standard vocabulary for structured data on the web. Major sites (news publishers, e-commerce, Wikipedia, etc.) embed this data in <script type="application/ld+json"> tags for SEO purposes.

This actor extracts that structured data without needing complex HTML parsing or browser automation.

Use Cases

  • SEO Analysis β€” Audit competitors' structured data implementation
  • Price Comparison β€” Aggregate product data from multiple e-commerce sites
  • Lead Generation β€” Extract LocalBusiness data (contact info, addresses, business hours)
  • AI/RAG Development β€” Feed clean structured data to LLMs and knowledge bases
  • Content Platforms β€” Pull article metadata for aggregators and news feeds
  • Research β€” Analyze Schema.org adoption patterns across the web

Input

FieldTypeDescriptionDefault
startUrlsarrayURLs to extract Schema.org data from[{ "url": "https://example.com" }]
timeoutintegerRequest timeout in seconds (1-60)10
userAgentstringCustom User-Agent headerMozilla/5.0 (compatible; SchemaOrgExtractor/1.0; +https://github.com/)
proxyobjectProxy configuration (Apify Proxy){ "useApifyProxy": false }

Output

Each URL produces one dataset item with the following structure:

{
"#url": "https://example.com",
"#schemaCount": 2,
"#contexts": ["https://schema.org"],
"#types": ["WebSite", "Organization"],
"schemas": [
{
"@context": "https://schema.org",
"@type": "WebSite",
"url": "https://example.com",
"name": "Example Website"
},
{
"@context": "https://schema.org",
"@type": "Organization",
"name": "Example Inc.",
"url": "https://example.com"
}
]
}

Fields

  • #url β€” Source URL
  • #schemaCount β€” Number of JSON-LD blocks found
  • #contexts β€” Unique @context values found
  • #types β€” Unique @type values found
  • schemas β€” Array of parsed Schema.org objects (or error details if parsing failed)

Usage Examples

Basic Extraction

{
"startUrls": [
{ "url": "https://www.nytimes.com/" }
]
}

Multiple URLs with Proxy

{
"startUrls": [
{ "url": "https://www.bbc.com/news/technology" },
{ "url": "https://en.wikipedia.org/wiki/Artificial_intelligence" }
],
"timeout": 15,
"proxy": {
"useApifyProxy": true
}
}

Custom Timeout and User-Agent

{
"startUrls": [
{ "url": "https://example.com/product/123" }
],
"timeout": 30,
"userAgent": "Mozilla/5.0 (compatible; MyBot/1.0)"
}

Schema.org Coverage

This actor works on sites that implement Schema.org JSON-LD. Adoption varies by industry:

  • News publishers: ~70% adoption (NYTimes, BBC, etc.)
  • E-commerce: ~50-60% adoption (Amazon, eBay, etc.)
  • Wikipedia: 100% adoption
  • Tech blogs: Medium-high adoption
  • Recipe sites: Very high adoption

Sites without Schema.org will return schemas: [] β€” this is expected behavior, not an error.

Limitations

  • Blocked Sites: Some sites block simple HTTP requests (403, timeouts). Enable proxy support if needed.
  • No Rendering: Static HTTP extraction only β€” doesn't execute JavaScript.
  • Success Rate: ~30-60% coverage depending on industry and blocking countermeasures.

Common Schema.org Types Extracted

  • Product β€” E-commerce products with prices, availability, SKUs
  • Article β€” News articles with authors, publish dates, headlines
  • LocalBusiness β€” Business contact info, addresses, opening hours
  • Organization β€” Company details, logos, social links
  • WebSite β€” Site metadata, search action URLs
  • BreadcrumbList β€” Navigation breadcrumbs
  • FAQPage β€” Frequently asked questions
  • Recipe β€” Recipe ingredients, cooking times, nutrition

Why This Actor?

vs. Custom HTML Scrapers

  • Universal β€” Same code works across thousands of sites
  • No Maintenance β€” Schema.org is a standard, not site-specific markup
  • Clean Data β€” Structured JSON-LD vs. messy HTML parsing

vs. Browser-Based Extractors

  • Fast β€” HTTP request only, no rendering overhead
  • Cheap β€” No browser memory/CPU costs
  • Reliable β€” No JavaScript execution issues

vs. Similar Actors

  • Simplicity β€” Clear output structure with summary fields
  • Proxy Support β€” Built-in Apify Proxy integration
  • Comprehensive β€” Extracts all schema types, not just specific ones

Credits

Built by Sync Computers

License

Apache 2.0