Schema.Org Json Ld Extractor
Pricing
from $0.50 / 1,000 url processeds
Schema.Org Json Ld Extractor
Extract Schema.org JSON-LD structured data from any website. Fast, lightweight HTTP-based scraper that pulls all JSON-LD scripts - perfect for SEO analysis, product data extraction, and AI/RAG pipelines. No browser overhead.
Pricing
from $0.50 / 1,000 url processeds
Rating
0.0
(0)
Developer

Alam
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
3 days ago
Last modified
Categories
Share
Schema.org JSON-LD Extractor
Extract structured Schema.org (JSON-LD) data from websites via simple HTTP request. No browser needed β pure HTTP extraction with regex parsing.
Features
- π Fast β HTTP request only, no browser rendering overhead
- π― Universal β Works across thousands of sites using Schema.org
- π Structured β Returns clean JSON-LD ready for analysis
- π Comprehensive β Extracts all Schema.org types (Product, Article, LocalBusiness, etc.)
- π Proxy Support β Optional Apify Proxy for bypassing anti-bot measures
What It Does
Schema.org is a standard vocabulary for structured data on the web. Major sites (news publishers, e-commerce, Wikipedia, etc.) embed this data in <script type="application/ld+json"> tags for SEO purposes.
This actor extracts that structured data without needing complex HTML parsing or browser automation.
Use Cases
- SEO Analysis β Audit competitors' structured data implementation
- Price Comparison β Aggregate product data from multiple e-commerce sites
- Lead Generation β Extract LocalBusiness data (contact info, addresses, business hours)
- AI/RAG Development β Feed clean structured data to LLMs and knowledge bases
- Content Platforms β Pull article metadata for aggregators and news feeds
- Research β Analyze Schema.org adoption patterns across the web
Input
| Field | Type | Description | Default |
|---|---|---|---|
startUrls | array | URLs to extract Schema.org data from | [{ "url": "https://example.com" }] |
timeout | integer | Request timeout in seconds (1-60) | 10 |
userAgent | string | Custom User-Agent header | Mozilla/5.0 (compatible; SchemaOrgExtractor/1.0; +https://github.com/) |
proxy | object | Proxy configuration (Apify Proxy) | { "useApifyProxy": false } |
Output
Each URL produces one dataset item with the following structure:
{"#url": "https://example.com","#schemaCount": 2,"#contexts": ["https://schema.org"],"#types": ["WebSite", "Organization"],"schemas": [{"@context": "https://schema.org","@type": "WebSite","url": "https://example.com","name": "Example Website"},{"@context": "https://schema.org","@type": "Organization","name": "Example Inc.","url": "https://example.com"}]}
Fields
#urlβ Source URL#schemaCountβ Number of JSON-LD blocks found#contextsβ Unique@contextvalues found#typesβ Unique@typevalues foundschemasβ Array of parsed Schema.org objects (or error details if parsing failed)
Usage Examples
Basic Extraction
{"startUrls": [{ "url": "https://www.nytimes.com/" }]}
Multiple URLs with Proxy
{"startUrls": [{ "url": "https://www.bbc.com/news/technology" },{ "url": "https://en.wikipedia.org/wiki/Artificial_intelligence" }],"timeout": 15,"proxy": {"useApifyProxy": true}}
Custom Timeout and User-Agent
{"startUrls": [{ "url": "https://example.com/product/123" }],"timeout": 30,"userAgent": "Mozilla/5.0 (compatible; MyBot/1.0)"}
Schema.org Coverage
This actor works on sites that implement Schema.org JSON-LD. Adoption varies by industry:
- News publishers: ~70% adoption (NYTimes, BBC, etc.)
- E-commerce: ~50-60% adoption (Amazon, eBay, etc.)
- Wikipedia: 100% adoption
- Tech blogs: Medium-high adoption
- Recipe sites: Very high adoption
Sites without Schema.org will return schemas: [] β this is expected behavior, not an error.
Limitations
- Blocked Sites: Some sites block simple HTTP requests (403, timeouts). Enable proxy support if needed.
- No Rendering: Static HTTP extraction only β doesn't execute JavaScript.
- Success Rate: ~30-60% coverage depending on industry and blocking countermeasures.
Common Schema.org Types Extracted
Productβ E-commerce products with prices, availability, SKUsArticleβ News articles with authors, publish dates, headlinesLocalBusinessβ Business contact info, addresses, opening hoursOrganizationβ Company details, logos, social linksWebSiteβ Site metadata, search action URLsBreadcrumbListβ Navigation breadcrumbsFAQPageβ Frequently asked questionsRecipeβ Recipe ingredients, cooking times, nutrition
Why This Actor?
vs. Custom HTML Scrapers
- Universal β Same code works across thousands of sites
- No Maintenance β Schema.org is a standard, not site-specific markup
- Clean Data β Structured JSON-LD vs. messy HTML parsing
vs. Browser-Based Extractors
- Fast β HTTP request only, no rendering overhead
- Cheap β No browser memory/CPU costs
- Reliable β No JavaScript execution issues
vs. Similar Actors
- Simplicity β Clear output structure with summary fields
- Proxy Support β Built-in Apify Proxy integration
- Comprehensive β Extracts all schema types, not just specific ones
Credits
Built by Sync Computers
License
Apache 2.0