Structured Data Extractor
Pricing
Pay per event
Structured Data Extractor
This actor extracts structured data markup from web pages. It parses all three major formats: JSON-LD (`<script type="application/ld+json">`), Microdata (`itemscope`/`itemprop`), and RDFa (`typeof`/`property`). For each page, it returns the full structured data objects, detected Schema.org...
Pricing
Pay per event
Rating
0.0
(0)
Developer
Stas Persiianenko
Actor stats
1
Bookmarked
5
Total users
3
Monthly active users
10 hours ago
Last modified
Categories
Share
Extract JSON-LD, Microdata, and RDFa structured data from web pages for SEO auditing and Schema.org validation.
What does Structured Data Extractor do?
This actor extracts structured data markup from web pages. It parses all three major formats: JSON-LD (<script type="application/ld+json">), Microdata (itemscope/itemprop), and RDFa (typeof/property). For each page, it returns the full structured data objects, detected Schema.org types, and format counts. Use it to audit rich snippet eligibility, verify Schema.org implementation, or monitor structured data across your entire site.
Use cases
- SEO specialists -- verify Schema.org markup implementation across hundreds of pages in a single run
- Rich snippet auditors -- check that pages have the right structured data types for Google rich results (Product, Article, FAQ, etc.)
- Competitive analysts -- see what structured data competitors use and identify markup opportunities you are missing
- Migration testers -- ensure structured data survives CMS, domain, or URL migrations without data loss
- Content monitoring teams -- track structured data changes across pages over time to catch regressions
- AI/ML engineers -- extract structured Schema.org data to build knowledge graphs, enrich RAG pipelines, or create training datasets with clean entity relationships
Why use Structured Data Extractor?
- All three formats -- extracts JSON-LD, Microdata, and RDFa in a single pass, so you never miss markup regardless of implementation
- Full data objects -- returns the complete structured data payload, not just type names, so you can inspect every property
- Batch processing -- analyze hundreds of URLs at once instead of checking pages one at a time in Google's testing tool
- AI-ready structured output -- each result includes format counts, detected Schema.org types, and boolean flags, ready for LLM training data or knowledge graph construction
- API and integration ready -- trigger runs programmatically or connect to dashboards via Google Sheets, Zapier, and more
- Pay-per-event pricing -- only pay for pages you actually analyze, starting at $0.001 per URL
Input parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
urls | string[] | Yes | -- | List of web page URLs to extract structured data from |
Example input
{"urls": ["https://www.google.com","https://en.wikipedia.org/wiki/Web_scraping","https://www.imdb.com/title/tt0111161/"]}
Output example
{"url": "https://en.wikipedia.org/wiki/Web_scraping","title": "Web scraping - Wikipedia","structuredDataCount": 2,"jsonLdCount": 1,"microdataCount": 1,"rdfaCount": 0,"schemaTypes": ["Article", "BreadcrumbList"],"structuredData": [{"type": "Article","format": "json-ld","data": { "@type": "Article", "name": "Web scraping", "headline": "Web scraping" }}],"hasJsonLd": true,"hasMicrodata": true,"hasRdfa": false,"error": null,"extractedAt": "2026-03-01T12:00:00.000Z"}
Output fields
| Field | Type | Description |
|---|---|---|
url | string | The analyzed page URL |
title | string | The page title |
structuredDataCount | number | Total number of structured data items found |
jsonLdCount | number | Number of JSON-LD blocks found |
microdataCount | number | Number of Microdata items found |
rdfaCount | number | Number of RDFa items found |
schemaTypes | string[] | List of detected Schema.org types |
structuredData | array | Full structured data objects with type, format, and data |
hasJsonLd | boolean | Whether the page contains any JSON-LD |
hasMicrodata | boolean | Whether the page contains any Microdata |
hasRdfa | boolean | Whether the page contains any RDFa |
error | string | Error message if extraction failed, null otherwise |
extractedAt | string | ISO timestamp of the extraction |
How to extract structured data from web pages
- Go to Structured Data Extractor on Apify Store
- Enter one or more URLs in the
urlsfield - Click Start to run the extractor
- Wait for results -- each page is analyzed in seconds
- Review the output for JSON-LD, Microdata, and RDFa structured data found on each page
- Download results as JSON, CSV, or Excel, or connect via API
How much does it cost to extract structured data?
Structured Data Extractor uses Apify's pay-per-event pricing model. You only pay for what you use.
| Event | Price | Description |
|---|---|---|
| Start | $0.035 | One-time per run |
| URL extracted | $0.001 | Per page extracted |
Example costs:
- 10 pages: $0.035 + 10 x $0.001 = $0.045
- 100 pages: $0.035 + 100 x $0.001 = $0.135
- 1,000 pages: $0.035 + 1,000 x $0.001 = $1.035
Using the Apify API
You can start Structured Data Extractor programmatically from your own applications using the Apify API. The following examples show how to run the actor and retrieve results in both Node.js and Python.
Node.js
import { ApifyClient } from 'apify-client';const client = new ApifyClient({ token: 'YOUR_TOKEN' });const run = await client.actor('automation-lab/structured-data-extractor').call({urls: ['https://en.wikipedia.org/wiki/Web_scraping'],});const { items } = await client.dataset(run.defaultDatasetId).listItems();console.log(items);
Python
from apify_client import ApifyClientclient = ApifyClient('YOUR_TOKEN')run = client.actor('automation-lab/structured-data-extractor').call(run_input={'urls': ['https://en.wikipedia.org/wiki/Web_scraping'],})items = client.dataset(run['defaultDatasetId']).list_items().itemsprint(items)
cURL
curl -X POST "https://api.apify.com/v2/acts/automation-lab~structured-data-extractor/runs?token=YOUR_TOKEN" \-H "Content-Type: application/json" \-d '{"urls": ["https://en.wikipedia.org/wiki/Web_scraping"]}'
Use with Claude AI (MCP)
This actor is available as a tool in Claude AI through the Model Context Protocol (MCP). Add it to Claude Desktop, Cursor, Windsurf, or any MCP-compatible client.
Setup for Claude Code
$claude mcp add --transport http apify "https://mcp.apify.com"
Setup for Claude Desktop, Cursor, or VS Code
Add this to your MCP config file:
{"mcpServers": {"apify": {"url": "https://mcp.apify.com"}}}
Example prompts
- "Extract structured data from this product page: https://www.example.com/product/123"
- "Get schema.org markup from these URLs and tell me which types they use"
- "Check if these pages have JSON-LD structured data for rich snippets"
Learn more in the Apify MCP documentation.
Integrations
Structured Data Extractor works with all major automation platforms available on Apify. Export results to Google Sheets to build a structured data audit dashboard across your site. Use Zapier or Make to trigger extraction runs whenever new pages are published. Send alerts to Slack when pages are missing expected Schema.org types. Pipe results into n8n workflows for custom validation logic, or set up webhooks to trigger downstream actions as soon as a run finishes. Chain it with JSON-LD Validator to first extract and then validate your structured data.
Tips and best practices
- Focus on pages eligible for rich results -- prioritize product pages, articles, FAQ pages, and recipe pages where structured data directly impacts search appearance
- Filter by
schemaTypesto quickly find pages missing specific types like Product, Article, or BreadcrumbList - Use
structuredDataCount: 0to find pages with no markup -- these are your biggest opportunities for SEO improvement - Combine with JSON-LD Validator to first extract structured data with this actor, then validate the JSON-LD blocks for errors and warnings
- Schedule regular runs to catch structured data regressions after site deployments or CMS updates
Legality
This tool analyzes publicly accessible web content. Automated analysis of public web resources is standard practice in SEO and web development. Always respect robots.txt directives and rate limits when analyzing third-party websites. For personal data processing, ensure compliance with applicable privacy regulations.
FAQ
What structured data formats does this actor support? It extracts all three major formats: JSON-LD (script tags), Microdata (itemscope/itemprop attributes), and RDFa (typeof/property attributes).
Does it validate the structured data? No. This actor extracts and reports what structured data exists on a page. For validation of JSON-LD syntax and required fields, use the JSON-LD Validator actor.
Can it extract structured data from JavaScript-rendered pages? No. The actor uses plain HTTP requests and parses the initial HTML response. Structured data that is injected by client-side JavaScript after page load will not be captured.
The actor returns structuredDataCount: 0 for a page I know has structured data. Why?
The actor uses plain HTTP requests and parses the initial HTML. If the structured data is injected by client-side JavaScript after page load (common with React, Angular, or Vue apps), it will not be captured. Test by viewing the page source (Ctrl+U) rather than the browser's inspector to see what the actor receives.
Why does the actor find Microdata but not JSON-LD on a page?
Some websites use Microdata (HTML attributes like itemscope and itemprop) instead of JSON-LD script tags. Both are valid formats for structured data. The actor extracts both, and the format field in each structuredData entry tells you which format was used.
Other SEO tools
- JSON-LD Validator -- Validate JSON-LD structured data for errors and warnings
- OG Meta Extractor -- Extract Open Graph meta tags from web pages
- SEO Title Checker -- Check page titles for SEO best practices
- Subdomain Finder -- Discover subdomains via certificate transparency logs
- Domain Availability Checker -- Check if domain names are available for registration

