Structured Data Extractor avatar

Structured Data Extractor

Pricing

Pay per event

Go to Apify Store
Structured Data Extractor

Structured Data Extractor

This actor extracts structured data markup from web pages. It parses all three major formats: JSON-LD (`<script type="application/ld+json">`), Microdata (`itemscope`/`itemprop`), and RDFa (`typeof`/`property`). For each page, it returns the full structured data objects, detected Schema.org...

Pricing

Pay per event

Rating

0.0

(0)

Developer

Stas Persiianenko

Stas Persiianenko

Maintained by Community

Actor stats

1

Bookmarked

5

Total users

3

Monthly active users

10 hours ago

Last modified

Share

Extract JSON-LD, Microdata, and RDFa structured data from web pages for SEO auditing and Schema.org validation.

What does Structured Data Extractor do?

This actor extracts structured data markup from web pages. It parses all three major formats: JSON-LD (<script type="application/ld+json">), Microdata (itemscope/itemprop), and RDFa (typeof/property). For each page, it returns the full structured data objects, detected Schema.org types, and format counts. Use it to audit rich snippet eligibility, verify Schema.org implementation, or monitor structured data across your entire site.

Use cases

  • SEO specialists -- verify Schema.org markup implementation across hundreds of pages in a single run
  • Rich snippet auditors -- check that pages have the right structured data types for Google rich results (Product, Article, FAQ, etc.)
  • Competitive analysts -- see what structured data competitors use and identify markup opportunities you are missing
  • Migration testers -- ensure structured data survives CMS, domain, or URL migrations without data loss
  • Content monitoring teams -- track structured data changes across pages over time to catch regressions
  • AI/ML engineers -- extract structured Schema.org data to build knowledge graphs, enrich RAG pipelines, or create training datasets with clean entity relationships

Why use Structured Data Extractor?

  • All three formats -- extracts JSON-LD, Microdata, and RDFa in a single pass, so you never miss markup regardless of implementation
  • Full data objects -- returns the complete structured data payload, not just type names, so you can inspect every property
  • Batch processing -- analyze hundreds of URLs at once instead of checking pages one at a time in Google's testing tool
  • AI-ready structured output -- each result includes format counts, detected Schema.org types, and boolean flags, ready for LLM training data or knowledge graph construction
  • API and integration ready -- trigger runs programmatically or connect to dashboards via Google Sheets, Zapier, and more
  • Pay-per-event pricing -- only pay for pages you actually analyze, starting at $0.001 per URL

Input parameters

ParameterTypeRequiredDefaultDescription
urlsstring[]Yes--List of web page URLs to extract structured data from

Example input

{
"urls": [
"https://www.google.com",
"https://en.wikipedia.org/wiki/Web_scraping",
"https://www.imdb.com/title/tt0111161/"
]
}

Output example

{
"url": "https://en.wikipedia.org/wiki/Web_scraping",
"title": "Web scraping - Wikipedia",
"structuredDataCount": 2,
"jsonLdCount": 1,
"microdataCount": 1,
"rdfaCount": 0,
"schemaTypes": ["Article", "BreadcrumbList"],
"structuredData": [
{
"type": "Article",
"format": "json-ld",
"data": { "@type": "Article", "name": "Web scraping", "headline": "Web scraping" }
}
],
"hasJsonLd": true,
"hasMicrodata": true,
"hasRdfa": false,
"error": null,
"extractedAt": "2026-03-01T12:00:00.000Z"
}

Output fields

FieldTypeDescription
urlstringThe analyzed page URL
titlestringThe page title
structuredDataCountnumberTotal number of structured data items found
jsonLdCountnumberNumber of JSON-LD blocks found
microdataCountnumberNumber of Microdata items found
rdfaCountnumberNumber of RDFa items found
schemaTypesstring[]List of detected Schema.org types
structuredDataarrayFull structured data objects with type, format, and data
hasJsonLdbooleanWhether the page contains any JSON-LD
hasMicrodatabooleanWhether the page contains any Microdata
hasRdfabooleanWhether the page contains any RDFa
errorstringError message if extraction failed, null otherwise
extractedAtstringISO timestamp of the extraction

How to extract structured data from web pages

  1. Go to Structured Data Extractor on Apify Store
  2. Enter one or more URLs in the urls field
  3. Click Start to run the extractor
  4. Wait for results -- each page is analyzed in seconds
  5. Review the output for JSON-LD, Microdata, and RDFa structured data found on each page
  6. Download results as JSON, CSV, or Excel, or connect via API

How much does it cost to extract structured data?

Structured Data Extractor uses Apify's pay-per-event pricing model. You only pay for what you use.

EventPriceDescription
Start$0.035One-time per run
URL extracted$0.001Per page extracted

Example costs:

  • 10 pages: $0.035 + 10 x $0.001 = $0.045
  • 100 pages: $0.035 + 100 x $0.001 = $0.135
  • 1,000 pages: $0.035 + 1,000 x $0.001 = $1.035

Using the Apify API

You can start Structured Data Extractor programmatically from your own applications using the Apify API. The following examples show how to run the actor and retrieve results in both Node.js and Python.

Node.js

import { ApifyClient } from 'apify-client';
const client = new ApifyClient({ token: 'YOUR_TOKEN' });
const run = await client.actor('automation-lab/structured-data-extractor').call({
urls: ['https://en.wikipedia.org/wiki/Web_scraping'],
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(items);

Python

from apify_client import ApifyClient
client = ApifyClient('YOUR_TOKEN')
run = client.actor('automation-lab/structured-data-extractor').call(run_input={
'urls': ['https://en.wikipedia.org/wiki/Web_scraping'],
})
items = client.dataset(run['defaultDatasetId']).list_items().items
print(items)

cURL

curl -X POST "https://api.apify.com/v2/acts/automation-lab~structured-data-extractor/runs?token=YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"urls": ["https://en.wikipedia.org/wiki/Web_scraping"]
}'

Use with Claude AI (MCP)

This actor is available as a tool in Claude AI through the Model Context Protocol (MCP). Add it to Claude Desktop, Cursor, Windsurf, or any MCP-compatible client.

Setup for Claude Code

$claude mcp add --transport http apify "https://mcp.apify.com"

Setup for Claude Desktop, Cursor, or VS Code

Add this to your MCP config file:

{
"mcpServers": {
"apify": {
"url": "https://mcp.apify.com"
}
}
}

Example prompts

  • "Extract structured data from this product page: https://www.example.com/product/123"
  • "Get schema.org markup from these URLs and tell me which types they use"
  • "Check if these pages have JSON-LD structured data for rich snippets"

Learn more in the Apify MCP documentation.

Integrations

Structured Data Extractor works with all major automation platforms available on Apify. Export results to Google Sheets to build a structured data audit dashboard across your site. Use Zapier or Make to trigger extraction runs whenever new pages are published. Send alerts to Slack when pages are missing expected Schema.org types. Pipe results into n8n workflows for custom validation logic, or set up webhooks to trigger downstream actions as soon as a run finishes. Chain it with JSON-LD Validator to first extract and then validate your structured data.

Tips and best practices

  • Focus on pages eligible for rich results -- prioritize product pages, articles, FAQ pages, and recipe pages where structured data directly impacts search appearance
  • Filter by schemaTypes to quickly find pages missing specific types like Product, Article, or BreadcrumbList
  • Use structuredDataCount: 0 to find pages with no markup -- these are your biggest opportunities for SEO improvement
  • Combine with JSON-LD Validator to first extract structured data with this actor, then validate the JSON-LD blocks for errors and warnings
  • Schedule regular runs to catch structured data regressions after site deployments or CMS updates

Legality

This tool analyzes publicly accessible web content. Automated analysis of public web resources is standard practice in SEO and web development. Always respect robots.txt directives and rate limits when analyzing third-party websites. For personal data processing, ensure compliance with applicable privacy regulations.

FAQ

What structured data formats does this actor support? It extracts all three major formats: JSON-LD (script tags), Microdata (itemscope/itemprop attributes), and RDFa (typeof/property attributes).

Does it validate the structured data? No. This actor extracts and reports what structured data exists on a page. For validation of JSON-LD syntax and required fields, use the JSON-LD Validator actor.

Can it extract structured data from JavaScript-rendered pages? No. The actor uses plain HTTP requests and parses the initial HTML response. Structured data that is injected by client-side JavaScript after page load will not be captured.

The actor returns structuredDataCount: 0 for a page I know has structured data. Why? The actor uses plain HTTP requests and parses the initial HTML. If the structured data is injected by client-side JavaScript after page load (common with React, Angular, or Vue apps), it will not be captured. Test by viewing the page source (Ctrl+U) rather than the browser's inspector to see what the actor receives.

Why does the actor find Microdata but not JSON-LD on a page? Some websites use Microdata (HTML attributes like itemscope and itemprop) instead of JSON-LD script tags. Both are valid formats for structured data. The actor extracts both, and the format field in each structuredData entry tells you which format was used.

Other SEO tools