JSON-LD Schema & Meta Tag Extractor
Pricing
from $3.50 / 1,000 results
JSON-LD Schema & Meta Tag Extractor
Extract JSON-LD/Schema.org structured data, Meta tags, OpenGraph and Twitter Cards from any URL. Get page title + meta description with a clean JSON output for SEO audits, validation, competitor research and AI datasets. Proxy-ready for large crawls.
Pricing
from $3.50 / 1,000 results
Rating
0.0
(0)
Developer
Logiover
Maintained by CommunityActor stats
1
Bookmarked
31
Total users
5
Monthly active users
3 days ago
Last modified
Categories
Share
๐งฉ JSON-LD Schema & Meta Tag Extractor โ Scrape Schema.org, OpenGraph & Meta Tags

The fastest way to pull structured data from any webpage at scale. Give this scraper a list of URLs and it returns JSON-LD / Schema.org markup, page title, meta description, Open Graph tags and Twitter Cards โ one clean, normalized record per URL, ready for SEO audits, schema validation, AEO/AIO research and AI training datasets.
๐ What this scraper does
The JSON-LD Schema & Meta Tag Extractor is a focused, production-grade Apify Actor that crawls any list of URLs and pulls out every signal search engines, social platforms and AI answer engines use to understand a page:
- All
<script type="application/ld+json">blocks (Schema.org / structured data) - The
<title>tag and meta description - Every
og:*Open Graph property (image, type, url, site_name, locale and more) - Every
twitter:*Twitter Card property (card, image, title, description, site, creator) - A scrape timestamp on every record so you can diff results over time
Output is flat, normalized JSON โ perfect for piping into a BigQuery / Postgres warehouse, a Looker dashboard, an SEO platform or a custom QA pipeline. Unlike browser-rendered scrapers, this Actor is HTTP-only, which makes it dramatically cheaper and faster on large URL lists (think 50kโ500k URL crawls).
If you've been pasting URLs into Google's Rich Results Test or Schema.org's validator one at a time, this is the bulk-processing tool you've been missing.
โจ Key features
| Feature | What you get |
|---|---|
| ๐งฉ JSON-LD / Schema.org extraction | All <script type="application/ld+json"> blocks, multiple per page if present |
| ๐ท๏ธ Meta tag extraction | Page <title> and meta description |
| ๐ฑ Open Graph parser | Complete og:* capture โ image, type, url, site_name, locale, video, audio |
| ๐ฆ Twitter Card extraction | All twitter:* properties โ card, image, title, description, site, creator, player |
| ๐งฑ Normalized output | One flat record per URL โ no nested mess, easy to load into SQL or Sheets |
| ๐ Scrape timestamp | Every record carries scrapeDate so you can diff and track regressions |
| โก HTTP-only crawler | No headless browser โ runs at thousands of URLs per minute |
| ๐ก๏ธ Proxy support | Apify Proxy out of the box, residential proxies for stricter sites |
| ๐ Bulk-friendly | Tested on lists of 100k+ URLs in a single run |
| ๐ค Export-ready | JSON, CSV, Excel, XML โ or stream via the Apify API |
| ๐ Schedulable | Run hourly / daily / weekly to track schema drift after deploys |
| ๐ค AEO / AIO ready | Schema markup is the #1 input for ChatGPT, Perplexity & Google AI Overviews |
๐ฏ Built for these use cases
- SEO audit dashboards โ feed every URL in a sitemap and surface pages missing Product, Article, FAQ or Organization schema. Plug results into Looker / Power BI to visualize coverage by section.
- Structured data validation & QA โ find malformed JSON-LD, missing required properties (e.g. Product.offers.price) or incorrectly nested entities before Googlebot does.
- Schema.org compliance audits โ verify your entire site uses the right
@typevalues and that breadcrumbs, Articles, Products and LocalBusiness entities are consistent across templates. - Knowledge graph building โ extract Organization, Person, Product and Event entities at scale to feed a private knowledge graph or an enrichment pipeline.
- AEO (Answer Engine Optimization) & AIO research โ ChatGPT, Claude, Perplexity, Gemini and Google AI Overviews lean heavily on Schema.org. Track which competitors mark up which entities and where the gaps are.
- Competitor SEO analysis โ reverse-engineer the schema strategies of the top 10 ranking pages for any query, then close the gap on your own pages.
- Social preview QA โ catch missing
og:image, brokentwitter:cardtypes or pages where the OG description doesn't match the meta description. - AI training data โ Schema.org is the cleanest, most reliable source of structured product / recipe / article data on the open web โ perfect for fine-tuning or RAG corpora.
๐ฅ Inputs
| Field | Type | Required | Description |
|---|---|---|---|
startUrls | array | โ Yes | List of target URLs to scrape structured data and metadata from. Accepts the standard Apify Request List format. |
proxyConfiguration | object | โ Yes | Proxy settings used to avoid blocking on large crawls. Defaults to { useApifyProxy: true }. |
Example input โ quick test on two pages
{"startUrls": [{ "url": "https://www.imdb.com/title/tt0111161/" },{ "url": "https://www.allrecipes.com/recipe/158968/spinach-and-feta-turkey-burgers/" }],"proxyConfiguration": { "useApifyProxy": true }}
Example input โ bulk e-commerce audit
{"startUrls": [{ "url": "https://shop.example.com/products/red-shoes" },{ "url": "https://shop.example.com/products/blue-shoes" },{ "url": "https://shop.example.com/products/green-shoes" }],"proxyConfiguration": {"useApifyProxy": true,"apifyProxyGroups": ["RESIDENTIAL"]}}
Example input โ sitemap audit (paste in URLs from your sitemap.xml)
{"startUrls": [{ "url": "https://www.example.com/" },{ "url": "https://www.example.com/about" },{ "url": "https://www.example.com/blog/post-1" },{ "url": "https://www.example.com/blog/post-2" }],"proxyConfiguration": { "useApifyProxy": true }}
Tip: Combine with the Sitemap to URL List Crawler Actor to auto-generate the
startUrlslist from any site's/sitemap.xml.
๐ค Output
Each dataset item is a structured-data report for one URL. The built-in Schema Report view focuses on URL, page title, JSON-LD and Open Graph so you can quickly spot missing or incorrect schema.
Field reference
| Field | Type | Description |
|---|---|---|
url | string | The scraped URL |
pageTitle | string | HTML page title (<title>) |
metaDescription | string | Meta description tag content |
jsonLd | array / object | Extracted JSON-LD / Schema.org objects from the page |
openGraph | object | All Open Graph (og:*) tags found |
scrapeDate | string | ISO-8601 scrape timestamp |
Sample output โ Recipe page
{"url": "https://www.allrecipes.com/recipe/158968/spinach-and-feta-turkey-burgers/","pageTitle": "Spinach and Feta Turkey Burgers Recipe","metaDescription": "Lean ground turkey is mixed with spinach and feta...","jsonLd": [{"@context": "https://schema.org","@type": "Recipe","name": "Spinach and Feta Turkey Burgers","recipeIngredient": ["1 lb ground turkey", "1/2 cup feta", "1 cup spinach"],"aggregateRating": { "@type": "AggregateRating", "ratingValue": "4.8", "ratingCount": 1234 }}],"openGraph": {"og:title": "Spinach and Feta Turkey Burgers Recipe","og:type": "article","og:image": "https://www.allrecipes.com/...spinach-feta.jpg","og:url": "https://www.allrecipes.com/recipe/158968/spinach-and-feta-turkey-burgers/"},"scrapeDate": "2026-05-16T12:00:00.000Z"}
Sample output โ Product page
{"url": "https://example.com/product/abc","pageTitle": "ABC Product โ Example","metaDescription": "Buy ABC Product with fast shipping.","jsonLd": [{"@context": "https://schema.org","@type": "Product","name": "ABC Product","brand": { "@type": "Brand", "name": "Example Co" },"offers": { "@type": "Offer", "price": "49.99", "priceCurrency": "USD", "availability": "https://schema.org/InStock" }}],"openGraph": {"og:title": "ABC Product โ Example","og:type": "product","og:image": "https://example.com/images/abc.jpg","og:url": "https://example.com/product/abc"},"scrapeDate": "2026-05-16T12:00:00.000Z"}
โ๏ธ How it works
- Input parsing โ the Actor reads
startUrlsand the proxy configuration. - HTTP fetch โ each URL is fetched via an HTTP request through Apify Proxy (no browser).
- HTML parsing โ the response is parsed and the Actor walks the DOM for:
<title>tag<meta name="description"><meta property="og:*"><meta name="twitter:*">- All
<script type="application/ld+json">blocks
- JSON-LD safe-parse โ each JSON-LD block is parsed with error guards so one malformed script never crashes the whole page record.
- Normalization โ Open Graph and Twitter Card properties are collected into clean key-value objects.
- Dataset push โ one record per URL is pushed to the Apify dataset with a
scrapeDatetimestamp. - Pagination & batching โ for very large URL lists, Apify automatically distributes work and persists state so you can pause/resume.
โก Performance
| Workload | Approx. runtime | Notes |
|---|---|---|
| 100 URLs | < 1 minute | Single proxy session, perfect for spot-checks |
| 1,000 URLs | 3โ6 minutes | Default concurrency |
| 10,000 URLs | 30โ60 minutes | Enable Apify Proxy, datacenter group is fine |
| 100,000 URLs | 4โ8 hours | Recommended: split into 5โ10 parallel runs |
| 1,000,000 URLs | ~1 day per run ร N runs | Split by sitemap section / batch and run on a schedule |
Speed depends primarily on the target site's response time and whether residential proxies are required. Static sites and most e-commerce platforms run at the high end of this range.
๐ฐ Cost model
This Actor is HTTP-only โ no headless browser โ which keeps platform usage extremely low. Costs scale linearly with the number of URLs processed. For typical sites, you can audit 10,000 URLs for a small fraction of a dollar in Apify Compute Units, plus standard proxy usage if you enable residential. Run a 100-URL spot check first to estimate your per-URL cost before scaling up.
๐ Schedule for continuous monitoring
Schema drift is real. A theme update, a CMS migration or a careless deploy can wipe out structured data overnight. Use Apify Schedules to:
- Run a daily audit of your top 1,000 URLs and alert when JSON-LD count drops.
- Run a weekly full-site sweep off your sitemap to catch missing
og:imageon new pages. - Run post-deploy diffs โ schedule a run after every release and compare against the previous dataset to catch regressions in CI.
Pair with Apify Integrations (Slack, Zapier, Make, Webhooks) to ping your team the moment schema coverage drops.
๐ ๏ธ FAQ
Do I need an API key or login? No. The Actor fetches publicly available webpages โ no API key and no login are required.
Is this allowed / legal? The Actor reads only publicly available HTML markup from the pages you provide. Always use it responsibly and respect each site's terms of service and robots.txt.
How many URLs can I process in one run? Tens of thousands comfortably; hundreds of thousands with batching. For million-URL crawls, split into multiple scheduled runs.
Why is the JSON-LD empty for some pages? Three common reasons: (1) the page genuinely doesn't ship structured data, (2) the schema is injected client-side by JavaScript and isn't in the raw HTML, or (3) the page blocked the request. Spot-check with a single URL using residential proxies to confirm.
Does it extract Open Graph and Twitter Card tags too?
Yes โ every og:* and twitter:* meta tag is captured alongside JSON-LD and the standard meta tags, so you get a complete picture of your social + structured-data footprint.
Can it handle multiple JSON-LD blocks per page?
Yes. Pages often ship Product + BreadcrumbList + Organization + Article in separate <script> tags โ the jsonLd field captures all of them.
Can I track changes over time?
Yes โ every record carries a scrapeDate. Schedule daily / weekly runs and diff outputs to detect schema or metadata regressions.
Will it render JavaScript-injected schema? Not in this version โ this Actor is HTTP-only by design for speed and cost. For JS-rendered schema, pre-render the page (e.g. via a SSR endpoint) or pair this with a browser-based Actor.
Does it follow links / crawl deeper? No โ by design it processes only the URLs you provide. Use the Sitemap to URL Crawler Actor first to build the URL list.
What output formats are supported? Results are stored as structured JSON and can be exported to JSON, CSV, Excel, XML or HTML, or accessed live via the Apify API.
Can I send results to Google Sheets / Slack / a webhook? Yes โ Apify Integrations support Google Sheets, Slack, Zapier, Make, webhooks, S3, Snowflake and more directly from the run page.
What about rate limits?
Use Apify Proxy (default) and the Actor distributes requests across IPs. For very strict sites, switch to the RESIDENTIAL proxy group.
๐ Related scrapers
Combine this Actor with the rest of the SEO / web-data toolkit:
- Sitemap to URL Crawler โ turn any
sitemap.xmlinto a ready-to-pastestartUrlslist. - Website Contact Scraper โ pull emails, phones and social links from the same URL list.
- Bulk URL Status Checker โ find 404s, redirects and slow pages across thousands of URLs.
๐ Keyword cloud
Core: JSON-LD extractor, Schema.org scraper, structured data scraper, meta tag scraper, open graph parser, twitter card extractor, rich results data, schema markup extractor, JSON-LD validator, schema crawler, structured data audit, meta tag checker, rich snippet scraper.
Niche: AEO answer engine optimization, AIO AI overview optimization, generative search optimization, knowledge graph extraction, breadcrumb schema scraper, product schema audit, recipe schema scraper, FAQ schema scraper, article schema scraper, organization schema scraper, local business schema scraper, event schema extractor.
Use case: technical SEO audit, schema validation, structured data QA, social preview QA, OG image audit, sitemap-wide audit, post-deploy regression, schema coverage report, schema diff over time, content automation, AI training data, RAG corpus, knowledge graph enrichment.
Audience: SEO consultants, in-house SEO teams, technical SEO managers, content engineers, growth engineers, e-commerce SEO, news publishers, recipe sites, local SEO agencies, AI/ML engineers, data scientists, web analytics teams.