JSON-LD Schema & Meta Tag Extractor avatar

JSON-LD Schema & Meta Tag Extractor

Pricing

from $3.50 / 1,000 results

Go to Apify Store
JSON-LD Schema & Meta Tag Extractor

JSON-LD Schema & Meta Tag Extractor

Extract JSON-LD/Schema.org structured data, Meta tags, OpenGraph and Twitter Cards from any URL. Get page title + meta description with a clean JSON output for SEO audits, validation, competitor research and AI datasets. Proxy-ready for large crawls.

Pricing

from $3.50 / 1,000 results

Rating

0.0

(0)

Developer

Logiover

Logiover

Maintained by Community

Actor stats

1

Bookmarked

31

Total users

5

Monthly active users

3 days ago

Last modified

Share

๐Ÿงฉ JSON-LD Schema & Meta Tag Extractor โ€” Scrape Schema.org, OpenGraph & Meta Tags

JSON-LD Schema & Meta Tag Extractor

The fastest way to pull structured data from any webpage at scale. Give this scraper a list of URLs and it returns JSON-LD / Schema.org markup, page title, meta description, Open Graph tags and Twitter Cards โ€” one clean, normalized record per URL, ready for SEO audits, schema validation, AEO/AIO research and AI training datasets.


๐Ÿ”Ž What this scraper does

The JSON-LD Schema & Meta Tag Extractor is a focused, production-grade Apify Actor that crawls any list of URLs and pulls out every signal search engines, social platforms and AI answer engines use to understand a page:

  • All <script type="application/ld+json"> blocks (Schema.org / structured data)
  • The <title> tag and meta description
  • Every og:* Open Graph property (image, type, url, site_name, locale and more)
  • Every twitter:* Twitter Card property (card, image, title, description, site, creator)
  • A scrape timestamp on every record so you can diff results over time

Output is flat, normalized JSON โ€” perfect for piping into a BigQuery / Postgres warehouse, a Looker dashboard, an SEO platform or a custom QA pipeline. Unlike browser-rendered scrapers, this Actor is HTTP-only, which makes it dramatically cheaper and faster on large URL lists (think 50kโ€“500k URL crawls).

If you've been pasting URLs into Google's Rich Results Test or Schema.org's validator one at a time, this is the bulk-processing tool you've been missing.


โœจ Key features

FeatureWhat you get
๐Ÿงฉ JSON-LD / Schema.org extractionAll <script type="application/ld+json"> blocks, multiple per page if present
๐Ÿท๏ธ Meta tag extractionPage <title> and meta description
๐Ÿ“ฑ Open Graph parserComplete og:* capture โ€” image, type, url, site_name, locale, video, audio
๐Ÿฆ Twitter Card extractionAll twitter:* properties โ€” card, image, title, description, site, creator, player
๐Ÿงฑ Normalized outputOne flat record per URL โ€” no nested mess, easy to load into SQL or Sheets
๐Ÿ•’ Scrape timestampEvery record carries scrapeDate so you can diff and track regressions
โšก HTTP-only crawlerNo headless browser โ€” runs at thousands of URLs per minute
๐Ÿ›ก๏ธ Proxy supportApify Proxy out of the box, residential proxies for stricter sites
๐Ÿ“ˆ Bulk-friendlyTested on lists of 100k+ URLs in a single run
๐Ÿ“ค Export-readyJSON, CSV, Excel, XML โ€” or stream via the Apify API
๐Ÿ” SchedulableRun hourly / daily / weekly to track schema drift after deploys
๐Ÿค– AEO / AIO readySchema markup is the #1 input for ChatGPT, Perplexity & Google AI Overviews

๐ŸŽฏ Built for these use cases

  1. SEO audit dashboards โ€” feed every URL in a sitemap and surface pages missing Product, Article, FAQ or Organization schema. Plug results into Looker / Power BI to visualize coverage by section.
  2. Structured data validation & QA โ€” find malformed JSON-LD, missing required properties (e.g. Product.offers.price) or incorrectly nested entities before Googlebot does.
  3. Schema.org compliance audits โ€” verify your entire site uses the right @type values and that breadcrumbs, Articles, Products and LocalBusiness entities are consistent across templates.
  4. Knowledge graph building โ€” extract Organization, Person, Product and Event entities at scale to feed a private knowledge graph or an enrichment pipeline.
  5. AEO (Answer Engine Optimization) & AIO research โ€” ChatGPT, Claude, Perplexity, Gemini and Google AI Overviews lean heavily on Schema.org. Track which competitors mark up which entities and where the gaps are.
  6. Competitor SEO analysis โ€” reverse-engineer the schema strategies of the top 10 ranking pages for any query, then close the gap on your own pages.
  7. Social preview QA โ€” catch missing og:image, broken twitter:card types or pages where the OG description doesn't match the meta description.
  8. AI training data โ€” Schema.org is the cleanest, most reliable source of structured product / recipe / article data on the open web โ€” perfect for fine-tuning or RAG corpora.

๐Ÿ“ฅ Inputs

FieldTypeRequiredDescription
startUrlsarrayโœ… YesList of target URLs to scrape structured data and metadata from. Accepts the standard Apify Request List format.
proxyConfigurationobjectโœ… YesProxy settings used to avoid blocking on large crawls. Defaults to { useApifyProxy: true }.

Example input โ€” quick test on two pages

{
"startUrls": [
{ "url": "https://www.imdb.com/title/tt0111161/" },
{ "url": "https://www.allrecipes.com/recipe/158968/spinach-and-feta-turkey-burgers/" }
],
"proxyConfiguration": { "useApifyProxy": true }
}

Example input โ€” bulk e-commerce audit

{
"startUrls": [
{ "url": "https://shop.example.com/products/red-shoes" },
{ "url": "https://shop.example.com/products/blue-shoes" },
{ "url": "https://shop.example.com/products/green-shoes" }
],
"proxyConfiguration": {
"useApifyProxy": true,
"apifyProxyGroups": ["RESIDENTIAL"]
}
}

Example input โ€” sitemap audit (paste in URLs from your sitemap.xml)

{
"startUrls": [
{ "url": "https://www.example.com/" },
{ "url": "https://www.example.com/about" },
{ "url": "https://www.example.com/blog/post-1" },
{ "url": "https://www.example.com/blog/post-2" }
],
"proxyConfiguration": { "useApifyProxy": true }
}

Tip: Combine with the Sitemap to URL List Crawler Actor to auto-generate the startUrls list from any site's /sitemap.xml.


๐Ÿ“ค Output

Each dataset item is a structured-data report for one URL. The built-in Schema Report view focuses on URL, page title, JSON-LD and Open Graph so you can quickly spot missing or incorrect schema.

Field reference

FieldTypeDescription
urlstringThe scraped URL
pageTitlestringHTML page title (<title>)
metaDescriptionstringMeta description tag content
jsonLdarray / objectExtracted JSON-LD / Schema.org objects from the page
openGraphobjectAll Open Graph (og:*) tags found
scrapeDatestringISO-8601 scrape timestamp

Sample output โ€” Recipe page

{
"url": "https://www.allrecipes.com/recipe/158968/spinach-and-feta-turkey-burgers/",
"pageTitle": "Spinach and Feta Turkey Burgers Recipe",
"metaDescription": "Lean ground turkey is mixed with spinach and feta...",
"jsonLd": [
{
"@context": "https://schema.org",
"@type": "Recipe",
"name": "Spinach and Feta Turkey Burgers",
"recipeIngredient": ["1 lb ground turkey", "1/2 cup feta", "1 cup spinach"],
"aggregateRating": { "@type": "AggregateRating", "ratingValue": "4.8", "ratingCount": 1234 }
}
],
"openGraph": {
"og:title": "Spinach and Feta Turkey Burgers Recipe",
"og:type": "article",
"og:image": "https://www.allrecipes.com/...spinach-feta.jpg",
"og:url": "https://www.allrecipes.com/recipe/158968/spinach-and-feta-turkey-burgers/"
},
"scrapeDate": "2026-05-16T12:00:00.000Z"
}

Sample output โ€” Product page

{
"url": "https://example.com/product/abc",
"pageTitle": "ABC Product โ€” Example",
"metaDescription": "Buy ABC Product with fast shipping.",
"jsonLd": [
{
"@context": "https://schema.org",
"@type": "Product",
"name": "ABC Product",
"brand": { "@type": "Brand", "name": "Example Co" },
"offers": { "@type": "Offer", "price": "49.99", "priceCurrency": "USD", "availability": "https://schema.org/InStock" }
}
],
"openGraph": {
"og:title": "ABC Product โ€” Example",
"og:type": "product",
"og:image": "https://example.com/images/abc.jpg",
"og:url": "https://example.com/product/abc"
},
"scrapeDate": "2026-05-16T12:00:00.000Z"
}

โš™๏ธ How it works

  1. Input parsing โ€” the Actor reads startUrls and the proxy configuration.
  2. HTTP fetch โ€” each URL is fetched via an HTTP request through Apify Proxy (no browser).
  3. HTML parsing โ€” the response is parsed and the Actor walks the DOM for:
    • <title> tag
    • <meta name="description">
    • <meta property="og:*">
    • <meta name="twitter:*">
    • All <script type="application/ld+json"> blocks
  4. JSON-LD safe-parse โ€” each JSON-LD block is parsed with error guards so one malformed script never crashes the whole page record.
  5. Normalization โ€” Open Graph and Twitter Card properties are collected into clean key-value objects.
  6. Dataset push โ€” one record per URL is pushed to the Apify dataset with a scrapeDate timestamp.
  7. Pagination & batching โ€” for very large URL lists, Apify automatically distributes work and persists state so you can pause/resume.

โšก Performance

WorkloadApprox. runtimeNotes
100 URLs< 1 minuteSingle proxy session, perfect for spot-checks
1,000 URLs3โ€“6 minutesDefault concurrency
10,000 URLs30โ€“60 minutesEnable Apify Proxy, datacenter group is fine
100,000 URLs4โ€“8 hoursRecommended: split into 5โ€“10 parallel runs
1,000,000 URLs~1 day per run ร— N runsSplit by sitemap section / batch and run on a schedule

Speed depends primarily on the target site's response time and whether residential proxies are required. Static sites and most e-commerce platforms run at the high end of this range.


๐Ÿ’ฐ Cost model

This Actor is HTTP-only โ€” no headless browser โ€” which keeps platform usage extremely low. Costs scale linearly with the number of URLs processed. For typical sites, you can audit 10,000 URLs for a small fraction of a dollar in Apify Compute Units, plus standard proxy usage if you enable residential. Run a 100-URL spot check first to estimate your per-URL cost before scaling up.


๐Ÿ”„ Schedule for continuous monitoring

Schema drift is real. A theme update, a CMS migration or a careless deploy can wipe out structured data overnight. Use Apify Schedules to:

  • Run a daily audit of your top 1,000 URLs and alert when JSON-LD count drops.
  • Run a weekly full-site sweep off your sitemap to catch missing og:image on new pages.
  • Run post-deploy diffs โ€” schedule a run after every release and compare against the previous dataset to catch regressions in CI.

Pair with Apify Integrations (Slack, Zapier, Make, Webhooks) to ping your team the moment schema coverage drops.


๐Ÿ› ๏ธ FAQ

Do I need an API key or login? No. The Actor fetches publicly available webpages โ€” no API key and no login are required.

Is this allowed / legal? The Actor reads only publicly available HTML markup from the pages you provide. Always use it responsibly and respect each site's terms of service and robots.txt.

How many URLs can I process in one run? Tens of thousands comfortably; hundreds of thousands with batching. For million-URL crawls, split into multiple scheduled runs.

Why is the JSON-LD empty for some pages? Three common reasons: (1) the page genuinely doesn't ship structured data, (2) the schema is injected client-side by JavaScript and isn't in the raw HTML, or (3) the page blocked the request. Spot-check with a single URL using residential proxies to confirm.

Does it extract Open Graph and Twitter Card tags too? Yes โ€” every og:* and twitter:* meta tag is captured alongside JSON-LD and the standard meta tags, so you get a complete picture of your social + structured-data footprint.

Can it handle multiple JSON-LD blocks per page? Yes. Pages often ship Product + BreadcrumbList + Organization + Article in separate <script> tags โ€” the jsonLd field captures all of them.

Can I track changes over time? Yes โ€” every record carries a scrapeDate. Schedule daily / weekly runs and diff outputs to detect schema or metadata regressions.

Will it render JavaScript-injected schema? Not in this version โ€” this Actor is HTTP-only by design for speed and cost. For JS-rendered schema, pre-render the page (e.g. via a SSR endpoint) or pair this with a browser-based Actor.

Does it follow links / crawl deeper? No โ€” by design it processes only the URLs you provide. Use the Sitemap to URL Crawler Actor first to build the URL list.

What output formats are supported? Results are stored as structured JSON and can be exported to JSON, CSV, Excel, XML or HTML, or accessed live via the Apify API.

Can I send results to Google Sheets / Slack / a webhook? Yes โ€” Apify Integrations support Google Sheets, Slack, Zapier, Make, webhooks, S3, Snowflake and more directly from the run page.

What about rate limits? Use Apify Proxy (default) and the Actor distributes requests across IPs. For very strict sites, switch to the RESIDENTIAL proxy group.


Combine this Actor with the rest of the SEO / web-data toolkit:

  • Sitemap to URL Crawler โ€” turn any sitemap.xml into a ready-to-paste startUrls list.
  • Website Contact Scraper โ€” pull emails, phones and social links from the same URL list.
  • Bulk URL Status Checker โ€” find 404s, redirects and slow pages across thousands of URLs.

๐Ÿ”‘ Keyword cloud

Core: JSON-LD extractor, Schema.org scraper, structured data scraper, meta tag scraper, open graph parser, twitter card extractor, rich results data, schema markup extractor, JSON-LD validator, schema crawler, structured data audit, meta tag checker, rich snippet scraper.

Niche: AEO answer engine optimization, AIO AI overview optimization, generative search optimization, knowledge graph extraction, breadcrumb schema scraper, product schema audit, recipe schema scraper, FAQ schema scraper, article schema scraper, organization schema scraper, local business schema scraper, event schema extractor.

Use case: technical SEO audit, schema validation, structured data QA, social preview QA, OG image audit, sitemap-wide audit, post-deploy regression, schema coverage report, schema diff over time, content automation, AI training data, RAG corpus, knowledge graph enrichment.

Audience: SEO consultants, in-house SEO teams, technical SEO managers, content engineers, growth engineers, e-commerce SEO, news publishers, recipe sites, local SEO agencies, AI/ML engineers, data scientists, web analytics teams.