# Schema Markup Extractor - Structured Data & SEO (`pink_comic/schema-markup-extractor`) Actor

Extract JSON-LD structured data, Open Graph tags, Twitter Card metadata, and all meta tags from any URL. Returns @type values, schema objects, og: properties. Fast pure-HTTP SEO audit tool.

- **URL**: https://apify.com/pink\_comic/schema-markup-extractor.md
- **Developed by:** [Ava Torres](https://apify.com/pink_comic) (community)
- **Categories:** Developer tools, AI
- **Stats:** 3 total users, 2 monthly users, 100.0% runs succeeded, NaN bookmarks
- **User rating**: No ratings yet

## Pricing

Pay per usage

This Actor is paid per platform usage. The Actor is free to use, and you only pay for the Apify platform usage, which gets cheaper the higher subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-usage

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## Schema Markup & SEO Data Extractor

Extract **JSON-LD structured data**, **Open Graph tags**, **Twitter Card metadata**, and **meta tags** from any URL. Built for SEO auditors, developers, and data engineers who need structured page metadata at scale.

**Pricing: $0.002 per URL** (~$2 per 1,000 URLs)

---

### What It Extracts

| Data Type | Examples |
|-----------|---------|
| **JSON-LD** | Product, Article, BreadcrumbList, FAQPage, LocalBusiness, WebSite, Person, Organization |
| **Open Graph** | og:title, og:description, og:image, og:url, og:type, og:site_name |
| **Twitter Card** | twitter:card, twitter:title, twitter:description, twitter:image, twitter:site |
| **Meta Tags** | description, keywords, author, robots, viewport, canonical |
| **Schema Types** | Deduplicated list of all @type values found on the page |

---

### Input

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `urls` | array | required | URLs to extract from |
| `includeJsonLd` | boolean | `true` | Parse JSON-LD script blocks |
| `includeOpenGraph` | boolean | `true` | Parse og: meta properties |
| `includeTwitterCard` | boolean | `true` | Parse twitter: meta tags |
| `includeMetaTags` | boolean | `true` | Parse all `<meta name=...>` tags |
| `concurrency` | integer | `5` | Parallel requests (1-20) |
| `timeout` | integer | `30` | Per-URL timeout in seconds |
| `maxResults` | integer | `50` | Cap on URLs processed |

---

### Output

Each URL produces one dataset record:

```json
{
  "url": "https://example.com/product/widget",
  "jsonLd": [
    {
      "@context": "https://schema.org",
      "@type": "Product",
      "name": "Widget Pro",
      "description": "A professional widget",
      "offers": {
        "@type": "Offer",
        "price": "29.99",
        "priceCurrency": "USD"
      }
    }
  ],
  "openGraph": {
    "title": "Widget Pro - Best Widgets",
    "description": "A professional widget for professionals",
    "image": "https://example.com/widget.jpg",
    "type": "product"
  },
  "twitterCard": {
    "card": "summary_large_image",
    "title": "Widget Pro",
    "image": "https://example.com/widget-twitter.jpg"
  },
  "metaTags": [
    { "name": "description", "content": "A professional widget for professionals" },
    { "name": "keywords", "content": "widget, pro, professional" }
  ],
  "schemaTypes": ["Product", "Offer"]
}
````

If a URL fails to fetch or parse, the record includes an `error` field and empty arrays/objects for the structured data fields.

***

### Use Cases

- **SEO audits** — verify JSON-LD is present and correct across hundreds of pages
- **Competitor research** — see what schema types competitors implement
- **Rich result eligibility** — check if pages qualify for Google rich results (Product, FAQ, Article, etc.)
- **Content aggregation** — extract og:image and og:title for link previews
- **Schema validation** — identify missing or malformed structured data before a site launch
- **Crawl pipelines** — feed output into downstream validators or dashboards

***

### Notes

- Uses a pure HTTP client — no browser required, fast and cost-efficient
- Handles `@graph` arrays in JSON-LD (common on WordPress/Yoast sites)
- Handles both `property="twitter:..."` and `name="twitter:..."` meta tag formats
- Follows up to 10 redirects per URL
- Response body capped at 10 MB per page
- No API key required

# Actor input Schema

## `urls` (type: `array`):

List of URLs to extract structured data from. Accepts any publicly accessible HTTP/HTTPS URL.

## `includeJsonLd` (type: `boolean`):

Extract JSON-LD structured data blocks (application/ld+json scripts). Includes all @type objects and @graph arrays.

## `includeOpenGraph` (type: `boolean`):

Extract Open Graph metadata (og:title, og:description, og:image, og:url, og:type, etc.).

## `includeTwitterCard` (type: `boolean`):

Extract Twitter Card metadata (twitter:card, twitter:title, twitter:description, twitter:image, etc.).

## `includeMetaTags` (type: `boolean`):

Extract all <meta name=...> tags (description, keywords, author, robots, viewport, etc.).

## `concurrency` (type: `integer`):

Number of URLs to fetch in parallel. Keep low (3-5) for polite crawling.

## `timeout` (type: `integer`):

Maximum seconds to wait for each URL to respond.

## `maxResults` (type: `integer`):

Maximum number of URLs to process. Set to 0 to process all URLs in the list.

## Actor input object example

```json
{
  "urls": [
    "https://schema.org/",
    "https://developers.google.com/search/docs/appearance/structured-data/intro-structured-data",
    "https://www.nytimes.com/",
    "https://www.imdb.com/title/tt0111161/",
    "https://en.wikipedia.org/wiki/Structured_data"
  ],
  "includeJsonLd": true,
  "includeOpenGraph": true,
  "includeTwitterCard": true,
  "includeMetaTags": true,
  "concurrency": 5,
  "timeout": 30,
  "maxResults": 50
}
```

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "urls": [
        "https://schema.org/",
        "https://developers.google.com/search/docs/appearance/structured-data/intro-structured-data",
        "https://www.nytimes.com/",
        "https://www.imdb.com/title/tt0111161/",
        "https://en.wikipedia.org/wiki/Structured_data"
    ]
};

// Run the Actor and wait for it to finish
const run = await client.actor("pink_comic/schema-markup-extractor").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = { "urls": [
        "https://schema.org/",
        "https://developers.google.com/search/docs/appearance/structured-data/intro-structured-data",
        "https://www.nytimes.com/",
        "https://www.imdb.com/title/tt0111161/",
        "https://en.wikipedia.org/wiki/Structured_data",
    ] }

# Run the Actor and wait for it to finish
run = client.actor("pink_comic/schema-markup-extractor").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "urls": [
    "https://schema.org/",
    "https://developers.google.com/search/docs/appearance/structured-data/intro-structured-data",
    "https://www.nytimes.com/",
    "https://www.imdb.com/title/tt0111161/",
    "https://en.wikipedia.org/wiki/Structured_data"
  ]
}' |
apify call pink_comic/schema-markup-extractor --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=pink_comic/schema-markup-extractor",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Schema Markup Extractor - Structured Data & SEO",
        "description": "Extract JSON-LD structured data, Open Graph tags, Twitter Card metadata, and all meta tags from any URL. Returns @type values, schema objects, og: properties. Fast pure-HTTP SEO audit tool.",
        "version": "1.0",
        "x-build-id": "VqBygqML8rarkevxw"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/pink_comic~schema-markup-extractor/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-pink_comic-schema-markup-extractor",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/pink_comic~schema-markup-extractor/runs": {
            "post": {
                "operationId": "runs-sync-pink_comic-schema-markup-extractor",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/pink_comic~schema-markup-extractor/run-sync": {
            "post": {
                "operationId": "run-sync-pink_comic-schema-markup-extractor",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "urls"
                ],
                "properties": {
                    "urls": {
                        "title": "URLs",
                        "type": "array",
                        "description": "List of URLs to extract structured data from. Accepts any publicly accessible HTTP/HTTPS URL.",
                        "items": {
                            "type": "string"
                        }
                    },
                    "includeJsonLd": {
                        "title": "Include JSON-LD",
                        "type": "boolean",
                        "description": "Extract JSON-LD structured data blocks (application/ld+json scripts). Includes all @type objects and @graph arrays.",
                        "default": true
                    },
                    "includeOpenGraph": {
                        "title": "Include Open Graph",
                        "type": "boolean",
                        "description": "Extract Open Graph metadata (og:title, og:description, og:image, og:url, og:type, etc.).",
                        "default": true
                    },
                    "includeTwitterCard": {
                        "title": "Include Twitter Card",
                        "type": "boolean",
                        "description": "Extract Twitter Card metadata (twitter:card, twitter:title, twitter:description, twitter:image, etc.).",
                        "default": true
                    },
                    "includeMetaTags": {
                        "title": "Include Meta Tags",
                        "type": "boolean",
                        "description": "Extract all <meta name=...> tags (description, keywords, author, robots, viewport, etc.).",
                        "default": true
                    },
                    "concurrency": {
                        "title": "Concurrency",
                        "minimum": 1,
                        "maximum": 20,
                        "type": "integer",
                        "description": "Number of URLs to fetch in parallel. Keep low (3-5) for polite crawling.",
                        "default": 5
                    },
                    "timeout": {
                        "title": "Request Timeout (seconds)",
                        "minimum": 5,
                        "maximum": 120,
                        "type": "integer",
                        "description": "Maximum seconds to wait for each URL to respond.",
                        "default": 30
                    },
                    "maxResults": {
                        "title": "Max Results",
                        "minimum": 1,
                        "maximum": 10000,
                        "type": "integer",
                        "description": "Maximum number of URLs to process. Set to 0 to process all URLs in the list.",
                        "default": 50
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
