# URL Metadata & OpenGraph Extractor (`thoob/url-metadata-extractor`) Actor

Reads a page's own public head tags, OpenGraph, Twitter card, title, description, canonical, favicon, and language, for clean link previews and RAG ingestion. Respects robots.txt by default. Billed only per URL successfully read.

- **URL**: https://apify.com/thoob/url-metadata-extractor.md
- **Developed by:** [Pono Data](https://apify.com/thoob) (community)
- **Categories:** SEO tools
- **Stats:** 2 total users, 1 monthly users, 100.0% runs succeeded, 0 bookmarks
- **User rating**: No ratings yet

## Pricing

$1.00 / 1,000 url reads

This Actor is paid per event. You are not charged for the Apify platform usage, but only a fixed price for specific events.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## URL Metadata & OpenGraph Extractor

Give it a list of page URLs and get back the metadata each page publishes for
previews: OpenGraph (`og:*`), Twitter card (`twitter:*`), `<title>`, meta
description, canonical link, declared favicon, and page language. Clean,
flat rows, built for link previews and for feeding RAG pipelines with consistent
per-link metadata.

### Input

- **URLs**: one per line.
- **Respect robots.txt**: when on (default), the host's robots.txt is checked and
  disallowed URLs are skipped.
- **Max delivered URLs**: cap on billed rows (0 = no cap).

### Output

One row per URL: `url`, `finalUrl`, `httpStatus`, `title`, `description`,
`canonical`, the `og*` fields, the `twitter*` fields, `favicon`, `lang`, plus
provenance (`sourceUrl`, `retrievedAt`, `confidence`, `dataSource`).

### How it works

Sites publish these head tags specifically so other tools can render previews.
The actor fetches each page politely with a declared User-Agent, reads only the
head, and copies the tags verbatim. Relative `og:image`, `canonical`, and favicon
URLs are resolved to absolute against the page URL; nothing else is transformed,
and a tag the page does not declare is null, never invented. A URL that robots
disallows, or that fails to fetch, is written to the free `rejected` dataset and
is not billed. A site owner can ask us to skip their domain at
https://ponodata.com/opt-out ; opted-out hosts are skipped and never charged.

### Billing

Pay per URL successfully read. Robots-disallowed and failed URLs cost nothing.

### Sample output

A real run reading each page's own public head tags (one row per URL):

| URL | title | description | OG type |
| --- | --- | --- | --- |
| https://www.cloudflare.com | Cloudflare: Build for the… | Welcome to Cloudflare - Powering … | website |
| https://stripe.com | Stripe / Financial Infras… | Stripe is a financial services pl… | website |
| https://www.python.org | Welcome to Python.org | The official home of the Python P… | website |
| https://kubernetes.io | Kubernetes | Kubernetes, also known as K8s, is… | website |

Every row carries a `sourceUrl` (the page read), for example `https://www.cloudflare.com`. Pages that return no metadata route to the free reject dataset.

### See also

More clean, pay-only-for-results data tools from Pono Data:

- [Sitemap Extractor](https://apify.com/thoob/sitemap-extractor) - every URL from any sitemap
- [Bulk DNS Lookup](https://apify.com/thoob/dns-bulk-lookup) - DNS records plus SPF, DMARC, and CAA
- [Domain WHOIS via RDAP](https://apify.com/thoob/rdap-domain-lookup) - registration data, structured from RDAP

Full catalog: https://apify.com/thoob

# Actor input Schema

## `urls` (type: `array`):

Page URLs to read metadata from, one per line. The actor fetches each page's HTML head and extracts the tags the site publishes for previews.
## `respectRobots` (type: `boolean`):

Check the host's robots.txt before fetching and skip any URL it disallows for our agent. Recommended on.
## `maxUrls` (type: `integer`):

Cap on delivered, billed rows. 0 means no cap. The platform spend cap is honored regardless.

## Actor input object example

```json
{
  "urls": [
    "https://github.com",
    "https://www.bbc.com/news"
  ],
  "respectRobots": true,
  "maxUrls": 0
}
````

# Actor output Schema

## `pages` (type: `string`):

One row per URL read.

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "urls": [
        "https://github.com",
        "https://www.bbc.com/news"
    ]
};

// Run the Actor and wait for it to finish
const run = await client.actor("thoob/url-metadata-extractor").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = { "urls": [
        "https://github.com",
        "https://www.bbc.com/news",
    ] }

# Run the Actor and wait for it to finish
run = client.actor("thoob/url-metadata-extractor").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "urls": [
    "https://github.com",
    "https://www.bbc.com/news"
  ]
}' |
apify call thoob/url-metadata-extractor --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=thoob/url-metadata-extractor",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "URL Metadata & OpenGraph Extractor",
        "description": "Reads a page's own public head tags, OpenGraph, Twitter card, title, description, canonical, favicon, and language, for clean link previews and RAG ingestion. Respects robots.txt by default. Billed only per URL successfully read.",
        "version": "0.0",
        "x-build-id": "sVZSNFpAhcGCpr2OM"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/thoob~url-metadata-extractor/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-thoob-url-metadata-extractor",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/thoob~url-metadata-extractor/runs": {
            "post": {
                "operationId": "runs-sync-thoob-url-metadata-extractor",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/thoob~url-metadata-extractor/run-sync": {
            "post": {
                "operationId": "run-sync-thoob-url-metadata-extractor",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "urls"
                ],
                "properties": {
                    "urls": {
                        "title": "URLs",
                        "minItems": 1,
                        "type": "array",
                        "description": "Page URLs to read metadata from, one per line. The actor fetches each page's HTML head and extracts the tags the site publishes for previews.",
                        "items": {
                            "type": "string"
                        }
                    },
                    "respectRobots": {
                        "title": "Respect robots.txt",
                        "type": "boolean",
                        "description": "Check the host's robots.txt before fetching and skip any URL it disallows for our agent. Recommended on.",
                        "default": true
                    },
                    "maxUrls": {
                        "title": "Max delivered URLs",
                        "minimum": 0,
                        "maximum": 200000,
                        "type": "integer",
                        "description": "Cap on delivered, billed rows. 0 means no cap. The platform spend cap is honored regardless.",
                        "default": 0
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
