URL Metadata & OpenGraph Extractor

Extract clean metadata from any list of URLs: title, description, preview image, favicon, site name, and more. Reconciles OpenGraph, Twitter Card, and standard meta tags into one tidy record per URL.

Pricing

from $1.00 / 1,000 results

Rating

0.0

(0)

Developer

Nicolas van Arkens

Actor stats

Bookmarked

Total users

Monthly active users

11 days ago

Last modified

.dockerignore

.git
.venv
__pycache__
*.pyc
storage
test_*.py

requirements.txt

1apify < 3.0
2pydantic >= 2.8, < 2.12
3browserforge == 1.2.3
4httpx

.actor/actor.json

{
    "actorSpecification": 1,
    "name": "url-metadata",
    "title": "URL Metadata & OpenGraph Extractor",
    "description": "Extract clean metadata from any list of URLs: title, description, preview image, favicon, site name, and more. Reconciles OpenGraph, Twitter Card, and standard meta tags into one tidy record per URL.",
    "version": "0.1",
    "buildTag": "latest",
    "input": "./input_schema.json",
    "dockerfile": "./Dockerfile",
    "storages": { "dataset": "./dataset_schema.json" }
}

.actor/dataset_schema.json

{
    "actorSpecification": 1,
    "views": {
        "overview": {
            "title": "Overview",
            "transformation": { "fields": ["url","title","description","image","siteName","type","favicon"] },
            "display": {
                "component": "table",
                "properties": {
                    "url": { "label": "URL", "format": "link" },
                    "title": { "label": "Title", "format": "text" },
                    "description": { "label": "Description", "format": "text" },
                    "image": { "label": "Preview image", "format": "link" },
                    "siteName": { "label": "Site", "format": "text" },
                    "type": { "label": "Type", "format": "text" },
                    "favicon": { "label": "Favicon", "format": "link" }
                }
            }
        }
    }
}

.actor/Dockerfile

FROM apify/actor-python:3.12
COPY requirements.txt ./
RUN pip install --no-cache-dir -r requirements.txt
COPY . ./
CMD ["python", "-m", "src"]

.actor/input_schema.json

{
    "title": "URL Metadata & OpenGraph Extractor",
    "type": "object",
    "schemaVersion": 1,
    "properties": {
        "urls": {
            "title": "URLs",
            "type": "array",
            "description": "List of page URLs to extract metadata from. A scheme is added automatically if you omit it (e.g. 'example.com' becomes 'https://example.com').",
            "editor": "stringList",
            "prefill": ["https://www.apify.com", "https://github.com", "https://news.ycombinator.com"]
        }
    },
    "required": []
}

.actor/README.md

1# URL Metadata & OpenGraph Extractor 🔗
2
3Give it any list of URLs and get back clean, unified metadata for each page — **title, description, preview image, favicon, site name, canonical URL, author, and published date**. It reconciles **OpenGraph**, **Twitter Card**, and standard `<meta>` tags into one tidy record per URL, so you don't have to care which tags a given site happens to use.
4
5Perfect for building link previews, social-media tools, bookmarking apps, content curation, and SEO audits.
6
7## Why use it
8
9- 🔀 **Reconciled metadata** — OpenGraph → Twitter Card → standard meta → `<title>`, in priority order
10- 🖼️ **Preview image & favicon** — resolved to absolute URLs (with a sensible favicon fallback)
11- 🔗 **Canonical URL** — and the final URL after redirects
12- 🏷️ **Rich fields** — site name, type, author, published date, theme color, Twitter card type, keywords
13- 📦 **Batch** — process many URLs in a single run
14- 🪶 **Fast & light** — metadata-only extraction, no heavy rendering
15
16## Use cases
17
18- **Link previews** — build rich cards like Slack, Discord, and iMessage show
19- **Social media tools** — pull share metadata for scheduling and preview
20- **Bookmarking & read-later apps** — auto-title and illustrate saved links
21- **Content curation & newsletters** — enrich link lists with titles and images
22- **SEO audits** — check OG/Twitter tags across a set of pages at once
23
24## Input
25
26| Field | Description |
27|-------|-------------|
28| **URLs** | List of page URLs (scheme added automatically if omitted). |
29
30## Output
31
32```json
33{
34  "url": "https://example.com/page",
35  "finalUrl": "https://example.com/page",
36  "success": true,
37  "statusCode": 200,
38  "title": "OG Title Wins",
39  "description": "The OpenGraph description.",
40  "image": "https://cdn.example.com/img.jpg",
41  "siteName": "Example Site",
42  "type": "article",
43  "canonicalUrl": "https://example.com/canonical-page",
44  "favicon": "https://example.com/favicon.png",
45  "author": "Jane Author",
46  "publishedDate": "2026-05-01T10:00:00Z",
47  "twitterCard": "summary_large_image"
48}
49```
50
51Export to JSON, CSV, or Excel, or pull via the Apify API.
52
53## Notes
54
55- Extracts metadata from publicly accessible pages only. Some sites block automated requests or require JavaScript rendering, in which case fewer fields may be available.
56- Independent tool; respects each site's access policy. Please use responsibly.

src/main.py

1"""URL Metadata & OpenGraph Extractor — Apify Actor.
2
3Give it a list of URLs and it fetches each page and extracts a clean, unified
4set of metadata: title, description, preview image, site name, favicon,
5canonical URL, author, published date, and type — reconciling OpenGraph,
6Twitter Card, standard <meta> tags, and HTML fallbacks in priority order.
7
8The value is the reconciliation: real pages scatter this info across og:*,
9twitter:*, name="description", <title>, etc. This returns one tidy record per
10URL regardless of which tags a given site happens to use.
11
12Parsing is done with the standard library + regex (no heavyweight HTML deps),
13which keeps the actor light and fast for metadata-only extraction.
14"""
15
16from __future__ import annotations
17
18import asyncio
19import html
20import re
21from urllib.parse import urljoin, urlparse
22
23import httpx
24from apify import Actor
25
26# --- meta extraction helpers -------------------------------------------------
27
28# Match <meta ...> tags (any attribute order). We capture the whole tag and
29# then pull property/name/content out of it individually for robustness.
30_META_TAG = re.compile(r"<meta\b[^>]*>", re.I)
31_TITLE_TAG = re.compile(r"<title[^>]*>(.*?)</title>", re.I | re.S)
32_LINK_TAG = re.compile(r"<link\b[^>]*>", re.I)
33_ATTR = re.compile(r"""(\w[\w:-]*)\s*=\s*(?:"([^"]*)"|'([^']*)'|([^\s>]+))""")
34
35
36def _attrs(tag: str) -> dict:
37    out = {}
38    for m in _ATTR.finditer(tag):
39        key = m.group(1).lower()
40        val = m.group(2) if m.group(2) is not None else (m.group(3) if m.group(3) is not None else m.group(4))
41        out[key] = html.unescape(val.strip()) if val else ""
42    return out
43
44
45def parse_metadata(page_html: str, base_url: str) -> dict:
46    """Extract and reconcile metadata from a page's HTML."""
47    og: dict = {}
48    tw: dict = {}
49    meta: dict = {}
50
51    for tag in _META_TAG.findall(page_html):
52        a = _attrs(tag)
53        content = a.get("content")
54        if content is None:
55            continue
56        prop = a.get("property", "").lower()
57        name = a.get("name", "").lower()
58        if prop.startswith("og:"):
59            og[prop[3:]] = content
60        elif prop.startswith("article:") or prop.startswith("profile:"):
61            og[prop] = content  # keep namespaced ones too
62        elif name.startswith("twitter:"):
63            tw[name[8:]] = content
64        elif name:
65            meta[name] = content
66
67    # <title> fallback
68    title_tag = _TITLE_TAG.search(page_html)
69    html_title = html.unescape(title_tag.group(1).strip()) if title_tag else None
70
71    # canonical + favicon from <link>
72    canonical = None
73    favicon = None
74    for tag in _LINK_TAG.findall(page_html):
75        a = _attrs(tag)
76        rel = a.get("rel", "").lower()
77        href = a.get("href")
78        if not href:
79            continue
80        if "canonical" in rel and canonical is None:
81            canonical = urljoin(base_url, href)
82        if "icon" in rel and favicon is None:  # matches "icon", "shortcut icon", "apple-touch-icon"
83            favicon = urljoin(base_url, href)
84
85    # Default favicon guess if none declared.
86    if favicon is None:
87        parsed = urlparse(base_url)
88        if parsed.scheme and parsed.netloc:
89            favicon = f"{parsed.scheme}://{parsed.netloc}/favicon.ico"
90
91    def pick(*vals):
92        for v in vals:
93            if v:
94                return v
95        return None
96
97    image = pick(og.get("image"), tw.get("image"), tw.get("image:src"))
98    if image:
99        image = urljoin(base_url, image)
100
101    return {
102        "title": pick(og.get("title"), tw.get("title"), meta.get("title"), html_title),
103        "description": pick(og.get("description"), tw.get("description"), meta.get("description")),
104        "image": image,
105        "siteName": pick(og.get("site_name"), tw.get("site")),
106        "type": og.get("type"),
107        "canonicalUrl": canonical,
108        "favicon": favicon,
109        "author": pick(meta.get("author"), og.get("article:author")),
110        "publishedDate": pick(og.get("article:published_time"), meta.get("date")),
111        "themeColor": meta.get("theme-color"),
112        "twitterCard": tw.get("card"),
113        "keywords": meta.get("keywords"),
114    }
115
116
117async def fetch_page(client: httpx.AsyncClient, url: str, log) -> tuple[str | None, int | None, str | None]:
118    """Fetch a page. Returns (html, status_code, final_url)."""
119    for attempt in range(1, 4):
120        try:
121            resp = await client.get(url)
122            status = resp.status_code
123            if status >= 400:
124                return None, status, str(resp.url)
125            ctype = resp.headers.get("content-type", "")
126            if "html" not in ctype and "xml" not in ctype and ctype:
127                # Not an HTML page (e.g. a PDF or image) — no metadata to parse.
128                return None, status, str(resp.url)
129            # Only read a reasonable amount; metadata lives in <head>.
130            text = resp.text
131            return text[:500000], status, str(resp.url)
132        except httpx.HTTPError as exc:
133            log.warning(f"Fetch attempt {attempt} failed for {url}: {exc}")
134            if attempt < 3:
135                await asyncio.sleep(attempt * 2)
136    return None, None, url
137
138
139async def main() -> None:
140    async with Actor:
141        actor_input = await Actor.get_input() or {}
142        urls = [u.strip() for u in (actor_input.get("urls", []) or []) if u and u.strip()]
143
144        if not urls:
145            Actor.log.warning("No URLs provided.")
146            await Actor.push_data([])
147            return
148
149        # Normalize: add https:// if no scheme.
150        normalized = []
151        for u in urls:
152            if not re.match(r"^https?://", u, re.I):
153                u = "https://" + u
154            normalized.append(u)
155
156        Actor.log.info(f"Extracting metadata from {len(normalized)} URL(s).")
157
158        async with httpx.AsyncClient(
159            timeout=30.0, follow_redirects=True,
160            headers={
161                "User-Agent": "Mozilla/5.0 (compatible; scrapeworks-url-metadata/0.1; +https://apify.com/scrapeworks)",
162                "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
163            },
164        ) as client:
165            for url in normalized:
166                page_html, status, final_url = await fetch_page(client, url, Actor.log)
167
168                if page_html is None:
169                    await Actor.push_data([{
170                        "url": url,
171                        "finalUrl": final_url,
172                        "success": False,
173                        "statusCode": status,
174                        "error": "Could not fetch or not an HTML page",
175                    }])
176                    Actor.log.info(f"  {url}: failed (status {status})")
177                    continue
178
179                md = parse_metadata(page_html, final_url or url)
180                record = {"url": url, "finalUrl": final_url, "success": True,
181                          "statusCode": status, **md}
182                await Actor.push_data([record])
183                Actor.log.info(f"  {url}: '{md.get('title') or '(no title)'}'")
184
185        Actor.log.info("Done.")

src/main.py

1import asyncio
2from .main import main
3
4asyncio.run(main())

URL Meta Card Generator — OpenGraph & Twitter Cards

wsgcjj/url-meta-card-generator

Extract OpenGraph, Twitter Card, and standard HTML meta tags from any URL. Get title, description, image, favicon, author, and site name. Batch-process up to 100 URLs per run. Perfect for link previews, social media cards, SEO audits, and content aggregation.