Pricing

from $1.00 / 1,000 results

Site Researcher

Extract structured intelligence from any website: title, meta description, Open Graph tags, JSON-LD structured data, headings, images, videos, tech-stack fingerprint. Walks the sitemap to discover pages.

Pricing

from $1.00 / 1,000 results

Rating

0.0

(0)

Developer

Crawler Bros

Actor stats

Bookmarked

Total users

Monthly active users

2 months ago

Last modified

What it does

You give it a start URL. The actor:

Optionally fetches /sitemap.xml (and follows sitemap-index files) to discover up to maxPages pages.
Optionally follows internal <a href> links from each crawled page (BFS, same-host only) to fill the rest of the budget.
Per page, extracts:
- Title (<title>).
- Description (<meta name="description">).
- Canonical URL (<link rel="canonical">).
- Open Graph tags — og:title, og:description, og:image, og:type, etc.
- Twitter Cards — twitter:card, twitter:site, etc.
- JSON-LD blocks — every application/ld+json script, plus a flat jsonLdTypes array of @type values.
- Headings — first 20 unique values per level (h1 / h2 / h3).
- Images — every <img src> (and data-src lazy variants), absolutised, with alt text.
- Videos — every <video src> and <source> inside <video>.
- Tech-stack — lightweight Wappalyzer-style fingerprint (WordPress, Shopify, Next.js, React, Vue, Angular, Webflow, Cloudflare, GTM, Google Analytics, HubSpot, Intercom, Zendesk, Facebook Pixel, etc.).

Input

Field	Type	Default	Description
`startUrl`	string (required)	`https://apify.com`	Root URL to research.
`crawlSitemap`	boolean	`true`	Parse `/sitemap.xml` to discover pages.
`followInternalLinks`	boolean	`true`	Follow internal `<a href>` links to extend the page set.
`maxPages`	integer	`20` (1–200)	Hard cap on pages researched.
`extractMedia`	boolean	`true`	Include image/video URLs in each page record.
`extractTechStack`	boolean	`true`	Run the Wappalyzer-style scan.
`userAgent`	string (optional)	(Chrome 131)	Override only if a server filters by UA.

Example input

{
  "startUrl": "https://apify.com",
  "crawlSitemap": true,
  "followInternalLinks": true,
  "maxPages": 30,
  "extractMedia": true,
  "extractTechStack": true
}

Output

One record per researched page. Empty fields are omitted.

{
  "url": "https://apify.com/",
  "title": "Apify · The full-stack web-scraping & automation platform",
  "description": "Apify is the all-in-one web scraping…",
  "canonical": "https://apify.com/",
  "ogTags": {
    "title": "Apify · The full-stack web-scraping & automation platform",
    "description": "Apify is the all-in-one web scraping…",
    "image": "https://apify.com/img/og-image.jpg",
    "type": "website"
  },
  "twitterTags": {
    "card": "summary_large_image",
    "site": "@apify"
  },
  "headings": {
    "h1": ["Build, deploy & monetize web scrapers and AI agents"],
    "h2": ["Trusted by 60,000+ developers", "Why Apify?"]
  },
  "jsonLdTypes": ["Organization", "WebSite"],
  "jsonLd": [ /* full ld+json blocks */ ],
  "images": [
    {"url": "https://apify.com/img/hero.png", "alt": "Apify hero"}
  ],
  "imageCount": 24,
  "videos": [],
  "techStack": ["Cloudflare", "Google Tag Manager", "Next.js", "React"],
  "seoSummary": {
    "titleLength": 58,
    "metaDescriptionLength": 142,
    "h1": "Build, deploy & monetize web scrapers and AI agents",
    "h1Count": 1,
    "wordCount": 1247,
    "imageCount": 24,
    "imagesWithoutAlt": 2,
    "internalLinkCount": 38,
    "externalLinkCount": 12,
    "pageSizeBytes": 78213,
    "hasCanonical": true,
    "hasStructuredData": true,
    "hasOgTags": true,
    "hasTwitterTags": true
  },
  "seoScores": {
    "metaTags": 100,
    "headings": 100,
    "images": 92,
    "links": 100,
    "structuredData": 100,
    "socialMeta": 100,
    "contentQuality": 100,
    "performance": 100,
    "technicalSeo": 100,
    "overallScore": 99,
    "overallGrade": "A",
    "issues": [
      {"category": "images", "message": "2/24 images missing alt text"}
    ],
    "issuesSummary": "1 issue(s) detected"
  },
  "discoveredVia": "start-url",
  "scrapedAt": "2024-12-16T14:23:11+00:00"
}

Output fields

url — page URL (absolute).
title / description — <title> and <meta name="description">.
canonical — <link rel="canonical"> URL.
ogTags — flat dict of og:* properties without the prefix.
twitterTags — flat dict of twitter:* properties without the prefix.
headings — {h1: [...], h2: [...], h3: [...]} with first 20 unique values per level.
jsonLd — every parseable application/ld+json block (raw payloads).
jsonLdTypes — sorted list of @type values across all blocks.
images / imageCount — array of {url, alt} and total count.
videos / videoCount — array of {url, type?} and total count.
techStack — sorted list of detected stack tokens (CMS, frameworks, analytics, CDNs, web server).
seoSummary — at-a-glance SEO metrics block: titleLength, metaDescriptionLength, h1, h1Count, wordCount, imageCount, imagesWithoutAlt, internalLinkCount, externalLinkCount, pageSizeBytes, plus presence flags hasCanonical, hasStructuredData, hasOgTags, hasTwitterTags.
seoScores — 9 category scores (0–100) — metaTags, headings, images, links, structuredData, socialMeta, contentQuality, performance, technicalSeo — plus overallScore (0–100), overallGrade (A/B/C/D/F), and an issues[] array with concrete recommendations (e.g. "Title length 18 chars (recommended 30-60)").
discoveredVia — "start-url" (the starting page), "sitemap" (from sitemap.xml), or "internal-link" (BFS from another page).
scrapedAt — ISO-8601 timestamp.

Tech-stack signatures

The detector pattern-matches on HTML body and response headers. Coverage is intentionally narrow but high-confidence:

CMS / builders: WordPress, Shopify, Squarespace, Wix, Webflow, Drupal, Ghost. SPA frameworks: Next.js, React, Vue (incl. Nuxt), Angular, Svelte. Analytics / tag managers: Google Tag Manager, Google Analytics, HubSpot, Facebook Pixel, Amplitude, Segment. Customer support: Intercom, Zendesk. CDNs: Cloudflare (incl. via cf-ray header), Amazon CloudFront, Fastly, Akamai. Web servers: nginx, Apache (via Server header). Backends: Express, ASP.NET, PHP (via X-Powered-By).

Use cases

Competitor research — quickly fingerprint a competitor's tech stack and content structure.
SEO audits — verify every page has a title, meta description, canonical URL, and OpenGraph image.
Sales enablement — extract pages tagged with specific JSON-LD @type (e.g. Product, Article, Event).
Brand monitoring — pull every image and video URL for asset auditing.
Lead enrichment — combine site title, description, and tech stack into a single CRM-ready record.

FAQ

Does it need a proxy? No. Public web pages are accessible from datacenter IPs. A few sites with aggressive WAFs may block; those will fall through with an empty techStack and missing images / videos.

Does it work on JavaScript-rendered (SPA) pages? Partially. The actor sees the server-rendered HTML, not what runs after the page boots. For Next.js pages this is usually fine (Next.js SSRs). For pure CSR React/Vue apps, the meta tags are still visible but content arrays may be sparse.

How many pages does it crawl? Up to maxPages (default 20). The discovery order is: start URL → sitemap pages → internal links from researched pages.

Does it download images / video binaries? No — only collects URLs and metadata. Combine with a downloader actor for the bytes.

What if the site has no sitemap? The actor falls back to internal-link BFS from the start URL. Set crawlSitemap: false to skip the probe entirely.

Why is jsonLd sometimes missing? Many sites don't ship structured data. The output omits jsonLd and jsonLdTypes when zero parseable blocks are found.

Meta Tags Extractor

krawlify/meta-tags-extractor

Extract SEO meta tags, Open Graph, Twitter Cards, JSON-LD structured data, and headings from any website. Perfect for SEO analysis, competitor research, and content audits.

Krawlify Krawlify

Meta Tags Extractor - SEO & Open Graph Data

benthepythondev/meta-tags-extractor

Extract page title, meta description, robots, canonical URL, Open Graph tags, Twitter Card tags and alternate links from web pages.

Ben

Bulk Metadata, Open Graph & JSON-LD Extractor — SEO Tags

haketa/bulk-metadata-extractor

Extract on-page metadata from thousands of URLs at once: title, meta description, Open Graph & Twitter Card tags, canonical, favicon, hreflang, RSS feed and schema.org JSON-LD structured data. Fast, keyless bulk SEO metadata and structured-data extractor.

Haketa

SEO Meta Tag Extractor — Free Website Audit Tool

kimmich237/seo-meta-extractor

Extract all SEO meta tags, Open Graph tags, Twitter cards, JSON-LD structured data, and more from any website. Process bulk URLs. Get competitive insights. Perfect for SEO agencies, developers, and content marketers.

Josue Tchoupa

Website Meta Tags & Open Graph Scraper

fit_melon/website-meta-tags-scraper

Extract SEO meta tags from any list of URLs: title, meta description, canonical, robots, Open Graph (og:image, og:title), Twitter Cards, favicons, language. Clean JSON for SEO audits and link previews. Free.

D N

Open Graph & Meta Tag Extractor

automation-lab/og-meta-extractor

This actor fetches any list of URLs and extracts all social media meta tags (Open Graph, Twitter Cards), SEO metadata (title, description, canonical, robots), structured data (JSON-LD), and internationalization (hreflang). Use it for social media audits, SEO analysis, link preview...

Stas Persiianenko

Webpage Content & Metadata Extractor

aetheragent/webpage-content-extractor

Extract the full content, metadata, and structure from any webpage. Get Open Graph tags, Twitter cards, JSON-LD structured data, meta tags, all images with alt text, headings hierarchy, and clean readable text. Perfect for content research, competitive analysis, and data collection.