Site Researcher avatar

Site Researcher

Pricing

from $1.00 / 1,000 results

Go to Apify Store
Site Researcher

Site Researcher

Extract structured intelligence from any website: title, meta description, Open Graph tags, JSON-LD structured data, headings, images, videos, tech-stack fingerprint. Walks the sitemap to discover pages.

Pricing

from $1.00 / 1,000 results

Rating

0.0

(0)

Developer

Crawler Bros

Crawler Bros

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

25 days ago

Last modified

Share

Extract structured intelligence from any website. For each page, the actor pulls the title, meta description, Open Graph tags, Twitter Cards, JSON-LD structured data, heading inventory, image and video URLs, and a tech-stack fingerprint. Discovers pages via sitemap-walk and/or internal-link BFS. HTTP-only — no proxy, no browser, no API key.

What it does

You give it a start URL. The actor:

  1. Optionally fetches /sitemap.xml (and follows sitemap-index files) to discover up to maxPages pages.
  2. Optionally follows internal <a href> links from each crawled page (BFS, same-host only) to fill the rest of the budget.
  3. Per page, extracts:
    • Title (<title>).
    • Description (<meta name="description">).
    • Canonical URL (<link rel="canonical">).
    • Open Graph tagsog:title, og:description, og:image, og:type, etc.
    • Twitter Cardstwitter:card, twitter:site, etc.
    • JSON-LD blocks — every application/ld+json script, plus a flat jsonLdTypes array of @type values.
    • Headings — first 20 unique values per level (h1 / h2 / h3).
    • Images — every <img src> (and data-src lazy variants), absolutised, with alt text.
    • Videos — every <video src> and <source> inside <video>.
    • Tech-stack — lightweight Wappalyzer-style fingerprint (WordPress, Shopify, Next.js, React, Vue, Angular, Webflow, Cloudflare, GTM, Google Analytics, HubSpot, Intercom, Zendesk, Facebook Pixel, etc.).

Input

FieldTypeDefaultDescription
startUrlstring (required)https://apify.comRoot URL to research.
crawlSitemapbooleantrueParse /sitemap.xml to discover pages.
followInternalLinksbooleantrueFollow internal <a href> links to extend the page set.
maxPagesinteger20 (1–200)Hard cap on pages researched.
extractMediabooleantrueInclude image/video URLs in each page record.
extractTechStackbooleantrueRun the Wappalyzer-style scan.
userAgentstring (optional)(Chrome 131)Override only if a server filters by UA.

Example input

{
"startUrl": "https://apify.com",
"crawlSitemap": true,
"followInternalLinks": true,
"maxPages": 30,
"extractMedia": true,
"extractTechStack": true
}

Output

One record per researched page. Empty fields are omitted.

{
"url": "https://apify.com/",
"title": "Apify · The full-stack web-scraping & automation platform",
"description": "Apify is the all-in-one web scraping…",
"canonical": "https://apify.com/",
"ogTags": {
"title": "Apify · The full-stack web-scraping & automation platform",
"description": "Apify is the all-in-one web scraping…",
"image": "https://apify.com/img/og-image.jpg",
"type": "website"
},
"twitterTags": {
"card": "summary_large_image",
"site": "@apify"
},
"headings": {
"h1": ["Build, deploy & monetize web scrapers and AI agents"],
"h2": ["Trusted by 60,000+ developers", "Why Apify?"]
},
"jsonLdTypes": ["Organization", "WebSite"],
"jsonLd": [ /* full ld+json blocks */ ],
"images": [
{"url": "https://apify.com/img/hero.png", "alt": "Apify hero"}
],
"imageCount": 24,
"videos": [],
"techStack": ["Cloudflare", "Google Tag Manager", "Next.js", "React"],
"seoSummary": {
"titleLength": 58,
"metaDescriptionLength": 142,
"h1": "Build, deploy & monetize web scrapers and AI agents",
"h1Count": 1,
"wordCount": 1247,
"imageCount": 24,
"imagesWithoutAlt": 2,
"internalLinkCount": 38,
"externalLinkCount": 12,
"pageSizeBytes": 78213,
"hasCanonical": true,
"hasStructuredData": true,
"hasOgTags": true,
"hasTwitterTags": true
},
"seoScores": {
"metaTags": 100,
"headings": 100,
"images": 92,
"links": 100,
"structuredData": 100,
"socialMeta": 100,
"contentQuality": 100,
"performance": 100,
"technicalSeo": 100,
"overallScore": 99,
"overallGrade": "A",
"issues": [
{"category": "images", "message": "2/24 images missing alt text"}
],
"issuesSummary": "1 issue(s) detected"
},
"discoveredVia": "start-url",
"scrapedAt": "2024-12-16T14:23:11+00:00"
}

Output fields

  • url — page URL (absolute).
  • title / description<title> and <meta name="description">.
  • canonical<link rel="canonical"> URL.
  • ogTags — flat dict of og:* properties without the prefix.
  • twitterTags — flat dict of twitter:* properties without the prefix.
  • headings{h1: [...], h2: [...], h3: [...]} with first 20 unique values per level.
  • jsonLd — every parseable application/ld+json block (raw payloads).
  • jsonLdTypes — sorted list of @type values across all blocks.
  • images / imageCount — array of {url, alt} and total count.
  • videos / videoCount — array of {url, type?} and total count.
  • techStack — sorted list of detected stack tokens (CMS, frameworks, analytics, CDNs, web server).
  • seoSummary — at-a-glance SEO metrics block: titleLength, metaDescriptionLength, h1, h1Count, wordCount, imageCount, imagesWithoutAlt, internalLinkCount, externalLinkCount, pageSizeBytes, plus presence flags hasCanonical, hasStructuredData, hasOgTags, hasTwitterTags.
  • seoScores — 9 category scores (0–100) — metaTags, headings, images, links, structuredData, socialMeta, contentQuality, performance, technicalSeo — plus overallScore (0–100), overallGrade (A/B/C/D/F), and an issues[] array with concrete recommendations (e.g. "Title length 18 chars (recommended 30-60)").
  • discoveredVia"start-url" (the starting page), "sitemap" (from sitemap.xml), or "internal-link" (BFS from another page).
  • scrapedAt — ISO-8601 timestamp.

Tech-stack signatures

The detector pattern-matches on HTML body and response headers. Coverage is intentionally narrow but high-confidence:

CMS / builders: WordPress, Shopify, Squarespace, Wix, Webflow, Drupal, Ghost. SPA frameworks: Next.js, React, Vue (incl. Nuxt), Angular, Svelte. Analytics / tag managers: Google Tag Manager, Google Analytics, HubSpot, Facebook Pixel, Amplitude, Segment. Customer support: Intercom, Zendesk. CDNs: Cloudflare (incl. via cf-ray header), Amazon CloudFront, Fastly, Akamai. Web servers: nginx, Apache (via Server header). Backends: Express, ASP.NET, PHP (via X-Powered-By).

Use cases

  • Competitor research — quickly fingerprint a competitor's tech stack and content structure.
  • SEO audits — verify every page has a title, meta description, canonical URL, and OpenGraph image.
  • Sales enablement — extract pages tagged with specific JSON-LD @type (e.g. Product, Article, Event).
  • Brand monitoring — pull every image and video URL for asset auditing.
  • Lead enrichment — combine site title, description, and tech stack into a single CRM-ready record.

FAQ

Does it need a proxy? No. Public web pages are accessible from datacenter IPs. A few sites with aggressive WAFs may block; those will fall through with an empty techStack and missing images / videos.

Does it work on JavaScript-rendered (SPA) pages? Partially. The actor sees the server-rendered HTML, not what runs after the page boots. For Next.js pages this is usually fine (Next.js SSRs). For pure CSR React/Vue apps, the meta tags are still visible but content arrays may be sparse.

How many pages does it crawl? Up to maxPages (default 20). The discovery order is: start URL → sitemap pages → internal links from researched pages.

Does it download images / video binaries? No — only collects URLs and metadata. Combine with a downloader actor for the bytes.

What if the site has no sitemap? The actor falls back to internal-link BFS from the start URL. Set crawlSitemap: false to skip the probe entirely.

Why is jsonLd sometimes missing? Many sites don't ship structured data. The output omits jsonLd and jsonLdTypes when zero parseable blocks are found.