Site Researcher avatar

Site Researcher

Pricing

from $1.00 / 1,000 results

Go to Apify Store
Site Researcher

Site Researcher

Extract structured intelligence from any website: title, meta description, Open Graph tags, JSON-LD structured data, headings, images, videos, tech-stack fingerprint. Walks the sitemap to discover pages.

Pricing

from $1.00 / 1,000 results

Rating

5.0

(14)

Developer

Crawler Bros

Crawler Bros

Maintained by Community

Actor stats

14

Bookmarked

2

Total users

1

Monthly active users

4 days ago

Last modified

Share

Extract structured intelligence from any website. For each page, the actor pulls the title, meta description, Open Graph tags, Twitter Cards, JSON-LD structured data, heading inventory, image and video URLs, and a tech-stack fingerprint. Discovers pages via sitemap-walk and/or internal-link BFS. HTTP-only — no proxy, no browser, no API key.

What it does

You give it a start URL. The actor:

  1. Optionally fetches /sitemap.xml (and follows sitemap-index files) to discover up to maxPages pages.
  2. Optionally follows internal <a href> links from each crawled page (BFS, same-host only) to fill the rest of the budget.
  3. Per page, extracts:
    • Title (<title>).
    • Description (<meta name="description">).
    • Canonical URL (<link rel="canonical">).
    • Open Graph tagsog:title, og:description, og:image, og:type, etc.
    • Twitter Cardstwitter:card, twitter:site, etc.
    • JSON-LD blocks — every application/ld+json script, plus a flat jsonLdTypes array of @type values.
    • Headings — first 20 unique values per level (h1 / h2 / h3).
    • Images — every <img src> (and data-src lazy variants), absolutised, with alt text.
    • Videos — every <video src> and <source> inside <video>.
    • Tech-stack — lightweight Wappalyzer-style fingerprint (WordPress, Shopify, Next.js, React, Vue, Angular, Webflow, Cloudflare, GTM, Google Analytics, HubSpot, Intercom, Zendesk, Facebook Pixel, etc.).

Input

FieldTypeDefaultDescription
startUrlstring (required)https://apify.comRoot URL to research.
crawlSitemapbooleantrueParse /sitemap.xml to discover pages.
followInternalLinksbooleantrueFollow internal <a href> links to extend the page set.
maxPagesinteger20 (1–200)Hard cap on pages researched.
extractMediabooleantrueInclude image/video URLs in each page record.
extractTechStackbooleantrueRun the Wappalyzer-style scan.
userAgentstring (optional)(Chrome 131)Override only if a server filters by UA.

Example input

{
"startUrl": "https://apify.com",
"crawlSitemap": true,
"followInternalLinks": true,
"maxPages": 30,
"extractMedia": true,
"extractTechStack": true
}

Output

One record per researched page. Empty fields are omitted.

{
"url": "https://apify.com/",
"title": "Apify · The full-stack web-scraping & automation platform",
"description": "Apify is the all-in-one web scraping…",
"canonical": "https://apify.com/",
"ogTags": {
"title": "Apify · The full-stack web-scraping & automation platform",
"description": "Apify is the all-in-one web scraping…",
"image": "https://apify.com/img/og-image.jpg",
"type": "website"
},
"twitterTags": {
"card": "summary_large_image",
"site": "@apify"
},
"headings": {
"h1": ["Build, deploy & monetize web scrapers and AI agents"],
"h2": ["Trusted by 60,000+ developers", "Why Apify?"]
},
"jsonLdTypes": ["Organization", "WebSite"],
"jsonLd": [ /* full ld+json blocks */ ],
"images": [
{"url": "https://apify.com/img/hero.png", "alt": "Apify hero"}
],
"imageCount": 24,
"videos": [],
"techStack": ["Cloudflare", "Google Tag Manager", "Next.js", "React"],
"seoSummary": {
"titleLength": 58,
"metaDescriptionLength": 142,
"h1": "Build, deploy & monetize web scrapers and AI agents",
"h1Count": 1,
"wordCount": 1247,
"imageCount": 24,
"imagesWithoutAlt": 2,
"internalLinkCount": 38,
"externalLinkCount": 12,
"pageSizeBytes": 78213,
"hasCanonical": true,
"hasStructuredData": true,
"hasOgTags": true,
"hasTwitterTags": true
},
"seoScores": {
"metaTags": 100,
"headings": 100,
"images": 92,
"links": 100,
"structuredData": 100,
"socialMeta": 100,
"contentQuality": 100,
"performance": 100,
"technicalSeo": 100,
"overallScore": 99,
"overallGrade": "A",
"issues": [
{"category": "images", "message": "2/24 images missing alt text"}
],
"issuesSummary": "1 issue(s) detected"
},
"discoveredVia": "start-url",
"scrapedAt": "2024-12-16T14:23:11+00:00"
}

Output fields

  • url — page URL (absolute).
  • title / description<title> and <meta name="description">.
  • canonical<link rel="canonical"> URL.
  • ogTags — flat dict of og:* properties without the prefix.
  • twitterTags — flat dict of twitter:* properties without the prefix.
  • headings{h1: [...], h2: [...], h3: [...]} with first 20 unique values per level.
  • jsonLd — every parseable application/ld+json block (raw payloads).
  • jsonLdTypes — sorted list of @type values across all blocks.
  • images / imageCount — array of {url, alt} and total count.
  • videos / videoCount — array of {url, type?} and total count.
  • techStack — sorted list of detected stack tokens (CMS, frameworks, analytics, CDNs, web server).
  • seoSummary — at-a-glance SEO metrics block: titleLength, metaDescriptionLength, h1, h1Count, wordCount, imageCount, imagesWithoutAlt, internalLinkCount, externalLinkCount, pageSizeBytes, plus presence flags hasCanonical, hasStructuredData, hasOgTags, hasTwitterTags.
  • seoScores — 9 category scores (0–100) — metaTags, headings, images, links, structuredData, socialMeta, contentQuality, performance, technicalSeo — plus overallScore (0–100), overallGrade (A/B/C/D/F), and an issues[] array with concrete recommendations (e.g. "Title length 18 chars (recommended 30-60)").
  • discoveredVia"start-url" (the starting page), "sitemap" (from sitemap.xml), or "internal-link" (BFS from another page).
  • scrapedAt — ISO-8601 timestamp.

Tech-stack signatures

The detector pattern-matches on HTML body and response headers. Coverage is intentionally narrow but high-confidence:

CMS / builders: WordPress, Shopify, Squarespace, Wix, Webflow, Drupal, Ghost. SPA frameworks: Next.js, React, Vue (incl. Nuxt), Angular, Svelte. Analytics / tag managers: Google Tag Manager, Google Analytics, HubSpot, Facebook Pixel, Amplitude, Segment. Customer support: Intercom, Zendesk. CDNs: Cloudflare (incl. via cf-ray header), Amazon CloudFront, Fastly, Akamai. Web servers: nginx, Apache (via Server header). Backends: Express, ASP.NET, PHP (via X-Powered-By).

Use cases

  • Competitor research — quickly fingerprint a competitor's tech stack and content structure.
  • SEO audits — verify every page has a title, meta description, canonical URL, and OpenGraph image.
  • Sales enablement — extract pages tagged with specific JSON-LD @type (e.g. Product, Article, Event).
  • Brand monitoring — pull every image and video URL for asset auditing.
  • Lead enrichment — combine site title, description, and tech stack into a single CRM-ready record.

FAQ

Does it need a proxy? No. Public web pages are accessible from datacenter IPs. A few sites with aggressive WAFs may block; those will fall through with an empty techStack and missing images / videos.

Does it work on JavaScript-rendered (SPA) pages? Partially. The actor sees the server-rendered HTML, not what runs after the page boots. For Next.js pages this is usually fine (Next.js SSRs). For pure CSR React/Vue apps, the meta tags are still visible but content arrays may be sparse.

How many pages does it crawl? Up to maxPages (default 20). The discovery order is: start URL → sitemap pages → internal links from researched pages.

Does it download images / video binaries? No — only collects URLs and metadata. Combine with a downloader actor for the bytes.

What if the site has no sitemap? The actor falls back to internal-link BFS from the start URL. Set crawlSitemap: false to skip the probe entirely.

Why is jsonLd sometimes missing? Many sites don't ship structured data. The output omits jsonLd and jsonLdTypes when zero parseable blocks are found.