Site Researcher
Pricing
from $1.00 / 1,000 results
Site Researcher
Extract structured intelligence from any website: title, meta description, Open Graph tags, JSON-LD structured data, headings, images, videos, tech-stack fingerprint. Walks the sitemap to discover pages.
Pricing
from $1.00 / 1,000 results
Rating
5.0
(14)
Developer
Crawler Bros
Maintained by CommunityActor stats
14
Bookmarked
2
Total users
1
Monthly active users
4 days ago
Last modified
Categories
Share
Extract structured intelligence from any website. For each page, the actor pulls the title, meta description, Open Graph tags, Twitter Cards, JSON-LD structured data, heading inventory, image and video URLs, and a tech-stack fingerprint. Discovers pages via sitemap-walk and/or internal-link BFS. HTTP-only — no proxy, no browser, no API key.
What it does
You give it a start URL. The actor:
- Optionally fetches
/sitemap.xml(and follows sitemap-index files) to discover up tomaxPagespages. - Optionally follows internal
<a href>links from each crawled page (BFS, same-host only) to fill the rest of the budget. - Per page, extracts:
- Title (
<title>). - Description (
<meta name="description">). - Canonical URL (
<link rel="canonical">). - Open Graph tags —
og:title,og:description,og:image,og:type, etc. - Twitter Cards —
twitter:card,twitter:site, etc. - JSON-LD blocks — every
application/ld+jsonscript, plus a flatjsonLdTypesarray of @type values. - Headings — first 20 unique values per level (h1 / h2 / h3).
- Images — every
<img src>(anddata-srclazy variants), absolutised, with alt text. - Videos — every
<video src>and<source>inside<video>. - Tech-stack — lightweight Wappalyzer-style fingerprint (WordPress, Shopify, Next.js, React, Vue, Angular, Webflow, Cloudflare, GTM, Google Analytics, HubSpot, Intercom, Zendesk, Facebook Pixel, etc.).
- Title (
Input
| Field | Type | Default | Description |
|---|---|---|---|
startUrl | string (required) | https://apify.com | Root URL to research. |
crawlSitemap | boolean | true | Parse /sitemap.xml to discover pages. |
followInternalLinks | boolean | true | Follow internal <a href> links to extend the page set. |
maxPages | integer | 20 (1–200) | Hard cap on pages researched. |
extractMedia | boolean | true | Include image/video URLs in each page record. |
extractTechStack | boolean | true | Run the Wappalyzer-style scan. |
userAgent | string (optional) | (Chrome 131) | Override only if a server filters by UA. |
Example input
{"startUrl": "https://apify.com","crawlSitemap": true,"followInternalLinks": true,"maxPages": 30,"extractMedia": true,"extractTechStack": true}
Output
One record per researched page. Empty fields are omitted.
{"url": "https://apify.com/","title": "Apify · The full-stack web-scraping & automation platform","description": "Apify is the all-in-one web scraping…","canonical": "https://apify.com/","ogTags": {"title": "Apify · The full-stack web-scraping & automation platform","description": "Apify is the all-in-one web scraping…","image": "https://apify.com/img/og-image.jpg","type": "website"},"twitterTags": {"card": "summary_large_image","site": "@apify"},"headings": {"h1": ["Build, deploy & monetize web scrapers and AI agents"],"h2": ["Trusted by 60,000+ developers", "Why Apify?"]},"jsonLdTypes": ["Organization", "WebSite"],"jsonLd": [ /* full ld+json blocks */ ],"images": [{"url": "https://apify.com/img/hero.png", "alt": "Apify hero"}],"imageCount": 24,"videos": [],"techStack": ["Cloudflare", "Google Tag Manager", "Next.js", "React"],"seoSummary": {"titleLength": 58,"metaDescriptionLength": 142,"h1": "Build, deploy & monetize web scrapers and AI agents","h1Count": 1,"wordCount": 1247,"imageCount": 24,"imagesWithoutAlt": 2,"internalLinkCount": 38,"externalLinkCount": 12,"pageSizeBytes": 78213,"hasCanonical": true,"hasStructuredData": true,"hasOgTags": true,"hasTwitterTags": true},"seoScores": {"metaTags": 100,"headings": 100,"images": 92,"links": 100,"structuredData": 100,"socialMeta": 100,"contentQuality": 100,"performance": 100,"technicalSeo": 100,"overallScore": 99,"overallGrade": "A","issues": [{"category": "images", "message": "2/24 images missing alt text"}],"issuesSummary": "1 issue(s) detected"},"discoveredVia": "start-url","scrapedAt": "2024-12-16T14:23:11+00:00"}
Output fields
url— page URL (absolute).title/description—<title>and<meta name="description">.canonical—<link rel="canonical">URL.ogTags— flat dict ofog:*properties without the prefix.twitterTags— flat dict oftwitter:*properties without the prefix.headings—{h1: [...], h2: [...], h3: [...]}with first 20 unique values per level.jsonLd— every parseableapplication/ld+jsonblock (raw payloads).jsonLdTypes— sorted list of@typevalues across all blocks.images/imageCount— array of{url, alt}and total count.videos/videoCount— array of{url, type?}and total count.techStack— sorted list of detected stack tokens (CMS, frameworks, analytics, CDNs, web server).seoSummary— at-a-glance SEO metrics block:titleLength,metaDescriptionLength,h1,h1Count,wordCount,imageCount,imagesWithoutAlt,internalLinkCount,externalLinkCount,pageSizeBytes, plus presence flagshasCanonical,hasStructuredData,hasOgTags,hasTwitterTags.seoScores— 9 category scores (0–100) —metaTags,headings,images,links,structuredData,socialMeta,contentQuality,performance,technicalSeo— plusoverallScore(0–100),overallGrade(A/B/C/D/F), and anissues[]array with concrete recommendations (e.g."Title length 18 chars (recommended 30-60)").discoveredVia—"start-url"(the starting page),"sitemap"(fromsitemap.xml), or"internal-link"(BFS from another page).scrapedAt— ISO-8601 timestamp.
Tech-stack signatures
The detector pattern-matches on HTML body and response headers. Coverage is intentionally narrow but high-confidence:
CMS / builders: WordPress, Shopify, Squarespace, Wix, Webflow, Drupal, Ghost.
SPA frameworks: Next.js, React, Vue (incl. Nuxt), Angular, Svelte.
Analytics / tag managers: Google Tag Manager, Google Analytics, HubSpot, Facebook Pixel, Amplitude, Segment.
Customer support: Intercom, Zendesk.
CDNs: Cloudflare (incl. via cf-ray header), Amazon CloudFront, Fastly, Akamai.
Web servers: nginx, Apache (via Server header).
Backends: Express, ASP.NET, PHP (via X-Powered-By).
Use cases
- Competitor research — quickly fingerprint a competitor's tech stack and content structure.
- SEO audits — verify every page has a title, meta description, canonical URL, and OpenGraph image.
- Sales enablement — extract pages tagged with specific JSON-LD
@type(e.g.Product,Article,Event). - Brand monitoring — pull every image and video URL for asset auditing.
- Lead enrichment — combine site title, description, and tech stack into a single CRM-ready record.
FAQ
Does it need a proxy?
No. Public web pages are accessible from datacenter IPs. A few sites with aggressive WAFs may block; those will fall through with an empty techStack and missing images / videos.
Does it work on JavaScript-rendered (SPA) pages? Partially. The actor sees the server-rendered HTML, not what runs after the page boots. For Next.js pages this is usually fine (Next.js SSRs). For pure CSR React/Vue apps, the meta tags are still visible but content arrays may be sparse.
How many pages does it crawl?
Up to maxPages (default 20). The discovery order is: start URL → sitemap pages → internal links from researched pages.
Does it download images / video binaries? No — only collects URLs and metadata. Combine with a downloader actor for the bytes.
What if the site has no sitemap?
The actor falls back to internal-link BFS from the start URL. Set crawlSitemap: false to skip the probe entirely.
Why is jsonLd sometimes missing?
Many sites don't ship structured data. The output omits jsonLd and jsonLdTypes when zero parseable blocks are found.