Site Researcher
Pricing
from $1.00 / 1,000 results
Site Researcher
Extract structured intelligence from any website: title, meta description, Open Graph tags, JSON-LD structured data, headings, images, videos, tech-stack fingerprint. Walks the sitemap to discover pages.
Pricing
from $1.00 / 1,000 results
Rating
0.0
(0)
Developer
Crawler Bros
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
25 days ago
Last modified
Categories
Share
Extract structured intelligence from any website. For each page, the actor pulls the title, meta description, Open Graph tags, Twitter Cards, JSON-LD structured data, heading inventory, image and video URLs, and a tech-stack fingerprint. Discovers pages via sitemap-walk and/or internal-link BFS. HTTP-only — no proxy, no browser, no API key.
What it does
You give it a start URL. The actor:
- Optionally fetches
/sitemap.xml(and follows sitemap-index files) to discover up tomaxPagespages. - Optionally follows internal
<a href>links from each crawled page (BFS, same-host only) to fill the rest of the budget. - Per page, extracts:
- Title (
<title>). - Description (
<meta name="description">). - Canonical URL (
<link rel="canonical">). - Open Graph tags —
og:title,og:description,og:image,og:type, etc. - Twitter Cards —
twitter:card,twitter:site, etc. - JSON-LD blocks — every
application/ld+jsonscript, plus a flatjsonLdTypesarray of @type values. - Headings — first 20 unique values per level (h1 / h2 / h3).
- Images — every
<img src>(anddata-srclazy variants), absolutised, with alt text. - Videos — every
<video src>and<source>inside<video>. - Tech-stack — lightweight Wappalyzer-style fingerprint (WordPress, Shopify, Next.js, React, Vue, Angular, Webflow, Cloudflare, GTM, Google Analytics, HubSpot, Intercom, Zendesk, Facebook Pixel, etc.).
- Title (
Input
| Field | Type | Default | Description |
|---|---|---|---|
startUrl | string (required) | https://apify.com | Root URL to research. |
crawlSitemap | boolean | true | Parse /sitemap.xml to discover pages. |
followInternalLinks | boolean | true | Follow internal <a href> links to extend the page set. |
maxPages | integer | 20 (1–200) | Hard cap on pages researched. |
extractMedia | boolean | true | Include image/video URLs in each page record. |
extractTechStack | boolean | true | Run the Wappalyzer-style scan. |
userAgent | string (optional) | (Chrome 131) | Override only if a server filters by UA. |
Example input
{"startUrl": "https://apify.com","crawlSitemap": true,"followInternalLinks": true,"maxPages": 30,"extractMedia": true,"extractTechStack": true}
Output
One record per researched page. Empty fields are omitted.
{"url": "https://apify.com/","title": "Apify · The full-stack web-scraping & automation platform","description": "Apify is the all-in-one web scraping…","canonical": "https://apify.com/","ogTags": {"title": "Apify · The full-stack web-scraping & automation platform","description": "Apify is the all-in-one web scraping…","image": "https://apify.com/img/og-image.jpg","type": "website"},"twitterTags": {"card": "summary_large_image","site": "@apify"},"headings": {"h1": ["Build, deploy & monetize web scrapers and AI agents"],"h2": ["Trusted by 60,000+ developers", "Why Apify?"]},"jsonLdTypes": ["Organization", "WebSite"],"jsonLd": [ /* full ld+json blocks */ ],"images": [{"url": "https://apify.com/img/hero.png", "alt": "Apify hero"}],"imageCount": 24,"videos": [],"techStack": ["Cloudflare", "Google Tag Manager", "Next.js", "React"],"seoSummary": {"titleLength": 58,"metaDescriptionLength": 142,"h1": "Build, deploy & monetize web scrapers and AI agents","h1Count": 1,"wordCount": 1247,"imageCount": 24,"imagesWithoutAlt": 2,"internalLinkCount": 38,"externalLinkCount": 12,"pageSizeBytes": 78213,"hasCanonical": true,"hasStructuredData": true,"hasOgTags": true,"hasTwitterTags": true},"seoScores": {"metaTags": 100,"headings": 100,"images": 92,"links": 100,"structuredData": 100,"socialMeta": 100,"contentQuality": 100,"performance": 100,"technicalSeo": 100,"overallScore": 99,"overallGrade": "A","issues": [{"category": "images", "message": "2/24 images missing alt text"}],"issuesSummary": "1 issue(s) detected"},"discoveredVia": "start-url","scrapedAt": "2024-12-16T14:23:11+00:00"}
Output fields
url— page URL (absolute).title/description—<title>and<meta name="description">.canonical—<link rel="canonical">URL.ogTags— flat dict ofog:*properties without the prefix.twitterTags— flat dict oftwitter:*properties without the prefix.headings—{h1: [...], h2: [...], h3: [...]}with first 20 unique values per level.jsonLd— every parseableapplication/ld+jsonblock (raw payloads).jsonLdTypes— sorted list of@typevalues across all blocks.images/imageCount— array of{url, alt}and total count.videos/videoCount— array of{url, type?}and total count.techStack— sorted list of detected stack tokens (CMS, frameworks, analytics, CDNs, web server).seoSummary— at-a-glance SEO metrics block:titleLength,metaDescriptionLength,h1,h1Count,wordCount,imageCount,imagesWithoutAlt,internalLinkCount,externalLinkCount,pageSizeBytes, plus presence flagshasCanonical,hasStructuredData,hasOgTags,hasTwitterTags.seoScores— 9 category scores (0–100) —metaTags,headings,images,links,structuredData,socialMeta,contentQuality,performance,technicalSeo— plusoverallScore(0–100),overallGrade(A/B/C/D/F), and anissues[]array with concrete recommendations (e.g."Title length 18 chars (recommended 30-60)").discoveredVia—"start-url"(the starting page),"sitemap"(fromsitemap.xml), or"internal-link"(BFS from another page).scrapedAt— ISO-8601 timestamp.
Tech-stack signatures
The detector pattern-matches on HTML body and response headers. Coverage is intentionally narrow but high-confidence:
CMS / builders: WordPress, Shopify, Squarespace, Wix, Webflow, Drupal, Ghost.
SPA frameworks: Next.js, React, Vue (incl. Nuxt), Angular, Svelte.
Analytics / tag managers: Google Tag Manager, Google Analytics, HubSpot, Facebook Pixel, Amplitude, Segment.
Customer support: Intercom, Zendesk.
CDNs: Cloudflare (incl. via cf-ray header), Amazon CloudFront, Fastly, Akamai.
Web servers: nginx, Apache (via Server header).
Backends: Express, ASP.NET, PHP (via X-Powered-By).
Use cases
- Competitor research — quickly fingerprint a competitor's tech stack and content structure.
- SEO audits — verify every page has a title, meta description, canonical URL, and OpenGraph image.
- Sales enablement — extract pages tagged with specific JSON-LD
@type(e.g.Product,Article,Event). - Brand monitoring — pull every image and video URL for asset auditing.
- Lead enrichment — combine site title, description, and tech stack into a single CRM-ready record.
FAQ
Does it need a proxy?
No. Public web pages are accessible from datacenter IPs. A few sites with aggressive WAFs may block; those will fall through with an empty techStack and missing images / videos.
Does it work on JavaScript-rendered (SPA) pages? Partially. The actor sees the server-rendered HTML, not what runs after the page boots. For Next.js pages this is usually fine (Next.js SSRs). For pure CSR React/Vue apps, the meta tags are still visible but content arrays may be sparse.
How many pages does it crawl?
Up to maxPages (default 20). The discovery order is: start URL → sitemap pages → internal links from researched pages.
Does it download images / video binaries? No — only collects URLs and metadata. Combine with a downloader actor for the bytes.
What if the site has no sitemap?
The actor falls back to internal-link BFS from the start URL. Set crawlSitemap: false to skip the probe entirely.
Why is jsonLd sometimes missing?
Many sites don't ship structured data. The output omits jsonLd and jsonLdTypes when zero parseable blocks are found.