Company Deep Research — AI Agent Dossier API
Pricing
from $15.00 / 1,000 results
Company Deep Research — AI Agent Dossier API
One-call company intelligence dossier for AI agents: website meta, tech stack, socials, contacts, SEO basics, news signals, competitors and blog/RSS from a company name or domain. No API key, no browser.
Pricing
from $15.00 / 1,000 results
Rating
0.0
(0)
Developer
Logiover
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
2 days ago
Last modified
Categories
Share
Company Deep Research Scraper — AI Agent Company Intelligence API
One-call company intelligence dossier for AI agents, LLM apps, sales/rev-tech and market researchers. Drop in a company name or domain and get back a clean, structured JSON record with the website meta, tech stack, social profiles, contact channels, SEO basics, recent news signals, competitor domains and RSS/blog feeds — all from a single Apify Actor run. No API key, no headless browser, no per-site scrapers to maintain.
Built for the new wave of AI agents that research companies autonomously — coding agents that need to understand a vendor before integrating, sales agents that qualify accounts, analyst agents that build market maps, and RAG pipelines that ground LLM answers in fresh, structured company data instead of stale training-set knowledge.
🎯 What this Actor is for
Large language models have a knowledge cutoff and no live web access. When an AI agent is asked "research Acme Corp" or "who are Notion's competitors and what stack do they run?", it needs structured, fresh, machine-readable company data — not a pile of raw HTML to re-parse every time. company-deep-research-scraper is that data layer:
- One call, one dossier. Pass
stripe.comand get back a single JSON object with ~40 fields covering identity, tech, socials, contacts, SEO, news and competitors. No need to chain five separate scrapers. - AI-agent friendly schema. Predictable field names, nullable values, arrays where multiple values exist, ISO timestamps. Drop straight into a prompt or a vector DB.
- Batch by default. Feed 500 domains and get 500 dossiers back, enriched in parallel — perfect for building a company database or enriching a CRM.
- Name → domain resolution. Don't know the website? Pass
"Figma"or"Linear"and the Actor resolves it tofigma.com/linear.appvia Clearbit's company autocomplete, then enriches. - No keys, no browser. Pure HTTP + lightweight HTML parsing on a small Node 20 container. Cheap, fast, and resilient — no Playwright to keep warm.
✨ Key features
- 🧠 Company identity — name, description, logo, favicon, OG metadata, site name, detected languages.
- 🛠️ Tech stack fingerprinting — 40+ technologies detected from HTML + HTTP headers: Next.js, React, Vue, Angular, Svelte, Nuxt, Gatsby, Astro, Remix, WordPress, Shopify, Webflow, Squarespace, Wix, Drupal, Joomla, Ghost, jQuery, Tailwind, Bootstrap, Cloudflare, Vercel, Netlify, Google Analytics, GTM, Facebook Pixel, Hotjar, HubSpot, Intercom, Segment, Mixpanel, Stripe, PayPal, Sentry, Datadog, Algolia, Elasticsearch, Font Awesome, Google Fonts, and more.
- 🔗 Social profiles — LinkedIn, X/Twitter, GitHub, Facebook, Instagram, YouTube, TikTok, Discord, Telegram URLs extracted from the homepage +
schema.orgsameAs. - ✉️ Contact channels — emails (from
mailto:+ text regex, junk filtered) and phone numbers (fromtel:+ international regex). - 📈 SEO basics — title tag, meta description, H1/H2 headings,
robots.txt, discovered sitemaps. - 📰 News signals — recent news for the company (last 30 days) via Google News RSS, with title, URL, publish date and source.
- ⚔️ Competitor discovery — alternative/competitor domains mined from DuckDuckGo result pages (junk domains filtered).
- 📡 RSS / blog feeds —
<link rel=alternate>feed tags, blog links and common feed-path guesses (/feed,/rss,/blog/feed). - 🏢 Organization JSON-LD —
schema.orgOrganization/Corporationparsed for founded year, employee range, industry, country, city. - 🌐 Proxy-aware — Apify datacenter proxy by default (avoids per-IP rate limits), optional residential for bot-walled targets.
- 💰 Pay-per-result — you're charged per company dossier produced, not per run. Empty/blocked results are not billed.
🤖 Why AI agents need this
Modern agentic workflows (Claude, ChatGPT, Cursor, LangGraph, CrewAI, AutoGen, Microsoft Copilot, Google Gemini) increasingly delegate research to tools. The bottleneck is no longer reasoning — it's grounding: getting current, structured facts about real-world entities. Company research is one of the highest-frequency grounding tasks:
- Vendor / tool evaluation. A coding agent is asked to pick a payment provider. It needs to know each candidate's website, tech stack, docs URL and recent news — not just the name.
- Sales account research. A GTM agent enriches a lead list with industry, employee size, tech stack, LinkedIn and a competitor set before drafting outreach.
- Market mapping. An analyst agent builds a landscape of 200 companies in a niche and needs structured rows it can cluster, filter and rank.
- RAG grounding. A support agent answers "does Acme integrate with Stripe?" by pulling Acme's tech stack dossier instead of guessing from training data.
- Due diligence. A researcher agent checks a startup's footprint: founded year, location, social reach, news momentum, competitors.
Each of these is one Actor call per company. Batch a list and you've got a database.
📦 What you get (output schema)
Every run streams one dossier per company to the default dataset. A dossier looks like:
{"input": "stripe.com","resolvedDomain": "stripe.com","websiteUrl": "https://stripe.com/","companyName": "Stripe","title": "Stripe | Financial Infrastructure to Grow Your Revenue","description": "Stripe is a suite of APIs powering online payment processing...","logo": "https://stripe.com/img/v3/home/twitter.png","favicon": "https://www.google.com/s2/favicons?domain=stripe.com&sz=64","languages": ["en-US"],"ogType": "website","techStack": ["Next.js", "React", "Cloudflare", "Stripe", "Google Analytics", "Segment", "Sentry", "Google Fonts"],"socials": [{ "platform": "twitter", "url": "https://twitter.com/stripe" },{ "platform": "linkedin", "url": "https://www.linkedin.com/company/stripe" },{ "platform": "github", "url": "https://github.com/stripe" }],"emails": ["support@stripe.com"],"phones": [],"linkedinUrl": "https://www.linkedin.com/company/stripe","twitterUrl": "https://twitter.com/stripe","githubUrl": "https://github.com/stripe","facebookUrl": null,"instagramUrl": null,"youtubeUrl": "https://www.youtube.com/stripe","employeesRange": "5001-10000","foundedYear": 2010,"industry": "Fintech","country": "US","city": "South San Francisco","seoTitleTag": "Stripe | Financial Infrastructure to Grow Your Revenue","seoMetaDescription": "Millions of businesses of all sizes...","seoHeadings": ["Payments", "Online payments", "In-person payments"],"robotsTxt": "User-agent: *\nDisallow: /...","sitemapUrls": ["https://stripe.com/sitemap.xml"],"news": [{ "title": "Stripe raises new round...", "url": "https://...", "publishedAt": "2026-06-...", "source": "TechCrunch" }],"competitors": ["paypal.com", "adyen.com", "braintreepayments.com", "square.com"],"rssFeeds": ["https://stripe.com/blog/feed"],"blogUrl": "https://stripe.com/blog","httpsValid": true,"scrapedAt": "2026-07-02T12:00:00.000Z"}
Use the Overview view to scan many companies, the Social & contact view for outreach lists, and the News signals view for monitoring.
🚀 How to use
1. Enrich a batch of domains (highest volume)
{"mode": "domain","domains": ["stripe.com", "linear.app", "figma.com", "notion.so", "vercel.com"],"sections": ["meta", "techStack", "socials", "contacts", "seo"],"concurrency": 5}
2. Resolve company names → websites → dossiers
{"mode": "name","companyNames": ["Notion", "Figma", "Linear", "Vercel", "Supabase"],"sections": ["meta", "techStack", "socials", "contacts", "news", "competitors"],"concurrency": 4}
3. Deep single dossier (all sections)
{"mode": "single","domain": "openai.com","maxNews": 15,"maxCompetitors": 10}
From code (Apify SDK)
import { ApifyClient } from 'apify-client';const client = new ApifyClient({ token: process.env.APIFY_TOKEN });const run = await client.actor('logiover/company-deep-research-scraper').call({mode: 'domain',domains: ['stripe.com', 'linear.app'],sections: ['meta', 'techStack', 'socials', 'contacts'],});const { items } = await client.dataset(run.defaultDatasetId).listItems();console.log(items); // array of company dossiers
As an MCP tool for AI agents
This Actor pairs naturally with an MCP server that wraps Apify. An agent calls the tool with a company name or domain and receives the structured dossier in its context — no browsing, no HTML parsing on the agent side.
🔧 Input fields
| Field | Type | Default | Description |
|---|---|---|---|
mode | enum | domain | domain (batch enrich), name (resolve names → website → enrich), single (one deep dossier). |
domains | array | — | Domains/URLs for domain/single modes. Normalized (protocol + www stripped). |
domain | string | — | Single domain for single mode. |
companyNames | array | — | Free-text company names for name mode (resolved via Clearbit autocomplete). |
sections | array | all | Which dossier sections to collect: meta, techStack, socials, contacts, seo, news, competitors, rss. Fewer = faster. |
maxNews | int | 10 | Max news items per company (0–50). |
maxCompetitors | int | 10 | Max competitor domains per company (0–30). |
concurrency | int | 5 | Parallel companies in batch modes (1–20). |
useApifyProxy | bool | true | Route through Apify datacenter proxy. Recommended. |
proxyGroups | array | — | Override proxy group, e.g. ["RESIDENTIAL"] for bot-walled targets. |
🧩 How it works
- Resolve. In
namemode, the company name is sent to Clearbit's keyless company autocomplete API, which returns the canonical domain ("Notion"→notion.so). - Fetch homepage. The website root is fetched over the Apify proxy with a browser-like User-Agent, following redirects. The final URL, HTTP headers and HTML body are captured.
- Parse meta.
<title>, meta description, Open Graph / Twitter Card tags,<html lang>,hreflangalternates, favicon and apple-touch-icon are extracted from the<head>. - Detect tech stack. A library of 40+ regex fingerprints runs against the HTML, script
srcattributes, stylesheethrefs and HTTP response headers (e.g.cf-ray→ Cloudflare,server: vercel→ Vercel,wp-content→ WordPress). Thegeneratormeta tag is also checked. - Extract socials. Link
hrefs andschema.orgsameAsarrays are matched against LinkedIn, X/Twitter, GitHub, Facebook, Instagram, YouTube, TikTok, Discord and Telegram patterns. - Extract contacts.
mailto:links and inline email regex (junk/image/sentry domains filtered), plustel:links and international phone regex. - SEO basics. Title tag, meta description, H1/H2 text. The
/robots.txtis fetched and parsed forSitemap:directives. - JSON-LD organization.
<script type="application/ld+json">blocks are parsed;Organization/Corporationblocks fill in name, founded year, employee range, industry, country, city and additionalsameAssocials. - News signals. Google News RSS is queried with the company name (last 30 days, US locale) and parsed into title/url/date/source items.
- Competitors. DuckDuckGo's HTML endpoint is queried with
"<company> alternatives competitors"; result domains are extracted, junk (Wikipedia, YouTube, social) filtered, and the target's own domain excluded. - RSS / blog.
<link rel=alternate type=application/rss+xml>tags, blog links and common feed paths (/feed,/rss,/blog/feed) are collected. - Stream. The complete dossier is pushed to the dataset and one
resultevent is charged per company.
💡 Tips & best practices
- Batch for cost efficiency.
domainmode with 5–10 concurrency is the sweet spot. A run of 100 companies typically completes in a few minutes. - Trim sections for speed. If you only need tech stack + socials, set
sections: ["meta","techStack","socials"]. News and competitors add extra HTTP calls per company. - Use
namemode for messy lead lists. Got a column of company names from a form?namemode cleans them into domains and enriches in one step. - Residential proxy for tough targets. If a site blocks datacenter IPs (rare for homepages, common for some SaaS), set
proxyGroups: ["RESIDENTIAL"]. - Pipe to a vector DB. Serialize each dossier's description + tech stack + headings into a text chunk and embed it. Now your agent can semantically search companies.
- Schedule recurring runs. News and competitors change weekly. Schedule a weekly run over your watchlist and diff the datasets to track movement.
- Combine with related Actors. Pair with
website-contact-scraper,website-tech-stack-detectorandlinkedin-company-scraperfor deeper enrichment on a shortlist.
❓ FAQ
Does this Actor need any API keys?
No. It uses only keyless public endpoints (Clearbit company autocomplete, Google News RSS, DuckDuckGo HTML) and direct HTTP fetches of company websites. Just an Apify account.
How accurate is the tech stack detection?
The fingerprint library covers 40+ of the most common technologies. It detects client-side signals (script paths, framework markers) and server headers. It will miss server-only tech (databases, backend languages) and deeply bundled apps — but for vendor reconnaissance and quick triage it's reliable. For exhaustive Wappalyzer-grade detection, pair with website-tech-stack-detector.
Will it work on sites behind Cloudflare/login?
Homepages rarely block. If a target returns 403/503, the dossier is still produced with the fields that could be collected and the domain recorded. For systematically walled targets, enable residential proxy.
How many companies can I enrich per run?
Practically unlimited — the Actor streams results and charges per dossier. Concurrency is capped at 20 to be polite. A few hundred per run is comfortable; for thousands, split into batches.
What's the difference between domain and single mode?
domain is batch-oriented (many companies, lean sections). single is one company with all sections at full depth. The output schema is identical.
Does it find personal emails of employees?
No. It collects contact emails published on the company homepage (support, info, press). For employee/person emails use linkedin-profile-scraper + youtube-creator-email-finder style Actors.
Can I get historical news?
The news section covers the last 30 days via Google News RSS. For deeper history, combine with google-news-scraper or archive.org.
How is this priced?
Pay-per-result: you're charged one result event per company dossier produced. Runs that yield zero companies (bad input) are free.
Is the output schema stable?
Yes. Fields are additive — new sections may add fields, but existing field names and types won't change within a major version. Nullable fields are marked.
Can AI agents call this directly?
Yes. Expose it through an MCP server or an Apify tool integration; the agent passes a company name/domain and gets structured JSON back. This is the primary design target.
🔗 Related Actors
- website-tech-stack-detector — deeper, Wappalyzer-style tech detection on a single site.
- website-contact-scraper — emails, phones, socials, addresses from a whole site crawl.
- linkedin-company-scraper — LinkedIn company page (size, industry, specialties).
- clutch-co-scraper / goodfirms-scraper — B2B agency profiles & reviews.
- y-combinator-companies-directory-scraper — YC startup directory.
- bulk-social-profile-extractor — pull social profiles from a list of URLs.
- subdomain-finder — find subdomains for a company domain (recon).
- news-intelligence-scraper — topic-level multi-source news + sentiment.
📝 Changelog
2026-07-02 — v1.0
- Initial release.
- 3 modes:
domain(batch),name(resolve → enrich),single(deep). - 8 dossier sections: meta, techStack, socials, contacts, seo, news, competitors, rss.
- 40+ tech-stack fingerprints.
- schema.org
OrganizationJSON-LD parsing. - Apify datacenter proxy by default, residential opt-in.
- Pay-per-result (
resultevent per dossier).
⚖️ Disclaimer
This Actor fetches publicly available web pages and keyless public APIs. It does not authenticate, bypass access controls, or scrape behind logins. Respect each website's Terms of Service and robots.txt. Use for research, sales intelligence and AI-agent grounding on data that is already public.