Company Deep Research — AI Agent Dossier API avatar

Company Deep Research — AI Agent Dossier API

Pricing

from $15.00 / 1,000 results

Go to Apify Store
Company Deep Research — AI Agent Dossier API

Company Deep Research — AI Agent Dossier API

One-call company intelligence dossier for AI agents: website meta, tech stack, socials, contacts, SEO basics, news signals, competitors and blog/RSS from a company name or domain. No API key, no browser.

Pricing

from $15.00 / 1,000 results

Rating

0.0

(0)

Developer

Logiover

Logiover

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

2 days ago

Last modified

Share

Company Deep Research Scraper — AI Agent Company Intelligence API

One-call company intelligence dossier for AI agents, LLM apps, sales/rev-tech and market researchers. Drop in a company name or domain and get back a clean, structured JSON record with the website meta, tech stack, social profiles, contact channels, SEO basics, recent news signals, competitor domains and RSS/blog feeds — all from a single Apify Actor run. No API key, no headless browser, no per-site scrapers to maintain.

Built for the new wave of AI agents that research companies autonomously — coding agents that need to understand a vendor before integrating, sales agents that qualify accounts, analyst agents that build market maps, and RAG pipelines that ground LLM answers in fresh, structured company data instead of stale training-set knowledge.


🎯 What this Actor is for

Large language models have a knowledge cutoff and no live web access. When an AI agent is asked "research Acme Corp" or "who are Notion's competitors and what stack do they run?", it needs structured, fresh, machine-readable company data — not a pile of raw HTML to re-parse every time. company-deep-research-scraper is that data layer:

  • One call, one dossier. Pass stripe.com and get back a single JSON object with ~40 fields covering identity, tech, socials, contacts, SEO, news and competitors. No need to chain five separate scrapers.
  • AI-agent friendly schema. Predictable field names, nullable values, arrays where multiple values exist, ISO timestamps. Drop straight into a prompt or a vector DB.
  • Batch by default. Feed 500 domains and get 500 dossiers back, enriched in parallel — perfect for building a company database or enriching a CRM.
  • Name → domain resolution. Don't know the website? Pass "Figma" or "Linear" and the Actor resolves it to figma.com / linear.app via Clearbit's company autocomplete, then enriches.
  • No keys, no browser. Pure HTTP + lightweight HTML parsing on a small Node 20 container. Cheap, fast, and resilient — no Playwright to keep warm.

✨ Key features

  • 🧠 Company identity — name, description, logo, favicon, OG metadata, site name, detected languages.
  • 🛠️ Tech stack fingerprinting — 40+ technologies detected from HTML + HTTP headers: Next.js, React, Vue, Angular, Svelte, Nuxt, Gatsby, Astro, Remix, WordPress, Shopify, Webflow, Squarespace, Wix, Drupal, Joomla, Ghost, jQuery, Tailwind, Bootstrap, Cloudflare, Vercel, Netlify, Google Analytics, GTM, Facebook Pixel, Hotjar, HubSpot, Intercom, Segment, Mixpanel, Stripe, PayPal, Sentry, Datadog, Algolia, Elasticsearch, Font Awesome, Google Fonts, and more.
  • 🔗 Social profiles — LinkedIn, X/Twitter, GitHub, Facebook, Instagram, YouTube, TikTok, Discord, Telegram URLs extracted from the homepage + schema.org sameAs.
  • ✉️ Contact channels — emails (from mailto: + text regex, junk filtered) and phone numbers (from tel: + international regex).
  • 📈 SEO basics — title tag, meta description, H1/H2 headings, robots.txt, discovered sitemaps.
  • 📰 News signals — recent news for the company (last 30 days) via Google News RSS, with title, URL, publish date and source.
  • ⚔️ Competitor discovery — alternative/competitor domains mined from DuckDuckGo result pages (junk domains filtered).
  • 📡 RSS / blog feeds<link rel=alternate> feed tags, blog links and common feed-path guesses (/feed, /rss, /blog/feed).
  • 🏢 Organization JSON-LDschema.org Organization/Corporation parsed for founded year, employee range, industry, country, city.
  • 🌐 Proxy-aware — Apify datacenter proxy by default (avoids per-IP rate limits), optional residential for bot-walled targets.
  • 💰 Pay-per-result — you're charged per company dossier produced, not per run. Empty/blocked results are not billed.

🤖 Why AI agents need this

Modern agentic workflows (Claude, ChatGPT, Cursor, LangGraph, CrewAI, AutoGen, Microsoft Copilot, Google Gemini) increasingly delegate research to tools. The bottleneck is no longer reasoning — it's grounding: getting current, structured facts about real-world entities. Company research is one of the highest-frequency grounding tasks:

  1. Vendor / tool evaluation. A coding agent is asked to pick a payment provider. It needs to know each candidate's website, tech stack, docs URL and recent news — not just the name.
  2. Sales account research. A GTM agent enriches a lead list with industry, employee size, tech stack, LinkedIn and a competitor set before drafting outreach.
  3. Market mapping. An analyst agent builds a landscape of 200 companies in a niche and needs structured rows it can cluster, filter and rank.
  4. RAG grounding. A support agent answers "does Acme integrate with Stripe?" by pulling Acme's tech stack dossier instead of guessing from training data.
  5. Due diligence. A researcher agent checks a startup's footprint: founded year, location, social reach, news momentum, competitors.

Each of these is one Actor call per company. Batch a list and you've got a database.


📦 What you get (output schema)

Every run streams one dossier per company to the default dataset. A dossier looks like:

{
"input": "stripe.com",
"resolvedDomain": "stripe.com",
"websiteUrl": "https://stripe.com/",
"companyName": "Stripe",
"title": "Stripe | Financial Infrastructure to Grow Your Revenue",
"description": "Stripe is a suite of APIs powering online payment processing...",
"logo": "https://stripe.com/img/v3/home/twitter.png",
"favicon": "https://www.google.com/s2/favicons?domain=stripe.com&sz=64",
"languages": ["en-US"],
"ogType": "website",
"techStack": ["Next.js", "React", "Cloudflare", "Stripe", "Google Analytics", "Segment", "Sentry", "Google Fonts"],
"socials": [
{ "platform": "twitter", "url": "https://twitter.com/stripe" },
{ "platform": "linkedin", "url": "https://www.linkedin.com/company/stripe" },
{ "platform": "github", "url": "https://github.com/stripe" }
],
"emails": ["support@stripe.com"],
"phones": [],
"linkedinUrl": "https://www.linkedin.com/company/stripe",
"twitterUrl": "https://twitter.com/stripe",
"githubUrl": "https://github.com/stripe",
"facebookUrl": null,
"instagramUrl": null,
"youtubeUrl": "https://www.youtube.com/stripe",
"employeesRange": "5001-10000",
"foundedYear": 2010,
"industry": "Fintech",
"country": "US",
"city": "South San Francisco",
"seoTitleTag": "Stripe | Financial Infrastructure to Grow Your Revenue",
"seoMetaDescription": "Millions of businesses of all sizes...",
"seoHeadings": ["Payments", "Online payments", "In-person payments"],
"robotsTxt": "User-agent: *\nDisallow: /...",
"sitemapUrls": ["https://stripe.com/sitemap.xml"],
"news": [
{ "title": "Stripe raises new round...", "url": "https://...", "publishedAt": "2026-06-...", "source": "TechCrunch" }
],
"competitors": ["paypal.com", "adyen.com", "braintreepayments.com", "square.com"],
"rssFeeds": ["https://stripe.com/blog/feed"],
"blogUrl": "https://stripe.com/blog",
"httpsValid": true,
"scrapedAt": "2026-07-02T12:00:00.000Z"
}

Use the Overview view to scan many companies, the Social & contact view for outreach lists, and the News signals view for monitoring.


🚀 How to use

1. Enrich a batch of domains (highest volume)

{
"mode": "domain",
"domains": ["stripe.com", "linear.app", "figma.com", "notion.so", "vercel.com"],
"sections": ["meta", "techStack", "socials", "contacts", "seo"],
"concurrency": 5
}

2. Resolve company names → websites → dossiers

{
"mode": "name",
"companyNames": ["Notion", "Figma", "Linear", "Vercel", "Supabase"],
"sections": ["meta", "techStack", "socials", "contacts", "news", "competitors"],
"concurrency": 4
}

3. Deep single dossier (all sections)

{
"mode": "single",
"domain": "openai.com",
"maxNews": 15,
"maxCompetitors": 10
}

From code (Apify SDK)

import { ApifyClient } from 'apify-client';
const client = new ApifyClient({ token: process.env.APIFY_TOKEN });
const run = await client.actor('logiover/company-deep-research-scraper').call({
mode: 'domain',
domains: ['stripe.com', 'linear.app'],
sections: ['meta', 'techStack', 'socials', 'contacts'],
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(items); // array of company dossiers

As an MCP tool for AI agents

This Actor pairs naturally with an MCP server that wraps Apify. An agent calls the tool with a company name or domain and receives the structured dossier in its context — no browsing, no HTML parsing on the agent side.


🔧 Input fields

FieldTypeDefaultDescription
modeenumdomaindomain (batch enrich), name (resolve names → website → enrich), single (one deep dossier).
domainsarrayDomains/URLs for domain/single modes. Normalized (protocol + www stripped).
domainstringSingle domain for single mode.
companyNamesarrayFree-text company names for name mode (resolved via Clearbit autocomplete).
sectionsarrayallWhich dossier sections to collect: meta, techStack, socials, contacts, seo, news, competitors, rss. Fewer = faster.
maxNewsint10Max news items per company (0–50).
maxCompetitorsint10Max competitor domains per company (0–30).
concurrencyint5Parallel companies in batch modes (1–20).
useApifyProxybooltrueRoute through Apify datacenter proxy. Recommended.
proxyGroupsarrayOverride proxy group, e.g. ["RESIDENTIAL"] for bot-walled targets.

🧩 How it works

  1. Resolve. In name mode, the company name is sent to Clearbit's keyless company autocomplete API, which returns the canonical domain ("Notion"notion.so).
  2. Fetch homepage. The website root is fetched over the Apify proxy with a browser-like User-Agent, following redirects. The final URL, HTTP headers and HTML body are captured.
  3. Parse meta. <title>, meta description, Open Graph / Twitter Card tags, <html lang>, hreflang alternates, favicon and apple-touch-icon are extracted from the <head>.
  4. Detect tech stack. A library of 40+ regex fingerprints runs against the HTML, script src attributes, stylesheet hrefs and HTTP response headers (e.g. cf-ray → Cloudflare, server: vercel → Vercel, wp-content → WordPress). The generator meta tag is also checked.
  5. Extract socials. Link hrefs and schema.org sameAs arrays are matched against LinkedIn, X/Twitter, GitHub, Facebook, Instagram, YouTube, TikTok, Discord and Telegram patterns.
  6. Extract contacts. mailto: links and inline email regex (junk/image/sentry domains filtered), plus tel: links and international phone regex.
  7. SEO basics. Title tag, meta description, H1/H2 text. The /robots.txt is fetched and parsed for Sitemap: directives.
  8. JSON-LD organization. <script type="application/ld+json"> blocks are parsed; Organization/Corporation blocks fill in name, founded year, employee range, industry, country, city and additional sameAs socials.
  9. News signals. Google News RSS is queried with the company name (last 30 days, US locale) and parsed into title/url/date/source items.
  10. Competitors. DuckDuckGo's HTML endpoint is queried with "<company> alternatives competitors"; result domains are extracted, junk (Wikipedia, YouTube, social) filtered, and the target's own domain excluded.
  11. RSS / blog. <link rel=alternate type=application/rss+xml> tags, blog links and common feed paths (/feed, /rss, /blog/feed) are collected.
  12. Stream. The complete dossier is pushed to the dataset and one result event is charged per company.

💡 Tips & best practices

  • Batch for cost efficiency. domain mode with 5–10 concurrency is the sweet spot. A run of 100 companies typically completes in a few minutes.
  • Trim sections for speed. If you only need tech stack + socials, set sections: ["meta","techStack","socials"]. News and competitors add extra HTTP calls per company.
  • Use name mode for messy lead lists. Got a column of company names from a form? name mode cleans them into domains and enriches in one step.
  • Residential proxy for tough targets. If a site blocks datacenter IPs (rare for homepages, common for some SaaS), set proxyGroups: ["RESIDENTIAL"].
  • Pipe to a vector DB. Serialize each dossier's description + tech stack + headings into a text chunk and embed it. Now your agent can semantically search companies.
  • Schedule recurring runs. News and competitors change weekly. Schedule a weekly run over your watchlist and diff the datasets to track movement.
  • Combine with related Actors. Pair with website-contact-scraper, website-tech-stack-detector and linkedin-company-scraper for deeper enrichment on a shortlist.

❓ FAQ

Does this Actor need any API keys?

No. It uses only keyless public endpoints (Clearbit company autocomplete, Google News RSS, DuckDuckGo HTML) and direct HTTP fetches of company websites. Just an Apify account.

How accurate is the tech stack detection?

The fingerprint library covers 40+ of the most common technologies. It detects client-side signals (script paths, framework markers) and server headers. It will miss server-only tech (databases, backend languages) and deeply bundled apps — but for vendor reconnaissance and quick triage it's reliable. For exhaustive Wappalyzer-grade detection, pair with website-tech-stack-detector.

Will it work on sites behind Cloudflare/login?

Homepages rarely block. If a target returns 403/503, the dossier is still produced with the fields that could be collected and the domain recorded. For systematically walled targets, enable residential proxy.

How many companies can I enrich per run?

Practically unlimited — the Actor streams results and charges per dossier. Concurrency is capped at 20 to be polite. A few hundred per run is comfortable; for thousands, split into batches.

What's the difference between domain and single mode?

domain is batch-oriented (many companies, lean sections). single is one company with all sections at full depth. The output schema is identical.

Does it find personal emails of employees?

No. It collects contact emails published on the company homepage (support, info, press). For employee/person emails use linkedin-profile-scraper + youtube-creator-email-finder style Actors.

Can I get historical news?

The news section covers the last 30 days via Google News RSS. For deeper history, combine with google-news-scraper or archive.org.

How is this priced?

Pay-per-result: you're charged one result event per company dossier produced. Runs that yield zero companies (bad input) are free.

Is the output schema stable?

Yes. Fields are additive — new sections may add fields, but existing field names and types won't change within a major version. Nullable fields are marked.

Can AI agents call this directly?

Yes. Expose it through an MCP server or an Apify tool integration; the agent passes a company name/domain and gets structured JSON back. This is the primary design target.


  • website-tech-stack-detector — deeper, Wappalyzer-style tech detection on a single site.
  • website-contact-scraper — emails, phones, socials, addresses from a whole site crawl.
  • linkedin-company-scraper — LinkedIn company page (size, industry, specialties).
  • clutch-co-scraper / goodfirms-scraper — B2B agency profiles & reviews.
  • y-combinator-companies-directory-scraper — YC startup directory.
  • bulk-social-profile-extractor — pull social profiles from a list of URLs.
  • subdomain-finder — find subdomains for a company domain (recon).
  • news-intelligence-scraper — topic-level multi-source news + sentiment.

📝 Changelog

2026-07-02 — v1.0

  • Initial release.
  • 3 modes: domain (batch), name (resolve → enrich), single (deep).
  • 8 dossier sections: meta, techStack, socials, contacts, seo, news, competitors, rss.
  • 40+ tech-stack fingerprints.
  • schema.org Organization JSON-LD parsing.
  • Apify datacenter proxy by default, residential opt-in.
  • Pay-per-result (result event per dossier).

⚖️ Disclaimer

This Actor fetches publicly available web pages and keyless public APIs. It does not authenticate, bypass access controls, or scrape behind logins. Respect each website's Terms of Service and robots.txt. Use for research, sales intelligence and AI-agent grounding on data that is already public.