Google AI Overview Citation Scraper avatar

Google AI Overview Citation Scraper

Pricing

Pay per event

Go to Apify Store
Google AI Overview Citation Scraper

Google AI Overview Citation Scraper

Scrape which domains Google's AI Overview cites for your target queries — one row per (query × cited source) with position, snippet, and selector telemetry — export to JSON or CSV. AEO / generative-SEO data for 2026, with captcha-aware retry and Pydantic-validated output.

Pricing

Pay per event

Rating

0.0

(0)

Developer

DevilScrapes

DevilScrapes

Maintained by Community

Actor stats

0

Bookmarked

3

Total users

1

Monthly active users

6 days ago

Last modified

Share

Google AI Overview Tracker

Google AI Overview Tracker — Citation Scraper

▶️ Full tutorial on YouTube

▶️ 45-second demo on YouTube

We do the dirty work so your dataset stays clean. 😈

$5.50 / 1,000 (query x citation) rows. Track which domains Google's AI Overview cites for your target queries. Answer Engine Optimization (AEO) is a fast-growing SEO discipline in 2026 — AI Overview now surfaces above the organic results for many informational searches, and there is no first-party API for citation attribution. Pass in a list of queries; we handle the browser fingerprinting, proxy rotation, consent dialogs, and the retry loop — and you get one clean Pydantic-validated row per (query x cited source). Pay only for results that land. No credit card required to try.

🎯 What this scrapes

For each query you pass in, this Actor:

  1. Opens https://www.google.com/search?q=<query>&hl=<language>&gl=<country> in a fresh Camoufox page.
  2. Dismisses any EU consent dialog automatically.
  3. Waits 4–15 seconds for the AI Overview block to lazy-render.
  4. Probes an 8-selector priority battery to find the AI Overview container, recording which selector matched on every row — so you can detect Google rotating their markup with a single GROUP BY selector_used query.
  5. Extracts every citation link inside the carousel: URL, registrable domain, anchor text, and 1-based position.
  6. Emits one row per citation, or — when AI Overview did not appear — a single marker row with ai_overview_appeared=false. Absence of AI Overview is itself a meaningful AEO data point.

Output fields:

FieldTypeDescription
querystringThe query the row was produced from
countrystringISO-3166 alpha-2 country code (gl=)
languagestringISO-639-1 language code (hl=)
ai_overview_appearedbooleanTrue when an AI Overview block was rendered
ai_overview_text_excerptstring | nullFirst 200 chars of the AI Overview body
citation_positioninteger | null1-based position in the citation carousel
source_domainstring | nullRegistrable domain (e.g. imf.org)
source_urlstring | nullFull https:// URL Google rendered
source_titlestring | nullAnchor text Google rendered
selector_usedstring | nullWhich selector matched — drift telemetry
scraped_atstringISO 8601 UTC timestamp

🔥 Features

  • Camoufox-rendered, not headless Chromium — we use a Firefox fork with anti-detection patches. Standard Playwright and Selenium emit fingerprints that Google's defences pick off instantly; Camoufox is the only browser we allow for scraping (ADR-0002).
  • We rotate residential proxies on every block — fresh session_id, fresh exit IP via Apify Proxy RESIDENTIAL. You never fight for bandwidth against someone else's CAPTCHA loop.
  • We handle CAPTCHA interstitials — when Google serves the sorry/index reCAPTCHA page, we rotate the proxy session and retry before emitting a marker row. The run never silently returns an empty dataset.
  • 8-selector priority battery with drift telemetry — Google has rotated the AI Overview DOM label at least three times since launch. We maintain an ordered selector list, probe all eight on each page, and record the winner in selector_used so you can chart selector drift over time.
  • We retry with exponential backoff on network failures and rate-limit responses. Up to five attempts per query before we surface a partial-success status.
  • Per-query session isolation — fresh proxy session and fresh browser page per query. Cookies and rate-limit state from one query never bleed into the next.
  • Pydantic v2 input + output validation — invalid input fails fast before any browser starts; row schema is enforced at push time. You get typed columns, not a bag of string fields.
  • Pay-Per-Event pricing — $0.05 start + $0.005 per row. No data, no charge beyond the warm-up fee.

💡 Use cases

  • AEO dashboard — schedule a weekly run for your 50 highest-priority queries; chart source_domain share-of-citation over time alongside ai_overview_appeared rate. Detect when AI Overview starts citing a new competitor in your space.
  • Pre-launch content gap analysis — feed in the queries you want to rank for, see which domains Google currently cites, and target outreach to publishers in the cite list rather than chasing pure backlink volume.
  • Brand citation monitoring — does AI Overview cite your domain for queries where your brand is the answer? Most brands have zero instrumentation here today; this is the direct way to find out.
  • Competitive intelligence — track exactly which 3–5 sources Google's generative system trusts for each of your category's head queries; compare to traditional SERP rank.
  • Selector drift monitoring — the selector_used column is a leading indicator of Google rotating AI Overview markup. Useful for SaaS observability of generative search behaviour even if you don't mine the citation data directly.
  • Localised AEO — pair country=us and country=gb runs over the same query list to detect locale-specific citation behaviour.

⚙️ How to use it

  1. Open the Actor input form.
  2. Paste your Search queries (1–50). Informational queries (what is, how to, best X 2026) have the highest AI Overview trigger rate.
  3. (Optional) Set Country and Language (default us / en).
  4. (Optional) Cap with Max queries per run (default 25).
  5. (Optional) Raise Wait after DOM ready to 12000–15000 ms for slow proxy exits.
  6. Pick a Proxy — leave the default RESIDENTIAL. We fall back to the BUYPROXIES94952 group automatically when residential is unavailable on your plan.
  7. Click Start. Rows stream into the default dataset as each query completes.

Quick examples

Two informational queries, default settings (the QA fixture):

{
"queries": ["best running shoes 2026", "what causes inflation"],
"country": "us",
"language": "en",
"maxQueries": 2,
"waitMsAfterLoad": 8000,
"proxyConfiguration": {
"useApifyProxy": true,
"apifyProxyGroups": ["RESIDENTIAL"]
}
}

UK English with a longer wait window:

{
"queries": ["best mortgage rates", "how does pension lump sum tax work"],
"country": "gb",
"language": "en",
"maxQueries": 10,
"waitMsAfterLoad": 12000,
"proxyConfiguration": {
"useApifyProxy": true,
"apifyProxyGroups": ["RESIDENTIAL"]
}
}

📥 Input

FieldTypeRequiredDefaultDescription
queriesarray of stringsyes1–50 search queries to probe
countrystringno"us"ISO-3166 alpha-2 (lowercase); maps to gl=
languagestringno"en"ISO-639-1 (lowercase); maps to hl=
maxQueriesintegerno25Hard cap per run (1–50)
waitMsAfterLoadintegerno8000ms after DOMContentLoaded (4000–15000)
proxyConfigurationobjectyesRESIDENTIALApify Proxy config — required

📤 Output

One JSON row per (query x citation), or a single marker row when AI Overview did not appear for that query. Example citation row:

{
"query": "what causes inflation",
"country": "us",
"language": "en",
"ai_overview_appeared": true,
"ai_overview_text_excerpt": "Inflation is caused by a combination of demand-pull factors, cost-push factors...",
"citation_position": 1,
"source_domain": "imf.org",
"source_url": "https://www.imf.org/en/Publications/fandd/issues/Series/Back-to-Basics/Inflation",
"source_title": "Inflation: Prices on the Rise",
"selector_used": "div[aria-label=\"AI Overview\"]",
"scraped_at": "2026-05-16T20:50:00.000Z"
}

Example no-AI-Overview marker row:

{
"query": "spinach recipe",
"country": "us",
"language": "en",
"ai_overview_appeared": false,
"ai_overview_text_excerpt": null,
"citation_position": null,
"source_domain": null,
"source_url": null,
"source_title": null,
"selector_used": null,
"scraped_at": "2026-05-16T20:50:00.000Z"
}

💰 Pricing

Pay-Per-Event:

EventPrice (USD)Trigger
actor-start$0.05Once per run when the Actor begins
result-row$0.005Per dataset row pushed

Worked example: a 50-query run with a ~30% AI Overview hit rate and ~4 citations per hit yields roughly 50 * 0.7 + 50 * 0.3 * 4 = 95 rows. Charge = $0.05 + 95 * $0.005 = $0.525. That's ~$5.50 per 1,000 rows — priced above commodity SERP scrapers because cross-domain citation attribution is not available from any first-party source.

🚧 Limitations

  • AI Overview triggers on ~30% of queries today. Queries that look transactional, navigational, or trademark-heavy will mostly produce ai_overview_appeared=false marker rows. That absence data is still valuable — and you're charged the same per row either way.
  • v0.1 is English-tuned. The text-based selector fallback looks for the literal string AI Overview. Non-English locales (e.g. gl=de) may emit false negatives on the fallback path. The CSS selector battery is locale-agnostic.
  • Apify Proxy is required. Google blocks datacenter IPs without proxy enrichment. The Actor fails fast at startup with a clear status message when no proxy group is reachable.
  • Mobile SERP is out of scope. Mobile AI Overview uses a different DOM structure; a separate Actor variant is planned.
  • No following citation links. This Actor records the cited URL but does not visit it. Pair with a downstream HTTP scraper when you need destination content.

❓ FAQ

Q: What is a Google AI Overview tracker and why do I need one? Google AI Overview is the AI-generated summary block that appears at the top of ~30% of Google searches and cites 3–8 external sources. There is no official API for citation attribution. An AI overview scraper like this one gives SEO teams and content strategists raw data on which domains Google's generative engine trusts — the same data the major SEO platforms are now billing $300–1,500/mo to approximate.

Q: How is this different from Ahrefs or Semrush's AEO features? Those platforms track your own domain's citation appearances. This Actor lets you track any domain across any query list — useful for competitor analysis, category audits, and building your own AEO share-of-voice metrics. You own the raw dataset; no subscription lock-in.

Q: Why Camoufox instead of plain Playwright? Google's defences fingerprint headless Chromium via the WebDriver/CDP signature, predictable navigator properties, and missing iframes-API behaviour. We use Camoufox — a Firefox fork with those signals patched — because it is the only browser automation layer that survives Google's current detection stack. Plain Playwright would be blocked before the AI Overview block ever loads.

Q: Why one row per citation instead of one row per query with an array? Tabular tools (Sheets, BigQuery, Excel pivots) handle nested arrays poorly. Long-form rows let you GROUP BY source_domain directly. If you need wide-form output you can pivot in five lines of SQL.

Q: Why charge for marker rows when AI Overview didn't appear? ai_overview_appeared=false is a meaningful AEO signal — knowing which of your queries do not trigger AI Overview is half the dashboard. Fair per-row pricing keeps the Actor sustainable.

Q: Does this work on the Apify FREE tier? Partially. FREE tier has no RESIDENTIAL proxy group, so the Actor falls back to BUYPROXIES94952 (5 IPs). Google serves CAPTCHAs more often on those exit IPs; expect a higher marker-row rate. Paid Apify plans with RESIDENTIAL get cleaner runs.

Q: How do I detect Google rotating their AI Overview DOM? Filter the dataset by selector_used over time. When the highest-priority selector stops hitting and a lower-priority one starts winning, the markup has shifted — raise an issue on the Store listing and we'll add the new selector to the battery.

Q: What is the google SGE tracker use case? Google SGE (Search Generative Experience) was the predecessor name for what is now called Google AI Overview. If you were tracking SGE citations, this is the equivalent tool for the current AI Overview product.

💬 Your feedback

Hit a selector miss, a parser edge case, or a feature gap? Open an issue on the Apify Store listing and we'll respond within a week. Pull requests are welcome — the source is structured for easy selector-battery extension (see src/parser.py).