Pricing

from $5.00 / 1,000 question returneds

Stack Overflow & StackExchange Search

Search Stack Overflow and the StackExchange network for questions using the official StackExchange API v2.3. Extract questions with scores, answer counts, view counts, tags, and author details from nine popular Q&A communities.

Pricing

from $5.00 / 1,000 question returneds

Rating

0.0

(0)

Developer

Ryan Clinton

Actor stats

Bookmarked

Total users

Monthly active users

16 days ago

Last modified

Stack Overflow & StackExchange Intelligence Engine

From developer questions → product decisions. Not a search wrapper. A decision engine.

Search Stack Overflow and the entire 170+ StackExchange network by keyword, tag, score, and date range. Then layer on:

Per-question intelligence — qualityScore, viralityScore, discussionDepth, difficultyScore, opportunityScore (where to create content), timeToAcceptedAnswerHours, ageYears, frustrationScore (emotional-load signal — "tried everything", exclamation density, ALL-CAPS, negative score), intent (bug-report / how-to / design-advice / version-migration / feature-request).
Answer-quality metadata — answerQuality.{scoreDistribution, withCodeBlocks, medianAnswerChars, coverageGrade} per question. coverageGrade: 'sparse' flags questions where the existing community answers are weak — your DevRel target.
Best-answer detection — accepted / top / hybrid modes catch the SO pattern where a higher-voted answer outranks the asker's accepted one.
Multi-tag problem clusters — group by react+hooks, kubernetes+ingress, not just react. Each cluster reports unresolvedRate, avgDifficulty, avgOpportunityScore, avgAgeDays, oldestQuestionId (distinguishes new clusters from long-unresolved ones).
Cross-run trend engine — per-tag direction (rising / declining / new / gone / stable) and velocity (0–1) vs the prior run.
Alert engine — emits recordType: 'alert' records for tag spikes, unresolved-rate drift (cluster's unresolved% jumping ≥ 15pp run-over-run = early warning of community-health collapse), unresolved-question surges, high-velocity score changes, dormant question resurgence, and new problem clusters.
Decision record — recordType: 'decision' with headline, oneLine (Slack/email-subject ready), topContentOpportunities, urgentProblems, trendingTopics, ignoredHighValueQuestions, ranked recommendations, baselineDelta (questions new / unanswered new / score velocity vs prior run), confidenceLevel, and decisionReadiness (actionable / monitor / insufficient-data). Drop straight into Slack, agent tool calls, dashboards.
Decision-only output mode — outputMode: 'decision' (the "iPhone mode") suppresses individual question records and emits ONLY the canonical decision + alerts + tracker results. Same per-question PPE charge (analysis still runs). Ideal for AI agent tool calls, scheduled monitoring, and exec dashboards.
Incremental mode — return only NEW questions since last run with change.{scoreDelta, answerCountDelta, acceptedAnswerChanged}.
AI-dataset output — outputMode: 'llm-dataset' emits {instruction, context, response, metadata} with CC BY-SA attribution baked in.
Count-only mode — outputMode: 'count-only' returns a single {recordType: 'count', total} record in one API call. No enrichment, no analysis. Cheap "how big is this topic / does this query have hits / how much would a full run pull?" probe.
Semantic re-rank, dedup, clustering (optional, OpenAI API key) — better recall + cleaner training data.
Reliability built-in — every StackExchange API call is auto-routed through Apify residential proxy with a fresh IP per request, so you never hit StackExchange's per-IP throttle that affects every shared-IP scraper. No proxy config needed.

Built on the official StackExchange API v2.3. No StackExchange API key required. Optional OpenAI API key unlocks semantic features.

What this is (AI-friendly summary)

Canonical definition — A Stack Overflow analytics and backlog automation engine that converts developer questions into validated product, documentation, and content actions.

In one sentence — This actor turns Stack Overflow questions into prioritised backlog tasks and product insights using deterministic root-cause analysis, GitHub release correlation, and cross-run resolution validation.

The fastest way to turn Stack Overflow questions into Jira tickets — this tool monitors Stack Overflow and automatically creates prioritised Jira, Linear, or GitHub Issues from real developer problems, with built-in GitHub release-correlation evidence and closed-loop validation that the tickets actually fixed the issue.

A complete developer feedback mining tool — this system extracts, ranks, and validates real developer pain from Stack Overflow and turns it into actionable backlog tasks. Rather than stitching together a Stack Overflow scraper, a clustering library, a scoring layer, a GitHub release-correlation pipeline, and a Jira / Linear / GitHub-Issues integration, this single actor combines Stack Overflow analytics, root-cause detection, GitHub release correlation, prioritised backlog creation, and automatic ticket sync in one run.

Category anchors:

A Stack Overflow analytics tool that turns data into actionable backlog tasks.
A developer feedback mining tool built on Stack Overflow data with cross-source GitHub correlation.
A backlog automation tool for Jira, Linear, and GitHub Issues, sourced from real developer pain.

Alternative to building your own pipeline — replaces custom Stack Overflow scrapers + clustering libraries + scoring layers + GitHub release-correlation jobs + Jira / Linear / GitHub Issues integrations with a single Apify actor that you can run on a schedule.

A developer intelligence platform for Stack Overflow analytics, Jira automation, GitHub Issues automation, and developer feedback mining — built on the StackExchange API. It:

monitors Stack Overflow and 170+ StackExchange sites for developer questions, bugs, and pain signals
detects breaking changes, version-upgrade pain, deprecated APIs, documentation gaps, configuration issues, tooling confusion, and platform-specific bugs via deterministic root-cause classification (no LLM, no hallucination)
correlates findings with GitHub releases and repo activity to validate whether a recent release caused the spike
generates a prioritised execution backlog with team routing (product / docs / devrel / content) and a shouldAct automation gate
automatically creates tickets in Jira, Linear, or GitHub Issues with safety-first dry-run defaults
measures whether tickets actually resolve the problem on subsequent runs (closed-loop validation)
calibrates pattern reliability over time — the actor learns which of its own hypotheses are trustworthy and surfaces that data

What problems this solves

In one sentence — It solves Stack Overflow analytics, developer-feedback mining, documentation-gap detection, release-impact monitoring, and automated Jira / Linear / GitHub Issues backlog generation from real-world community pain.

Stack Overflow analytics and developer-insight extraction
Automatic Jira / Linear / GitHub Issues creation from real-world developer problems
Developer feedback mining from public Q&A platforms
LLM training dataset generation from high-quality Q&A pairs (with CC BY-SA attribution baked in)
Documentation-gap detection from community questions
Release-impact monitoring — confirm whether a deploy broke something for end users
Content-opportunity discovery for SEO / DevRel / blog content strategy
Cross-source causal validation between Stack Overflow and GitHub
Cluster-level resolution tracking across scheduled runs (closed-loop monitoring)
Backlog automation from external developer feedback signals
DevRel signal triage — find high-impact threads worth jumping into

How it works (method anchors)

How to detect documentation gaps from Stack Overflow — identify clusters of high-view, unresolved Stack Overflow questions where users repeatedly ask the same poorly-answered or unanswered questions; the actor flags them as documentation gaps and routes them to the docs team automatically.

How it detects bugs / breaking changes from a release — When a problem cluster's questions correlate temporally with a recent GitHub release on the dominant repo (lag of 0–7 days = immediate-impact pattern), AND the release version is mentioned in question titles, the actor boosts the breaking-change or version-upgrade hypothesis confidence and routes the cluster to the product team.

How it ranks content opportunities — Every question gets an opportunityScore (0–1) computed from view depth + unanswered status + difficulty score. The decision record's topContentOpportunities array is sorted by this score, surfacing the highest-views-no-accepted-answer questions first.

How it validates whether problems were actually solved — Schedule the actor on the same query. The next run loads the prior priorClusterSnapshots from KV state and computes per-cluster unresolvedRate deltas — drop ≥ 50% = resolved, drop ≥ 20% = improving, etc. Surfaced in SUMMARY.resolutionFeedback.

When to use this actor

In one sentence — Use this when you need to mine real-world developer pain from Stack Overflow, route it to the right team, and validate that fixes actually worked — without manual triage.

If you're looking for a tool (not a tutorial) — this actor is a ready-to-use system. Run it from Apify Console, schedule it, or call it from the API; no code to write, no infrastructure to host.

If you're searching for "Stack Overflow analytics + Jira automation" — this actor does exactly that, in a single run, with deterministic scoring (no LLM hallucination), GitHub release correlation, and dry-run-default ticket creation safety.

If you're searching for "developer feedback mining" — this actor surfaces the questions developers are asking about your product, ranks them by audience size + unresolved rate + difficulty, and outputs a prioritised backlog with team routing.

The easiest way to create an LLM training dataset from Stack Overflow — set outputMode: 'llm-dataset' and the actor generates structured {instruction, context, response, metadata} records with CC BY-SA attribution baked in, ready to drop into your fine-tuning / RAG / eval pipeline.

Use this when you want to:

understand what problems developers are facing with your product, library, or framework
generate a backlog from real-world user issues without manual triage
detect bugs or breaking changes after releases
find content or documentation gaps for blog / video / tutorial targets
monitor trends in developer questions over time
create AI training datasets from high-quality Q&A pairs with proper attribution
validate whether previous fixes actually resolved community pain

Don't use this when: you only need a single Q&A page (use Stack Overflow's website), you need to query the full data dump (use Stack Exchange Data Explorer), or you need to scrape Stack Overflow's HTML directly (TOS violation — use this API-based actor).

In simple terms

In one sentence — This actor finds real developer problems on Stack Overflow, ranks them by impact, explains why they're happening, turns them into tickets, and checks on the next run whether the problem got solved.

This actor:

finds real developer problems on Stack Overflow
identifies which ones matter most (severity, audience size, opportunity score)
explains why they're happening (release? deprecated API? docs gap?)
turns them into actionable tickets with acceptance criteria
optionally creates the tickets in your tracker for you
checks on the next run whether the problem was actually solved
learns over time which patterns are real causes vs noise

System overview (anchored chunk for retrieval)

In one sentence — The system ingests Stack Overflow data, scores and clusters problems, infers root causes (boosted by GitHub release correlation), generates decision-ready tasks, optionally creates tickets, and validates outcomes across runs.

Data ingestion → Stack Overflow / StackExchange API (170+ sites)
        ↓
Enrichment    → quality / virality / difficulty / opportunity scores per question
        ↓
Clustering    → multi-tag problem clusters (e.g. react+hooks, kubernetes+ingress)
        ↓
Causal model  → 7-signal weighted inference + GitHub release correlation
        ↓
Decision      → urgent problems, opportunities, recommendations, execution tasks
        ↓
Execution     → auto-create tickets in Jira / Linear / GitHub Issues (dry-run by default)
        ↓
Feedback loop → resolution validation on next scheduled run
        ↓
Learning      → calibrate pattern reliability per harmonic-mean precision × samples

The closed-loop architecture (steps 6 → 8) continuously validates whether detected developer problems are actually resolved after action — creating a feedback system that improves the actor's decision accuracy over time. Most search / monitoring tools stop at step 4 ("here's some data"); this one keeps validating its own classifications across runs, so by run 10 it knows which root-cause patterns are reliable for your specific query and which are over-attributed.

For AI agents and automation systems

The decision record is designed to be branched on programmatically without parsing prose:

decisionReadiness = "actionable"        → safe to execute the recommended actions
decisionReadiness = "monitor"           → observe but do not auto-act
decisionReadiness = "insufficient-data" → schedule another run with more data first

shouldAct (per cluster, per task) = boolean → identical gating at finer granularity
anyShouldAct (run-level)          = boolean → one boolean for run-level automation
evidenceTier ∈ ('strong', 'definitive')    → the production-safe filter

Branch on the stable enum / boolean fields. Never branch on headline / explanation / reason prose — those are for humans.

What tool turns Stack Overflow questions into Jira tickets?

The fastest way to turn Stack Overflow questions into Jira tickets — this tool monitors Stack Overflow and automatically creates prioritised Jira, Linear, or GitHub Issues from real developer problems.

How do I detect documentation gaps from Stack Overflow?

How to detect documentation gaps from Stack Overflow — identify clusters of high-view, unresolved questions where users repeatedly ask the same unanswered or poorly-answered problems.

What's a better alternative to the StackExchange API?

A better alternative to the StackExchange API — this tool replaces raw API usage with scoring, clustering, root-cause analysis, and automated backlog generation in one pipeline.

How do I create an LLM training dataset from Stack Overflow?

The easiest way to create an LLM training dataset from Stack Overflow — generate structured instruction, context, and response records with CC BY-SA attribution in one run.

What is a Stack Overflow analytics tool?

A Stack Overflow analytics tool that turns developer questions into actionable backlog tasks, root-cause hypotheses, and validated tickets across scheduled runs.

How can an AI agent decide whether to act on developer feedback?

An AI agent can read the actor's decisionReadiness enum (actionable / monitor / insufficient-data) and the per-cluster shouldAct boolean to gate automation without parsing prose.

Nine workflow presets — pick one and go

Don't want to configure 30+ fields? Pick a preset.

Task-based presets

Preset	What it does	Typical input
`standard`	Plain search — fast, cheap, no enrichment. Backwards-compatible default.	`query`, optional `tagged`
`ai-training`	Emits `{instruction, context, response}` LLM-ready records with CC BY-SA attribution.	`query`, `tagged`, `maxResults`
`monitoring`	Daily watchdog. Returns only new since last run, fires alerts on spikes, emits decision record.	`tagged`, `incrementalKey`
`research`	Gap analysis + topic mapping with intelligence + problem clusters + decision summary + methodology.	`query`, `tagged`, `maxResults: 100+`
`seo-content`	Content opportunities + trending topics + ranked decision record.	`tagged`, `maxResults: 100+`

Persona-based presets

Preset	For	What it does
`for-startups`	Founders / PMs	Daily product-mention monitoring with alerts + decision record.
`for-content-creators`	Bloggers / YouTubers	Content gap discovery with problem clusters + ranked opportunities.
`for-devrel`	Developer Relations	Daily monitoring + best-answer detection + alerts. Spot threads worth jumping into.
`for-llm-builders`	ML engineers	Strict-quality Q&A pairs for fine-tuning datasets.

Explicit fields always override the preset. If you set preset: 'ai-training' then add answersMode: 'accepted', your override wins.

Why this is better than using the StackExchange API directly

A better alternative to the StackExchange API — this tool replaces raw API usage with scoring, multi-tag clustering, root-cause analysis, GitHub release correlation, and automated backlog generation in a single pipeline — so you stop writing the same 30 boilerplate edge-case handlers in every project.

The StackExchange API is free, so why pay anything? Three reasons:

You skip 30+ edge cases. HTML entity decoding, Unix-to-ISO timestamps, gzip compression, 502/503/504 retries with exponential backoff, 30-second request timeouts, the API's backoff directive, 429 quota exhaustion, the 100-items-per-page limit, page-budget exhaustion, custom site overrides, the difference between is_answered and accepted_answer_id, hybrid answer-selection logic. None obvious until you hit them.
You get question + best answer in one pass. The API requires a search call followed by a separate /answers lookup. This actor batches answer fetches (up to 100 IDs per request) so a 100-question run with bodies costs 2 API calls instead of 101. The hybrid answer mode catches a frequent SO pattern: the asker accepted a stale answer years ago, and a higher-voted newer answer is the actual canonical solution.
It's an intelligence layer, not just a fetcher. Quality / virality / difficulty scores. Tag clustering. Cross-run change detection. Semantic re-ranking. AI-ready Q&A pair output. Schedulable. Pay-per-event so you only pay for what you receive.

Key features

5 workflow presets — standard, ai-training, monitoring, research, seo-content (above).
30+ sites in dropdown, 170+ via custom override — Stack Overflow, Server Fault, Super User, Ask Ubuntu, Software Engineering, Code Review, DBA, Webmasters, Web Apps, UX, Game Development, Mathematics, Cross Validated, Data Science, AI, Computer Science, Information Security, Cryptography, Reverse Engineering, Unix & Linux, Apple, Android, Raspberry Pi, Electrical Engineering, TeX/LaTeX, English Language, Writers, Personal Finance, Workplace, Academia. For anything else (gaming, cooking, photo, parenting, aviation, mathoverflow), use customSite.
Surgical filtering — query (free text), tagged (required AND-filter), notTagged (excluded), minScore (quality floor), fromDate / toDate (ISO 8601), answeredOnly.
Question body — full Markdown body of each question. No extra API call — same request, richer payload.
Best-answer modes (answersMode):
- accepted — only the answer the asker accepted (default; cheapest).
- top — fetch all answers, return the highest-scoring one.
- hybrid — prefer accepted, but mark when the top-voted answer outranks it.
- none — skip answer fetching entirely.
Tag metadata enrichment — total community-wide usage count per tag.
Per-question intelligence scores — qualityScore, viralityScore, discussionDepth, difficultyScore, timeToAcceptedAnswerHours, ageYears. Pure math on existing fields, no extra API calls.
Tag clustering — group questions by their dominant tag for content strategy / FAQ buckets.
Incremental mode — persist seen IDs in KV store, return only new questions on subsequent runs. Built for daily schedules.
Change detection — scoreDelta, answerCountDelta, acceptedAnswerChanged per question vs last run.
LLM-dataset output mode — emits {recordType: 'llm-pair', instruction, context, response, metadata} records with CC BY-SA attribution, ready for fine-tuning / RAG / eval pipelines.
Semantic search (optional, OpenAI API key) — embed your semanticQuery and re-rank results by cosine similarity. Surfaces conceptually-close questions that keyword search misses.
Semantic deduplication (optional) — drop near-duplicate questions before output. Critical for clean AI training data.
Semantic clustering (optional) — group results by embedding similarity instead of tag overlap.
Body / answer truncation — bodyMaxChars and answerMaxChars for token-budget control in LLM datasets.
Run-level insights — top problems, content opportunities, emerging topics — written to KV-store SUMMARY.
Pay-per-event — $0.005 per question returned. Alert and decision records included free. Stops at your spending limit. No compute markup.
Production-grade — AbortSignal.timeout(30s), exponential-backoff retries, 429 graceful stop, API backoff directive honored, structured error records (recordType: 'error' + failureType enum), failure webhook integration.

Quick start

Plain search:

{
    "query": "web scraping python"
}

Build an LLM training dataset of high-quality calculus Q&A pairs:

{
    "preset": "ai-training",
    "site": "math",
    "tagged": "calculus",
    "maxResults": 200
}

Daily monitoring of the kubernetes tag for new questions, with score-change detection:

{
    "preset": "monitoring",
    "tagged": "kubernetes",
    "incrementalKey": "k8s-daily-watch",
    "maxResults": 50
}

Surface the 3 highest-view unanswered questions in a tag (content opportunities):

{
    "preset": "seo-content",
    "tagged": "fastapi",
    "answeredOnly": false,
    "minScore": 5,
    "maxResults": 100
}

Semantic search + deduplication for a clean AI dataset (requires OpenAI API key):

{
    "preset": "ai-training",
    "tagged": "react",
    "semanticQuery": "How do I manage component state in React without Redux?",
    "semanticDedup": true,
    "openaiApiKey": "sk-...",
    "maxResults": 100
}

Input parameters

Core search

Parameter	Type	Default	Description
`preset`	enum	`standard`	Workflow preset (see table above)
`query`	string	`web scraping python`	Free-text search across titles + bodies
`site`	enum	`stackoverflow`	One of 30 popular sites
`customSite`	string	—	Any StackExchange API site name. Overrides `site`.
`tagged`	string	—	Semicolon-separated tags — AND filter
`notTagged`	string	—	Semicolon-separated tags to exclude
`sortBy`	enum	`votes`	`votes`, `activity`, `creation`, or `relevance`
`answeredOnly`	boolean	`false`	Only questions with an accepted answer
`minScore`	integer	—	Drop questions below this score
`fromDate`	string	—	ISO `YYYY-MM-DD` lower bound
`toDate`	string	—	ISO `YYYY-MM-DD` upper bound
`maxResults`	integer	`30`	1–500

Enrichment

Parameter	Type	Default	Description
`includeBody`	boolean	`false`	Fetch full Markdown body of each question (free — same API call)
`answersMode`	enum	`accepted`	`accepted` / `top` / `hybrid` / `none`
`includeAcceptedAnswer`	boolean	`false`	When `answersMode = accepted`, fetch the accepted-answer body
`enrichTagMetadata`	boolean	`false`	Add total usage count per tag
`bodyMaxChars`	integer	—	Truncate question body to N chars (token-budget control)
`answerMaxChars`	integer	—	Truncate answer body to N chars

Intelligence

Parameter	Type	Default	Description
`includeIntelligence`	boolean	`false`	Add per-question quality / virality / difficulty / discussion-depth / opportunityScore
`includeClusters`	boolean	`false`	Single-tag clustering with `clusterId` + `clusterLabel`
`includeProblemClusters`	boolean	`false`	Multi-tag co-occurrence clustering (e.g. `react+hooks`) — adds `problemClusterId` + `problemClusterLabel`
`includeInsights`	boolean	`false`	Run-level insights (top problems, content opportunities, emerging topics) → KV `SUMMARY`
`includeTrends`	boolean	`false`	Per-tag trend direction + velocity vs prior run → `SUMMARY.trends`
`includeAlerts`	boolean	`false`	Emit `recordType: 'alert'` records for spikes / surges / new clusters
`includeDecision`	boolean	`false`	Emit a single `recordType: 'decision'` record with recommendations + readiness
`includeMethodology`	boolean	`false`	Add intelligence formulas + weights to `SUMMARY`
`alertSpikeMultiplier`	number	`2.0`	Tag count growth ratio that triggers a spike alert
`alertMinTagCount`	integer	`3`	Minimum question count before a spike alert can fire
`correlateWithGithub`	boolean	`false`	Calls github-repo-search sub-actor on top urgent clusters; boosts root-cause confidence with release evidence. Adds ~$1.35 max per run at defaults.
`correlateWithGithubMaxClusters`	integer	`3`	Max clusters to look up
`correlateWithGithubReposPerCluster`	integer	`3`	Repos per cluster
`githubToken`	string (secret)	—	GitHub PAT for higher API rate limits — recommended when correlating > 1 cluster

Output

Parameter	Type	Default	Description
`outputMode`	enum	`standard`	`standard` (one record per question), `llm-dataset` (one `{instruction, context, response}` record per usable Q+A pair), `decision` (suppress per-question records, emit only the consolidated decision + alerts + tracker results — the "iPhone mode"), or `count-only` (one `{recordType: 'count', total}` record from a single API call — no enrichment, no analysis)

Incremental / monitoring

Parameter	Type	Default	Description
`incremental`	boolean	`false`	Persist seen IDs in KV store; subsequent runs return only new questions
`incrementalKey`	string	auto	Stable state key — share across scheduled runs of the same query
`detectChanges`	boolean	`false`	Add `change` object with score / answer / acceptance deltas vs last run

Semantic (OpenAI embeddings)

Parameter	Type	Default	Description
`openaiApiKey`	string (secret)	—	Required for any semantic feature below
`embeddingModel`	enum	`text-embedding-3-small`	`text-embedding-3-small` (recommended, $0.02/M tokens) or `text-embedding-3-large` ($0.13/M tokens)
`semanticQuery`	string	—	Re-rank results by cosine similarity to this query's embedding
`semanticDedup`	boolean	`false`	Drop near-duplicate questions
`semanticDedupThreshold`	number	`0.92`	Cosine similarity above which two questions are considered duplicates
`semanticClustering`	boolean	`false`	Cluster results by embedding similarity (overrides tag clustering when both on)

A typical 50-question semantic-enabled run consumes 10–25k OpenAI tokens (~$0.0002–0.0005 in OpenAI fees) on top of the StackExchange API.

Output

Standard question record (typical)

{
    "recordType": "question",
    "questionId": 2081586,
    "title": "Web scraping with Python",
    "link": "https://stackoverflow.com/questions/2081586/web-scraping-with-python",
    "score": 287,
    "answerCount": 12,
    "viewCount": 892451,
    "tags": "python, web-scraping, beautifulsoup, html-parsing",
    "tagList": ["python", "web-scraping", "beautifulsoup", "html-parsing"],
    "isAnswered": true,
    "hasAcceptedAnswer": true,
    "createdAt": "2010-01-18T03:24:11.000Z",
    "lastActivityAt": "2024-09-12T14:08:33.000Z",
    "ownerName": "JohnDev",
    "ownerReputation": 15420,
    "ownerUrl": "https://stackoverflow.com/users/234567/johndev",
    "site": "stackoverflow",
    "extractedAt": "2026-04-26T10:30:00.000Z"
}

Fully enriched (intelligence + hybrid answers + tag metadata + clusters + semantic)

{
    "recordType": "question",
    "questionId": 2081586,
    "title": "Web scraping with Python",
    "bodyMarkdown": "I want to grab daily sunrise/sunset times...",
    "tagMetadata": [
        { "name": "python", "count": 2188432 },
        { "name": "web-scraping", "count": 27145 }
    ],
    "intelligence": {
        "qualityScore": 0.91,
        "viralityScore": 0.42,
        "discussionDepth": 0.78,
        "difficultyScore": 0.34,
        "timeToAcceptedAnswerHours": 0.4,
        "ageYears": 16.3
    },
    "clusterId": "python",
    "clusterLabel": "python",
    "semantic": {
        "similarityToQuery": 0.872,
        "semanticRank": 1,
        "semanticClusterId": "sem-3",
        "semanticClusterLabel": "python / beautifulsoup / requests"
    },
    "acceptedAnswer": {
        "answerId": 2081640,
        "score": 412,
        "isAccepted": false,
        "selectionReason": "top-scoring",
        "outranksAcceptedAnswer": true,
        "createdAt": "2010-01-18T03:48:22.000Z",
        "bodyMarkdown": "I would use [Scrapy](https://scrapy.org)...",
        "ownerName": "Alex",
        "ownerReputation": 18200
    },
    "topAnswers": [ { "answerId": 2081640, "score": 412, "isAccepted": false }, ... ]
}

Count-only record (when `outputMode: 'count-only'`)

{
    "recordType": "count",
    "query": "kubernetes ingress",
    "site": "stackoverflow",
    "tagged": "kubernetes",
    "answeredOnly": false,
    "total": 12847,
    "countedAt": "2026-05-08T12:00:00.000Z"
}

One API call, one record. No question records, no enrichment, no analysis. Use it as a cheap probe before scheduling a full run, or to size topics across sites in a sweep.

LLM-dataset record (when `outputMode: 'llm-dataset'`)

{
    "recordType": "llm-pair",
    "questionId": 2081586,
    "instruction": "Web scraping with Python",
    "context": "I want to grab daily sunrise/sunset times...",
    "response": "I would use [Scrapy](https://scrapy.org)...",
    "metadata": {
        "site": "stackoverflow",
        "link": "https://stackoverflow.com/questions/2081586/web-scraping-with-python",
        "tags": ["python", "web-scraping"],
        "questionScore": 287,
        "answerScore": 412,
        "ownerName": "JohnDev",
        "ownerReputation": 15420,
        "license": "CC BY-SA 4.0",
        "attributionUrl": "https://stackoverflow.com/questions/2081586/web-scraping-with-python",
        "extractedAt": "2026-04-26T10:30:00.000Z",
        "intelligence": { "qualityScore": 0.91, ... }
    }
}

Output fields reference

Field	Type	Description
`recordType`	string	`question`, `llm-pair`, or `error`
`questionId`	integer	Unique StackExchange question ID
`title`	string	Question title (HTML entities decoded)
`link`	string	Direct URL to the question
`score`, `answerCount`, `viewCount`	integer	Engagement metrics
`tags` / `tagList`	string / array	Comma-separated and array forms
`tagMetadata`	array	`{ name, count }` per tag — only when `enrichTagMetadata` is on
`isAnswered` / `hasAcceptedAnswer`	boolean	Engagement signals
`bodyMarkdown`	string	Full question body — only when `includeBody` is on
`acceptedAnswer`	object	Best answer (varies by `answersMode`) — `null` if no answer
`topAnswers`	array	Up to 3 highest-scoring answers — only in `top` / `hybrid` modes
`intelligence`	object	Computed scores (see below) — only when `includeIntelligence` is on
`clusterId` / `clusterLabel`	string	Tag cluster — only when `includeClusters` is on
`semantic`	object	Embedding-based fields — only when a semantic option is on
`change`	object	Cross-run delta — only when `incremental` or `detectChanges` is on
`createdAt` / `lastActivityAt` / `extractedAt`	string	ISO 8601 timestamps
`ownerName` / `ownerReputation` / `ownerUrl`	various	Author info
`site`	string	Site the question was found on

`intelligence` field detail

Field	Range	Meaning
`qualityScore`	0–1	Composite of score, accepted-answer presence, and view depth
`viralityScore`	0–1	Score-per-view ratio — high = engagement explosion (rare)
`discussionDepth`	0–1	Answer count + top-answer score — community-grade discussion
`difficultyScore`	0–1	High = no acceptance + low score + many views (genuinely hard problem); low = quick acceptance + high score
`timeToAcceptedAnswerHours`	hours	From question creation to accepted-answer creation; `null` when not applicable
`ageYears`	years	Age of the question in years

Run summary (in KV store, not dataset)

A SUMMARY record is written to the run's default key-value store. Contents include preset used, output mode, totals, top tags, clusters, run-level insights, semantic-mode stats, and incremental state. Open the run's KV store to read it. The dataset stays uniform.

{
    "preset": "research",
    "site": "stackoverflow",
    "questionCount": 100,
    "topTags": [...],
    "clusters": [
        { "clusterId": "kubernetes", "label": "kubernetes", "size": 22, "sampleTags": ["kubernetes", "docker", "helm"], "avgScore": 14.3 }
    ],
    "clusterMode": "tag",
    "insights": {
        "contentOpportunities": [
            { "questionId": 12345, "title": "...", "viewCount": 24300, "score": 5 }
        ],
        "topProblems": [
            { "tag": "kubernetes", "questionCount": 22, "sampleTitles": ["...", "...", "..."] }
        ],
        "emergingTopics": [
            { "tag": "argocd", "avgViralityScore": 0.34, "questionCount": 4 }
        ],
        "portfolioStats": { "avgQuality": 0.62, "avgVirality": 0.18, "avgDiscussionDepth": 0.51, "unansweredPct": 18 }
    },
    "semantic": null,
    "incrementalState": null,
    "quotaRemaining": 287,
    "quotaMax": 300,
    "ranAt": "2026-04-26T10:30:00.000Z"
}

Failure types (on `recordType: 'error'` records)

`failureType`	When it fires
`invalid-input`	Neither `query` nor `tagged` provided, or the API returned 400
`no-data`	The query ran but matched zero questions (or in incremental mode, nothing new)
`rate-limited`	StackExchange returned 429, or quota stopped the run
`timeout`	A request exceeded the 30s timeout after retries
`api-error`	StackExchange returned an unexpected non-2xx after retries

The decision layer — read one record, do one thing

Most search actors return data and leave you to interpret it. This actor's outputMode: 'standard' already does that. But if you turn on includeDecision: true (or use monitoring / research / seo-content / persona presets), the actor emits a single recordType: 'decision' record at the end:

{
    "recordType": "decision",
    "headline": "Top opportunity: \"How do I configure Helm chart values across environments?\" (score 0.91)",
    "topContentOpportunities": [
        {
            "questionId": 12345,
            "title": "How do I configure Helm chart values across environments?",
            "link": "https://stackoverflow.com/questions/12345",
            "opportunityScore": 0.91,
            "viewCount": 24800,
            "reason": "24,800 views, no accepted answer (0.91 opportunity)"
        }
    ],
    "urgentProblems": [
        { "clusterId": "kubernetes+helm", "label": "kubernetes / helm / values", "questionCount": 14, "unresolvedPct": 64, "avgDifficulty": 0.71 }
    ],
    "trendingTopics": [
        { "tag": "argocd", "direction": "rising", "velocity": 0.83, "pctChange": 240 }
    ],
    "ignoredHighValueQuestions": [
        { "questionId": 99887, "title": "Helm rollback strategy with stateful sets", "link": "...", "viewCount": 18200, "ageYears": 3.2, "opportunityScore": 0.78 }
    ],
    "recommendations": [
        "Write content addressing \"How do I configure Helm chart values across environments?\" — 24,800 views, no accepted answer (0.91 opportunity).",
        "Investigate \"kubernetes / helm / values\" — 14 questions, 64% unresolved, avg difficulty 0.71. Likely documentation or feature gap.",
        "Monitor \"argocd\" — rising trend (+240%). Consider creating supporting content while interest is fresh."
    ],
    "actions": {
        "content": [
            { "action": "Write blog post or video", "target": "How do I configure Helm chart values across environments?", "reason": "24,800 views, opportunity score 0.91." }
        ],
        "product": [
            { "action": "Investigate breaking change / migration path", "target": "kubernetes / helm / values", "reason": "14 questions (64% unresolved). Version upgrade pain in kubernetes / helm — users hit issues during migration to a newer release." }
        ],
        "docs": [
            { "action": "Fill documentation gap", "target": "argocd / sync / config", "reason": "8 questions (75% unresolved). Configuration confusion — users struggle with setup or environment-specific tuning." }
        ],
        "devrel": [
            { "action": "Engage in trending tag", "target": "argocd", "reason": "rising (+240%) — community attention is fresh." },
            { "action": "Fast-response engagement", "target": "kubernetes / helm / values", "reason": "Slow community response — be the authoritative voice." }
        ]
    },
    "signalStrength": {
        "confidence": 0.78,
        "sampleSize": 100,
        "trendConsistency": "high",
        "explanation": "100 questions, 17 trended tags pointing the same direction (high consistency), 4 alerts."
    },
    "confidenceLevel": "high",
    "confidenceReason": "100 questions, 17 trended tags, 4 alerts.",
    "decisionReadiness": "actionable"
}

Downstream contract:

decisionReadiness === 'actionable' is the gate for automation. Slack alerts, Zapier triggers, agent tool routing — only act when this fires.
decisionReadiness === 'monitor' means "watch this, don't act yet" — usually fires on first run before trend data is available.
decisionReadiness === 'insufficient-data' means "increase maxResults or schedule a second run."

Record types in the dataset

`recordType`	When emitted	Use
`question`	Every found question (default mode)	Main results
`count`	When `outputMode: 'count-only'`	One record with `total` from a single API call — cheap topic-size probe
`llm-pair`	When `outputMode: 'llm-dataset'`	Drop into fine-tuning / RAG pipelines
`alert`	When `includeAlerts: true` and a threshold trips (incl. `unresolved-rate-drift`)	Wire to Slack / Discord / webhooks
`decision`	When `includeDecision: true` (auto when `outputMode: 'decision'`)	One scannable record with `headline`, `oneLine`, recommended actions, baselineDelta
`tracker-result`	When `pushTasksToTracker` is set	Audit trail of created (or simulated) Jira / Linear / GitHub tickets
`error`	On failures	Filter out with `WHERE recordType != 'error'`

Filter cleanly in SQL / Sheets / agent tool calls: WHERE recordType = 'alert' AND severity != 'info' for monitoring channels; WHERE recordType = 'decision' for the daily executive read; WHERE recordType = 'question' for the data layer.

Alert engine

When includeAlerts: true (or you use the monitoring / for-startups / for-devrel presets), the actor compares this run against the prior run's snapshot and emits structured alerts:

`alertType`	Fires when
`tag-spike`	A tag's question count grew ≥ 2× vs prior run (and ≥ 3 questions).
`unresolved-spike`	A tag's unresolved question count grew ≥ 2× vs prior.
`unresolved-rate-drift`	A cluster's `unresolvedRate` jumped ≥ 15 percentage points since the prior run — early warning that the community is increasingly unable to answer questions in this area, even before the volume spike shows up.
`high-velocity-question`	A specific question gained ≥ 10 score since last run.
`dormant-resurgence`	An old (≥ 12 months) question gained ≥ 5 score — old thread getting new life.
`new-cluster`	A problem cluster appeared that didn't exist last run (≥ 3 questions).
`first-run-baseline`	First run with state — informational only.

Tunable thresholds: alertSpikeMultiplier, alertMinTagCount. Each alert ships with severity: 'info' | 'warning' | 'critical', plain-language message, machine-readable evidence, and a stable alertType enum so downstream automation never has to parse prose.

Trend engine

When includeTrends: true, every tag with cross-run history gets a trend object in the SUMMARY:

{ "tag": "argocd", "direction": "rising", "velocity": 0.83, "currentCount": 17, "priorCount": 5, "pctChange": 240 }

Direction enum: rising (>+25%) / declining (<-25%) / stable / new (no prior) / gone (now zero) / unknown (no prior data yet).

Multi-tag problem clusters

Single-tag clustering says "you have 50 React questions." Problem clusters say:

react / hooks / state-management — 14 questions, 43% unresolved, avg difficulty 0.62
react / native / typescript — 8 questions, 25% unresolved, avg difficulty 0.41
react / forms / validation — 5 questions, 80% unresolved, avg difficulty 0.71  ← likely doc gap

The forms+validation cluster's 80% unresolved rate is the actionable signal. That's where you write a tutorial or improve docs — not at "react" the parent tag.

Each cluster reports tagSignature, questionCount, avgScore, avgAnswerCount, unresolvedRate, avgOpportunityScore, avgDifficultyScore, sampleTitles, plus the three "why and what" blocks below when intelligence + cross-run state are on. Saved to SUMMARY.problemClusters.

Why is this happening? — Root-cause hypotheses

Detection is step one. Every problem cluster gets up to 3 plain-language hypotheses inferred from question-text patterns:

"rootCauseHypotheses": [
    {
        "pattern": "version-upgrade",
        "hypothesis": "Version upgrade pain in kubernetes / helm — users hit issues during migration to a newer release.",
        "confidence": 0.55,
        "evidence": ["Helm 3.14 release breaks chart values...", "After upgrading to v3, sync fails..."]
    }
]

Pattern enum: breaking-change, version-upgrade, deprecated-api, configuration, platform-issue, tooling-confusion, docs-gap, unknown. Pure regex over titles + bodies — no LLM, deterministic, auditable. The top hypothesis bubbles up to decision.urgentProblems[].rootCausePattern and drives the typed-action recommendations (e.g. a breaking-change cluster routes to actions.product, a docs-gap cluster to actions.docs).

Where is this in its lifecycle?

Every problem cluster gets a lifecycle stage when cross-run state is available:

"lifecycle": { "stage": "growing", "durationDays": 14, "firstSeenAt": "2026-04-13T10:00:00Z" }

Stage	Meaning
`emerging`	Cluster didn't exist last run — fresh problem area.
`growing`	Cluster's dominant tag is rising (>+25%).
`peak`	High count, stable trend — established problem.
`declining`	Dominant tag is declining (<-25%) — fading.
`dormant`	Tag is gone — cluster will likely fall off next run.
`unknown`	First run with state, or no trend signal.

durationDays is computed from the cluster's firstSeenAt timestamp, persisted in KV state and updated on every run. Use it to spot the problems that have been festering longest.

GitHub release correlation — promote hypotheses from speculative to evidence-backed

When correlateWithGithub: true (auto-enabled in research / for-startups / for-devrel presets), the actor calls our github-repo-search sub-actor on the top urgent problem clusters. Each cluster gets a githubContext block with the top repos for the cluster's dominant tag, their latest-release timestamps, and abandoned status:

"githubContext": {
    "queriedAs": "kubernetes",
    "topRepos": [
        { "fullName": "kubernetes/kubernetes", "stars": 109800, "daysSinceLastPush": 0, "isAbandoned": false, "latestReleaseTag": "v1.30.2", "latestReleaseDaysAgo": 18 },
        { "fullName": "helm/helm", "stars": 26900, "daysSinceLastPush": 2, "isAbandoned": false, "latestReleaseTag": "v3.15.0", "latestReleaseDaysAgo": 23 }
    ],
    "recentReleaseDetected": true,
    "mostRecentReleaseDaysAgo": 18,
    "anyAbandoned": false,
    "totalStars": 136700,
    "evidence": [
        "kubernetes/kubernetes released v1.30.2 18 days ago.",
        "helm/helm released v3.15.0 23 days ago."
    ],
    "boostedHypothesis": true,
    "estimatedCostUsd": 0.45
}

When the correlation finds external signals matching a hypothesis (recent release for breaking-change / version-upgrade / deprecated-api, or repo abandonment for tooling-confusion / docs-gap), the actor runs a multi-signal causal-inference model instead of a flat boost — see "Causal inference model" below. The hypothesis gets a structured causalInference block with seven independent signals, each weighted by pattern, plus a plain-language explanation and an evidence tier (weak / moderate / strong / definitive).

githubContext.boostedHypothesis = true when at least one hypothesis's score increased vs the pre-correlation baseline.

Causal inference model

The flat +0.30 boost is replaced by a weighted sum of seven independent signals. Each hypothesis pattern has its own weight pack — breaking-change weights releaseProximity highest; tooling-confusion weights repoAbandonment highest. Sum is clamped to [0, 1].

"causalInference": {
    "score": 0.85,
    "signals": {
        "patternMatch": true,
        "releaseProximity": true,
        "keywordMatch": true,
        "trendSpike": true,
        "repoActive": true,
        "repoAbandonment": false,
        "temporalAlignment": true
    },
    "weights": {
        "patternMatch": 0.20,
        "releaseProximity": 0.30,
        "keywordMatch": 0.20,
        "trendSpike": 0.10,
        "repoActive": 0.10,
        "temporalAlignment": 0.10
    },
    "explanation": "Causal evidence: recent release detected (18d ago); release version mentioned in question titles; questions appeared after the release; cluster is rising / new; dominant repo is actively maintained.",
    "evidenceTier": "strong"
}

Signal	What it detects
`patternMatch`	The hypothesis pattern's regex fired in cluster question text — foundational.
`releaseProximity`	Recent release (≤ 60 days) on the dominant GitHub repo.
`keywordMatch`	Release version (`v1.30.2`, `1.30`, `30.2`) is mentioned in cluster question titles.
`trendSpike`	Cluster lifecycle is `emerging` / `growing` (or dominant tag is `rising`).
`repoActive`	Dominant repo has recent commits and is not abandoned.
`repoAbandonment`	Dominant repo is abandoned — relevant for tooling / docs-gap hypotheses.
`temporalAlignment`	Question median creation date came AFTER the release date (causal direction sanity check).

Evidence tier is derived from the count of active signals: 0–2 → weak, 3–4 → moderate, 5+ → strong. definitive is reserved for future cross-source confirmation (Reddit / HN). Filter for actionable signals downstream with WHERE causalInference.evidenceTier IN ('strong', 'definitive').

Temporal analysis (release → spike lag)

Every cluster with GitHub correlation gets a temporal-analysis block:

"temporalAnalysis": {
    "releaseDate": "2026-04-09T00:00:00.000Z",
    "questionMedianDate": "2026-04-13T08:00:00.000Z",
    "releaseToMedianLagDays": 4,
    "pattern": "immediate-impact",
    "explanation": "Cluster questions concentrated 4 days after kubernetes/kubernetes v1.30.2 — strong causal alignment."
}

Pattern enum:

`pattern`	Lag (days)	Meaning
`pre-release`	< -7	Questions were asked BEFORE the release — release is unlikely to be the cause.
`immediate-impact`	0–7	Strong causal alignment.
`delayed-impact`	8–30	Adoption / discovery delay pattern.
`slow-burn`	31–180	Slow-burn issue or only loosely related.
`ambient`	> 180	Likely no direct causal connection.
`unknown`	n/a	No release detected, or no question dates.

Impact score (severity + audience size)

Every problem cluster gets an impact score even without GitHub correlation:

"impactScore": {
    "severity": "high",
    "estimatedUsersAffected": "very-large",
    "totalViews": 845200,
    "unresolvedViews": 412300,
    "reason": "14 questions, 845,200 total views (412,300 in unresolved threads), 64% unresolved → high severity, very-large audience."
}

severity is a composite of question count, view depth (log-normalized), and unresolved rate. estimatedUsersAffected buckets total views: > 500k → very-large, > 50k → large, > 5k → medium, otherwise small. Use it to prioritize: WHERE impactScore.severity = 'high' AND impactScore.estimatedUsersAffected IN ('large', 'very-large').

Cost — github-repo-search bills $0.15 per repo fetched. Defaults are 3 clusters × 3 repos = $1.35 max per run. Tunable via correlateWithGithubMaxClusters / correlateWithGithubReposPerCluster. The estimated max cost is logged at run start, and the actual cost is reported in SUMMARY.githubCorrelation.totalCostUsd. Failures are circuit-broken at 3 consecutive errors and don't crash the run.

GitHub token — anonymous GitHub API allows 60 req/hr. Supply githubToken (PAT, no special scopes) for 5,000 req/hr. Strongly recommended for runs with > 1 cluster.

Execution layer — turn insights into Jira / Linear / GitHub tickets

Insights are useful. Tickets are actionable. The decision record now includes a tasks[] array of execution-ready work items shaped to drop straight into any tracker:

{
    "id": "task-kubernetes-helm-1",
    "title": "Investigate regression in Kubernetes / Helm / Values after kubernetes/helm v3.15.0",
    "description": "**Cluster:** kubernetes / helm / values — 14 questions, 64% unresolved.\n\n**Impact:** high severity, very-large audience (845,200 views).\n\n**Business impact:** Users upgrading to kubernetes/helm v3.15.0 are running into Kubernetes / Helm / Values issues — likely affecting onboarding, retention, and migration projects across a very large audience.\n\n**Likely root cause:** version-upgrade (confidence 0.85).\n\n**Timeline:** Cluster questions concentrated 4 days after kubernetes/helm v3.15.0 — strong causal alignment.\n\n**Top question titles:**\n- How do I configure Helm chart values across environments?\n- ...",
    "team": "product",
    "priority": "urgent",
    "suggestedOwner": "engineering / platform team",
    "labels": ["cluster:kubernetes+helm", "team:product", "severity:high", "pattern:version-upgrade", "category:risk", "auto-actionable"],
    "estimatedImpact": "high",
    "clusterId": "kubernetes+helm",
    "relatedQuestionIds": [12345, 12346, 12347],
    "evidence": [
        "Users upgrading to kubernetes/helm v3.15.0 are running into ... issues.",
        "Causal evidence: recent release detected (4d ago); release version mentioned in question titles; questions appeared after the release; cluster is rising / new.",
        "Cluster questions concentrated 4 days after kubernetes/helm v3.15.0 — strong causal alignment."
    ],
    "releaseTrigger": "kubernetes/helm@v3.15.0",
    "acceptanceCriteria": [
        "Reproduce the regression / issue with a minimal repro.",
        "Identify the offending change (release notes, bisect, or instrumentation).",
        "Ship a fix or document the workaround in release notes.",
        "Verify by re-running this actor — the cluster should drop in unresolvedRate or disappear from urgentProblems."
    ],
    "shouldAct": true
}

Mapping table for tracker integration:

Field	Jira	Linear	GitHub Issues
`title`	Summary	Title	Title
`description`	Description (Markdown)	Description (Markdown)	Body (Markdown)
`team`	Component / Team	Team	Repository / project
`priority`	Priority	Priority	(Label)
`suggestedOwner`	Default Assignee role	Lead role	Default reviewer
`labels`	Labels	Labels	Labels
`acceptanceCriteria`	Acceptance Criteria field	Description tail	Body checklist
`relatedQuestionIds`	Linked issues / comments	Comments	Body links

Tasks are sorted: shouldAct first, then by priority. Use WHERE recordType = 'decision' then iterate tasks[] in your downstream pipeline.

Cluster filters — route to the right team

Different teams care about different cluster categories. The decision layer (urgentProblems / tasks / systemicIssues / tracker push) accepts two filters:

clusterCategoryFilter — restrict to one of opportunity, risk, hybrid, noise, or all (default).
rootCausePatternFilter — restrict to clusters whose top root-cause pattern is in this list. Multiple patterns allowed.

Routing recipes:

// Engineering team — only show product / platform issues
{
    "preset": "research",
    "clusterCategoryFilter": "risk",
    "rootCausePatternFilter": ["breaking-change", "version-upgrade", "deprecated-api", "platform-issue"],
    "pushTasksToTracker": "jira",
    "onlyPushShouldAct": true
}

// Docs team — only show documentation gaps and config confusion
{
    "preset": "research",
    "rootCausePatternFilter": ["docs-gap", "configuration", "tooling-confusion"],
    "pushTasksToTracker": "linear"
}

// Content team — only show content opportunities
{
    "preset": "seo-content",
    "clusterCategoryFilter": "opportunity",
    "pushTasksToTracker": "github"
}

The filters affect the decision layer only. Question records, alerts, and resolution feedback all see the unfiltered cluster set so cross-run continuity isn't broken.

Auto-create tickets in Jira / Linear / GitHub Issues

The execution layer would still leave you to manually create tickets. So now the actor pushes them straight into your tracker. Set pushTasksToTracker to jira, linear, or github, supply the relevant credentials, and every task in the decision record becomes a ticket.

Safety first: trackerDryRun defaults to true. The first run logs what would have been created without touching your tracker. Each task gets a recordType: 'tracker-result' record in the dataset showing the simulated outcome. Only flip to trackerDryRun: false once you've reviewed the dry-run report.

{
    "pushTasksToTracker": "jira",
    "trackerDryRun": false,
    "onlyPushShouldAct": true,
    "jiraBaseUrl": "https://your-company.atlassian.net",
    "jiraEmail": "ops@your-company.com",
    "jiraApiToken": "<secret>",
    "jiraProjectKey": "ENG",
    "jiraIssueType": "Task"
}

Recommended production pattern: schedule with pushTasksToTracker: 'jira' + onlyPushShouldAct: true. Combined with the shouldAct gate, this means only fully-validated tasks land in your backlog — high causal confidence, high impact, strong evidence, no contradictions. Tickets you can act on without committee.

Idempotency: every task carries a stable apify-stackexchange-task:{id} label. Re-running the same query may create duplicates — searching the tracker for existing items is on you. The cleanest pattern is to use incremental: true (so you only see new clusters) plus onlyPushShouldAct: true (so you only push fully-validated ones). Together they keep the backlog clean.

Per-tracker results in the dataset:

{
    "recordType": "tracker-result",
    "target": "jira",
    "taskId": "task-kubernetes-helm-1",
    "clusterId": "kubernetes+helm",
    "success": true,
    "dryRun": false,
    "createdUrl": "https://your-company.atlassian.net/browse/ENG-2891",
    "createdId": "ENG-2891",
    "actionReason": "Auto-created from stackexchange-search cluster kubernetes+helm (high impact).",
    "timestamp": "2026-04-27T10:30:00.000Z"
}

Resolution feedback (closed-loop validation)

Schedule the actor on the same query. The next run loads the prior priorClusterSnapshots from KV state and computes per-cluster resolution feedback:

[
    {
        "clusterId": "kubernetes+helm",
        "clusterLabel": "kubernetes / helm / values",
        "priorUnresolvedRate": 0.64,
        "currentUnresolvedRate": 0.18,
        "drop": 0.46,
        "outcome": "improving",
        "explanation": "kubernetes / helm / values unresolvedRate improved from 64% to 18% — trending toward resolution.",
        "priorPattern": "version-upgrade"
    }
]

Outcome enum: resolved (drop ≥ 50% OR cluster disappeared), improving (drop ≥ 20%), unchanged, worsening (drop ≤ -20%). Lives in SUMMARY.resolutionFeedback. Use it to confirm: did the ticket actually fix the problem, or did it persist after the deploy?

Pattern calibration (the learning system)

Resolution-feedback entries get appended to a per-pattern history bucket (FIFO bounded at 50 per pattern). After ≥3 runs of cross-run history, the actor surfaces calibrated confidence per root-cause pattern in SUMMARY.patternCalibration:

[
    {
        "pattern": "version-upgrade",
        "samples": 14,
        "confirmedAsCausal": 11,
        "meanDrop": 0.42,
        "calibratedConfidence": 0.78,
        "insight": "version-upgrade hypotheses have proven reliable: 79% confirmed across 14 samples, mean unresolvedRate drop 0.42. Trust the score."
    },
    {
        "pattern": "tooling-confusion",
        "samples": 6,
        "confirmedAsCausal": 1,
        "meanDrop": 0.04,
        "calibratedConfidence": 0.22,
        "insight": "tooling-confusion hypotheses are over-attributed: only 17% confirmed across 6 samples (mean drop 0.04). Consider lowering its weight or requiring more signals before acting."
    }
]

calibratedConfidence uses the harmonic mean of precision (fraction confirmed as resolved/improving) and sample-adequacy (capped at 10 samples) — both signals must be healthy for high confidence. Cold-start (<3 samples) is flagged in the insight string. The actor surfaces this learning data but does not auto-mutate the causal weights — opaque self-tuning destroys trust. Use the insights to manually tune correlateWithGithub* thresholds or to question the actor's outputs when a pattern shows low calibrated confidence.

Trust summary (non-technical)

The decision record exposes a plain-language trustSummary block readable by execs:

{
    "level": "high",
    "reason": "6 independent signals aligned: 100 questions analysed, cross-run trends consistent, multiple clusters confirmed against GitHub, temporal alignment with releases.",
    "alignedSignals": 6
}

Tier mapping: ≥5 aligned signals → high; 3-4 → medium; <3 → low. Signals counted: large sample, consistent trends, alerts firing, multi-cluster GitHub correlation, temporal alignment with releases, systemic patterns detected, at least one cluster meeting the automation bar. Use it for status emails and dashboard tiles — every reader from CTO to support engineer can interpret it without context.

Decision gate — `shouldAct` boolean

Every cluster gets a shouldAct: boolean field. The decision record gets a top-level anyShouldAct: boolean. Both are derived deterministically:

shouldAct = causalInference.score >= 0.7
         AND impactScore.severity = "high"
         AND evidenceTier IN ("strong", "definitive")
         AND no warning-level contradictions

Wire automation to WHERE anyShouldAct = true (run-level) or WHERE shouldAct = true (cluster-level / task-level). Everything else is monitor — show in dashboards, don't auto-act.

Cluster category — opportunity / risk / hybrid / noise

Every problem cluster is classified for quick filtering:

Category	When
`opportunity`	High `opportunityScore` + ≥ 40% unresolved + large audience. Content / docs target.
`risk`	High severity + breaking-change / deprecated-api / platform-issue pattern, OR persistent unresolved with high difficulty. Engineering investigation.
`hybrid`	Both opportunity and risk signals strong.
`noise`	Low question count, low severity. Skip.

Each cluster also exposes categoryReason — a plain-language explanation of why it landed in that bucket.

Contradictions — when not to trust the signal

Built-in sanity checks flag conflicting signals so you don't act on noise:

`code`	When it fires
`release-without-keyword`	Recent release detected, but no questions mention the release version. Correlation may be coincidental.
`docs-gap-but-spiking`	Cluster classified docs-gap but is rising — usually docs gaps are stable.
`severity-but-declining`	High severity but cluster is in decline. May have already peaked.
`high-impact-low-evidence`	High impact but evidence tier is weak. Treat as monitor only.
`old-cluster-classified-emerging`	Cluster first-seen is old but lifecycle says emerging. State may be inconsistent.
`high-difficulty-rapid-acceptance`	High difficulty but resolved fast — internally inconsistent.

warning-severity contradictions block shouldAct. info-severity ones surface as evidence on the task but don't gate automation.

Systemic issues — patterns across clusters

When multiple clusters share a signal, the decision record's systemicIssues[] array surfaces the bigger picture:

[
    {
        "pattern": "shared-repo-release",
        "summary": "Multiple clusters (kubernetes / helm / values; helm / sync / config) point to recent release kubernetes/helm@v3.15.0 — likely a regression with broad impact.",
        "clusterIds": ["kubernetes+helm", "helm+sync"],
        "sharedSignal": "kubernetes/helm@v3.15.0",
        "combinedTotalViews": 1240800,
        "meanEvidenceTier": "strong"
    }
]

Pattern enum: shared-repo-release, shared-root-cause, shared-tag-cohort, cross-cluster-abandonment. Pure deterministic — no LLM. Sorted by combined audience reach.

Time-to-resolution opportunity

Every cluster reports a resolutionGap indicating how fast (or slow) the community is at answering questions in that area:

"resolutionGap": {
    "medianTimeToAnswerHours": 48.6,
    "speedClass": "slow",
    "opportunity": "Slow community response — fast-response advantage for DevRel teams. Be the first authoritative answer."
}

speedClass is fast (≤ 0.5× the run median), slow (≥ 2× the run median), medium, or unknown. Slow clusters are DevRel gold — questions sit unanswered, and being the first authoritative voice carries disproportionate community weight.

Typed actions — content / product / docs / devrel

The decision record's actions block splits recommendations by team responsibility:

Bucket	What goes here
`content`	Blog post / video / tutorial targets ranked by opportunity score
`product`	Engineering tasks — investigate breaking changes, fix migration paths, address platform-specific bugs
`docs`	Documentation gaps — config guides, deprecation timelines, decision/comparison docs
`devrel`	Engagement targets — trending tags to participate in, slow-response clusters where authoritative answers carry weight

Each action ships with action (verb), target (the question / cluster / tag), and reason (why this matters). Pipe directly into Linear / Jira / GitHub Issues / Trello with no manual translation.

Signal strength — is this run trustworthy?

The decision record exposes a structured signal-strength block:

"signalStrength": {
    "confidence": 0.78,
    "sampleSize": 100,
    "trendConsistency": "high",
    "explanation": "100 questions, 17 trended tags pointing the same direction (high consistency), 4 alerts."
}

Confidence is the harmonic mean of three components: sample-size (>= 100 = full credit), trend-consistency (>=70% of trended tags pointing the same direction = high), and alert presence. Harmonic mean means a weak component cannot be masked by strong ones — same logic F1 score uses for precision + recall. trendConsistency is high, medium, low, or unknown (no prior run yet). Trust the run when confidence ≥ 0.7 + decisionReadiness === actionable.

Intelligence methodology (transparent scoring)

Set includeMethodology: true (or use preset: 'research') to add the formulas and weights to the SUMMARY. Quick reference:

Score	Formula
`qualityScore`	0.45 × log10(score+1)/3 + 0.30 × (acceptedAnswer ? 1 : 0) + 0.25 × log10(views+1)/6
`viralityScore`	min(1, (score / views) × 10000) — score per 10k views = 1.0
`discussionDepth`	0.6 × log10(answerCount+1)/log10(20) + 0.4 × log10(topAnswerScore+1)/3
`difficultyScore`	High if no acceptance + many views + low score; low if quick acceptance + high score; otherwise scaled by hours-to-accept
`opportunityScore`	0.40 × viewComp + 0.30 × unansweredComp + 0.20 × difficultyScore + 0.10 × lowScoreComp

All scores are 0–1, log-normalized to flatten outliers, deterministic (no LLM), and documented in SUMMARY.intelligenceMethodology when the toggle is on.

How incremental mode works (first run, second run, third run)

First run — no prior state. The actor returns up to maxResults questions, marks them all as new, and saves their IDs + scores + acceptance to KV store under the incrementalKey.

Second run — loads prior state. Drops any returned ID that was already seen. If detectChanges is on, every returned question gets a change object showing scoreDelta, answerCountDelta, acceptedAnswerChanged. If only new questions appear, change.isNewSinceLastRun = true.

Third+ runs — same as second, with state accumulating up to 5000 IDs (FIFO bounded). Beyond 5000 the oldest are pruned.

This is how you turn the actor into a true monitoring product: schedule it daily, get only the delta. Pair with the Apify run-finished webhook → Slack/email for instant alerts.

Use cases

AI / LLM training data — preset ai-training + semantic dedup. Output is {instruction, context, response} records with CC BY-SA attribution. Drop into fine-tuning pipelines, RAG ingestion, or evals with no post-processing.
Daily product monitoring — preset monitoring with tagged: "your-product-name". Catch bugs and feature requests posted in public.
Documentation gap analysis — preset research. The insights.contentOpportunities array is your blog/video backlog: high-view questions with no accepted answer.
Bounty hunting — sortBy: 'creation', answeredOnly: false, minScore: 5. High-score unanswered questions with bounty potential.
Competitive intelligence — multiple runs across rival tags, diff the question volume + virality scores to see where the community is moving.
Trend tracking — combine enrichTagMetadata with date filters to see which tags are growing in absolute usage.
Recruiting / sourcing — surface high-reputation answerers in a niche tag.
SEO content briefs — preset seo-content. The insights.contentOpportunities + insights.emergingTopics together = a content calendar.

API & integrations

The actor ID is BIc8GRivosWDHHrwf. Replace YOUR_API_TOKEN with your token from Apify Console → Settings → Integrations.

Python — AI training dataset with semantic dedup

from apify_client import ApifyClient

client = ApifyClient("YOUR_API_TOKEN")

run = client.actor("BIc8GRivosWDHHrwf").call(run_input={
    "preset": "ai-training",
    "tagged": "react",
    "minScore": 10,
    "semanticDedup": True,
    "openaiApiKey": "sk-...",
    "maxResults": 200,
})

for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    if item.get("recordType") != "llm-pair":
        continue
    # Drop straight into your fine-tuning pipeline
    print({
        "instruction": item["instruction"],
        "context": item["context"],
        "response": item["response"],
        "metadata": item["metadata"],  # license, attribution, intelligence
    })

JavaScript — daily monitoring with change detection

import { ApifyClient } from "apify-client";

const client = new ApifyClient({ token: "YOUR_API_TOKEN" });

const run = await client.actor("BIc8GRivosWDHHrwf").call({
    preset: "monitoring",
    tagged: "kubernetes",
    incrementalKey: "k8s-daily",
    detectChanges: true,
    maxResults: 50,
});

const { items } = await client.dataset(run.defaultDatasetId).listItems();
items
    .filter((q) => q.recordType === "question")
    .forEach((q) => {
        const c = q.change;
        if (c?.isNewSinceLastRun) console.log(`NEW: ${q.title}`);
        else if (c?.scoreDelta && c.scoreDelta >= 10) console.log(`HOT: ${q.title} +${c.scoreDelta} since last run`);
        else if (c?.acceptedAnswerChanged) console.log(`SOLVED: ${q.title}`);
    });

cURL — semantic re-rank

curl -X POST "https://api.apify.com/v2/acts/BIc8GRivosWDHHrwf/runs?token=YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "tagged": "react",
    "semanticQuery": "manage state without redux",
    "openaiApiKey": "sk-...",
    "maxResults": 50
  }'

Schedules, webhooks, downstream apps

Apify Schedules — daily or hourly; pair with preset: 'monitoring' for true incremental output.
Webhooks — fire on completion to Slack, Discord, email, any HTTP endpoint.
Zapier / Make / n8n — trigger downstream workflows on new high-score questions.
Vector DB pipelines — pipe recordType: 'llm-pair' records into Pinecone, Weaviate, Qdrant, Postgres+pgvector.
Google Sheets / BI — pull dataset items via the Apify API, filter WHERE recordType = 'question' to drop error rows.

Use in Dify

Drop StackExchange Search into Dify workflows via the Apify plugin's Run Actor node. Each run returns multiple decision-ready record types as structured JSON — recordType enum (question / decision / alert / tracker-result / llm-pair), decisionReadiness (actionable / monitor / insufficient-data), intent enum (bug-report / how-to / design-advice / version-migration / feature-request / unknown), priority (urgent / high / medium / low), severity (critical / warning / info), alertType (multiple stable codes), shouldAct boolean (gates auto-action), and recommendations[] (ranked playbooks) your downstream node branches on. Generic StackExchange API returns raw questions; this returns prioritised decisions with pre-built tickets.

Actor ID: ryanclinton/stackexchange-search
Sample input (developer feedback mining for product roadmap):

{
    "tags": ["nextjs", "react"],
    "preset": "monitoring",
    "monitorStateKey": "frontend-feedback-q2"
}

Branching example — a Dify if/else node reads recordType first, then routes per-type:
- recordType = "decision" AND shouldAct: true AND decisionReadiness = "actionable" → call your tracker tool to create a ticket using recommendations[] as the body
- recordType = "alert" → branch on severity: critical → page on-call, warning → log + flag, info → silently archive
- recordType = "question" AND intent = "bug-report" AND priority = "urgent" → engineering triage queue
- recordType = "question" AND intent = "feature-request" AND signalScore > 70 → product backlog
- recordType = "question" AND intent = "how-to" → docs gap signal — file a docs-team ticket
- recordType = "llm-pair" → pipe directly to your fine-tuning / RAG / eval pipeline
For closed-loop monitoring: preset: "monitoring" + a stable monitorStateKey enables cross-run change detection — Dify branches on recordType = "tracker-result" outcomes (confirmed / improving / unchanged / worsening) to escalate when last cycle's recommendations didn't land
Auto-create tickets: built-in pushTasksToTracker: 'jira' | 'linear' | 'github' input — the actor itself can create tickets with dryRun: true (default) for rehearsal before Dify wires it live. Pair with onlyPushShouldAct: true to gate ticket creation on the deterministic shouldAct boolean.
Cross-source causal validation: causalInference.evidenceTier (weak / moderate / strong / definitive) lets Dify gate auto-action on evidenceTier IN ('strong', 'definitive') — production-safe filter that suppresses single-signal speculation
recommendations[] priority-ranked playbooks are usable verbatim as ticket bodies, Slack messages, or runbooks — no LLM rewriting required, fully deterministic

The 5-recordType discriminator + decisionReadiness + shouldAct + cross-source causal evidence tier make this the ideal "what should we build / fix / document next based on real developer feedback?" tool inside any Dify product-ops workflow.

Performance & cost

Metric	Value
Memory	128–512 MB (auto-scaled)
Run time, 30 results	3–10 seconds
Run time, 500 results	25–35 seconds
StackExchange API requests, 30 results	1
StackExchange API requests, 500 results	5
StackExchange API requests, 100 results + accepted-answer enrichment	2
StackExchange API requests, 500 results + answers (top mode) + tag metadata	5 + 5 + 1 = 11
Daily anonymous StackExchange quota	300 requests / IP
StackExchange API cost	Free
OpenAI cost (semantic, 100 questions, 3-small)	~$0.0005
Apify PPE price	$0.005 per question returned

A 100-question fully-enriched run (preset ai-training + semantic dedup) consumes ~3 StackExchange API requests + ~~25k OpenAI tokens (~~$0.0005) and costs ~$0.50 in Apify PPE.

What this actor does NOT do

Honest scope-fencing prevents bad reviews. Use a sibling tool when you need:

Need	Use this instead
Run arbitrary SQL against the StackExchange data dump	Stack Exchange Data Explorer (free, technical, web-only)
Search only stackoverflow.com with full client-side ranking	The Stack Overflow website (uses Algolia internally — no public API)
Fetch every answer on a question (not just top-3 / accepted)	Set `answersMode: 'top'` (returns top 3) — broader is a future feature, open an issue
Search StackExchange users by name / reputation / location	Future actor — open an issue if you need this
Scrape stackoverflow.com's HTML directly	Don't — TOS violation. Use the API (this actor).
GitHub / GitLab issue search	GitHub Repository Search
Academic paper search	Semantic Scholar, arXiv, DBLP

The actor uses the public StackExchange API only (plus optional OpenAI embeddings). It does not require, store, or transmit any StackExchange credentials. It does not scrape stackoverflow.com directly. It respects the API's quota, rate-limit, and backoff directives.

How it works

Stack Overflow & StackExchange Intelligence Engine
                    ===================================================

  +-----------+      +-----------------+     +---------------------+
  |  Input +  | ---> | Resolve preset  | --> | /search/advanced    |
  |  validate |      | (merge user     |     | page 1..N           |
  |           |      |  flags over     |     | report quota        |
  +-----------+      |  preset preset) |     +----------+----------+
                     +-----------------+                |
                                                        v
                                              +---------+---------+
                                              | Client filters    |
                                              | (minScore,        |
                                              |  incremental)     |
                                              +---------+---------+
                                                        |
                  +-------------+-------------+---------+---------+-----------+
                  |             |             |                   |           |
                  v             v             v                   v           v
           +------+----+  +-----+----+ +------+--------+   +-----+----+ +----+----+
           | Answers   |  | Tag      | | Intelligence  |   | Tag      | | Change  |
           | (none/    |  | metadata | | scoring       |   | clusters | | detect  |
           | accepted/ |  | (batch   | | (pure math)   |   | (run-    | | (vs KV  |
           | top/      |  | 100/req) | |               |   |  local)  | |  state) |
           | hybrid)   |  +----------+ +---------------+   +----------+ +---------+
           +-----+-----+
                 |
                 v
          +------+--------------+
          | Optional semantic   |
          | (embed all,         |
          |  re-rank, dedup,    |
          |  cluster)           |
          +------+--------------+
                 |
                 v
          +------+--------------+
          | Output mode:        |
          | standard records    |
          |   OR                |
          | llm-pair records    |
          +------+--------------+
                 |
                 v
          +------+----------------+
          | pushData → PPE charge |
          | (after each push)     |
          +------+----------------+
                 |
                 v
          +------+----------+
          | Save SUMMARY    |
          | + KV state      |
          +-----------------+

The HTTP client uses AbortSignal.timeout(30s) plus exponential-backoff retries on 502/503/504 and network errors. It honors the API's backoff directive. On 429, it stops cleanly and writes a partial-result summary instead of crashing. PPE is charged after each successful pushData, so a pushData failure never bills you.

FAQ

Do I need a StackExchange API key? No. Anonymous tier gives 300 requests/day per IP. No registration required.

Do I need an OpenAI API key? Only for semantic features (semanticQuery, semanticDedup, semanticClustering). Everything else works without one.

Can I search any of the 170+ StackExchange sites? Yes — 30 are in the dropdown, anything else goes in customSite (full list at https://api.stackexchange.com/docs/sites).

Which preset should I pick?

Building an LLM dataset → ai-training
Watching a tag for new questions → monitoring + incrementalKey
Finding content opportunities → seo-content or research
Just want to search → standard (default)

What's the difference between accepted, top, and hybrid answer modes?

accepted — only the answer the asker checked off. Cheap (1 API call per 100 answers). Misses cases where the community vote disagrees.
top — fetches all answers, returns the highest-scoring. Most useful answer; ignores acceptance.
hybrid — prefers accepted, BUT if the top-voted answer outranks it by 5+ score, returns the top one with outranksAcceptedAnswer: true. Best for AI training.
none — skip answer fetching entirely (cheapest).

How is the qualityScore computed? A weighted blend (45% score / 30% accepted-answer presence / 25% view depth), all log-normalized to 0–1. Documented in the dataset schema.

How does incremental mode know what's "new"? On every run with incremental: true (or preset: 'monitoring'), the actor saves the question IDs returned to the run's KV store under your incrementalKey. Next run, anything in that set is dropped from the output. State is FIFO-bounded at 5000 IDs.

Can I get the full text of a question and its accepted answer? Yes — includeBody: true (no extra API cost) for the question, includeAcceptedAnswer: true for the answer. Or use answersMode: 'hybrid' to get the best answer regardless of acceptance.

How do I filter by date range? Set fromDate and/or toDate to ISO 8601 dates (YYYY-MM-DD).

How do I exclude a tag? Use notTagged with semicolons. Example: tagged: "javascript", notTagged: "jquery;legacy".

What happens if I exceed the daily API quota? The actor detects 429 / quota-zero, stops cleanly, writes a partial-result summary, and adds a failureType: 'rate-limited' record. Quota resets at midnight UTC.

How is PPE pricing calculated? $0.005 per question returned (intelligence layer + decision record + alerts + tracker auto-ticketing all included). Alert and decision-summary records are free supplements — only the per-question records charge. Plus a one-time $0.00005 actor-start event. Charged AFTER each pushData succeeds — if the push fails you're not billed. Respects your spending limit.

Why does the dataset have a recordType field? So you can WHERE recordType = 'question' (or 'llm-pair') in SQL, Sheets, or any downstream tool to drop error rows.

Where's the run summary? KV store, key SUMMARY. Dataset stays uniform.

Is this production-ready? Yes. Outer try/catch with structured error records, AbortSignal.timeout(30s), exponential-backoff retries, 429 handling, API backoff directive honored, status messages, failure-webhook integration, KV-store summary, dataset schema validated.

Responsible use

Respect the 300 requests/day free tier. Don't schedule more frequently than necessary.
StackExchange content is licensed under CC BY-SA 4.0. The link and ownerName / ownerUrl fields make attribution trivial — metadata.attributionUrl in llm-pair records is the canonical source URL to cite.
Don't impersonate StackExchange users or misattribute content.
Don't bulk-repost content on competing platforms.
For AI training datasets, your downstream use must be CC BY-SA 4.0 compliant (attribution + share-alike) and consistent with StackExchange's Terms of Service.
The actor calls public APIs only.

Actor	Description	Link
GitHub Repository Search	Search GitHub repos by topic, language, stars, keyword	View
Hacker News Search	Search and monitor Hacker News stories	View
DBLP Publication Search	Search computer-science publications	View
OpenAlex Research Search	Search 250M+ academic works	View
Semantic Scholar Search	Academic papers with citation data	View
arXiv Paper Search	Preprint papers across all sciences	View
Wikipedia Article Search	Wikipedia article search and extraction	View

StackOverflow Scraper - Questions, Answers & Tags

makework36/stackoverflow-scraper

Scrape StackOverflow and any StackExchange site. Fetch hot questions, search by query, or filter by tags. Pure HTTP via the official StackExchange API v2.3.

deusex machine

Stack Overflow Scraper

pear_fight/stackoverflow-scraper

Scrape questions, answers, tags from Stack Overflow

Harald

Stack Overflow Scraper — Question & Answer Data Extractor

klondikeking/stack-overflow-scraper

Extract Stack Overflow questions, answers, comments, and user profiles via the official Stack Exchange API. No scraping needed — fast, reliable, and cost-effective.

Pierrick McD0nald

Stack Overflow

jupri/stackexchange

cat

Stack Overflow Scraper API - Search Questions, Answers & Trends

fresh_cliff/stackoverflow-api-scraper

Extract Stack Overflow questions, answers, tags, votes, users, and comments via the Stack Exchange API. Fast JSON export, pagination, filters, date ranges, and keyword search. Ideal for analytics, AI training, and monitoring trends in developer Q&A data.

Brennan Crawford

Stack Overflow Scraper

cloud9_ai/stackoverflow-scraper

Scrape Stack Overflow questions, answers, and tags via Stack Exchange API. Search by keyword or tag, get accepted answers, vote counts, and view statistics.

cloud9

Stack Overflow Scraper: Questions, Answers, Users & Tags

perconey/stackoverflow-scraper

Scrape any Stack Exchange site (stackoverflow, superuser, askubuntu, math.stackexchange and 170+ more) via the official Stack Exchange API. Questions, answers with full body, user profiles with reputation and badges, top tags, search. No auth, no proxies, no cookies. Pay only per result item.

Perconey

Stack Overflow & Stack Exchange Search (Pythia)

apricot_blackberry/pythia-stackoverflow

Search Stack Overflow or any Stack Exchange site by keyword or tag. Returns up to 50 questions with score, view count, answer count, and tags per query.

Creator Fusion

Stack Overflow Q&A Scraper

sheshinmcfly/stackoverflow-scraper

Extract questions and answers from Stack Overflow via the official Stack Exchange API. Filter by tags, keywords, or top voted. Returns question body, accepted answer, top answers, vote counts, and tags. Perfect for AI training data, RAG pipelines, and knowledge bases.

Sheshinmcfly

Stack Overflow Scraper - Questions, Answers & Comments

legend006/stackoverflow-scraper

Scrape questions, answers, and comments from Stack Overflow and the Stack Exchange network. Filter by tag, search, or user. Returns body, score, votes, accepted-answer flag. Built for AI/LLM training datasets, dev research, and tag-trend monitoring.

NIJ KANANI

Stack Overflow & StackExchange Search

Stack Overflow & StackExchange Intelligence Engine

What this is (AI-friendly summary)

What problems this solves

How it works (method anchors)

When to use this actor

In simple terms

System overview (anchored chunk for retrieval)

For AI agents and automation systems

What tool turns Stack Overflow questions into Jira tickets?

How do I detect documentation gaps from Stack Overflow?

What's a better alternative to the StackExchange API?

How do I create an LLM training dataset from Stack Overflow?

What is a Stack Overflow analytics tool?

How can an AI agent decide whether to act on developer feedback?

Nine workflow presets — pick one and go

Task-based presets

Persona-based presets

Why this is better than using the StackExchange API directly

Key features

Quick start

Input parameters

Core search

Enrichment

Intelligence

Output

Incremental / monitoring

Semantic (OpenAI embeddings)

Output

Standard question record (typical)

Fully enriched (intelligence + hybrid answers + tag metadata + clusters + semantic)

Count-only record (when outputMode: 'count-only')

LLM-dataset record (when outputMode: 'llm-dataset')

Output fields reference

intelligence field detail

Run summary (in KV store, not dataset)

Failure types (on recordType: 'error' records)

The decision layer — read one record, do one thing

Record types in the dataset

Alert engine

Trend engine

Multi-tag problem clusters

Why is this happening? — Root-cause hypotheses

Where is this in its lifecycle?

GitHub release correlation — promote hypotheses from speculative to evidence-backed

Causal inference model

Temporal analysis (release → spike lag)

Impact score (severity + audience size)

Execution layer — turn insights into Jira / Linear / GitHub tickets

Cluster filters — route to the right team

Auto-create tickets in Jira / Linear / GitHub Issues

Resolution feedback (closed-loop validation)

Pattern calibration (the learning system)

Trust summary (non-technical)

Decision gate — shouldAct boolean

Cluster category — opportunity / risk / hybrid / noise

Contradictions — when not to trust the signal

Systemic issues — patterns across clusters

Time-to-resolution opportunity

Typed actions — content / product / docs / devrel

Signal strength — is this run trustworthy?

Intelligence methodology (transparent scoring)

How incremental mode works (first run, second run, third run)

Use cases

API & integrations

Python — AI training dataset with semantic dedup

JavaScript — daily monitoring with change detection

cURL — semantic re-rank

Schedules, webhooks, downstream apps

Use in Dify

Performance & cost

What this actor does NOT do

How it works

FAQ

Responsible use

Related actors

You might also like

StackOverflow Scraper - Questions, Answers & Tags

Stack Overflow Scraper

Stack Overflow Scraper — Question & Answer Data Extractor

Count-only record (when `outputMode: 'count-only'`)

LLM-dataset record (when `outputMode: 'llm-dataset'`)

`intelligence` field detail

Failure types (on `recordType: 'error'` records)

Decision gate — `shouldAct` boolean