Pricing

Pay per event

Stack Overflow Scraper — Stack Exchange Questions

Search and scrape questions across Stack Overflow and every Stack Exchange site — by tag, search query, or user — title, body, tags, score, views, answers, accepted answer, asker, timestamps — export to a JSON or CSV dataset. Built on the Stack Exchange v2.3 API.

Pricing

Pay per event

Rating

0.0

(0)

Developer

DevilScrapes

Actor stats

Bookmarked

Total users

Monthly active users

2 months ago

Last modified

🎯 What this scrapes

The Stack Exchange network (api.stackexchange.com/2.3) covers every site — Stack Overflow, Server Fault, Super User, Cross Validated, plus 170+ topic communities. This Actor wraps the questions endpoint, paginates safely through the backoff field, rotates API quota keys, and writes one clean row per question with body, tags, and key metadata. No quota overruns, no partial failures swept under the rug.

Stack Overflow is the world's largest developer Q&A corpus. Post the SO data dump pause, this is the freshest pipeline you can run without getting tangled in the CC BY-SA attribution maze yourself — we include posted_at + a stable question_id so your downstream attribution is always correct.

🔥 Features

We absorb every failure mode that would otherwise block your pipeline:

🛡️ Browser fingerprint rotation — curl-cffi impersonates real Chrome / Firefox / Safari TLS handshakes so the target sees a browser, not a Python script.
🌐 Residential proxy rotation via Apify Proxy — fresh session and exit IP on every block, so your quota never drains from a single flagged address.
🔁 Retries with exponential backoff on 408 / 429 / 5xx — up to 5 attempts per page, Retry-After honoured, no silent empty results.
🧱 Rate-limit-aware pacing — when Stack Exchange pushes back, we slow down instead of burning your daily quota.
🧊 Clean, typed dataset rows — Pydantic-validated, ISO-8601 timestamps, stable IDs, JSON / CSV / Excel export straight from the Apify Console.
💰 Pay-Per-Event pricing — you only pay for results that hit your dataset. No data, no charge (beyond the tiny warm-up fee).

💡 Use cases

RAG corpus pipeline — feed Q&A bodies from your product's tag into a LangChain / LlamaIndex vector store for a domain copilot.
DevRel feedback signal — daily diffs on unanswered questions in your tag to surface gaps before users churn.
Competitor-tag intelligence — compare question volume and score trends across react vs vue vs angular over time.
Help-center seed — pull the top-200 voted questions per tag to pre-populate an internal knowledge base.
Recruiter outreach — extract active askers from a niche tag, score by reputation (via the user endpoint).
AI training dataset — build a deduped, fresh alternative to the paused SO data dump; each row includes attribution fields required by CC BY-SA 4.0.

⚙️ How to use it

Click Try for free at the top of the page.
Fill in the input form — most fields have sensible defaults (site: stackoverflow, mode: tagged, tag: python).
Click Start. Results stream into the run's dataset in real time.
Export from Storage → Dataset as JSON, CSV, or Excel — or pull rows via the Apify API into your own pipeline.

For large jobs (>10 000 questions): add your own Stack Exchange API key in the apiKey field to lift the daily quota from 300 to 10 000 requests. Free key, takes 30 seconds at stackapps.com.

📥 Input

Field	Type	Required	Default	Notes
`site`	`string`	no	`stackoverflow`	Site slug, e.g. `stackoverflow`, `superuser`, `serverfault`, `askubuntu`, `stats` (Cross Validated), `math`.
`mode`	`string`	no	`tagged`	How to find questions: `tagged`, `search`, or `user_questions`.
`tags`	`array`	no	`["python"]`	Tags to filter by (mode=tagged). Multiple tags = OR.
`searchQuery`	`string`	no	—	Free-text search query (mode=search).
`userId`	`integer`	no	—	Numeric Stack Exchange user id (mode=user_questions).
`sortBy`	`string`	no	`activity`	Sort order: `activity`, `votes`, `creation`, or `hot`.
`maxResults`	`integer`	no	`30`	Max questions across pages. API caps page size at 100; we paginate automatically.
`includeBody`	`boolean`	no	`true`	Request `filter=withbody` to include the full question body HTML. Slightly larger payload.
`apiKey`	`string`	no	—	Stack Exchange API key from stackapps.com — lifts daily quota from 300 to 10 000 requests.
`proxyConfiguration`	`object`	no	`{"useApifyProxy": false}`	Optional. We rotate proxies automatically when blocks occur.

Example input

{
  "site": "stackoverflow",
  "mode": "tagged",
  "tags": ["python"],
  "sortBy": "votes",
  "maxResults": 100,
  "includeBody": true,
  "proxyConfiguration": {
    "useApifyProxy": false
  }
}

📤 Output

Every run writes one dataset item per question. All timestamps are ISO-8601 UTC; all IDs are stable integers. Rows are Pydantic-validated before they land — no surprise nulls on required fields.

Field	Type	Notes
`question_id`	`integer`	Stack Exchange question id (stable across sites).
`site`	`string`	Site slug the question came from.
`title`	`string`	Question title.
`body_html`	`string \| null`	Question body in HTML (when `includeBody=true`).
`tags`	`array`	Tags applied to the question.
`score`	`integer`	Net score (upvotes minus downvotes).
`view_count`	`integer`	Question views.
`answer_count`	`integer`	Number of answers.
`is_answered`	`boolean`	Has an accepted answer or any positive-score answer.
`accepted_answer_id`	`integer \| null`	Accepted answer id, when present.
`link`	`string`	Canonical question URL.
`owner_user_id`	`integer \| null`	Asker user id (null for deleted accounts).
`owner_display_name`	`string \| null`	Asker display name.
`creation_date`	`integer`	Unix timestamp — created at.
`last_activity_date`	`integer`	Unix timestamp — last activity.
`posted_at`	`string`	ISO-8601 UTC derived from `creation_date`.
`scraped_at`	`string`	When this row was recorded by the Actor.

Example output

{
  "question_id": 1234567,
  "site": "stackoverflow",
  "title": "How do I close a connection cleanly in asyncio?",
  "body_html": "<p>I'm trying to gracefully shut down an asyncio server…</p>",
  "tags": ["python", "asyncio"],
  "score": 142,
  "view_count": 48300,
  "answer_count": 3,
  "is_answered": true,
  "accepted_answer_id": 1234570,
  "link": "https://stackoverflow.com/questions/1234567/...",
  "owner_user_id": 987654,
  "owner_display_name": "asyncio_dev",
  "creation_date": 1609459200,
  "last_activity_date": 1712345678,
  "posted_at": "2021-01-01T00:00:00Z",
  "scraped_at": "2026-06-01T10:00:00Z"
}

Attribution note: Stack Exchange content is licensed under CC BY-SA 4.0. Each row includes link and owner_display_name — the fields required for proper attribution when you redistribute or display the content.

💰 Pricing

Pay-Per-Event — you pay only when these events fire:

Event	USD	What it is
`actor-start`	$0.005	One-off warm-up charge per run
`result`	$0.0015	Per question written to the dataset

1 000 questions ≈ $1.50. No subscription, no minimum spend, no credit card to try — Apify gives every new account $5 of free credit.

🚧 Limitations

Question bodies only — comments, voting graphs, revision history, and answer bodies are not in scope for this Actor.
Search ranking is Stack Exchange's own, which differs from the site's visible UI sort in subtle ways.
Daily quota — without an API key you get 300 requests/day (each page = 1 request, 100 questions/page). With a free key from stackapps.com you get 10 000 requests/day. For very large jobs, plan your daily budget accordingly.
Deleted users — some owner_user_id and owner_display_name fields will be null where the asker's account was removed.

❓ FAQ

Why use this instead of the Stack Overflow data dump?

The official SO data dump is paused (as of mid-2025) and the last release is years stale. Even when active it was quarterly, unfiltered by tag, and shipped as multi-GB XML — you had to ETL it yourself. This Actor gives you fresh questions filtered by tag or query, clean JSON rows, and attribution fields ready to use, in minutes rather than days.

Why is the API quota so low without a key?

Stack Exchange caps unauthenticated usage at 300 requests/day per IP. Add a free API key from stackapps.com and you get 10 000 requests/day — enough for most jobs. For bulk corpus pulls, schedule multiple smaller runs across days.

Can I get answers too?

Not in this Actor — answers are a separate endpoint with different pagination and field shapes. A sibling stackexchange-answers-scraper Actor is planned. For now, the accepted answer id is included so you can cross-reference.

Do I have to worry about CC BY-SA attribution?

Yes — Stack Exchange content is licensed CC BY-SA 4.0. Your downstream use must attribute the source. Each row includes link (the canonical question URL) and owner_display_name, which are the minimum fields required. Do not redistribute the dataset commercially without including those attribution fields.

What about voting or posting?

We do not write to Stack Exchange. This Actor is read-only API access only.

Why are some user fields null?

Some questions were asked by accounts that have since been deleted. The Stack Exchange API returns null for those owner fields — we pass it through as-is.

💬 Your feedback

Spotted a bug, hit a quota edge case, or need a new field? Open an issue on the Actor's Issues tab on Apify Console — we ship fixes weekly and we read every report.

Stack Overflow Scraper — Questions, Answers & Tags

hichemdev/stackoverflow-scraper

Scrape Stack Overflow questions and answers by keyword or tag via the official Stack Exchange API: score, views, tags, author and body. Works on any Stack Exchange site.

Hichem Ben Moussa

Stack Exchange — Questions Search (Stack Overflow & more)

omao/stackexchange

Search questions across Stack Overflow and any Stack Exchange site into clean JSON: title, score, views, answers, tags, owner, dates and link. Powered by the official Stack Exchange API. No API key, no anti-bot.

Marouane Oulabass

Stack Overflow & Stack Exchange Scraper

hipersoft/stackexchange-scraper

Scrape questions from Stack Overflow and any Stack Exchange site by tag or full-text search: title, tags, score, views, answer count, accepted answer, author, dates, link and full question body. Fast clean HTTP via the Stack Exchange API; optional key for higher quota.

hiper soft

Stack Overflow Scraper - Questions & Tags

benthepythondev/stackoverflow-scraper

Search Stack Overflow (or any Stack Exchange site) by keyword: question title, link, tags, score, views, answer count, accepted flag, asker and dates. Filter by tag, sort by relevance/votes/activity. Reliable via the public Stack Exchange API, no key.

Ben

Stack Overflow & Stack Exchange Search (Pythia)

apricot_blackberry/pythia-stackoverflow

Search Stack Overflow or any Stack Exchange site by keyword or tag. Returns up to 50 questions with score, view count, answer count, and tags per query.

Creator Fusion

Stack Overflow Scraper

leftwinglautus/stack-overflow-scraper

Search and scrape Stack Overflow questions via the Stack Exchange API with filters for tags, sorting, and accepted answers.

Moeeze Hassan

Stack Exchange Questions Scraper

fetch_cat/stack-exchange-questions-scraper

Collect public Stack Overflow and Stack Exchange questions by site, tag, keyword, date, score, and answers for SEO, DevRel, product, and support research.

Hanna Nosova

Stack Overflow Scraper - Questions & Users

fascinating_lentil/stack-overflow-scraper

Scrape Stack Overflow questions and users via the official Stack Exchange API. Get titles, scores, answers, views, tags, bodies, and user profiles. Works across all Stack Exchange sites.

Md Jakaria Mirza

Stack Exchange Q&A Scraper

crawlerbros/stack-exchange-qa-scraper

Scrape questions, answers, and site listings from Stack Overflow and 170+ Stack Exchange communities via the official Stack Exchange API v2.3. No login, no cookies, no proxy needed.

Crawler Bros

Stack Exchange Q&A Scraper

crawlergang/stack-exchange-qa-scraper

Scrape questions, answers, and site listings from Stack Overflow and 170+ Stack Exchange communities via the official Stack Exchange API v2.3. No login, no cookies, no proxy needed.