Stack Overflow Scraper — Stack Exchange Questions avatar

Stack Overflow Scraper — Stack Exchange Questions

Pricing

Pay per event

Go to Apify Store
Stack Overflow Scraper — Stack Exchange Questions

Stack Overflow Scraper — Stack Exchange Questions

Search and scrape questions across Stack Overflow and every Stack Exchange site — by tag, search query, or user — title, body, tags, score, views, answers, accepted answer, asker, timestamps — export to a JSON or CSV dataset. Built on the Stack Exchange v2.3 API.

Pricing

Pay per event

Rating

0.0

(0)

Developer

DevilScrapes

DevilScrapes

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

2 days ago

Last modified

Categories

Share


🎯 What this scrapes

The Stack Exchange network (api.stackexchange.com/2.3) covers every site — Stack Overflow, Server Fault, Super User, Cross Validated, plus 170+ topic communities. This Actor wraps the questions endpoint, paginates safely through the backoff field, rotates API quota keys, and writes one clean row per question with body, tags, and key metadata. No quota overruns, no partial failures swept under the rug.

Stack Overflow is the world's largest developer Q&A corpus. Post the SO data dump pause, this is the freshest pipeline you can run without getting tangled in the CC BY-SA attribution maze yourself — we include posted_at + a stable question_id so your downstream attribution is always correct.

🔥 Features

We absorb every failure mode that would otherwise block your pipeline:

  • 🛡️ Browser fingerprint rotationcurl-cffi impersonates real Chrome / Firefox / Safari TLS handshakes so the target sees a browser, not a Python script.
  • 🌐 Residential proxy rotation via Apify Proxy — fresh session and exit IP on every block, so your quota never drains from a single flagged address.
  • 🔁 Retries with exponential backoff on 408 / 429 / 5xx — up to 5 attempts per page, Retry-After honoured, no silent empty results.
  • 🧱 Rate-limit-aware pacing — when Stack Exchange pushes back, we slow down instead of burning your daily quota.
  • 🧊 Clean, typed dataset rows — Pydantic-validated, ISO-8601 timestamps, stable IDs, JSON / CSV / Excel export straight from the Apify Console.
  • 💰 Pay-Per-Event pricing — you only pay for results that hit your dataset. No data, no charge (beyond the tiny warm-up fee).

💡 Use cases

  • RAG corpus pipeline — feed Q&A bodies from your product's tag into a LangChain / LlamaIndex vector store for a domain copilot.
  • DevRel feedback signal — daily diffs on unanswered questions in your tag to surface gaps before users churn.
  • Competitor-tag intelligence — compare question volume and score trends across react vs vue vs angular over time.
  • Help-center seed — pull the top-200 voted questions per tag to pre-populate an internal knowledge base.
  • Recruiter outreach — extract active askers from a niche tag, score by reputation (via the user endpoint).
  • AI training dataset — build a deduped, fresh alternative to the paused SO data dump; each row includes attribution fields required by CC BY-SA 4.0.

⚙️ How to use it

  1. Click Try for free at the top of the page.
  2. Fill in the input form — most fields have sensible defaults (site: stackoverflow, mode: tagged, tag: python).
  3. Click Start. Results stream into the run's dataset in real time.
  4. Export from Storage → Dataset as JSON, CSV, or Excel — or pull rows via the Apify API into your own pipeline.

For large jobs (>10 000 questions): add your own Stack Exchange API key in the apiKey field to lift the daily quota from 300 to 10 000 requests. Free key, takes 30 seconds at stackapps.com.

📥 Input

FieldTypeRequiredDefaultNotes
sitestringnostackoverflowSite slug, e.g. stackoverflow, superuser, serverfault, askubuntu, stats (Cross Validated), math.
modestringnotaggedHow to find questions: tagged, search, or user_questions.
tagsarrayno["python"]Tags to filter by (mode=tagged). Multiple tags = OR.
searchQuerystringnoFree-text search query (mode=search).
userIdintegernoNumeric Stack Exchange user id (mode=user_questions).
sortBystringnoactivitySort order: activity, votes, creation, or hot.
maxResultsintegerno30Max questions across pages. API caps page size at 100; we paginate automatically.
includeBodybooleannotrueRequest filter=withbody to include the full question body HTML. Slightly larger payload.
apiKeystringnoStack Exchange API key from stackapps.com — lifts daily quota from 300 to 10 000 requests.
proxyConfigurationobjectno{"useApifyProxy": false}Optional. We rotate proxies automatically when blocks occur.

Example input

{
"site": "stackoverflow",
"mode": "tagged",
"tags": ["python"],
"sortBy": "votes",
"maxResults": 100,
"includeBody": true,
"proxyConfiguration": {
"useApifyProxy": false
}
}

📤 Output

Every run writes one dataset item per question. All timestamps are ISO-8601 UTC; all IDs are stable integers. Rows are Pydantic-validated before they land — no surprise nulls on required fields.

FieldTypeNotes
question_idintegerStack Exchange question id (stable across sites).
sitestringSite slug the question came from.
titlestringQuestion title.
body_htmlstring | nullQuestion body in HTML (when includeBody=true).
tagsarrayTags applied to the question.
scoreintegerNet score (upvotes minus downvotes).
view_countintegerQuestion views.
answer_countintegerNumber of answers.
is_answeredbooleanHas an accepted answer or any positive-score answer.
accepted_answer_idinteger | nullAccepted answer id, when present.
linkstringCanonical question URL.
owner_user_idinteger | nullAsker user id (null for deleted accounts).
owner_display_namestring | nullAsker display name.
creation_dateintegerUnix timestamp — created at.
last_activity_dateintegerUnix timestamp — last activity.
posted_atstringISO-8601 UTC derived from creation_date.
scraped_atstringWhen this row was recorded by the Actor.

Example output

{
"question_id": 1234567,
"site": "stackoverflow",
"title": "How do I close a connection cleanly in asyncio?",
"body_html": "<p>I'm trying to gracefully shut down an asyncio server…</p>",
"tags": ["python", "asyncio"],
"score": 142,
"view_count": 48300,
"answer_count": 3,
"is_answered": true,
"accepted_answer_id": 1234570,
"link": "https://stackoverflow.com/questions/1234567/...",
"owner_user_id": 987654,
"owner_display_name": "asyncio_dev",
"creation_date": 1609459200,
"last_activity_date": 1712345678,
"posted_at": "2021-01-01T00:00:00Z",
"scraped_at": "2026-06-01T10:00:00Z"
}

Attribution note: Stack Exchange content is licensed under CC BY-SA 4.0. Each row includes link and owner_display_name — the fields required for proper attribution when you redistribute or display the content.

💰 Pricing

Pay-Per-Event — you pay only when these events fire:

EventUSDWhat it is
actor-start$0.005One-off warm-up charge per run
result$0.0015Per question written to the dataset

1 000 questions ≈ $1.50. No subscription, no minimum spend, no credit card to try — Apify gives every new account $5 of free credit.

🚧 Limitations

  • Question bodies only — comments, voting graphs, revision history, and answer bodies are not in scope for this Actor.
  • Search ranking is Stack Exchange's own, which differs from the site's visible UI sort in subtle ways.
  • Daily quota — without an API key you get 300 requests/day (each page = 1 request, 100 questions/page). With a free key from stackapps.com you get 10 000 requests/day. For very large jobs, plan your daily budget accordingly.
  • Deleted users — some owner_user_id and owner_display_name fields will be null where the asker's account was removed.

❓ FAQ

Why use this instead of the Stack Overflow data dump?

The official SO data dump is paused (as of mid-2025) and the last release is years stale. Even when active it was quarterly, unfiltered by tag, and shipped as multi-GB XML — you had to ETL it yourself. This Actor gives you fresh questions filtered by tag or query, clean JSON rows, and attribution fields ready to use, in minutes rather than days.

Why is the API quota so low without a key?

Stack Exchange caps unauthenticated usage at 300 requests/day per IP. Add a free API key from stackapps.com and you get 10 000 requests/day — enough for most jobs. For bulk corpus pulls, schedule multiple smaller runs across days.

Can I get answers too?

Not in this Actor — answers are a separate endpoint with different pagination and field shapes. A sibling stackexchange-answers-scraper Actor is planned. For now, the accepted answer id is included so you can cross-reference.

Do I have to worry about CC BY-SA attribution?

Yes — Stack Exchange content is licensed CC BY-SA 4.0. Your downstream use must attribute the source. Each row includes link (the canonical question URL) and owner_display_name, which are the minimum fields required. Do not redistribute the dataset commercially without including those attribution fields.

What about voting or posting?

We do not write to Stack Exchange. This Actor is read-only API access only.

Why are some user fields null?

Some questions were asked by accounts that have since been deleted. The Stack Exchange API returns null for those owner fields — we pass it through as-is.

💬 Your feedback

Spotted a bug, hit a quota edge case, or need a new field? Open an issue on the Actor's Issues tab on Apify Console — we ship fixes weekly and we read every report.