Stack Overflow & Stack Exchange Scraper avatar

Stack Overflow & Stack Exchange Scraper

Pricing

from $2.50 / 1,000 questions

Go to Apify Store
Stack Overflow & Stack Exchange Scraper

Stack Overflow & Stack Exchange Scraper

[๐Ÿ’ฐ $2.5 / 1K] Extract questions from Stack Overflow and the 170+ site Stack Exchange network. Search by keyword or tag, sort by votes/activity, or pull specific questions by URL. Optionally collect answers and comments as linked rows.

Pricing

from $2.50 / 1,000 questions

Rating

0.0

(0)

Developer

SolidCode

SolidCode

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

a day ago

Last modified

Share

Pull questions โ€” plus full answer and comment threads โ€” from Stack Overflow and the wider Stack Exchange network in one run, with question, answer, and comment bodies in Markdown, vote scores, accepted-answer flags, view counts, tags, and author reputation. Search by keyword, filter by tag, browse trending threads, or fetch specific questions by URL. Built for developer-tooling teams, LLM/AI training-data builders, and technical researchers who need clean, structured Q&A content without hand-rolling paginated API calls and juggling anonymous request limits.

Why This Scraper?

  • 170+ Stack Exchange communities from one field โ€” Stack Overflow, Server Fault, Super User, Ask Ubuntu, Mathematics, Cross Validated, Unix & Linux, Data Science, and 20+ more curated sites, all selectable without touching a URL.
  • Full question + answer + comment bodies in Markdown โ€” not just titles and metadata. Pull the actual content behind every thread, ready for content analysis and LLM pipelines.
  • Answers with vote score, accepted-answer flag, and author reputation โ€” every answer row carries score, isAccepted, and the answerer's authorReputation, so you can rank canonical solutions instantly.
  • Comments on both questions and answers โ€” each comment row is tagged postType (question or answer), carries the postId of the exact question or answer it hangs off, and is linked back to its parent question by ID. Answer comments arrive when you enable both answers and comments.
  • Linked three-record output โ€” question, answer, and comment rows share questionId so you can reassemble whole threads or load each record type into its own table.
  • Tag AND-filtering โ€” pass python + pandas and get only questions carrying every tag, filtered on Stack Exchange's side so you never pay for off-target rows.
  • 6 sort modes โ€” Recent activity, Newest, Most votes, Hot, Top this week, and Top this month, plus a date range for precise "new since yesterday" windows.
  • Search a keyword or fetch exact questions by URL/ID โ€” run a full-text search across titles and bodies, or paste specific question links like stackoverflow.com/questions/11227809/... to pull those threads directly.

Use Cases

Developer Tooling & IDE Plugins

  • Feed an in-editor "top accepted answers" panel for a language or framework tag
  • Surface the highest-voted solution for an error message inside a support bot
  • Keep a curated snippet library fresh from canonical Q&A threads

LLM & AI Training Data

  • Build instruction-tuning datasets of real questions paired with accepted answers
  • Extract Markdown code blocks and explanations for code-model pretraining
  • Assemble evaluation sets of high-score answers with their vote signals

Technical Research & Trend Analysis

  • Track which frameworks and libraries are gaining question volume over a date range
  • Analyze answer quality by score distribution across a tag
  • Compare activity between Stack Overflow and niche communities like Data Science or DevOps

Community & Reputation Monitoring

  • Watch a tag for newly asked, still-unanswered questions to jump on
  • Track top contributors by author reputation across a community
  • Alert on trending "Hot" threads in your product's ecosystem

Content & Documentation

  • Mine frequently asked questions to prioritize docs and knowledge-base articles
  • Pull real user phrasing for FAQ and help-center content
  • Source vetted code examples with attribution back to the original thread

Getting Started

One topic, newest 100 questions:

{
"site": "stackoverflow",
"searchQuery": "pandas groupby performance"
}

Tag Filter + Sort + Date Range

The highest-voted Kubernetes questions asked in 2024:

{
"site": "stackoverflow",
"tags": ["kubernetes", "networking"],
"sort": "votes",
"fromDate": "2024-01-01",
"toDate": "2024-12-31",
"maxResults": 200
}

Specific Questions with Answers + Comments

Pull two exact threads with their full Q&A content:

{
"questionUrlsOrIds": [
"https://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster",
"231767"
],
"includeQuestionBody": true,
"includeAnswers": true,
"maxAnswersPerQuestion": 5,
"includeComments": true
}

Browse a Whole Community

The most-active questions on Unix & Linux, no keyword needed:

{
"site": "unix",
"sort": "activity",
"maxResults": 500
}

Input Reference

What to Scrape

ParameterTypeDefaultDescription
sitestring"stackoverflow"Which Stack Exchange community to pull from โ€” Stack Overflow, Server Fault, Super User, Ask Ubuntu, Mathematics, Data Science, Cross Validated, and more.
searchQuerystring""Full-text search across question titles and bodies (e.g. "kubernetes ingress timeout"). Leave blank to browse by tag and sort order instead.
tagsarray[]Only include questions carrying ALL of these tags (e.g. python, pandas). Use exact tag names as they appear on the site. Leave empty to include every tag.
questionUrlsOrIdsarray[]Fetch specific questions directly by URL or numeric ID. When set, the keyword/tag/sort finders are ignored for those questions.

Filters

ParameterTypeDefaultDescription
sortstring"activity"Order questions are collected in: Recent activity, Newest, Most votes, Hot, Top this week, or Top this month. Ignored when fetching specific question URLs/IDs.
fromDatestring""Only include questions created on or after this date (YYYY-MM-DD). Perfect for scheduled "new since yesterday" runs.
toDatestring""Only include questions created on or before this date (YYYY-MM-DD).

Limits & Content

ParameterTypeDefaultDescription
maxResultsinteger100Maximum number of questions to collect. Set to 0 for as many as the site returns. The full last page is kept even if it slightly overshoots. Ignored when fetching specific question URLs/IDs.
includeQuestionBodybooleanfalseInclude each question's full body text (Markdown), not just its title.
includeAnswersbooleanfalseAlso collect each question's answers โ€” with body, score, accepted flag, and author โ€” as separate linked rows.
maxAnswersPerQuestioninteger0Cap how many answers to collect per question when answers are enabled. 0 = all.
includeCommentsbooleanfalseAlso collect comments as separate linked rows. Question comments are always included; answer comments are included when both answers and comments are enabled.

Output

Every row carries a recordType field โ€” question, answer, or comment โ€” and shares a questionId so you can rejoin whole threads or load each type into its own table.

Question (recordType: "question")

{
"recordType": "question",
"questionId": 11227809,
"title": "Why is processing a sorted array faster than processing an unsorted array?",
"link": "https://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-an-unsorted-array",
"site": "stackoverflow",
"tags": ["java", "c++", "performance", "cpu-architecture", "branch-prediction"],
"author": "GManNickG",
"authorId": 87234,
"authorReputation": 511234,
"score": 27543,
"viewCount": 1850342,
"answerCount": 25,
"commentCount": 12,
"isAnswered": true,
"hasAcceptedAnswer": true,
"acceptedAnswerId": 11227902,
"body": "Here is a piece of C++ code that shows some very peculiar behavior...",
"createdAt": "2012-06-27T13:51:36+00:00",
"lastActivityAt": "2024-05-10T09:12:04+00:00",
"scrapedAt": "2026-07-02T10:15:00+00:00"
}
FieldTypeDescription
recordTypestringAlways "question"
questionIdintegerStack Exchange question ID โ€” the link key for answers and comments
titlestringQuestion title
linkstringCanonical question URL
sitestringSource community (e.g. stackoverflow)
tagsarrayTags on the question
authorstringAsker display name
authorIdintegerAsker account ID (null for deleted accounts)
authorReputationintegerAsker reputation
scoreintegerNet votes
viewCountintegerTotal views
answerCountintegerNumber of answers
commentCountintegerNumber of comments on the question
isAnsweredbooleanWhether the question is marked answered
hasAcceptedAnswerbooleanWhether an accepted answer exists
acceptedAnswerIdintegerID of the accepted answer, if any
bodystringQuestion body in Markdown โ€” only when includeQuestionBody is on
createdAtstringCreation timestamp (ISO 8601)
lastActivityAtstringLast-activity timestamp (ISO 8601)
scrapedAtstringCollection timestamp (ISO 8601)

Answer (recordType: "answer")

Emitted only when includeAnswers is on.

{
"recordType": "answer",
"answerId": 11227902,
"questionId": 11227809,
"site": "stackoverflow",
"body": "**Branch prediction.**\n\nWith a sorted array, the condition is predictable...",
"score": 36012,
"isAccepted": true,
"author": "Mysticial",
"authorId": 922184,
"authorReputation": 481203,
"commentCount": 8,
"createdAt": "2012-06-27T13:56:42+00:00",
"lastActivityAt": "2023-08-14T18:33:20+00:00"
}
FieldTypeDescription
recordTypestringAlways "answer"
answerIdintegerAnswer ID
questionIdintegerParent question ID (link key)
sitestringSource community
bodystringAnswer body in Markdown
scoreintegerNet votes
isAcceptedbooleanWhether this is the accepted answer
authorstringAnswerer display name
authorIdintegerAnswerer account ID (null for deleted accounts)
authorReputationintegerAnswerer reputation
commentCountintegerNumber of comments on this answer
createdAtstringCreation timestamp (ISO 8601)
lastActivityAtstringLast-activity timestamp (ISO 8601)

Comment (recordType: "comment")

Emitted only when includeComments is on.

{
"recordType": "comment",
"commentId": 14738201,
"postId": 11227902,
"postType": "answer",
"questionId": 11227809,
"site": "stackoverflow",
"body": "This is the clearest explanation of branch prediction I have ever read.",
"score": 214,
"author": "user1234",
"authorId": 445566,
"createdAt": "2012-06-28T08:04:11+00:00"
}
FieldTypeDescription
recordTypestringAlways "comment"
commentIdintegerComment ID
postIdintegerID of the question or answer the comment belongs to
postTypestring"question" or "answer" โ€” what the comment is attached to
questionIdintegerParent question ID (link key)
sitestringSource community
bodystringComment body in Markdown
scoreintegerNet votes
authorstringCommenter display name
authorIdintegerCommenter account ID (null for deleted accounts)
createdAtstringCreation timestamp (ISO 8601)

Tips for Best Results

  • Mine canonical answers with votes sort + tag AND-filtering. Combine two or three specific tags with sort: "votes" to surface the definitive, highest-scored solutions for a topic โ€” ideal for training data and snippet libraries.
  • Set a date range for trend windows. Pair fromDate and toDate to isolate a quarter or a release window and measure how question volume for a framework shifts over time.
  • Reach beyond Stack Overflow with site. The same run works on Server Fault, Super User, Data Science, Unix & Linux, and 170+ other communities โ€” switch site to pull domain-specific Q&A the main site doesn't cover.
  • Cap answers on popular threads. Canonical questions can carry 30โ€“50+ answers. Set maxAnswersPerQuestion to keep only the top few and control run size and cost. Note that answerCount always reports the question's true total on the site, independent of how many answer rows you actually collect.
  • Turn on bodies only when you need content. Leave includeQuestionBody, includeAnswers, and includeComments off for lightweight metadata runs; enable them when you need the actual Markdown text.
  • Use Newest for scheduled incremental runs. sort: "creation" with a rolling fromDate reliably catches only questions added since your last run.
  • Fetch exact threads by URL for deep dives. Paste question links into questionUrlsOrIds to pull specific high-value threads with all their answers and comments in one shot.

Pricing

From $2.50 per 1,000 questions โ€” undercuts the market rate for Stack Exchange extraction, and answers and comments (when you enable them) are billed separately at much lower rates. You pay only for the results you collect.

This actor uses a per-result model split by record type. Prices below are per 1,000 rows of that type; Bronze, Silver, and Gold subscribers pay progressively less.

Record typeNo discountBronzeSilverGold
Question$3.00$2.80$2.65$2.50
Answer$0.60$0.56$0.53$0.50
Comment$0.24$0.22$0.21$0.20

Plus a small fixed $0.005 per-run start fee.

Because answers and comments are far cheaper than questions, your real total depends on the mix you collect. Example totals at the Gold tier:

What you collectRowsCost at Gold
100 questions only100 questions$0.255
100 questions + ~3 answers each100 questions + 300 answers$0.405
100 questions + 300 answers + 500 comments900 rows$0.505

No compute or time-based charges โ€” you pay only for the results you collect, plus the small fixed per-run start fee. Answers and comments are billed only when you turn them on. Platform fees (storage, data transfer) depend on your Apify plan.

Integrations

Export data in JSON, CSV, Excel, XML, or RSS. Connect to 1,500+ apps via:

  • Zapier / Make / n8n โ€” Workflow automation
  • Google Sheets โ€” Direct spreadsheet export
  • Slack / Email โ€” Notifications on new results
  • Webhooks โ€” Trigger custom APIs on run completion
  • Apify API โ€” Full programmatic access

This actor is designed for legitimate research, developer tooling, dataset building, and market intelligence. Users are responsible for complying with applicable laws and Stack Exchange's terms of service, including content-attribution and licensing requirements for any questions, answers, and comments collected. Do not use extracted data for spam, harassment, or any illegal purpose.