Stack Overflow & Stack Exchange Scraper
Pricing
from $2.50 / 1,000 questions
Stack Overflow & Stack Exchange Scraper
[๐ฐ $2.5 / 1K] Extract questions from Stack Overflow and the 170+ site Stack Exchange network. Search by keyword or tag, sort by votes/activity, or pull specific questions by URL. Optionally collect answers and comments as linked rows.
Pricing
from $2.50 / 1,000 questions
Rating
0.0
(0)
Developer
SolidCode
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
a day ago
Last modified
Categories
Share
Pull questions โ plus full answer and comment threads โ from Stack Overflow and the wider Stack Exchange network in one run, with question, answer, and comment bodies in Markdown, vote scores, accepted-answer flags, view counts, tags, and author reputation. Search by keyword, filter by tag, browse trending threads, or fetch specific questions by URL. Built for developer-tooling teams, LLM/AI training-data builders, and technical researchers who need clean, structured Q&A content without hand-rolling paginated API calls and juggling anonymous request limits.
Why This Scraper?
- 170+ Stack Exchange communities from one field โ Stack Overflow, Server Fault, Super User, Ask Ubuntu, Mathematics, Cross Validated, Unix & Linux, Data Science, and 20+ more curated sites, all selectable without touching a URL.
- Full question + answer + comment bodies in Markdown โ not just titles and metadata. Pull the actual content behind every thread, ready for content analysis and LLM pipelines.
- Answers with vote score, accepted-answer flag, and author reputation โ every answer row carries
score,isAccepted, and the answerer'sauthorReputation, so you can rank canonical solutions instantly. - Comments on both questions and answers โ each comment row is tagged
postType(questionoranswer), carries thepostIdof the exact question or answer it hangs off, and is linked back to its parent question by ID. Answer comments arrive when you enable both answers and comments. - Linked three-record output โ question, answer, and comment rows share
questionIdso you can reassemble whole threads or load each record type into its own table. - Tag AND-filtering โ pass
python+pandasand get only questions carrying every tag, filtered on Stack Exchange's side so you never pay for off-target rows. - 6 sort modes โ Recent activity, Newest, Most votes, Hot, Top this week, and Top this month, plus a date range for precise "new since yesterday" windows.
- Search a keyword or fetch exact questions by URL/ID โ run a full-text search across titles and bodies, or paste specific question links like
stackoverflow.com/questions/11227809/...to pull those threads directly.
Use Cases
Developer Tooling & IDE Plugins
- Feed an in-editor "top accepted answers" panel for a language or framework tag
- Surface the highest-voted solution for an error message inside a support bot
- Keep a curated snippet library fresh from canonical Q&A threads
LLM & AI Training Data
- Build instruction-tuning datasets of real questions paired with accepted answers
- Extract Markdown code blocks and explanations for code-model pretraining
- Assemble evaluation sets of high-score answers with their vote signals
Technical Research & Trend Analysis
- Track which frameworks and libraries are gaining question volume over a date range
- Analyze answer quality by score distribution across a tag
- Compare activity between Stack Overflow and niche communities like Data Science or DevOps
Community & Reputation Monitoring
- Watch a tag for newly asked, still-unanswered questions to jump on
- Track top contributors by author reputation across a community
- Alert on trending "Hot" threads in your product's ecosystem
Content & Documentation
- Mine frequently asked questions to prioritize docs and knowledge-base articles
- Pull real user phrasing for FAQ and help-center content
- Source vetted code examples with attribution back to the original thread
Getting Started
Simple Keyword Search
One topic, newest 100 questions:
{"site": "stackoverflow","searchQuery": "pandas groupby performance"}
Tag Filter + Sort + Date Range
The highest-voted Kubernetes questions asked in 2024:
{"site": "stackoverflow","tags": ["kubernetes", "networking"],"sort": "votes","fromDate": "2024-01-01","toDate": "2024-12-31","maxResults": 200}
Specific Questions with Answers + Comments
Pull two exact threads with their full Q&A content:
{"questionUrlsOrIds": ["https://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster","231767"],"includeQuestionBody": true,"includeAnswers": true,"maxAnswersPerQuestion": 5,"includeComments": true}
Browse a Whole Community
The most-active questions on Unix & Linux, no keyword needed:
{"site": "unix","sort": "activity","maxResults": 500}
Input Reference
What to Scrape
| Parameter | Type | Default | Description |
|---|---|---|---|
site | string | "stackoverflow" | Which Stack Exchange community to pull from โ Stack Overflow, Server Fault, Super User, Ask Ubuntu, Mathematics, Data Science, Cross Validated, and more. |
searchQuery | string | "" | Full-text search across question titles and bodies (e.g. "kubernetes ingress timeout"). Leave blank to browse by tag and sort order instead. |
tags | array | [] | Only include questions carrying ALL of these tags (e.g. python, pandas). Use exact tag names as they appear on the site. Leave empty to include every tag. |
questionUrlsOrIds | array | [] | Fetch specific questions directly by URL or numeric ID. When set, the keyword/tag/sort finders are ignored for those questions. |
Filters
| Parameter | Type | Default | Description |
|---|---|---|---|
sort | string | "activity" | Order questions are collected in: Recent activity, Newest, Most votes, Hot, Top this week, or Top this month. Ignored when fetching specific question URLs/IDs. |
fromDate | string | "" | Only include questions created on or after this date (YYYY-MM-DD). Perfect for scheduled "new since yesterday" runs. |
toDate | string | "" | Only include questions created on or before this date (YYYY-MM-DD). |
Limits & Content
| Parameter | Type | Default | Description |
|---|---|---|---|
maxResults | integer | 100 | Maximum number of questions to collect. Set to 0 for as many as the site returns. The full last page is kept even if it slightly overshoots. Ignored when fetching specific question URLs/IDs. |
includeQuestionBody | boolean | false | Include each question's full body text (Markdown), not just its title. |
includeAnswers | boolean | false | Also collect each question's answers โ with body, score, accepted flag, and author โ as separate linked rows. |
maxAnswersPerQuestion | integer | 0 | Cap how many answers to collect per question when answers are enabled. 0 = all. |
includeComments | boolean | false | Also collect comments as separate linked rows. Question comments are always included; answer comments are included when both answers and comments are enabled. |
Output
Every row carries a recordType field โ question, answer, or comment โ and shares a questionId so you can rejoin whole threads or load each type into its own table.
Question (recordType: "question")
{"recordType": "question","questionId": 11227809,"title": "Why is processing a sorted array faster than processing an unsorted array?","link": "https://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-an-unsorted-array","site": "stackoverflow","tags": ["java", "c++", "performance", "cpu-architecture", "branch-prediction"],"author": "GManNickG","authorId": 87234,"authorReputation": 511234,"score": 27543,"viewCount": 1850342,"answerCount": 25,"commentCount": 12,"isAnswered": true,"hasAcceptedAnswer": true,"acceptedAnswerId": 11227902,"body": "Here is a piece of C++ code that shows some very peculiar behavior...","createdAt": "2012-06-27T13:51:36+00:00","lastActivityAt": "2024-05-10T09:12:04+00:00","scrapedAt": "2026-07-02T10:15:00+00:00"}
| Field | Type | Description |
|---|---|---|
recordType | string | Always "question" |
questionId | integer | Stack Exchange question ID โ the link key for answers and comments |
title | string | Question title |
link | string | Canonical question URL |
site | string | Source community (e.g. stackoverflow) |
tags | array | Tags on the question |
author | string | Asker display name |
authorId | integer | Asker account ID (null for deleted accounts) |
authorReputation | integer | Asker reputation |
score | integer | Net votes |
viewCount | integer | Total views |
answerCount | integer | Number of answers |
commentCount | integer | Number of comments on the question |
isAnswered | boolean | Whether the question is marked answered |
hasAcceptedAnswer | boolean | Whether an accepted answer exists |
acceptedAnswerId | integer | ID of the accepted answer, if any |
body | string | Question body in Markdown โ only when includeQuestionBody is on |
createdAt | string | Creation timestamp (ISO 8601) |
lastActivityAt | string | Last-activity timestamp (ISO 8601) |
scrapedAt | string | Collection timestamp (ISO 8601) |
Answer (recordType: "answer")
Emitted only when includeAnswers is on.
{"recordType": "answer","answerId": 11227902,"questionId": 11227809,"site": "stackoverflow","body": "**Branch prediction.**\n\nWith a sorted array, the condition is predictable...","score": 36012,"isAccepted": true,"author": "Mysticial","authorId": 922184,"authorReputation": 481203,"commentCount": 8,"createdAt": "2012-06-27T13:56:42+00:00","lastActivityAt": "2023-08-14T18:33:20+00:00"}
| Field | Type | Description |
|---|---|---|
recordType | string | Always "answer" |
answerId | integer | Answer ID |
questionId | integer | Parent question ID (link key) |
site | string | Source community |
body | string | Answer body in Markdown |
score | integer | Net votes |
isAccepted | boolean | Whether this is the accepted answer |
author | string | Answerer display name |
authorId | integer | Answerer account ID (null for deleted accounts) |
authorReputation | integer | Answerer reputation |
commentCount | integer | Number of comments on this answer |
createdAt | string | Creation timestamp (ISO 8601) |
lastActivityAt | string | Last-activity timestamp (ISO 8601) |
Comment (recordType: "comment")
Emitted only when includeComments is on.
{"recordType": "comment","commentId": 14738201,"postId": 11227902,"postType": "answer","questionId": 11227809,"site": "stackoverflow","body": "This is the clearest explanation of branch prediction I have ever read.","score": 214,"author": "user1234","authorId": 445566,"createdAt": "2012-06-28T08:04:11+00:00"}
| Field | Type | Description |
|---|---|---|
recordType | string | Always "comment" |
commentId | integer | Comment ID |
postId | integer | ID of the question or answer the comment belongs to |
postType | string | "question" or "answer" โ what the comment is attached to |
questionId | integer | Parent question ID (link key) |
site | string | Source community |
body | string | Comment body in Markdown |
score | integer | Net votes |
author | string | Commenter display name |
authorId | integer | Commenter account ID (null for deleted accounts) |
createdAt | string | Creation timestamp (ISO 8601) |
Tips for Best Results
- Mine canonical answers with
votessort + tag AND-filtering. Combine two or three specific tags withsort: "votes"to surface the definitive, highest-scored solutions for a topic โ ideal for training data and snippet libraries. - Set a date range for trend windows. Pair
fromDateandtoDateto isolate a quarter or a release window and measure how question volume for a framework shifts over time. - Reach beyond Stack Overflow with
site. The same run works on Server Fault, Super User, Data Science, Unix & Linux, and 170+ other communities โ switchsiteto pull domain-specific Q&A the main site doesn't cover. - Cap answers on popular threads. Canonical questions can carry 30โ50+ answers. Set
maxAnswersPerQuestionto keep only the top few and control run size and cost. Note thatanswerCountalways reports the question's true total on the site, independent of how many answer rows you actually collect. - Turn on bodies only when you need content. Leave
includeQuestionBody,includeAnswers, andincludeCommentsoff for lightweight metadata runs; enable them when you need the actual Markdown text. - Use
Newestfor scheduled incremental runs.sort: "creation"with a rollingfromDatereliably catches only questions added since your last run. - Fetch exact threads by URL for deep dives. Paste question links into
questionUrlsOrIdsto pull specific high-value threads with all their answers and comments in one shot.
Pricing
From $2.50 per 1,000 questions โ undercuts the market rate for Stack Exchange extraction, and answers and comments (when you enable them) are billed separately at much lower rates. You pay only for the results you collect.
This actor uses a per-result model split by record type. Prices below are per 1,000 rows of that type; Bronze, Silver, and Gold subscribers pay progressively less.
| Record type | No discount | Bronze | Silver | Gold |
|---|---|---|---|---|
| Question | $3.00 | $2.80 | $2.65 | $2.50 |
| Answer | $0.60 | $0.56 | $0.53 | $0.50 |
| Comment | $0.24 | $0.22 | $0.21 | $0.20 |
Plus a small fixed $0.005 per-run start fee.
Because answers and comments are far cheaper than questions, your real total depends on the mix you collect. Example totals at the Gold tier:
| What you collect | Rows | Cost at Gold |
|---|---|---|
| 100 questions only | 100 questions | $0.255 |
| 100 questions + ~3 answers each | 100 questions + 300 answers | $0.405 |
| 100 questions + 300 answers + 500 comments | 900 rows | $0.505 |
No compute or time-based charges โ you pay only for the results you collect, plus the small fixed per-run start fee. Answers and comments are billed only when you turn them on. Platform fees (storage, data transfer) depend on your Apify plan.
Integrations
Export data in JSON, CSV, Excel, XML, or RSS. Connect to 1,500+ apps via:
- Zapier / Make / n8n โ Workflow automation
- Google Sheets โ Direct spreadsheet export
- Slack / Email โ Notifications on new results
- Webhooks โ Trigger custom APIs on run completion
- Apify API โ Full programmatic access
Legal & Ethical Use
This actor is designed for legitimate research, developer tooling, dataset building, and market intelligence. Users are responsible for complying with applicable laws and Stack Exchange's terms of service, including content-attribution and licensing requirements for any questions, answers, and comments collected. Do not use extracted data for spam, harassment, or any illegal purpose.