4Chan Thread & Board Scraper
Pricing
from $0.35 / 1,000 results
4Chan Thread & Board Scraper
Scrape threads from one or more 4chan boards using the official 4chan JSON API. Collect structured thread data, original posts, optional replies, attachments, extracted links, participant summaries, and thread-level metadata for research, monitoring, archiving, and downstream analysis.
Pricing
from $0.35 / 1,000 results
Rating
0.0
(0)
Developer
Inus Grobler
Maintained by CommunityActor stats
1
Bookmarked
5
Total users
1
Monthly active users
5 days ago
Last modified
Categories
Share
4chan JSON API Scraper
At a glance: what it does is scrape public 4chan board and thread data through the official JSON API; input examples include board names and direct thread URLs; output examples are thread, post, attachment, and reply rows; use cases include research, monitoring, and archiving; limitations, troubleshooting, and pricing/cost notes are covered below.
Scrape threads from one or more 4chan boards through the official 4chan JSON API.
This actor is useful when you want structured thread data for research, monitoring, archiving, enrichment, or downstream analysis. You choose which boards to scrape, how many threads to collect from each board, and whether to include replies or just the original post.
What The Actor Does
- Fetches the current catalog for each selected board
- Collects up to your chosen number of threads per board
- Stores thread metadata, original post data, attachment details, and optional replies
- Enriches output with catalog context, participation summaries, and extracted links
Input
boards
List of boards to scrape, without slashes.
Example:
["g", "biz", "tv"]
Default: ["g"]
maxThreadsPerBoard
Maximum number of threads to collect from each selected board.
Example:
20
Default: 10
threadUrls
Optional direct thread URLs to scrape. Use this when you already know the threads you need and want to skip board catalog discovery.
Example:
["https://boards.4chan.org/g/thread/105076684"]
maxRepliesPerThread
Optional cap on stored replies for each thread when scrapeReplies is enabled. The newest replies are kept. Leave empty to store all available replies.
Example:
25
scrapeReplies
false: store only the original post, while still keeping thread-level counts and summary metadatatrue: store the original post and all available replies in each thread
Default: false
proxyConfiguration
Optional proxy settings for the requests. By default, the actor uses direct requests because the public 4chan JSON API is usually faster without a proxy. Enable Apify Proxy or provide custom proxy URLs only when your environment needs it.
Output
Each dataset item represents one scraped post row with repeated thread-level metadata.
Top-level fields include:
boardthreadIdthreadUrlapiUrlscrapedAtsubjectsemanticUrlreplyCountimageCountisStickyisClosedisArchivedarchivedOncatalogstatsparticipantslinkspost
catalog
Board catalog context for the thread, including:
- catalog page number
- last modified timestamp
- omitted reply count
- omitted image count
- recent reply post IDs when available
stats
Thread-level summary fields, including:
- total posts and replies in the thread
- how many posts and replies are stored in the dataset item
- attachment totals
- quote counts
- external link counts
- board reference counts
- simple content flags such as code and greentext counts
participants
Participant summaries, including:
- unique poster IDs when present
- countries represented in the thread when present
links
Extracted link-related fields, including:
- external links
- external domains
- quoted post IDs
- board references such as
>>>/g/123456789
post
Each post record can include:
- author and subject
- timestamp and formatted posting date
- comment HTML and cleaned comment text
- quote targets
- board references
- external links
- attachment metadata
- content flags such as
containsCodeandcontainsGreentext
Best Practices
- Use
scrapeReplies: falsewhen you want faster, lighter discovery runs across many boards. - Use
scrapeReplies: truewhen you need full thread content. - Use
threadUrlswhen you already know specific threads; this skips catalog discovery and finishes faster. - Use
maxRepliesPerThreadwhen you only need the latest replies and want to reduce dataset volume. - Start with a smaller
maxThreadsPerBoardif you are exploring new board mixes. The default value of10is chosen to keep quick validation and test runs lightweight. - Split very wide crawls across multiple runs if you are scraping many boards at once.
- Keep normal runs at
128 MB. Launch larger reply-heavy runs at256 MB, especially when scraping 50 or more thread detail pages or storing uncapped/high reply counts.
Large Scraping Guidance
This actor has been tested on larger multi-board runs and works well for long-running scrapes. For the best production experience:
- Use separate runs for very broad board coverage instead of putting every board into one run.
- Keep reply scraping enabled only when you need full thread bodies.
- Use leaner discovery runs first, then follow up with deeper runs on boards or threads that matter most.
- For reply-heavy board runs, allow roughly one second per fetched thread plus startup and catalog overhead. The actor logs a recommended timeout at startup and warns if the current run timeout is likely too low.
- The actor logs a recommended memory setting at startup. Use
256 MBfor large reply-heavy runs and keep128 MBfor small discovery runs to minimize compute cost. - The default cloud timeout is set high enough for large normal runs, but API users can still override it per run if they intentionally want a stricter cap.
In practice, splitting large board lists across scheduled runs is the safest approach for high-volume scraping.
Notes
- Invalid or unavailable boards are skipped.
- Threads that disappear before they are fetched are skipped.
- Results are pushed to the Apify dataset as each catalog batch or thread is processed; the actor does not wait until the end of the run to write all rows.
- Very large threads may be split into multiple dataset items to stay within dataset size limits.
- The actor only returns data available through the public 4chan JSON API at the time of scraping.