4Chan Thread & Board Scraper avatar

4Chan Thread & Board Scraper

Pricing

from $0.35 / 1,000 results

Go to Apify Store
4Chan Thread & Board Scraper

4Chan Thread & Board Scraper

Scrape threads from one or more 4chan boards using the official 4chan JSON API. Collect structured thread data, original posts, optional replies, attachments, extracted links, participant summaries, and thread-level metadata for research, monitoring, archiving, and downstream analysis.

Pricing

from $0.35 / 1,000 results

Rating

0.0

(0)

Developer

Inus Grobler

Inus Grobler

Maintained by Community

Actor stats

1

Bookmarked

5

Total users

1

Monthly active users

5 days ago

Last modified

Share

4chan JSON API Scraper

At a glance: what it does is scrape public 4chan board and thread data through the official JSON API; input examples include board names and direct thread URLs; output examples are thread, post, attachment, and reply rows; use cases include research, monitoring, and archiving; limitations, troubleshooting, and pricing/cost notes are covered below.

Scrape threads from one or more 4chan boards through the official 4chan JSON API.

This actor is useful when you want structured thread data for research, monitoring, archiving, enrichment, or downstream analysis. You choose which boards to scrape, how many threads to collect from each board, and whether to include replies or just the original post.

What The Actor Does

  • Fetches the current catalog for each selected board
  • Collects up to your chosen number of threads per board
  • Stores thread metadata, original post data, attachment details, and optional replies
  • Enriches output with catalog context, participation summaries, and extracted links

Input

boards

List of boards to scrape, without slashes.

Example:

["g", "biz", "tv"]

Default: ["g"]

maxThreadsPerBoard

Maximum number of threads to collect from each selected board.

Example:

20

Default: 10

threadUrls

Optional direct thread URLs to scrape. Use this when you already know the threads you need and want to skip board catalog discovery.

Example:

["https://boards.4chan.org/g/thread/105076684"]

maxRepliesPerThread

Optional cap on stored replies for each thread when scrapeReplies is enabled. The newest replies are kept. Leave empty to store all available replies.

Example:

25

scrapeReplies

  • false: store only the original post, while still keeping thread-level counts and summary metadata
  • true: store the original post and all available replies in each thread

Default: false

proxyConfiguration

Optional proxy settings for the requests. By default, the actor uses direct requests because the public 4chan JSON API is usually faster without a proxy. Enable Apify Proxy or provide custom proxy URLs only when your environment needs it.

Output

Each dataset item represents one scraped post row with repeated thread-level metadata.

Top-level fields include:

  • board
  • threadId
  • threadUrl
  • apiUrl
  • scrapedAt
  • subject
  • semanticUrl
  • replyCount
  • imageCount
  • isSticky
  • isClosed
  • isArchived
  • archivedOn
  • catalog
  • stats
  • participants
  • links
  • post

catalog

Board catalog context for the thread, including:

  • catalog page number
  • last modified timestamp
  • omitted reply count
  • omitted image count
  • recent reply post IDs when available

stats

Thread-level summary fields, including:

  • total posts and replies in the thread
  • how many posts and replies are stored in the dataset item
  • attachment totals
  • quote counts
  • external link counts
  • board reference counts
  • simple content flags such as code and greentext counts

participants

Participant summaries, including:

  • unique poster IDs when present
  • countries represented in the thread when present

Extracted link-related fields, including:

  • external links
  • external domains
  • quoted post IDs
  • board references such as >>>/g/123456789

post

Each post record can include:

  • author and subject
  • timestamp and formatted posting date
  • comment HTML and cleaned comment text
  • quote targets
  • board references
  • external links
  • attachment metadata
  • content flags such as containsCode and containsGreentext

Best Practices

  • Use scrapeReplies: false when you want faster, lighter discovery runs across many boards.
  • Use scrapeReplies: true when you need full thread content.
  • Use threadUrls when you already know specific threads; this skips catalog discovery and finishes faster.
  • Use maxRepliesPerThread when you only need the latest replies and want to reduce dataset volume.
  • Start with a smaller maxThreadsPerBoard if you are exploring new board mixes. The default value of 10 is chosen to keep quick validation and test runs lightweight.
  • Split very wide crawls across multiple runs if you are scraping many boards at once.
  • Keep normal runs at 128 MB. Launch larger reply-heavy runs at 256 MB, especially when scraping 50 or more thread detail pages or storing uncapped/high reply counts.

Large Scraping Guidance

This actor has been tested on larger multi-board runs and works well for long-running scrapes. For the best production experience:

  • Use separate runs for very broad board coverage instead of putting every board into one run.
  • Keep reply scraping enabled only when you need full thread bodies.
  • Use leaner discovery runs first, then follow up with deeper runs on boards or threads that matter most.
  • For reply-heavy board runs, allow roughly one second per fetched thread plus startup and catalog overhead. The actor logs a recommended timeout at startup and warns if the current run timeout is likely too low.
  • The actor logs a recommended memory setting at startup. Use 256 MB for large reply-heavy runs and keep 128 MB for small discovery runs to minimize compute cost.
  • The default cloud timeout is set high enough for large normal runs, but API users can still override it per run if they intentionally want a stricter cap.

In practice, splitting large board lists across scheduled runs is the safest approach for high-volume scraping.

Notes

  • Invalid or unavailable boards are skipped.
  • Threads that disappear before they are fetched are skipped.
  • Results are pushed to the Apify dataset as each catalog batch or thread is processed; the actor does not wait until the end of the run to write all rows.
  • Very large threads may be split into multiple dataset items to stay within dataset size limits.
  • The actor only returns data available through the public 4chan JSON API at the time of scraping.