Lemmy Scraper — Posts, Comments & Community Data
Pricing
Pay per event
Lemmy Scraper — Posts, Comments & Community Data
Scrape posts and comments from any public Lemmy community on any Fediverse instance. Fingerprint rotation, retries, and proxy fallback handled for you. Typed dataset rows, ready for SQL, CSV, or JSON.
Pricing
Pay per event
Rating
0.0
(0)
Developer
DevilScrapes
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
10 hours ago
Last modified
Categories
Share
Lemmy Scraper — Posts, Comments & Community Data
We do the dirty work so your dataset stays clean. 😈
$2.05 / 1,000 posts — Export posts and optional comments from any public Lemmy community on any Lemmy instance. Pay only for results that land in your dataset. No credit card needed to try.
This fediverse scraper targets the ActivityPub-based Lemmy network — the federated Reddit alternative running on ~1,000 public instances (lemmy.world, lemmy.ml, beehaw.org, sh.itjust.works, and hundreds more). Every lemmy community export comes out as a flat, denormalised dataset with community metadata, post metadata, and comment-tree path on every row — self-contained for SQL, BI, or CSV analytics with zero joins.
🎯 What this scrapes
This Actor emits two row types into a single dataset, discriminated by the row_type column:
- Post rows (
row_type="post") — one row per post in the target community. Always emitted. - Comment rows (
row_type="comment") — one row per comment on each post. Emitted only whenincludeCommentsis enabled.
Every row carries the community context (name, title, subscriber count, posts count, ActivityPub actor_id) and the post context (id, title, score, upvotes, downvotes, comment count) so downstream tools see a fully denormalised, join-free table. Comment rows additionally carry the comment id, content, path, score, and a derived comment_parent_id field that preserves the parent-child relationship for tree reconstruction.
| Field | Type | Description |
|---|---|---|
row_type | string | post or comment |
instance_url | string | Lemmy instance base URL |
community_actor_id | string | ActivityPub actor_id of the community |
community_name | string | Local community name (e.g. asklemmy) |
community_title | string | Human-readable community title |
community_description | string | null | Community description, if set |
community_subscribers | integer | Subscriber count |
community_posts_count | integer | Total posts in the community |
community_local | boolean | True if the community is local to the instance |
post_id | integer | Lemmy local post id |
post_ap_id | string | ActivityPub canonical URL of the post |
post_url | string | Canonical post URL (same as post_ap_id) |
post_title | string | Post title |
post_body | string | null | Post body text; null on link posts |
post_external_url | string | null | External link URL; null on text posts |
post_score | integer | Net post score (upvotes - downvotes) |
post_upvotes | integer | Number of upvotes on the post |
post_downvotes | integer | Number of downvotes on the post |
post_comments_count | integer | Total comments on the post |
post_published | string | ISO 8601 UTC datetime |
post_updated | string | null | ISO 8601 UTC datetime of last edit |
comment_id | integer | null | Lemmy local comment id (comment rows only) |
comment_path | string | null | Lemmy thread path (e.g. 0.12345.67890) |
comment_content | string | null | Comment body text |
comment_score | integer | null | Net comment score |
comment_published | string | null | ISO 8601 UTC datetime |
comment_parent_id | integer | null | Parent comment id derived from path |
author_name | string | Author username (local form) |
author_display_name | string | null | Author display name, if set |
scraped_at | string | ISO 8601 UTC datetime this row was written |
🔥 Features
What we handle for you so you don't have to:
- 🛡️ We rotate browser fingerprints —
curl-cffiimpersonates real browser TLS (Chrome / Firefox / Safari) so the target sees real-browser handshakes, not Python. - 🔁 We retry with exponential backoff on
408 / 429 / 503responses and honourRetry-After. Up to 5 attempts per page before surfacing a clear error. - 🌐 We rotate through Apify Proxy on connection failures — fresh session ID, fresh exit IP — so transient blocks don't abort your run.
- 🧱 We pace requests per instance to avoid triggering rate limits; partial successes surface a clear
set_status_message— we never silently return an empty dataset. - 🧊 We keep the dataset clean — Pydantic v2 validated rows, ISO 8601 timestamps, stable IDs, and a
comment_parent_idfield derived from Lemmy's path encoding so you can reconstruct the full thread tree without an extra API call. - 💰 You pay only for results that land. No data → no charge beyond the small actor-start warm-up fee.
Additional capabilities:
- Supports any Lemmy instance —
lemmy.world,lemmy.ml,beehaw.org,sh.itjust.works, and any public instance running Lemmy v0.19. - Two operating modes from one input — posts only, or posts + comments in the same dataset.
- Federated community syntax — pass
memes@lemmy.worldto scrape a remote community from any other instance, orasklemmyfor a local community on the chosen instance. - 17-token sort enum —
Active,Hot,New,Old,Scaled,Controversial,MostComments,NewComments,TopHour,TopSixHour,TopTwelveHour,TopDay,TopWeek,TopMonth,TopYear,TopAll. - Cursor-based post pagination + integer-page comment pagination — both verified against Lemmy v0.19 on live instances.
- Denormalised output — community metadata on every row, no joins needed for downstream analytics or CSV exports.
- Pydantic v2 input validation with named sort enum and range bounds; bare
Top(invalid on v0.19) is rejected up-front before any network call. - Pairs with
bluesky-feed-postsandbluesky-starter-packas the Federated Social Suite.
💡 Use cases
- Reddit-alternative migration research — track how communities and engagement migrate from Reddit to Lemmy after policy changes; compare subscriber and post growth across instances.
- Newsroom monitoring — subscribe to journalism, politics, or breaking-news Lemmy communities and pipe the latest top posts to Slack or Google Sheets via Apify integrations.
- Brand monitoring on the fediverse — Lemmy is a growing channel for product complaints, support discussions, and competitor mentions outside the Reddit walled garden; this Actor surfaces them on a schedule.
- Academic federated-social research — Lemmy's public REST API makes it significantly more accessible for longitudinal community studies, sentiment tracking, or content-moderation research than platforms that gate their data.
- Community trend analysis — pull
TopWeekposts across multiple communities and rank by score, comment count, or upvote ratio to benchmark community health. - Comment-tree reconstruction — combine post rows with comment rows (joined by
post_id) and thecomment_parent_idfield to rebuild the full discussion tree for NLP or moderation pipelines. - NLP corpus building — Lemmy provides Reddit-shaped threaded conversation data useful for sentiment training, RAG pipelines, and discourse modelling without platform-specific OAuth hoops.
⚙️ How to use it
- Open the Actor input form.
- Set Lemmy instance URL to the base URL of any public Lemmy instance — e.g.
https://lemmy.worldorhttps://lemmy.ml. Trailing slash is stripped automatically. - Set Community name to the community you want to scrape. Either local form (
asklemmy) for a community on the chosen instance, or federated form (memes@lemmy.world) for a remote community visible from the chosen instance. - Pick a Post sort order from the 17 valid tokens.
Hot(default) blends recency and engagement;TopWeekreturns the highest-scoring posts of the last 7 days;Newreturns chronological order. - Adjust Max posts (default 100, max 5,000).
- Toggle Include comments if you want comment rows alongside post rows. When enabled, set Max comments per post (default 50, max 500).
- Configure Apify Proxy — we recommend leaving it at the default auto setting; the Actor activates proxy rotation automatically when it encounters rate-limit or connection errors.
- Click Start. Results stream into the default dataset and can be exported as JSON, CSV, Excel, or XML via the Export button.
Single-community example
{"instanceUrl": "https://lemmy.world","communityName": "asklemmy","sort": "TopWeek","maxPosts": 200,"includeComments": false}
Posts + comments example
{"instanceUrl": "https://lemmy.ml","communityName": "memes@lemmy.world","sort": "Hot","maxPosts": 50,"includeComments": true,"maxCommentsPerPost": 25}
📥 Input
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
instanceUrl | string | yes | — | Base URL of the Lemmy instance (e.g. https://lemmy.world) |
communityName | string | yes | — | Community name — local form (asklemmy) or federated form (memes@lemmy.world) |
sort | string (enum) | no | Hot | One of 17 valid Lemmy v0.19 sort tokens |
maxPosts | integer | no | 100 | Max post rows emitted (1–5000) |
includeComments | boolean | no | false | If true, also fetch comments per post |
maxCommentsPerPost | integer | no | 50 | Max comment rows per post (1–500) |
proxyConfiguration | object | no | auto | Apify Proxy configuration; default auto activates on block |
The full list of valid sort tokens is: Active, Hot, New, Old, Scaled, Controversial, MostComments, NewComments, TopHour, TopSixHour, TopTwelveHour, TopDay, TopWeek, TopMonth, TopYear, TopAll. Bare Top is not valid on Lemmy v0.19 and is rejected by the input validator before any network call.
📤 Output
One row per post or per comment. Community + post metadata is denormalised onto every comment row so a flat CSV is self-contained.
{"row_type": "post","instance_url": "https://lemmy.world","community_actor_id": "https://lemmy.world/c/asklemmy","community_name": "asklemmy","community_title": "Ask Lemmy","community_description": "A Lemmy equivalent of Ask Reddit.","community_subscribers": 39567,"community_posts_count": 8853,"community_local": true,"post_id": 46934299,"post_ap_id": "https://leminal.space/post/35446889","post_url": "https://leminal.space/post/35446889","post_title": "What's something that you feel genuinely sad about?","post_body": null,"post_external_url": null,"post_score": 34,"post_upvotes": 38,"post_downvotes": 4,"post_comments_count": 18,"post_published": "2026-05-16T09:31:47.723504Z","post_updated": null,"post_nsfw": false,"post_featured_community": false,"post_locked": false,"comment_id": null,"comment_ap_id": null,"comment_path": null,"comment_content": null,"comment_score": null,"comment_published": null,"comment_parent_id": null,"author_actor_id": "https://leminal.space/u/FosterMolasses","author_name": "FosterMolasses","author_display_name": null,"author_bot_account": false,"author_published": "2025-01-17T14:35:51.850105Z","scraped_at": "2026-06-01T12:00:00.000Z"}
Comment rows have the same shape, with row_type="comment" and the comment fields populated. The comment_parent_id field is derived from the penultimate segment of comment.path — null for top-level comments, otherwise the integer id of the parent comment — so callers can reconstruct the full thread tree without an extra API call.
Export formats
- JSON — full fidelity, all fields, newline-delimited
- CSV — flat, one row per post or comment, all columns including denormalised community + post metadata
- Excel —
.xlsxvia the Apify dataset converter - XML — structured per-item
All formats are available via the Apify API: GET /datasets/{id}/items?format=csv&clean=true.
💰 Pricing
Pay-Per-Event (PPE) — you pay only for what you use:
| Event | Price (USD) | When |
|---|---|---|
actor-start | $0.05 | Once per run, at boot |
result-row | $0.002 | Per post row written to the dataset |
result-row-comment | $0.001 | Per comment row written to the dataset |
Example costs
| Run | Cost |
|---|---|
| 100 posts (no comments) | $0.25 |
| 500 posts (no comments) | $1.05 |
| 1,000 posts (no comments) | $2.05 |
| 100 posts × 50 comments = 5,100 rows | $5.30 |
| 500 posts × 25 comments = 13,000 rows | $13.55 |
Comment rows are half the price of post rows because they are higher-volume and have lower per-row commercial value for most analytics use cases. Disable comment fetching when you only need post-level data — the Actor runs faster and costs less.
🚧 Limitations
- Public communities only. Private or access-restricted communities require authentication and are out of scope.
- One community per run. This Actor scrapes one community per run; multi-community or instance-wide listings need separate runs or a different Actor.
- No media download. Image, video, and external link URLs are captured in the row, but the Actor does not download the media content itself.
- No real-time streaming. The Actor takes a snapshot at run time; for live updates schedule recurring runs via Apify Schedules.
- 7-day default storage retention on the Apify FREE tier. Export your dataset immediately after the run or upgrade for longer retention.
- Lemmy v0.19 only. The Actor targets the v0.19 API surface and field paths. Pydantic
extra="ignore"absorbs additive changes but breaking removes require a version update. - Comment-tree depth is preserved via
comment_pathbut the Actor does not return comments in a pre-nested structure — callers reconstruct the tree fromcomment_parent_id. - Federation deduplication is the caller's responsibility. A post on
c/news@lemmy.worldmay appear on multiple instances; usecommunity_actor_idto deduplicate across multi-instance runs.
❓ FAQ
Do I need a Lemmy account or API key?
Lemmy's public /api/v3/ REST endpoint — the same Lemmy API wrapper that powers federation between instances — accepts unauthenticated reads for public communities. This Actor uses that same public interface, so no login, signup, or API key is required on your end. We still handle rate-limit backoff and proxy rotation on our side so your run doesn't stall mid-export.
What is the federated community format?
Lemmy communities can live on any instance but be subscribed to and read from any other instance via ActivityPub federation. The federated form is community@instance.tld — for example memes@lemmy.world refers to the memes community hosted on lemmy.world, accessible from any other Lemmy instance that has fetched it via federation. Local communities (on the instance you point this Actor at) use the bare local name like asklemmy.
Why does sort=Top not work?
Lemmy v0.19 removed the bare Top sort token and replaced it with compound tokens that embed a time range: TopHour, TopSixHour, TopTwelveHour, TopDay, TopWeek, TopMonth, TopYear, TopAll. Passing bare Top returns {"error": "unknown"} from the Lemmy API. This Actor rejects bare Top up-front during input validation — pick the compound token whose time range you want.
Which Lemmy instance should I point this at?
Any public Lemmy instance running v0.19. lemmy.world and lemmy.ml are the two largest general-purpose instances and good defaults. Topic-specific instances like beehaw.org, sh.itjust.works, or programming.dev are also fully supported. The instance URL determines which "view" of the federated network the Actor scrapes from — different instances may have fetched different remote communities, so re-running against a different instance can return slightly different remote-community data.
How do I reconstruct the comment tree from the dataset?
Each comment row carries comment_id and comment_parent_id. Group rows by post_id, then for each post: top-level comments have comment_parent_id IS NULL; replies have comment_parent_id pointing at the parent comment's comment_id within the same post. The comment_path field encodes the full ancestry (e.g. 0.12345.67890 means parent comment 12345, this comment 67890) for callers who need it.
Is this the right tool for building an NLP corpus from Lemmy data?
Yes. Lemmy provides Reddit-shaped threaded conversation data — post title, body, score, and a full comment tree with parent-child relationships — without requiring platform-specific OAuth credentials. Run the Actor against multiple communities across multiple instances, join on community_actor_id to deduplicate federation mirrors, and you have a clean conversation corpus ready for sentiment training, RAG indexing, or discourse analysis.
Is scraping public Lemmy data legal?
Lemmy is AGPL-3.0 free and open-source software. Lemmy instances explicitly serve public community data through an unauthenticated REST API designed for federation. Always verify the specific instance's terms of service and your local jurisdiction's data-protection rules before using scraped data for commercial purposes.
Related Actors
Part of the Devil Scrapes Federated Social Suite:
- Bluesky Starter Pack Scraper — export full member lists from any public Bluesky Starter Pack via the AT Protocol public API.
- Bluesky Feed Posts Scraper — export posts from any public Bluesky custom feed or algorithm feed, with denormalised feed metadata on every row.
All three Actors share consistent pricing ($0.002 per post row, $0.05 per run) and field-naming conventions (<entity>_<field> snake_case) so cross-network federated-social analyses can join cleanly on author_handle / author_name.
💬 Your feedback
Found a bug, hit a rate limit, or need a new field on the output row? Open an issue on the Actor's Apify Store page or contact the Devil Scrapes team at apify.com/DevilScrapes. We ship updates within days of validated reports.