Lemmy Scraper — Posts, Comments & Community Data avatar

Lemmy Scraper — Posts, Comments & Community Data

Pricing

Pay per event

Go to Apify Store
Lemmy Scraper — Posts, Comments & Community Data

Lemmy Scraper — Posts, Comments & Community Data

Scrape posts and comments from any public Lemmy community on any Fediverse instance. Fingerprint rotation, retries, and proxy fallback handled for you. Typed dataset rows, ready for SQL, CSV, or JSON.

Pricing

Pay per event

Rating

0.0

(0)

Developer

DevilScrapes

DevilScrapes

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

10 hours ago

Last modified

Categories

Share

Lemmy Scraper

Lemmy Scraper — Posts, Comments & Community Data

We do the dirty work so your dataset stays clean. 😈

$2.05 / 1,000 posts — Export posts and optional comments from any public Lemmy community on any Lemmy instance. Pay only for results that land in your dataset. No credit card needed to try.

This fediverse scraper targets the ActivityPub-based Lemmy network — the federated Reddit alternative running on ~1,000 public instances (lemmy.world, lemmy.ml, beehaw.org, sh.itjust.works, and hundreds more). Every lemmy community export comes out as a flat, denormalised dataset with community metadata, post metadata, and comment-tree path on every row — self-contained for SQL, BI, or CSV analytics with zero joins.

🎯 What this scrapes

This Actor emits two row types into a single dataset, discriminated by the row_type column:

  1. Post rows (row_type="post") — one row per post in the target community. Always emitted.
  2. Comment rows (row_type="comment") — one row per comment on each post. Emitted only when includeComments is enabled.

Every row carries the community context (name, title, subscriber count, posts count, ActivityPub actor_id) and the post context (id, title, score, upvotes, downvotes, comment count) so downstream tools see a fully denormalised, join-free table. Comment rows additionally carry the comment id, content, path, score, and a derived comment_parent_id field that preserves the parent-child relationship for tree reconstruction.

FieldTypeDescription
row_typestringpost or comment
instance_urlstringLemmy instance base URL
community_actor_idstringActivityPub actor_id of the community
community_namestringLocal community name (e.g. asklemmy)
community_titlestringHuman-readable community title
community_descriptionstring | nullCommunity description, if set
community_subscribersintegerSubscriber count
community_posts_countintegerTotal posts in the community
community_localbooleanTrue if the community is local to the instance
post_idintegerLemmy local post id
post_ap_idstringActivityPub canonical URL of the post
post_urlstringCanonical post URL (same as post_ap_id)
post_titlestringPost title
post_bodystring | nullPost body text; null on link posts
post_external_urlstring | nullExternal link URL; null on text posts
post_scoreintegerNet post score (upvotes - downvotes)
post_upvotesintegerNumber of upvotes on the post
post_downvotesintegerNumber of downvotes on the post
post_comments_countintegerTotal comments on the post
post_publishedstringISO 8601 UTC datetime
post_updatedstring | nullISO 8601 UTC datetime of last edit
comment_idinteger | nullLemmy local comment id (comment rows only)
comment_pathstring | nullLemmy thread path (e.g. 0.12345.67890)
comment_contentstring | nullComment body text
comment_scoreinteger | nullNet comment score
comment_publishedstring | nullISO 8601 UTC datetime
comment_parent_idinteger | nullParent comment id derived from path
author_namestringAuthor username (local form)
author_display_namestring | nullAuthor display name, if set
scraped_atstringISO 8601 UTC datetime this row was written

🔥 Features

What we handle for you so you don't have to:

  • 🛡️ We rotate browser fingerprintscurl-cffi impersonates real browser TLS (Chrome / Firefox / Safari) so the target sees real-browser handshakes, not Python.
  • 🔁 We retry with exponential backoff on 408 / 429 / 503 responses and honour Retry-After. Up to 5 attempts per page before surfacing a clear error.
  • 🌐 We rotate through Apify Proxy on connection failures — fresh session ID, fresh exit IP — so transient blocks don't abort your run.
  • 🧱 We pace requests per instance to avoid triggering rate limits; partial successes surface a clear set_status_message — we never silently return an empty dataset.
  • 🧊 We keep the dataset clean — Pydantic v2 validated rows, ISO 8601 timestamps, stable IDs, and a comment_parent_id field derived from Lemmy's path encoding so you can reconstruct the full thread tree without an extra API call.
  • 💰 You pay only for results that land. No data → no charge beyond the small actor-start warm-up fee.

Additional capabilities:

  • Supports any Lemmy instance — lemmy.world, lemmy.ml, beehaw.org, sh.itjust.works, and any public instance running Lemmy v0.19.
  • Two operating modes from one input — posts only, or posts + comments in the same dataset.
  • Federated community syntax — pass memes@lemmy.world to scrape a remote community from any other instance, or asklemmy for a local community on the chosen instance.
  • 17-token sort enum — Active, Hot, New, Old, Scaled, Controversial, MostComments, NewComments, TopHour, TopSixHour, TopTwelveHour, TopDay, TopWeek, TopMonth, TopYear, TopAll.
  • Cursor-based post pagination + integer-page comment pagination — both verified against Lemmy v0.19 on live instances.
  • Denormalised output — community metadata on every row, no joins needed for downstream analytics or CSV exports.
  • Pydantic v2 input validation with named sort enum and range bounds; bare Top (invalid on v0.19) is rejected up-front before any network call.
  • Pairs with bluesky-feed-posts and bluesky-starter-pack as the Federated Social Suite.

💡 Use cases

  • Reddit-alternative migration research — track how communities and engagement migrate from Reddit to Lemmy after policy changes; compare subscriber and post growth across instances.
  • Newsroom monitoring — subscribe to journalism, politics, or breaking-news Lemmy communities and pipe the latest top posts to Slack or Google Sheets via Apify integrations.
  • Brand monitoring on the fediverse — Lemmy is a growing channel for product complaints, support discussions, and competitor mentions outside the Reddit walled garden; this Actor surfaces them on a schedule.
  • Academic federated-social research — Lemmy's public REST API makes it significantly more accessible for longitudinal community studies, sentiment tracking, or content-moderation research than platforms that gate their data.
  • Community trend analysis — pull TopWeek posts across multiple communities and rank by score, comment count, or upvote ratio to benchmark community health.
  • Comment-tree reconstruction — combine post rows with comment rows (joined by post_id) and the comment_parent_id field to rebuild the full discussion tree for NLP or moderation pipelines.
  • NLP corpus building — Lemmy provides Reddit-shaped threaded conversation data useful for sentiment training, RAG pipelines, and discourse modelling without platform-specific OAuth hoops.

⚙️ How to use it

  1. Open the Actor input form.
  2. Set Lemmy instance URL to the base URL of any public Lemmy instance — e.g. https://lemmy.world or https://lemmy.ml. Trailing slash is stripped automatically.
  3. Set Community name to the community you want to scrape. Either local form (asklemmy) for a community on the chosen instance, or federated form (memes@lemmy.world) for a remote community visible from the chosen instance.
  4. Pick a Post sort order from the 17 valid tokens. Hot (default) blends recency and engagement; TopWeek returns the highest-scoring posts of the last 7 days; New returns chronological order.
  5. Adjust Max posts (default 100, max 5,000).
  6. Toggle Include comments if you want comment rows alongside post rows. When enabled, set Max comments per post (default 50, max 500).
  7. Configure Apify Proxy — we recommend leaving it at the default auto setting; the Actor activates proxy rotation automatically when it encounters rate-limit or connection errors.
  8. Click Start. Results stream into the default dataset and can be exported as JSON, CSV, Excel, or XML via the Export button.

Single-community example

{
"instanceUrl": "https://lemmy.world",
"communityName": "asklemmy",
"sort": "TopWeek",
"maxPosts": 200,
"includeComments": false
}

Posts + comments example

{
"instanceUrl": "https://lemmy.ml",
"communityName": "memes@lemmy.world",
"sort": "Hot",
"maxPosts": 50,
"includeComments": true,
"maxCommentsPerPost": 25
}

📥 Input

FieldTypeRequiredDefaultDescription
instanceUrlstringyesBase URL of the Lemmy instance (e.g. https://lemmy.world)
communityNamestringyesCommunity name — local form (asklemmy) or federated form (memes@lemmy.world)
sortstring (enum)noHotOne of 17 valid Lemmy v0.19 sort tokens
maxPostsintegerno100Max post rows emitted (1–5000)
includeCommentsbooleannofalseIf true, also fetch comments per post
maxCommentsPerPostintegerno50Max comment rows per post (1–500)
proxyConfigurationobjectnoautoApify Proxy configuration; default auto activates on block

The full list of valid sort tokens is: Active, Hot, New, Old, Scaled, Controversial, MostComments, NewComments, TopHour, TopSixHour, TopTwelveHour, TopDay, TopWeek, TopMonth, TopYear, TopAll. Bare Top is not valid on Lemmy v0.19 and is rejected by the input validator before any network call.

📤 Output

One row per post or per comment. Community + post metadata is denormalised onto every comment row so a flat CSV is self-contained.

{
"row_type": "post",
"instance_url": "https://lemmy.world",
"community_actor_id": "https://lemmy.world/c/asklemmy",
"community_name": "asklemmy",
"community_title": "Ask Lemmy",
"community_description": "A Lemmy equivalent of Ask Reddit.",
"community_subscribers": 39567,
"community_posts_count": 8853,
"community_local": true,
"post_id": 46934299,
"post_ap_id": "https://leminal.space/post/35446889",
"post_url": "https://leminal.space/post/35446889",
"post_title": "What's something that you feel genuinely sad about?",
"post_body": null,
"post_external_url": null,
"post_score": 34,
"post_upvotes": 38,
"post_downvotes": 4,
"post_comments_count": 18,
"post_published": "2026-05-16T09:31:47.723504Z",
"post_updated": null,
"post_nsfw": false,
"post_featured_community": false,
"post_locked": false,
"comment_id": null,
"comment_ap_id": null,
"comment_path": null,
"comment_content": null,
"comment_score": null,
"comment_published": null,
"comment_parent_id": null,
"author_actor_id": "https://leminal.space/u/FosterMolasses",
"author_name": "FosterMolasses",
"author_display_name": null,
"author_bot_account": false,
"author_published": "2025-01-17T14:35:51.850105Z",
"scraped_at": "2026-06-01T12:00:00.000Z"
}

Comment rows have the same shape, with row_type="comment" and the comment fields populated. The comment_parent_id field is derived from the penultimate segment of comment.pathnull for top-level comments, otherwise the integer id of the parent comment — so callers can reconstruct the full thread tree without an extra API call.

Export formats

  • JSON — full fidelity, all fields, newline-delimited
  • CSV — flat, one row per post or comment, all columns including denormalised community + post metadata
  • Excel.xlsx via the Apify dataset converter
  • XML — structured per-item

All formats are available via the Apify API: GET /datasets/{id}/items?format=csv&clean=true.

💰 Pricing

Pay-Per-Event (PPE) — you pay only for what you use:

EventPrice (USD)When
actor-start$0.05Once per run, at boot
result-row$0.002Per post row written to the dataset
result-row-comment$0.001Per comment row written to the dataset

Example costs

RunCost
100 posts (no comments)$0.25
500 posts (no comments)$1.05
1,000 posts (no comments)$2.05
100 posts × 50 comments = 5,100 rows$5.30
500 posts × 25 comments = 13,000 rows$13.55

Comment rows are half the price of post rows because they are higher-volume and have lower per-row commercial value for most analytics use cases. Disable comment fetching when you only need post-level data — the Actor runs faster and costs less.

🚧 Limitations

  • Public communities only. Private or access-restricted communities require authentication and are out of scope.
  • One community per run. This Actor scrapes one community per run; multi-community or instance-wide listings need separate runs or a different Actor.
  • No media download. Image, video, and external link URLs are captured in the row, but the Actor does not download the media content itself.
  • No real-time streaming. The Actor takes a snapshot at run time; for live updates schedule recurring runs via Apify Schedules.
  • 7-day default storage retention on the Apify FREE tier. Export your dataset immediately after the run or upgrade for longer retention.
  • Lemmy v0.19 only. The Actor targets the v0.19 API surface and field paths. Pydantic extra="ignore" absorbs additive changes but breaking removes require a version update.
  • Comment-tree depth is preserved via comment_path but the Actor does not return comments in a pre-nested structure — callers reconstruct the tree from comment_parent_id.
  • Federation deduplication is the caller's responsibility. A post on c/news@lemmy.world may appear on multiple instances; use community_actor_id to deduplicate across multi-instance runs.

❓ FAQ

Do I need a Lemmy account or API key?

Lemmy's public /api/v3/ REST endpoint — the same Lemmy API wrapper that powers federation between instances — accepts unauthenticated reads for public communities. This Actor uses that same public interface, so no login, signup, or API key is required on your end. We still handle rate-limit backoff and proxy rotation on our side so your run doesn't stall mid-export.

What is the federated community format?

Lemmy communities can live on any instance but be subscribed to and read from any other instance via ActivityPub federation. The federated form is community@instance.tld — for example memes@lemmy.world refers to the memes community hosted on lemmy.world, accessible from any other Lemmy instance that has fetched it via federation. Local communities (on the instance you point this Actor at) use the bare local name like asklemmy.

Why does sort=Top not work?

Lemmy v0.19 removed the bare Top sort token and replaced it with compound tokens that embed a time range: TopHour, TopSixHour, TopTwelveHour, TopDay, TopWeek, TopMonth, TopYear, TopAll. Passing bare Top returns {"error": "unknown"} from the Lemmy API. This Actor rejects bare Top up-front during input validation — pick the compound token whose time range you want.

Which Lemmy instance should I point this at?

Any public Lemmy instance running v0.19. lemmy.world and lemmy.ml are the two largest general-purpose instances and good defaults. Topic-specific instances like beehaw.org, sh.itjust.works, or programming.dev are also fully supported. The instance URL determines which "view" of the federated network the Actor scrapes from — different instances may have fetched different remote communities, so re-running against a different instance can return slightly different remote-community data.

How do I reconstruct the comment tree from the dataset?

Each comment row carries comment_id and comment_parent_id. Group rows by post_id, then for each post: top-level comments have comment_parent_id IS NULL; replies have comment_parent_id pointing at the parent comment's comment_id within the same post. The comment_path field encodes the full ancestry (e.g. 0.12345.67890 means parent comment 12345, this comment 67890) for callers who need it.

Is this the right tool for building an NLP corpus from Lemmy data?

Yes. Lemmy provides Reddit-shaped threaded conversation data — post title, body, score, and a full comment tree with parent-child relationships — without requiring platform-specific OAuth credentials. Run the Actor against multiple communities across multiple instances, join on community_actor_id to deduplicate federation mirrors, and you have a clean conversation corpus ready for sentiment training, RAG indexing, or discourse analysis.

Is scraping public Lemmy data legal?

Lemmy is AGPL-3.0 free and open-source software. Lemmy instances explicitly serve public community data through an unauthenticated REST API designed for federation. Always verify the specific instance's terms of service and your local jurisdiction's data-protection rules before using scraped data for commercial purposes.

Part of the Devil Scrapes Federated Social Suite:

  • Bluesky Starter Pack Scraper — export full member lists from any public Bluesky Starter Pack via the AT Protocol public API.
  • Bluesky Feed Posts Scraper — export posts from any public Bluesky custom feed or algorithm feed, with denormalised feed metadata on every row.

All three Actors share consistent pricing ($0.002 per post row, $0.05 per run) and field-naming conventions (<entity>_<field> snake_case) so cross-network federated-social analyses can join cleanly on author_handle / author_name.

💬 Your feedback

Found a bug, hit a rate limit, or need a new field on the output row? Open an issue on the Actor's Apify Store page or contact the Devil Scrapes team at apify.com/DevilScrapes. We ship updates within days of validated reports.