Lemmy Community Scraper avatar

Lemmy Community Scraper

Pricing

Pay per event

Go to Apify Store
Lemmy Community Scraper

Lemmy Community Scraper

Scrape posts and comments from any public Lemmy community on any Lemmy instance — the federated Reddit alternative. We handle the pagination, retries, fingerprint rotation, and rate-limit pacing — you get typed dataset rows ready to export to CSV or JSON.

Pricing

Pay per event

Rating

0.0

(0)

Developer

DevilScrapes

DevilScrapes

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

2 days ago

Last modified

Categories

Share

Lemmy Community Scraper

Lemmy Community Scraper

We do the dirty work so your dataset stays clean. 😈

$2.05 / 1,000 posts — Export posts and optional comments from any public Lemmy community on any Lemmy instance, via the public /api/v3/ REST API. No Lemmy account. No API key. No browser automation.

Lemmy is the federated Reddit alternative running the ActivityPub protocol; every Lemmy instance exposes the same versioned REST API with no authentication required for public reads. This Actor fetches community metadata once, cursor-paginates through posts, optionally fetches comments per post, and emits a flat dataset with community + post metadata denormalised onto every row — so flat CSV exports are self-contained for analytics, BI, or SQL with zero joins.

🎯 What this scrapes

This Actor exports two row types into a single dataset, discriminated by the row_type column:

  1. Post rows (row_type="post") — one row per post in the target community. Always emitted.
  2. Comment rows (row_type="comment") — one row per comment on each post. Emitted only when includeComments is enabled.

Every row carries the community context (name, title, subscriber count, posts count, ActivityPub actor_id) and the post context (id, title, score, upvotes, downvotes, comment count) so downstream tools see a fully denormalised, join-free table. Comment rows additionally carry the comment id, content, path, score, and a derived comment_parent_id field that preserves the parent-child relationship for tree reconstruction.

FieldTypeDescription
row_typestringpost or comment
instance_urlstringLemmy instance base URL
community_actor_idstringActivityPub actor_id of the community
community_namestringLocal community name (e.g. asklemmy)
community_titlestringHuman-readable community title
community_descriptionstring | nullCommunity description, if set
community_subscribersintegerSubscriber count
community_posts_countintegerTotal posts in the community
community_localbooleanTrue if the community is local to the instance
post_idintegerLemmy local post id
post_ap_idstringActivityPub canonical URL of the post
post_urlstringCanonical post URL (same as post_ap_id)
post_titlestringPost title
post_bodystring | nullPost body text; null on link posts
post_external_urlstring | nullExternal link URL; null on text posts
post_scoreintegerNet post score (upvotes - downvotes)
post_upvotesintegerNumber of upvotes on the post
post_downvotesintegerNumber of downvotes on the post
post_comments_countintegerTotal comments on the post
post_publishedstringISO 8601 UTC datetime
post_updatedstring | nullISO 8601 UTC datetime of last edit
comment_idinteger | nullLemmy local comment id (comment rows only)
comment_pathstring | nullLemmy thread path (e.g. 0.12345.67890)
comment_contentstring | nullComment body text
comment_scoreinteger | nullNet comment score
comment_publishedstring | nullISO 8601 UTC datetime
comment_parent_idinteger | nullParent comment id derived from path
author_namestringAuthor username (local form)
author_display_namestring | nullAuthor display name, if set
scraped_atstringISO 8601 UTC datetime this row was written

🔥 Features

  • No Lemmy account required — uses the public unauthenticated /api/v3/ REST API.
  • Supports any Lemmy instance — lemmy.world, lemmy.ml, beehaw.org, sh.itjust.works, and any other public instance running Lemmy v0.19.
  • Two operating modes from one input — posts only, or posts + comments in the same dataset.
  • Federated community syntax — pass memes@lemmy.world to scrape a remote community from any other instance, or asklemmy for a local community on the chosen instance.
  • 17-token sort enum — Active, Hot, New, Old, Scaled, Controversial, MostComments, NewComments, TopHour, TopSixHour, TopTwelveHour, TopDay, TopWeek, TopMonth, TopYear, TopAll.
  • Cursor-based post pagination + integer-page comment pagination — both verified against Lemmy v0.19 on live instances.
  • Denormalised output — community metadata on every row, no joins needed for downstream analytics or CSV exports.
  • comment_parent_id derived from comment.path — preserves the comment tree so callers can reconstruct threaded discussion.
  • Exponential backoff with Retry-After honoured for 408 / 429 / 503 responses; max 5 attempts.
  • Pure HTTP client (curl-cffi with browser fingerprint impersonation) — no browser automation, low compute footprint.
  • Pydantic v2 input validation with named sort enum and range bounds; bare Top (invalid on v0.19) is rejected up-front before any network call.
  • Pairs with bluesky-feed-posts and bluesky-starter-pack as the Federated Social Suite.

💡 Use cases

  • Reddit-alternative migration research — track how communities and engagement migrate from Reddit to Lemmy after policy changes; compare subscriber and post growth across instances.
  • Newsroom monitoring — subscribe to journalism, politics, or breaking-news Lemmy communities and pipe the latest top posts to Slack or Google Sheets via Apify integrations.
  • Brand monitoring on the fediverse — Lemmy is a growing channel for product complaints, support discussions, and competitor mentions outside the Reddit walled garden; this Actor surfaces them on a schedule.
  • Academic federated-social research — Lemmy's open, public REST API is significantly more accessible than Reddit's gated API; ideal for longitudinal community studies, sentiment tracking, or content-moderation research.
  • Community trend analysis — pull TopWeek posts across multiple communities and rank by score, comment count, or upvote ratio to benchmark community health.
  • Comment-tree reconstruction — combine post rows with comment rows (joined by post_id) and the comment_parent_id field to rebuild the full discussion tree for NLP or moderation pipelines.

⚙️ How to use it

  1. Open the Actor input form.
  2. Set Lemmy instance URL to the base URL of any public Lemmy instance — e.g. https://lemmy.world or https://lemmy.ml. Trailing slash is stripped automatically.
  3. Set Community name to the community you want to scrape. Either local form (asklemmy) for a community on the chosen instance, or federated form (memes@lemmy.world) for a remote community visible from the chosen instance.
  4. Pick a Post sort order from the 17 valid tokens. Hot (default) blends recency and engagement; TopWeek returns the highest-scoring posts of the last 7 days; New returns chronological order.
  5. Adjust Max posts (default 100, max 5,000).
  6. Toggle Include comments if you want comment rows alongside post rows. When enabled, set Max comments per post (default 50, max 500).
  7. Leave Use Apify Proxy off unless you are behind a restrictive ISP — Lemmy instances do not block datacenter IPs, so direct routing is faster and free.
  8. Click Start. Results stream into the default dataset and can be exported as JSON, CSV, Excel, or XML via the Export button.

Single-community example

{
"instanceUrl": "https://lemmy.world",
"communityName": "asklemmy",
"sort": "TopWeek",
"maxPosts": 200,
"includeComments": false,
"useProxy": false
}

Posts + comments example

{
"instanceUrl": "https://lemmy.ml",
"communityName": "memes@lemmy.world",
"sort": "Hot",
"maxPosts": 50,
"includeComments": true,
"maxCommentsPerPost": 25,
"useProxy": false
}

📥 Input

FieldTypeRequiredDefaultDescription
instanceUrlstringyesBase URL of the Lemmy instance
communityNamestringyesCommunity name (local or federated form)
sortstring (enum)noHotOne of 17 valid Lemmy v0.19 sort tokens
maxPostsintegerno100Max post rows emitted (1–5000)
includeCommentsbooleannofalseIf true, also fetch comments per post
maxCommentsPerPostintegerno50Max comment rows per post (1–500)
useProxybooleannofalseRoute through Apify Proxy (BUYPROXIES94952)

The full list of valid sort tokens is: Active, Hot, New, Old, Scaled, Controversial, MostComments, NewComments, TopHour, TopSixHour, TopTwelveHour, TopDay, TopWeek, TopMonth, TopYear, TopAll. Bare Top is not valid on Lemmy v0.19 and is rejected by the input validator before any network call.

📤 Output

One row per post or per comment. Community + post metadata is denormalised onto every comment row so a flat CSV is self-contained.

{
"row_type": "post",
"instance_url": "https://lemmy.world",
"community_actor_id": "https://lemmy.world/c/asklemmy",
"community_name": "asklemmy",
"community_title": "Ask Lemmy",
"community_description": "A Lemmy equivalent of Ask Reddit.",
"community_subscribers": 39567,
"community_posts_count": 8853,
"community_local": true,
"post_id": 46934299,
"post_ap_id": "https://leminal.space/post/35446889",
"post_url": "https://leminal.space/post/35446889",
"post_title": "What's something that you feel genuinely sad about?",
"post_body": null,
"post_external_url": null,
"post_score": 34,
"post_upvotes": 38,
"post_downvotes": 4,
"post_comments_count": 18,
"post_published": "2026-05-16T09:31:47.723504Z",
"post_updated": null,
"post_nsfw": false,
"post_featured_community": false,
"post_locked": false,
"comment_id": null,
"comment_ap_id": null,
"comment_path": null,
"comment_content": null,
"comment_score": null,
"comment_published": null,
"comment_parent_id": null,
"author_actor_id": "https://leminal.space/u/FosterMolasses",
"author_name": "FosterMolasses",
"author_display_name": null,
"author_bot_account": false,
"author_published": "2025-01-17T14:35:51.850105Z",
"scraped_at": "2026-05-16T12:00:00.000Z"
}

Comment rows have the same shape, with row_type="comment" and the comment fields populated. The comment_parent_id field is derived from the penultimate segment of comment.pathnull for top-level comments, otherwise the integer id of the parent comment — so callers can reconstruct the full thread tree without an extra API call.

Export formats

  • JSON — full fidelity, all fields, newline-delimited
  • CSV — flat, one row per post or comment, all columns including denormalised community + post metadata
  • Excel.xlsx via the Apify dataset converter
  • XML — structured per-item

All formats are available via the Apify API: GET /datasets/{id}/items?format=csv&clean=true.

💰 Pricing

Pay-Per-Event (PPE) — you pay only for what you use:

EventPrice (USD)When
actor-start$0.05Once per run, at boot
result-row$0.002Per post row written to the dataset
result-row-comment$0.001Per comment row written to the dataset

Example costs

RunCost
100 posts (no comments)$0.25
500 posts (no comments)$1.05
1,000 posts (no comments)$2.05
100 posts × 50 comments = 5,100 rows$5.30
500 posts × 25 comments = 13,000 rows$13.55

Comment rows are half the price of post rows because they are higher-volume and have lower per-row commercial value for most analytics use cases. Disable comment fetching when you only need post-level data — the Actor is significantly cheaper and faster.

🚧 Limitations

  • Public communities only. Private or access-restricted communities require authentication and are out of scope.
  • One community per run. This Actor scrapes one community per run; multi-community or instance-wide listings need separate runs (or a different Actor).
  • No media download. Image, video, and external link URLs are captured in the row, but the Actor does not download the media content itself.
  • No real-time streaming. The Actor takes a snapshot at run time; for live updates schedule recurring runs via Apify Schedules.
  • 7-day default storage retention on the Apify FREE tier. Export your dataset immediately after the run or upgrade for longer retention.
  • Lemmy v0.19 only. The Actor targets the v0.19 API surface and field paths. v0.20 (when released) may require updates; Pydantic extra="ignore" absorbs additive changes but breaking removes will require a new version.
  • Comment-tree depth is preserved via comment_path but the Actor does not return comments in a pre-nested structure — callers reconstruct the tree from comment_parent_id.

❓ FAQ

Do I need a Lemmy account?

No. Lemmy's public /api/v3/ REST API is unauthenticated by design — every endpoint this Actor calls is open to anyone without a login, signup, or API key.

What is the federated community format?

Lemmy communities can live on any instance but be subscribed to and read from any other instance via ActivityPub federation. The federated form is community@instance.tld — for example memes@lemmy.world refers to the memes community hosted on lemmy.world, accessible from any other Lemmy instance that has fetched it via federation. Local communities (on the instance you point this Actor at) use the bare local name like asklemmy.

Why does sort=Top not work?

Lemmy v0.19 removed the bare Top sort token and replaced it with compound tokens that embed a time range: TopHour, TopSixHour, TopTwelveHour, TopDay, TopWeek, TopMonth, TopYear, TopAll. Passing bare Top returns {"error": "unknown"} from the Lemmy API. This Actor rejects bare Top up-front during input validation — pick the compound token whose time range you want.

Which Lemmy instance should I point this at?

Any public Lemmy instance running v0.19. lemmy.world and lemmy.ml are the two largest general-purpose instances and good defaults. Topic-specific instances like beehaw.org, sh.itjust.works, or programming.dev are also fully supported. The instance URL you choose determines which "view" of the federated network this Actor scrapes from — different instances may have fetched different remote communities and different comment ranges, so re-running against a different instance can return slightly different remote-community data.

How do I reconstruct the comment tree from the dataset?

Each comment row carries comment_id and comment_parent_id. Group rows by post_id, then for each post: top-level comments have comment_parent_id IS NULL; replies have comment_parent_id pointing at the parent comment's comment_id within the same post. The comment_path field encodes the full ancestry (e.g. 0.12345.67890 means parent comment 12345, this comment 67890) for callers who need it.

Is scraping public Lemmy data legal?

Lemmy is AGPL-3.0 free and open-source software. Lemmy instances explicitly serve public community data through an unauthenticated REST API designed for federation. Always verify the specific instance's terms of service and your local jurisdiction's data-protection rules before using scraped data for commercial purposes.

Part of the Devil Scrapes Federated Social Suite:

  • Bluesky Starter Pack Scraper — export full member lists from any public Bluesky Starter Pack via the AT Protocol public API.
  • Bluesky Feed Posts Scraper — export posts from any public Bluesky custom feed or algorithm feed, with denormalised feed metadata on every row.

All three Actors share consistent pricing ($0.002 per post row, $0.05 per run) and field-naming conventions (<entity>_<field> snake_case) so cross-network federated-social analyses can join cleanly on author_handle / author_name.

💬 Your feedback

Found a bug, hit a rate limit, or need a new field on the output row? Open an issue on the Actor's Apify Store page or contact the Devil Scrapes team at apify.com/DevilScrapes. We ship updates within days of validated reports.