Pricing

Pay per event

Lemmy Scraper — Posts, Comments & Community Data

Scrape posts and comments from any public Lemmy community on any Fediverse instance. Fingerprint rotation, retries, and proxy fallback handled for you. Typed dataset rows, ready for SQL, CSV, or JSON.

Pricing

Pay per event

Rating

0.0

(0)

Developer

DevilScrapes

Actor stats

Bookmarked

Total users

Monthly active users

2 months ago

Last modified

Lemmy Scraper — Posts, Comments & Community Data

We do the dirty work so your dataset stays clean. 😈

$2.05 / 1,000 posts — Export posts and optional comments from any public Lemmy community on any Lemmy instance. Pay only for results that land in your dataset. No credit card needed to try.

This fediverse scraper targets the ActivityPub-based Lemmy network — the federated Reddit alternative running on ~1,000 public instances (lemmy.world, lemmy.ml, beehaw.org, sh.itjust.works, and hundreds more). Every lemmy community export comes out as a flat, denormalised dataset with community metadata, post metadata, and comment-tree path on every row — self-contained for SQL, BI, or CSV analytics with zero joins.

🎯 What this scrapes

This Actor emits two row types into a single dataset, discriminated by the row_type column:

Post rows (row_type="post") — one row per post in the target community. Always emitted.
Comment rows (row_type="comment") — one row per comment on each post. Emitted only when includeComments is enabled.

Every row carries the community context (name, title, subscriber count, posts count, ActivityPub actor_id) and the post context (id, title, score, upvotes, downvotes, comment count) so downstream tools see a fully denormalised, join-free table. Comment rows additionally carry the comment id, content, path, score, and a derived comment_parent_id field that preserves the parent-child relationship for tree reconstruction.

Field	Type	Description
`row_type`	string	`post` or `comment`
`instance_url`	string	Lemmy instance base URL
`community_actor_id`	string	ActivityPub actor_id of the community
`community_name`	string	Local community name (e.g. `asklemmy`)
`community_title`	string	Human-readable community title
`community_description`	string \| null	Community description, if set
`community_subscribers`	integer	Subscriber count
`community_posts_count`	integer	Total posts in the community
`community_local`	boolean	True if the community is local to the instance
`post_id`	integer	Lemmy local post id
`post_ap_id`	string	ActivityPub canonical URL of the post
`post_url`	string	Canonical post URL (same as `post_ap_id`)
`post_title`	string	Post title
`post_body`	string \| null	Post body text; null on link posts
`post_external_url`	string \| null	External link URL; null on text posts
`post_score`	integer	Net post score (upvotes - downvotes)
`post_upvotes`	integer	Number of upvotes on the post
`post_downvotes`	integer	Number of downvotes on the post
`post_comments_count`	integer	Total comments on the post
`post_published`	string	ISO 8601 UTC datetime
`post_updated`	string \| null	ISO 8601 UTC datetime of last edit
`comment_id`	integer \| null	Lemmy local comment id (comment rows only)
`comment_path`	string \| null	Lemmy thread path (e.g. `0.12345.67890`)
`comment_content`	string \| null	Comment body text
`comment_score`	integer \| null	Net comment score
`comment_published`	string \| null	ISO 8601 UTC datetime
`comment_parent_id`	integer \| null	Parent comment id derived from path
`author_name`	string	Author username (local form)
`author_display_name`	string \| null	Author display name, if set
`scraped_at`	string	ISO 8601 UTC datetime this row was written

🔥 Features

What we handle for you so you don't have to:

🛡️ We rotate browser fingerprints — curl-cffi impersonates real browser TLS (Chrome / Firefox / Safari) so the target sees real-browser handshakes, not Python.
🔁 We retry with exponential backoff on 408 / 429 / 503 responses and honour Retry-After. Up to 5 attempts per page before surfacing a clear error.
🌐 We rotate through Apify Proxy on connection failures — fresh session ID, fresh exit IP — so transient blocks don't abort your run.
🧱 We pace requests per instance to avoid triggering rate limits; partial successes surface a clear set_status_message — we never silently return an empty dataset.
🧊 We keep the dataset clean — Pydantic v2 validated rows, ISO 8601 timestamps, stable IDs, and a comment_parent_id field derived from Lemmy's path encoding so you can reconstruct the full thread tree without an extra API call.
💰 You pay only for results that land. No data → no charge beyond the small actor-start warm-up fee.

Additional capabilities:

Supports any Lemmy instance — lemmy.world, lemmy.ml, beehaw.org, sh.itjust.works, and any public instance running Lemmy v0.19.
Two operating modes from one input — posts only, or posts + comments in the same dataset.
Federated community syntax — pass memes@lemmy.world to scrape a remote community from any other instance, or asklemmy for a local community on the chosen instance.
17-token sort enum — Active, Hot, New, Old, Scaled, Controversial, MostComments, NewComments, TopHour, TopSixHour, TopTwelveHour, TopDay, TopWeek, TopMonth, TopYear, TopAll.
Cursor-based post pagination + integer-page comment pagination — both verified against Lemmy v0.19 on live instances.
Denormalised output — community metadata on every row, no joins needed for downstream analytics or CSV exports.
Pydantic v2 input validation with named sort enum and range bounds; bare Top (invalid on v0.19) is rejected up-front before any network call.
Pairs with bluesky-feed-posts and bluesky-starter-pack as the Federated Social Suite.

💡 Use cases

Reddit-alternative migration research — track how communities and engagement migrate from Reddit to Lemmy after policy changes; compare subscriber and post growth across instances.
Newsroom monitoring — subscribe to journalism, politics, or breaking-news Lemmy communities and pipe the latest top posts to Slack or Google Sheets via Apify integrations.
Brand monitoring on the fediverse — Lemmy is a growing channel for product complaints, support discussions, and competitor mentions outside the Reddit walled garden; this Actor surfaces them on a schedule.
Academic federated-social research — Lemmy's public REST API makes it significantly more accessible for longitudinal community studies, sentiment tracking, or content-moderation research than platforms that gate their data.
Community trend analysis — pull TopWeek posts across multiple communities and rank by score, comment count, or upvote ratio to benchmark community health.
Comment-tree reconstruction — combine post rows with comment rows (joined by post_id) and the comment_parent_id field to rebuild the full discussion tree for NLP or moderation pipelines.
NLP corpus building — Lemmy provides Reddit-shaped threaded conversation data useful for sentiment training, RAG pipelines, and discourse modelling without platform-specific OAuth hoops.

⚙️ How to use it

Open the Actor input form.
Set Lemmy instance URL to the base URL of any public Lemmy instance — e.g. https://lemmy.world or https://lemmy.ml. Trailing slash is stripped automatically.
Set Community name to the community you want to scrape. Either local form (asklemmy) for a community on the chosen instance, or federated form (memes@lemmy.world) for a remote community visible from the chosen instance.
Pick a Post sort order from the 17 valid tokens. Hot (default) blends recency and engagement; TopWeek returns the highest-scoring posts of the last 7 days; New returns chronological order.
Adjust Max posts (default 100, max 5,000).
Toggle Include comments if you want comment rows alongside post rows. When enabled, set Max comments per post (default 50, max 500).
Configure Apify Proxy — we recommend leaving it at the default auto setting; the Actor activates proxy rotation automatically when it encounters rate-limit or connection errors.
Click Start. Results stream into the default dataset and can be exported as JSON, CSV, Excel, or XML via the Export button.

Single-community example

{
  "instanceUrl": "https://lemmy.world",
  "communityName": "asklemmy",
  "sort": "TopWeek",
  "maxPosts": 200,
  "includeComments": false
}

Posts + comments example

{
  "instanceUrl": "https://lemmy.ml",
  "communityName": "memes@lemmy.world",
  "sort": "Hot",
  "maxPosts": 50,
  "includeComments": true,
  "maxCommentsPerPost": 25
}

📥 Input

Field	Type	Required	Default	Description
`instanceUrl`	string	yes	—	Base URL of the Lemmy instance (e.g. `https://lemmy.world`)
`communityName`	string	yes	—	Community name — local form (`asklemmy`) or federated form (`memes@lemmy.world`)
`sort`	string (enum)	no	`Hot`	One of 17 valid Lemmy v0.19 sort tokens
`maxPosts`	integer	no	`100`	Max post rows emitted (1–5000)
`includeComments`	boolean	no	`false`	If true, also fetch comments per post
`maxCommentsPerPost`	integer	no	`50`	Max comment rows per post (1–500)
`proxyConfiguration`	object	no	auto	Apify Proxy configuration; default auto activates on block

The full list of valid sort tokens is: Active, Hot, New, Old, Scaled, Controversial, MostComments, NewComments, TopHour, TopSixHour, TopTwelveHour, TopDay, TopWeek, TopMonth, TopYear, TopAll. Bare Top is not valid on Lemmy v0.19 and is rejected by the input validator before any network call.

📤 Output

One row per post or per comment. Community + post metadata is denormalised onto every comment row so a flat CSV is self-contained.

{
  "row_type": "post",
  "instance_url": "https://lemmy.world",
  "community_actor_id": "https://lemmy.world/c/asklemmy",
  "community_name": "asklemmy",
  "community_title": "Ask Lemmy",
  "community_description": "A Lemmy equivalent of Ask Reddit.",
  "community_subscribers": 39567,
  "community_posts_count": 8853,
  "community_local": true,
  "post_id": 46934299,
  "post_ap_id": "https://leminal.space/post/35446889",
  "post_url": "https://leminal.space/post/35446889",
  "post_title": "What's something that you feel genuinely sad about?",
  "post_body": null,
  "post_external_url": null,
  "post_score": 34,
  "post_upvotes": 38,
  "post_downvotes": 4,
  "post_comments_count": 18,
  "post_published": "2026-05-16T09:31:47.723504Z",
  "post_updated": null,
  "post_nsfw": false,
  "post_featured_community": false,
  "post_locked": false,
  "comment_id": null,
  "comment_ap_id": null,
  "comment_path": null,
  "comment_content": null,
  "comment_score": null,
  "comment_published": null,
  "comment_parent_id": null,
  "author_actor_id": "https://leminal.space/u/FosterMolasses",
  "author_name": "FosterMolasses",
  "author_display_name": null,
  "author_bot_account": false,
  "author_published": "2025-01-17T14:35:51.850105Z",
  "scraped_at": "2026-06-01T12:00:00.000Z"
}

Comment rows have the same shape, with row_type="comment" and the comment fields populated. The comment_parent_id field is derived from the penultimate segment of comment.path — null for top-level comments, otherwise the integer id of the parent comment — so callers can reconstruct the full thread tree without an extra API call.

Export formats

JSON — full fidelity, all fields, newline-delimited
CSV — flat, one row per post or comment, all columns including denormalised community + post metadata
Excel — .xlsx via the Apify dataset converter
XML — structured per-item

All formats are available via the Apify API: GET /datasets/{id}/items?format=csv&clean=true.

💰 Pricing

Pay-Per-Event (PPE) — you pay only for what you use:

Event	Price (USD)	When
`actor-start`	$0.05	Once per run, at boot
`result-row`	$0.002	Per post row written to the dataset
`result-row-comment`	$0.001	Per comment row written to the dataset

Example costs

Run	Cost
100 posts (no comments)	$0.25
500 posts (no comments)	$1.05
1,000 posts (no comments)	$2.05
100 posts × 50 comments = 5,100 rows	$5.30
500 posts × 25 comments = 13,000 rows	$13.55

Comment rows are half the price of post rows because they are higher-volume and have lower per-row commercial value for most analytics use cases. Disable comment fetching when you only need post-level data — the Actor runs faster and costs less.

🚧 Limitations

Public communities only. Private or access-restricted communities require authentication and are out of scope.
One community per run. This Actor scrapes one community per run; multi-community or instance-wide listings need separate runs or a different Actor.
No media download. Image, video, and external link URLs are captured in the row, but the Actor does not download the media content itself.
No real-time streaming. The Actor takes a snapshot at run time; for live updates schedule recurring runs via Apify Schedules.
7-day default storage retention on the Apify FREE tier. Export your dataset immediately after the run or upgrade for longer retention.
Lemmy v0.19 only. The Actor targets the v0.19 API surface and field paths. Pydantic extra="ignore" absorbs additive changes but breaking removes require a version update.
Comment-tree depth is preserved via comment_path but the Actor does not return comments in a pre-nested structure — callers reconstruct the tree from comment_parent_id.
Federation deduplication is the caller's responsibility. A post on c/news@lemmy.world may appear on multiple instances; use community_actor_id to deduplicate across multi-instance runs.

❓ FAQ

Do I need a Lemmy account or API key?

Lemmy's public /api/v3/ REST endpoint — the same Lemmy API wrapper that powers federation between instances — accepts unauthenticated reads for public communities. This Actor uses that same public interface, so no login, signup, or API key is required on your end. We still handle rate-limit backoff and proxy rotation on our side so your run doesn't stall mid-export.

What is the federated community format?

Lemmy communities can live on any instance but be subscribed to and read from any other instance via ActivityPub federation. The federated form is community@instance.tld — for example memes@lemmy.world refers to the memes community hosted on lemmy.world, accessible from any other Lemmy instance that has fetched it via federation. Local communities (on the instance you point this Actor at) use the bare local name like asklemmy.

Why does sort=Top not work?

Lemmy v0.19 removed the bare Top sort token and replaced it with compound tokens that embed a time range: TopHour, TopSixHour, TopTwelveHour, TopDay, TopWeek, TopMonth, TopYear, TopAll. Passing bare Top returns {"error": "unknown"} from the Lemmy API. This Actor rejects bare Top up-front during input validation — pick the compound token whose time range you want.

Which Lemmy instance should I point this at?

Any public Lemmy instance running v0.19. lemmy.world and lemmy.ml are the two largest general-purpose instances and good defaults. Topic-specific instances like beehaw.org, sh.itjust.works, or programming.dev are also fully supported. The instance URL determines which "view" of the federated network the Actor scrapes from — different instances may have fetched different remote communities, so re-running against a different instance can return slightly different remote-community data.

How do I reconstruct the comment tree from the dataset?

Each comment row carries comment_id and comment_parent_id. Group rows by post_id, then for each post: top-level comments have comment_parent_id IS NULL; replies have comment_parent_id pointing at the parent comment's comment_id within the same post. The comment_path field encodes the full ancestry (e.g. 0.12345.67890 means parent comment 12345, this comment 67890) for callers who need it.

Is this the right tool for building an NLP corpus from Lemmy data?

Yes. Lemmy provides Reddit-shaped threaded conversation data — post title, body, score, and a full comment tree with parent-child relationships — without requiring platform-specific OAuth credentials. Run the Actor against multiple communities across multiple instances, join on community_actor_id to deduplicate federation mirrors, and you have a clean conversation corpus ready for sentiment training, RAG indexing, or discourse analysis.

Is scraping public Lemmy data legal?

Lemmy is AGPL-3.0 free and open-source software. Lemmy instances explicitly serve public community data through an unauthenticated REST API designed for federation. Always verify the specific instance's terms of service and your local jurisdiction's data-protection rules before using scraped data for commercial purposes.

Part of the Devil Scrapes Federated Social Suite:

Bluesky Starter Pack Scraper — export full member lists from any public Bluesky Starter Pack via the AT Protocol public API.
Bluesky Feed Posts Scraper — export posts from any public Bluesky custom feed or algorithm feed, with denormalised feed metadata on every row.

All three Actors share consistent pricing ($0.002 per post row, $0.05 per run) and field-naming conventions (<entity>_<field> snake_case) so cross-network federated-social analyses can join cleanly on author_handle / author_name.

💬 Your feedback

Found a bug, hit a rate limit, or need a new field on the output row? Open an issue on the Actor's Apify Store page or contact the Devil Scrapes team at apify.com/DevilScrapes. We ship updates within days of validated reports.

Lemmy scraper - posts, comments, communities and users

mangudai/lemmy-scraper

Scrape public Lemmy data from any instance without a login or API key. Turn a keyword, community, user, or feed into structured rows of posts, comments, communities and user profiles. Export to CSV, JSON or Excel for community monitoring, research and lead discovery.

Mangudäi

Lemmy Scraper

opendata-labs/lemmy-scraper

Scrape posts, comments, communities and search results from any Lemmy instance via the official API. Clean structured data (JSON/CSV), no login required.

Joao Paulo

Lemmy Scraper — Posts, Scores & Comments (No Key)

ninhothedev/lemmy-scraper

$0.5/1K 🔥 Fast Lemmy scraper! Posts from any instance or community — title, score, comments & author. No key. JSON, CSV, Excel or API in seconds. Pull thousands for social monitoring & trend research ⚡

ninhothedev

Lemmy Scraper

dami_studio/lemmy-scraper

Scrapes public Lemmy posts from any instance (default lemmy.world) by front-page feed, community, or keyword search. Returns title, link, body, author, community, score, comments, votes, NSFW flag and thumbnail as JSON. Best for brand and product mon

Dami's Studio

5.0

Lemmy Scraper - Federated Reddit Alternative

legend006/lemmy-scraper

Scrape posts and comments from any Lemmy instance (the open, federated Reddit alternative). Filter by community, search keyword, or pull instance-wide feeds. No login required. Built for AI training datasets, fediverse research, and community monitoring.

NIJ KANANI

👾 Lemmy Scraper - Federated Reddit Posts & Comments

benthepythondev/lemmy-scraper

Scrape Lemmy (the federated Reddit alternative) from any instance via the public API — no login needed. Get front-page or per-community posts, comments, keyword search, and community data. Clean JSON with scores, upvotes & comment counts.

Ben

Lemmy Scraper

goat255/lemmy-scraper

Scrape Lemmy communities, posts, comments, and search results from any instance without a login. Pull a community's posts, a single post with its comment thread, or keyword search results. Walks pagination up to your chosen limit.

Goutam Soni

Lemmy Posts & Communities Scraper

makework36/lemmy-scraper

Scrape Lemmy instances for posts, comments, communities. Works with any instance. Sort by Hot, New, Top. No login needed.

deusex machine

Lemmy Community Posts Scraper

parseforge/lemmy-community-posts-scraper

Track social activity from Lemmy Community Posts with profile name, follower count, posts, replies and timestamps. Designed for community managers, brand watchers and trend researchers. Run on demand or on a recurring schedule and feed every row into your favourite analytics or workflow stack.

ParseForge

Lemmy Instance Extractor

datamule/lemmy-instance-extractor

Point at ANY Lemmy instance (lemmy.world, sh.itjust.works, lemm.ee…) and pull instance stats, posts, comments or communities into clean flat rows. One actor reads every federated /api/v3 server. Sort + community filters, auto-paged, lossless _raw. Pay per item.