# Reddit Comment Scraper Pro (`crawlerbros/reddit-comment-scraper-pro`) Actor

Scrape comments from any Reddit post with advanced filters (minScore, maxDepth, excludeDeleted, authorFilter, keywordFilter) and rich per-comment fields: awards, gildedCount, controversiality, repliesCount, parentCommentId, body, bodyHtml, subreddit, permalink. No login required.

- **URL**: https://apify.com/crawlerbros/reddit-comment-scraper-pro.md
- **Developed by:** [Crawler Bros](https://apify.com/crawlerbros) (community)
- **Categories:** Social media, Developer tools, Automation
- **Stats:** 1 total users, 0 monthly users, 100.0% runs succeeded, 13 bookmarks
- **User rating**: 5.00 out of 5 stars

## Pricing

from $1.00 / 1,000 results

This Actor is paid per event and usage. You are charged both the fixed price for specific events and for Apify platform usage.
Since this Actor supports Apify Store discounts, the price gets lower the higher subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## Reddit Comment Scraper

An Apify Actor for scraping comments from Reddit posts using browser automation with Playwright.

### Features

- 💬 Scrape comments from multiple Reddit posts
- 📊 Extract comprehensive comment data (text, author, score, timestamps, etc.)
- 🔄 Automatically expand collapsed threads and "load more" sections
- 🌳 Capture nested comment structure with depth levels
- 📦 No authentication required for public posts
- 💾 Data saved in structured JSON format
- 🌐 Browser automation bypasses API restrictions

### Input Parameters

The actor accepts the following input parameters:

| Parameter       | Type    | Required | Default | Description                                                     |
| --------------- | ------- | -------- | ------- | --------------------------------------------------------------- |
| `postUrls`      | array   | Yes      | -       | List of Reddit post URLs to scrape comments from                |
| `maxComments`   | integer | No       | `100`   | Maximum number of comments to scrape from each post (1-10000)   |
| `expandThreads` | boolean | No       | `true`  | Automatically expand collapsed threads and "load more" sections |

#### Example Input

```json
{
  "postUrls": [
    "https://www.reddit.com/r/programming/comments/1abc123/interesting_discussion/",
    "https://old.reddit.com/r/python/comments/1def456/another_post/"
  ],
  "maxComments": 200,
  "expandThreads": true
}
````

### Output Fields

The actor extracts the following data for each comment:

#### Comment Information

- `comment_id` - Unique comment ID (e.g., "abc123xyz")
- `comment_name` - Full comment name in Reddit format (e.g., "t1\_abc123xyz")
- `author` - Username of the comment author (or "\[deleted]")
- `text` - Full comment text/content

#### Engagement Metrics

- `score` - Comment score/karma (upvotes minus downvotes)
- `awards_count` - Number of awards/gildings the comment received

#### Links

- `permalink` - Direct link to the comment
- `post_url` - URL of the parent post

#### Metadata

- `depth` - Nesting level/depth in the comment thread (0 = top-level)
- `parent_comment_id` - ID of the parent comment (null for top-level comments)
- `is_op` - Boolean indicating if the author is the Original Poster
- `is_edited` - Boolean indicating if the comment was edited
- `is_stickied` - Boolean indicating if the comment is stickied/pinned

#### Timestamps

- `created_utc` - Unix timestamp when the comment was created
- `created_at` - ISO 8601 formatted datetime (e.g., "2025-10-14T12:30:45")

#### Example Output

```json
{
  "comment_id": "abc123xyz",
  "comment_name": "t1_abc123xyz",
  "author": "example_user",
  "text": "This is a great discussion! I totally agree with your points about...",
  "score": 42,
  "awards_count": 2,
  "permalink": "https://old.reddit.com/r/programming/comments/1abc123/_/abc123xyz/",
  "post_url": "https://old.reddit.com/r/programming/comments/1abc123/interesting_discussion/",
  "depth": 0,
  "parent_comment_id": null,
  "is_op": false,
  "is_edited": true,
  "is_stickied": false,
  "created_utc": 1728912645,
  "created_at": "2025-10-14T12:30:45"
}
```

### Usage

#### Local Development

1. **Install dependencies**:

   ```bash
   pip install -r requirements.txt
   playwright install chromium
   ```

2. **Set up input** in `storage/key_value_stores/default/INPUT.json`:

   ```json
   {
     "postUrls": ["https://www.reddit.com/r/programming/comments/1example/"],
     "maxComments": 100,
     "expandThreads": true
   }
   ```

3. **Run the actor**:

   ```bash
   python -m src
   ```

4. **Check results** in `storage/datasets/default/`

#### On Apify Platform

1. **Push to Apify**:

   - Login to Apify CLI: `apify login`
   - Initialize: `apify init` (if not already done)
   - Push to Apify: `apify push`

2. **Or manually upload**:

   - Create a new actor on Apify platform
   - Upload all files including `Dockerfile`, `requirements.txt`, and `.actor/` directory

3. **Configure and run**:
   - Set input parameters in the Apify console
   - Paste Reddit post URLs
   - Click "Start" to run the actor
   - Download results from the dataset tab

### Technical Details

#### Browser Automation

- Uses **Playwright** with Chromium browser
- Scrapes `old.reddit.com` for better compatibility and simpler HTML structure
- Implements anti-detection measures:
  - Custom User-Agent headers
  - Disabled automation flags
  - Browser fingerprint masking

#### Features

- **Automatic thread expansion**: Clicks "load more" and "continue this thread" buttons
- **Smart extraction**: Handles nested comments and preserves thread structure
- **Depth tracking**: Captures comment nesting levels
- **Parent-child relationships**: Links comments to their parents
- **Error handling**: Gracefully handles deleted comments and missing data

#### Comment Expansion

The scraper automatically:

1. Clicks "load more comments" buttons (up to 10 per attempt)
2. Clicks "continue this thread" links (up to 5 per attempt)
3. Makes up to 3 expansion attempts to maximize comment coverage
4. Waits for new comments to load after each expansion

#### Performance

- Headless browser mode for efficiency
- Optimized page load strategy (`domcontentloaded`)
- Configurable wait times and timeouts
- Parallel processing of multiple posts (sequential with delays)

### Limitations

- Only works with public Reddit posts
- Cannot scrape private or restricted posts
- Browser automation is slower than direct API calls but more reliable
- Hidden scores show as 0 (when "\[score hidden]" is displayed)
- Maximum 10,000 comments per post (configurable)

### Dependencies

- `apify>=2.1.0` - Apify SDK for Python
- `playwright~=1.40.0` - Browser automation framework
- `beautifulsoup4~=4.12.0` - HTML parsing library

### Troubleshooting

#### Timeout Issues

If you encounter timeout errors:

- Check if the post URL is valid and accessible
- Increase timeout values in the code if needed
- Verify the post has comments

#### Missing Comments

If some comments are missing:

- Enable `expandThreads` to load collapsed comments
- Increase `maxComments` limit
- Some comments may be deleted or removed by moderators

#### "\[deleted]" Authors

- Comments from deleted accounts show "\[deleted]" as author
- This is normal Reddit behavior
- The comment text may still be available or show as "\[removed]"

### Use Cases

- **Sentiment Analysis**: Analyze community opinions on topics
- **Market Research**: Gather user feedback and discussions
- **Content Moderation**: Monitor discussions for moderation
- **Academic Research**: Study online community interactions
- **Data Analysis**: Build datasets for machine learning

### License

This actor is provided as-is for scraping public Reddit data in accordance with Reddit's terms of service.

### Notes

- This scraper uses browser automation to access Reddit's public web interface
- Always respect Reddit's robots.txt and terms of service
- Use responsibly and avoid overwhelming Reddit's servers
- Consider implementing additional rate limiting for large-scale scraping
- The actor works best with the Apify platform's infrastructure
- Posts with thousands of comments may take longer to scrape

# Actor input Schema

## `postUrls` (type: `array`):

List of Reddit post URLs to scrape comments from.

## `maxComments` (type: `integer`):

Max comments per post.

## `expandThreads` (type: `boolean`):

Auto-expand collapsed comment threads and 'load more' sections.

## `minScore` (type: `integer`):

Drop comments with score below this number.

## `maxDepth` (type: `integer`):

Drop comments deeper than N levels (0 = top-level only).

## `excludeDeleted` (type: `boolean`):

Drop comments where author is `[deleted]` or text is `[removed]`.

## `authorFilter` (type: `string`):

Only emit comments by this author (case-insensitive substring match).

## `keywordFilter` (type: `string`):

Only emit comments whose body contains this substring (case-insensitive). Prefix with `!` to invert (`!spam` drops comments mentioning spam).

## `minWordCount` (type: `integer`):

Drop one-liner comments. Counts whitespace-separated tokens in the body.

## `maxWordCount` (type: `integer`):

Drop walls of text — useful when surveying short reactions only.

## Actor input object example

```json
{
  "postUrls": [
    "https://www.reddit.com/r/programming/comments/1s9jkzi/announcement_temporary_llm_content_ban/"
  ],
  "maxComments": 100,
  "expandThreads": true,
  "excludeDeleted": false
}
```

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "postUrls": [
        "https://www.reddit.com/r/programming/comments/1s9jkzi/announcement_temporary_llm_content_ban/"
    ]
};

// Run the Actor and wait for it to finish
const run = await client.actor("crawlerbros/reddit-comment-scraper-pro").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = { "postUrls": ["https://www.reddit.com/r/programming/comments/1s9jkzi/announcement_temporary_llm_content_ban/"] }

# Run the Actor and wait for it to finish
run = client.actor("crawlerbros/reddit-comment-scraper-pro").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "postUrls": [
    "https://www.reddit.com/r/programming/comments/1s9jkzi/announcement_temporary_llm_content_ban/"
  ]
}' |
apify call crawlerbros/reddit-comment-scraper-pro --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=crawlerbros/reddit-comment-scraper-pro",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Reddit Comment Scraper Pro",
        "description": "Scrape comments from any Reddit post with advanced filters (minScore, maxDepth, excludeDeleted, authorFilter, keywordFilter) and rich per-comment fields: awards, gildedCount, controversiality, repliesCount, parentCommentId, body, bodyHtml, subreddit, permalink. No login required.",
        "version": "1.0",
        "x-build-id": "MaP3WhPTJXAxgdpU9"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/crawlerbros~reddit-comment-scraper-pro/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-crawlerbros-reddit-comment-scraper-pro",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/crawlerbros~reddit-comment-scraper-pro/runs": {
            "post": {
                "operationId": "runs-sync-crawlerbros-reddit-comment-scraper-pro",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/crawlerbros~reddit-comment-scraper-pro/run-sync": {
            "post": {
                "operationId": "run-sync-crawlerbros-reddit-comment-scraper-pro",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "postUrls"
                ],
                "properties": {
                    "postUrls": {
                        "title": "Reddit post URLs",
                        "type": "array",
                        "description": "List of Reddit post URLs to scrape comments from.",
                        "items": {
                            "type": "string"
                        }
                    },
                    "maxComments": {
                        "title": "Max comments per post",
                        "minimum": 1,
                        "maximum": 10000,
                        "type": "integer",
                        "description": "Max comments per post.",
                        "default": 100
                    },
                    "expandThreads": {
                        "title": "Expand collapsed threads",
                        "type": "boolean",
                        "description": "Auto-expand collapsed comment threads and 'load more' sections.",
                        "default": true
                    },
                    "minScore": {
                        "title": "Min score (filter)",
                        "minimum": -10000,
                        "maximum": 10000000,
                        "type": "integer",
                        "description": "Drop comments with score below this number."
                    },
                    "maxDepth": {
                        "title": "Max comment depth",
                        "minimum": 0,
                        "maximum": 50,
                        "type": "integer",
                        "description": "Drop comments deeper than N levels (0 = top-level only)."
                    },
                    "excludeDeleted": {
                        "title": "Exclude deleted/removed",
                        "type": "boolean",
                        "description": "Drop comments where author is `[deleted]` or text is `[removed]`.",
                        "default": false
                    },
                    "authorFilter": {
                        "title": "Author filter (substring)",
                        "type": "string",
                        "description": "Only emit comments by this author (case-insensitive substring match)."
                    },
                    "keywordFilter": {
                        "title": "Keyword filter (substring in body)",
                        "type": "string",
                        "description": "Only emit comments whose body contains this substring (case-insensitive). Prefix with `!` to invert (`!spam` drops comments mentioning spam)."
                    },
                    "minWordCount": {
                        "title": "Min word count",
                        "minimum": 0,
                        "maximum": 100000,
                        "type": "integer",
                        "description": "Drop one-liner comments. Counts whitespace-separated tokens in the body."
                    },
                    "maxWordCount": {
                        "title": "Max word count",
                        "minimum": 1,
                        "maximum": 100000,
                        "type": "integer",
                        "description": "Drop walls of text — useful when surveying short reactions only."
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
