# Reddit Scraper Pro (`crawlerbros/reddit-scraper-pro`) Actor

Scrape Reddit subreddit posts with advanced filters (keywordFilter, minScore, maxAgeDays, excludeStickied, excludeNsfw, authorBlocklist, domainAllowlist/Blocklist). Adds is\_video / awards / gilded / upvote\_ratio / media\_metadata to every post. Browser-based, no login.

- **URL**: https://apify.com/crawlerbros/reddit-scraper-pro.md
- **Developed by:** [Crawler Bros](https://apify.com/crawlerbros) (community)
- **Categories:** Social media, Developer tools, Other
- **Stats:** 2 total users, 1 monthly users, 100.0% runs succeeded, 13 bookmarks
- **User rating**: 5.00 out of 5 stars

## Pricing

from $1.00 / 1,000 results

This Actor is paid per event and usage. You are charged both the fixed price for specific events and for Apify platform usage.
Since this Actor supports Apify Store discounts, the price gets lower the higher subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## Reddit Subreddit Scraper

An Apify Actor for scraping posts from Reddit subreddits using browser automation with Playwright.

### Features

- 🎯 Scrape multiple subreddits in a single run
- 📊 Extract comprehensive post data (title, author, score, comments, etc.)
- 🔄 Support for different sorting methods (hot, new, top, rising, controversial)
- ⏰ Time filters for "top" and "controversial" posts
- 📦 No authentication required for public subreddits
- 💾 Data saved in structured JSON format
- 🌐 Browser automation bypasses API restrictions
- 🔄 Automatic pagination support

### Input Parameters

The actor accepts the following input parameters:

| Parameter    | Type    | Required | Default      | Description                                                                          |
| ------------ | ------- | -------- | ------------ | ------------------------------------------------------------------------------------ |
| `subreddits` | array   | Yes      | `["python"]` | List of subreddit names to scrape (without 'r/' prefix)                              |
| `maxPosts`   | integer | No       | `25`         | Maximum number of posts to scrape from each subreddit (1-1000)                       |
| `sort`       | string  | No       | `"hot"`      | How to sort posts: `hot`, `new`, `top`, `rising`, or `controversial`                 |
| `timeFilter` | string  | No       | `"day"`      | Time filter for 'top'/'controversial': `hour`, `day`, `week`, `month`, `year`, `all` |

#### Example Input

```json
{
  "subreddits": ["islamabad", "pakistan", "programming"],
  "maxPosts": 50,
  "sort": "hot",
  "timeFilter": "day"
}
````

### Output Fields

The actor extracts the following data for each post:

#### Subreddit Information

- `subreddit` - Subreddit name (e.g., "islamabad")
- `subreddit_prefixed` - Subreddit name with r/ prefix (e.g., "r/islamabad")

#### Post Content

- `post_id` - Unique post ID (e.g., "1kql1t5")
- `post_name` - Full post name in Reddit format (e.g., "t3\_1kql1t5")
- `title` - Post title
- `author` - Username of the post author
- `selftext` - Text content preview (first 1000 chars, for self posts only)

#### Engagement Metrics

- `score` - Post score/karma (upvotes minus downvotes)
- `num_comments` - Number of comments on the post

#### Links

- `url` - URL of the linked content (external URL or permalink for self posts)
- `permalink` - Direct link to the Reddit post

#### Metadata

- `domain` - Domain of the linked content (e.g., "self.islamabad" for text posts)
- `is_self_post` - Boolean indicating if it's a text post (true) or link post (false)
- `link_flair` - Post flair/tag text (if any)
- `thumbnail_url` - URL of the post thumbnail image (if any)

#### Timestamps

- `created_utc` - Unix timestamp when the post was created
- `created_at` - ISO 8601 formatted datetime (e.g., "2025-05-19T19:40:28")

#### Flags

- `is_stickied` - Boolean indicating if the post is stickied/pinned
- `is_locked` - Boolean indicating if the post is locked (no new comments)
- `is_nsfw` - Boolean indicating if the post is marked as NSFW (over 18)

#### Example Output

```json
{
  "subreddit": "islamabad",
  "subreddit_prefixed": "r/islamabad",
  "post_id": "1kql1t5",
  "post_name": "t3_1kql1t5",
  "title": "Everyone's always asking what to do in Islamabad - I made a list",
  "author": "hafmaestro",
  "selftext": "Note: I have not mentioned normal restaurants and cafes...",
  "score": 595,
  "num_comments": 101,
  "url": "https://old.reddit.com/r/islamabad/comments/1kql1t5/...",
  "permalink": "https://old.reddit.com/r/islamabad/comments/1kql1t5/...",
  "domain": "self.islamabad",
  "is_self_post": true,
  "link_flair": "Islamabad",
  "thumbnail_url": null,
  "created_utc": 1747683628,
  "created_at": "2025-05-19T19:40:28",
  "is_stickied": false,
  "is_locked": false,
  "is_nsfw": false
}
```

### Usage

#### Local Development

1. **Install dependencies**:

   ```bash
   pip install -r requirements.txt
   playwright install chromium
   ```

2. **Set up input** in `storage/key_value_stores/default/INPUT.json`:

   ```json
   {
     "subreddits": ["python"],
     "maxPosts": 25,
     "sort": "hot"
   }
   ```

3. **Run the actor**:

   ```bash
   python -m src
   ```

4. **Check results** in `storage/datasets/default/`

#### On Apify Platform

1. **Push to Apify**:

   - Login to Apify CLI: `apify login`
   - Initialize: `apify init` (if not already done)
   - Push to Apify: `apify push`

2. **Or manually upload**:

   - Create a new actor on Apify platform
   - Upload all files including `Dockerfile`, `requirements.txt`, and `.actor/` directory

3. **Configure and run**:
   - Set input parameters in the Apify console
   - Click "Start" to run the actor
   - Download results from the dataset tab

### Technical Details

#### Browser Automation

- Uses **Playwright** with Chromium browser
- Scrapes `old.reddit.com` for better compatibility and simpler HTML structure
- Implements anti-detection measures:
  - Custom User-Agent headers
  - Disabled automation flags
  - Browser fingerprint masking

#### Features

- **Automatic pagination**: Clicks "next" button to load more posts
- **Smart selectors**: Multiple fallback CSS selectors for reliability
- **Error handling**: Screenshots saved on errors for debugging
- **Rate limiting**: Built-in delays between requests

#### Performance

- Headless browser mode for efficiency
- Optimized page load strategy (`domcontentloaded`)
- Configurable wait times and timeouts

### Limitations

- Only works with public subreddits
- Cannot scrape private or restricted communities
- Browser automation is slower than direct API calls but more reliable
- Selftext preview limited to first 1000 characters

### Dependencies

- `apify>=2.1.0` - Apify SDK for Python
- `playwright~=1.40.0` - Browser automation framework
- `beautifulsoup4~=4.12.0` - HTML parsing library

### Troubleshooting

#### Timeout Issues

If you encounter timeout errors:

- Check the debug screenshots in the key-value store
- Increase timeout values in the code
- Verify the subreddit exists and is public

#### No Posts Found

- Verify the subreddit name is correct (without 'r/' prefix)
- Check if the subreddit has posts for the selected sort method
- Review logs for detailed error messages

### License

This actor is provided as-is for scraping public Reddit data in accordance with Reddit's terms of service.

### Notes

- This scraper uses browser automation to access Reddit's public web interface
- Always respect Reddit's robots.txt and terms of service
- Use responsibly and avoid overwhelming Reddit's servers
- Consider implementing additional rate limiting for large-scale scraping
- The actor works best with the Apify platform's infrastructure

# Actor input Schema

## `subreddits` (type: `array`):

List of subreddit names without `r/` prefix.

## `maxPosts` (type: `integer`):

Max posts per subreddit.

## `sort` (type: `string`):

How to sort posts in the subreddit.

## `timeFilter` (type: `string`):

Time range for `top` or `controversial` sorts.

## `keywordFilter` (type: `string`):

Only emit posts whose title or content contains this substring (case-insensitive).

## `minScore` (type: `integer`):

Drop posts with score below this number.

## `maxAgeDays` (type: `integer`):

Drop posts older than N days.

## `excludeStickied` (type: `boolean`):

Drop pinned/stickied posts (typically mod announcements).

## `excludeNsfw` (type: `boolean`):

Drop NSFW posts.

## `excludeOriginalContent` (type: `boolean`):

Drop posts marked as Original Content.

## `authorBlocklist` (type: `array`):

Drop posts by these usernames (case-insensitive).

## `domainAllowlist` (type: `array`):

If set, only emit posts whose link domain is in this list (e.g. `["github.com", "arxiv.org"]`). Self-posts pass.

## `domainBlocklist` (type: `array`):

Drop posts whose link domain is in this list.

## Actor input object example

```json
{
  "subreddits": [
    "python",
    "programming"
  ],
  "maxPosts": 25,
  "sort": "hot",
  "timeFilter": "day",
  "excludeStickied": false,
  "excludeNsfw": false,
  "excludeOriginalContent": false,
  "authorBlocklist": [],
  "domainAllowlist": [],
  "domainBlocklist": []
}
```

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "subreddits": [
        "python"
    ]
};

// Run the Actor and wait for it to finish
const run = await client.actor("crawlerbros/reddit-scraper-pro").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = { "subreddits": ["python"] }

# Run the Actor and wait for it to finish
run = client.actor("crawlerbros/reddit-scraper-pro").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "subreddits": [
    "python"
  ]
}' |
apify call crawlerbros/reddit-scraper-pro --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=crawlerbros/reddit-scraper-pro",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Reddit Scraper Pro",
        "description": "Scrape Reddit subreddit posts with advanced filters (keywordFilter, minScore, maxAgeDays, excludeStickied, excludeNsfw, authorBlocklist, domainAllowlist/Blocklist). Adds is_video / awards / gilded / upvote_ratio / media_metadata to every post. Browser-based, no login.",
        "version": "1.0",
        "x-build-id": "9qhnTeaXi2vgd89uK"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/crawlerbros~reddit-scraper-pro/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-crawlerbros-reddit-scraper-pro",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/crawlerbros~reddit-scraper-pro/runs": {
            "post": {
                "operationId": "runs-sync-crawlerbros-reddit-scraper-pro",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/crawlerbros~reddit-scraper-pro/run-sync": {
            "post": {
                "operationId": "run-sync-crawlerbros-reddit-scraper-pro",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "subreddits"
                ],
                "properties": {
                    "subreddits": {
                        "title": "Subreddits",
                        "type": "array",
                        "description": "List of subreddit names without `r/` prefix.",
                        "items": {
                            "type": "string"
                        }
                    },
                    "maxPosts": {
                        "title": "Max posts per subreddit",
                        "minimum": 1,
                        "maximum": 1000,
                        "type": "integer",
                        "description": "Max posts per subreddit.",
                        "default": 25
                    },
                    "sort": {
                        "title": "Sort posts by",
                        "enum": [
                            "hot",
                            "new",
                            "top",
                            "rising",
                            "controversial"
                        ],
                        "type": "string",
                        "description": "How to sort posts in the subreddit.",
                        "default": "hot"
                    },
                    "timeFilter": {
                        "title": "Time filter (top/controversial)",
                        "enum": [
                            "hour",
                            "day",
                            "week",
                            "month",
                            "year",
                            "all"
                        ],
                        "type": "string",
                        "description": "Time range for `top` or `controversial` sorts.",
                        "default": "day"
                    },
                    "keywordFilter": {
                        "title": "Keyword filter (substring)",
                        "type": "string",
                        "description": "Only emit posts whose title or content contains this substring (case-insensitive)."
                    },
                    "minScore": {
                        "title": "Min score (filter)",
                        "minimum": -10000,
                        "maximum": 10000000,
                        "type": "integer",
                        "description": "Drop posts with score below this number."
                    },
                    "maxAgeDays": {
                        "title": "Max post age in days (filter)",
                        "minimum": 1,
                        "maximum": 36500,
                        "type": "integer",
                        "description": "Drop posts older than N days."
                    },
                    "excludeStickied": {
                        "title": "Exclude stickied posts",
                        "type": "boolean",
                        "description": "Drop pinned/stickied posts (typically mod announcements).",
                        "default": false
                    },
                    "excludeNsfw": {
                        "title": "Exclude NSFW",
                        "type": "boolean",
                        "description": "Drop NSFW posts.",
                        "default": false
                    },
                    "excludeOriginalContent": {
                        "title": "Exclude OC posts",
                        "type": "boolean",
                        "description": "Drop posts marked as Original Content.",
                        "default": false
                    },
                    "authorBlocklist": {
                        "title": "Author blocklist",
                        "type": "array",
                        "description": "Drop posts by these usernames (case-insensitive).",
                        "default": [],
                        "items": {
                            "type": "string"
                        }
                    },
                    "domainAllowlist": {
                        "title": "Domain allowlist",
                        "type": "array",
                        "description": "If set, only emit posts whose link domain is in this list (e.g. `[\"github.com\", \"arxiv.org\"]`). Self-posts pass.",
                        "default": [],
                        "items": {
                            "type": "string"
                        }
                    },
                    "domainBlocklist": {
                        "title": "Domain blocklist",
                        "type": "array",
                        "description": "Drop posts whose link domain is in this list.",
                        "default": [],
                        "items": {
                            "type": "string"
                        }
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
