Pricing

Pay per usage

Go to Apify Store

Jalen Hurts

Philadelphia Eagles quarterback #1

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Jalen Hurts

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

4 months ago

Last modified

Categories

Automation

SEO tools

Open source

Share

My Actor 1

modernistic_virtuoso/my-actor-1

Jalen Hurts Philadelphia Eagles QB1

Jalen Hurts

2

Philadelphia Crime Data Scraper | PPD Incident Reports

parseforge/philadelphia-crime-data-scraper

Extract Philadelphia Police Department crime incidents with offense, location, date, district, and ucr code. Filter by district, category, or date range. Built for journalists, neighborhood safety analysts, and researchers studying Philadelphia public safety.

ParseForge

2

Philadelphia Building Permits

velvety_bedbug/philadelphia-building-permits

Scrape building permit records from the City of Philadelphia via the public Carto SQL API. Filter by permit type, status, type of work, address, and date range.

Peters Bugs

2

Actor 1

code_crafter/actor-1

Code Pioneer

63

Immobilienscout24 (.de|.ch|.at) 💚 $1/1k | By Search URL

azzouzana/immobilienscout24-de-search-results-scraper-by-search-url

$1/1K 🔥 Fast #1 ImmoScout24 search pages scraper for Germany 🇩🇪, Switzerland 🇨🇭 & Austria 🇦🇹! Scrape search results for just $1 per 1k listings. Get data in seconds via JSON, CSV, Excel, or API. Simply paste the URL and extract thousands of listings instantly! Fast, cheap & DACH-ready. ⚡

Azzouzana

349

5.0

Google Maps $1.5/1k Listings Scraper

braveleads/google-maps-listings-scraper

$1.5 / 1k 🔥 Extract rich business data from thousands of Google Maps listings, including reviews, reviewer insights, images, contact details, opening hours, pricing, ratings, locations, and more.

Brave Leads

6

5.0

Google Maps 🟢 $1.5/1k Scraper 🟢

olympus/google-maps-leads-scraper

$1.5/1k 🔥 Gather detailed business data from thousands of Google Maps listings, including reviews, reviewer insights, images, contact information, operating hours, pricing, ratings, locations, and more.

Olympus

476

4.8

Your Actor 1

jupri/your-actor-1

cat

2

Ecommerce Price Scraper 1

fipper_ai/Ecommerce-Price-Scraper-1

Josh Baker

3

My Actor 1

labile_juniper/my-actor-1

现男董

4

.actor/actor.json

{
    "actorSpecification": 1,
    "name": "my-actor",
    "title": "Getting started with Python Crawlee and BeautifulSoup",
    "description": "Scrapes titles of websites using Crawlee and BeautifulSoup.",
    "version": "0.0",
    "buildTag": "latest",
    "meta": {
        "templateId": "python-crawlee-beautifulsoup",
        "generatedBy": "<FILL-IN-MODEL>"
    },
    "input": "./input_schema.json",
    "output": "./output_schema.json",
    "storages": {
        "dataset": "./dataset_schema.json"
    },
    "dockerfile": "../Dockerfile"
}

.actor/dataset_schema.json

{
    "actorSpecification": 1,
    "fields": {},
    "views": {
        "overview": {
            "title": "Overview",
            "transformation": {
                "fields": ["title", "url", "h1s", "h2s", "h3s"]
            },
            "display": {
                "component": "table",
                "properties": {
                    "title": {
                        "label": "Title",
                        "format": "text"
                    },
                    "url": {
                        "label": "URL",
                        "format": "link"
                    },
                    "h1s": {
                        "label": "H1s",
                        "format": "array"
                    },
                    "h2s": {
                        "label": "H2s",
                        "format": "array"
                    },
                    "h3s": {
                        "label": "H3s",
                        "format": "array"
                    }
                }
            }
        }
    }
}

.actor/input_schema.json

{
    "title": "Python Crawlee BeautifulSoup Scraper",
    "type": "object",
    "schemaVersion": 1,
    "properties": {
        "start_urls": {
            "title": "Start URLs",
            "type": "array",
            "description": "URLs to start with",
            "prefill": [{ "url": "https://apify.com" }],
            "editor": "requestListSources"
        }
    },
    "required": ["start_urls"]
}

.actor/output_schema.json

{
    "actorOutputSchemaVersion": 1,
    "title": "Output schema of the files scraper",
    "properties": {
        "overview": {
            "type": "string",
            "title": "Overview",
            "template": "{{links.apiDefaultDatasetUrl}}/items?view=overview"
        }
    }
}

.dockerignore

.git
.mise.toml
.nvim.lua
storage

# The rest is copied from https://github.com/github/gitignore/blob/main/Python.gitignore

# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
#  Usually these files are written by a python script from a template
#  before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
cover/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
.pybuilder/
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# pyenv
#   For a library or package, you might want to ignore these files since the code is
#   intended to run in multiple environments; otherwise, check them in:
.python-version

# pdm
#   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
#pdm.lock
#   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
#   in version control.
#   https://pdm.fming.dev/latest/usage/project/#working-with-version-control
.pdm.toml
.pdm-python
.pdm-build/

# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/

# pytype static type analyzer
.pytype/

# Cython debug symbols
cython_debug/

# PyCharm
#  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
#  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
#  and can be added to the global gitignore or merged into this file.  For a more nuclear
#  option (not recommended) you can uncomment the following to ignore the entire idea folder.
.idea/

# Visual Studio Code
#  Ignores the folder created by VS Code when changing workspace settings, doing debugger
#  configuration, etc. Can be commented out to share Workspace Settings within a team
.vscode

# Zed editor
#  Ignores the folder created when setting Project Settings in the Zed editor. Can be commented out
#  to share Project Settings within a team
.zed

.gitignore

.mise.toml
.nvim.lua
storage

# The rest is copied from https://github.com/github/gitignore/blob/main/Python.gitignore

# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
#  Usually these files are written by a python script from a template
#  before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
cover/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
.pybuilder/
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# pyenv
#   For a library or package, you might want to ignore these files since the code is
#   intended to run in multiple environments; otherwise, check them in:
.python-version

# pdm
#   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
#pdm.lock
#   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
#   in version control.
#   https://pdm.fming.dev/latest/usage/project/#working-with-version-control
.pdm.toml
.pdm-python
.pdm-build/

# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/

# pytype static type analyzer
.pytype/

# Cython debug symbols
cython_debug/

# PyCharm
#  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
#  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
#  and can be added to the global gitignore or merged into this file.  For a more nuclear
#  option (not recommended) you can uncomment the following to ignore the entire idea folder.
.idea/

# Zed editor
#  Ignores the folder created when setting Project Settings in the Zed editor. Can be commented out
#  to share Project Settings within a team
.zed

AGENTS.md

1# Apify Actors Development Guide
2
3Important: Before you begin, fill in the `generatedBy` property in the meta section of `.actor/actor.json`. Replace it with the tool and model you're currently using, such as "Claude Code with Claude Sonnet 4.5". This helps Apify monitor and improve AGENTS.md for specific AI tools and models.
4
5## What are Apify Actors?
6
7- Actors are serverless programs that run in the cloud. They're inspired by the UNIX philosophy - programs that do one thing well and can be easily combined to build complex systems.
8- Actors are programs packaged as Docker images that run in isolated containers
9
10## Core Concepts
11
12- Accept well-defined JSON input
13- Perform isolated tasks (web scraping, automation, data processing)
14- Produce structured JSON output to datasets and/or store data in key-value stores
15- Can run from seconds to hours or even indefinitely
16- Persist state and can be restarted
17
18## Do
19
20- accept well-defined JSON input and produce structured JSON output
21- use Apify SDK (`apify`) for code running ON Apify platform
22- validate input early with proper error handling and fail gracefully
23- use CheerioCrawler for static HTML content (10x faster than browsers)
24- use PlaywrightCrawler only for JavaScript-heavy sites and dynamic content
25- use router pattern (createCheerioRouter/createPlaywrightRouter) for complex crawls
26- implement retry strategies with exponential backoff for failed requests
27- use proper concurrency settings (HTTP: 10-50, Browser: 1-5)
28- set sensible defaults in `.actor/input_schema.json` for all optional fields
29- set up output schema in `.actor/output_schema.json`
30- clean and validate data before pushing to dataset
31- use semantic CSS selectors and fallback strategies for missing elements
32- respect robots.txt, ToS, and implement rate limiting with delays
33- check which tools (cheerio/playwright/crawlee) are installed before applying guidance
34- use `Actor.log` for logging (censors sensitive data)
35- implement readiness probe handler for standby Actors
36- handle the `aborting` event to gracefully shut down when Actor is stopped
37
38## Don't
39
40- do not rely on `Dataset.getInfo()` for final counts on Cloud platform
41- do not use browser crawlers when HTTP/Cheerio works (massive performance gains with HTTP)
42- do not hard code values that should be in input schema or environment variables
43- do not skip input validation or error handling
44- do not overload servers - use appropriate concurrency and delays
45- do not scrape prohibited content or ignore Terms of Service
46- do not store personal/sensitive data unless explicitly permitted
47- do not use deprecated options like `requestHandlerTimeoutMillis` on CheerioCrawler (v3.x)
48- do not use `additionalHttpHeaders` - use `preNavigationHooks` instead
49- do not assume that local storage is persistent or automatically synced to Apify Console - when running locally with `apify run`, the `storage/` directory is local-only and is NOT pushed to the Cloud
50- do not disable standby mode (`usesStandbyMode: false`) without explicit permission
51
52## Logging
53
54- **ALWAYS use `Actor.log` for logging** - This logger contains critical security logic including censoring sensitive data (Apify tokens, API keys, credentials) to prevent accidental exposure in logs
55
56### Available Log Levels
57
58The Apify Actor logger provides the following methods for logging:
59
60- `Actor.log.debug()` - Debug level logs (detailed diagnostic information)
61- `Actor.log.info()` - Info level logs (general informational messages)
62- `Actor.log.warning()` - Warning level logs (warning messages for potentially problematic situations)
63- `Actor.log.error()` - Error level logs (error messages for failures)
64- `Actor.log.exception()` - Exception level logs (for exceptions with stack traces)
65
66**Best practices:**
67
68- Use `Actor.log.debug()` for detailed operation-level diagnostics (inside functions)
69- Use `Actor.log.info()` for general informational messages (API requests, successful operations)
70- Use `Actor.log.warning()` for potentially problematic situations (validation failures, unexpected states)
71- Use `Actor.log.error()` for actual errors and failures
72- Use `Actor.log.exception()` for caught exceptions with stack traces
73
74## Graceful Abort Handling
75
76Handle the `aborting` event to terminate the Actor quickly when stopped by user or platform, minimizing costs especially for PPU/PPE+U billing.
77
78```python
79import asyncio
80
81async def on_aborting() -> None:
82    # Persist any state, do any cleanup you need, and terminate the Actor using `await Actor.exit()` explicitly as soon as possible
83    # This will help ensure that the Actor is doing best effort to honor any potential limits on costs of a single run set by the user
84    # Wait 1 second to allow Crawlee/SDK state persistence operations to complete
85    # This is a temporary workaround until SDK implements proper state persistence in the aborting event
86    await asyncio.sleep(1)
87    await Actor.exit()
88
89Actor.on('aborting', on_aborting)
90```
91
92## Standby Mode
93
94- **NEVER disable standby mode (`usesStandbyMode: false`) in `.actor/actor.json` without explicit permission** - Actor Standby mode solves this problem by letting you have the Actor ready in the background, waiting for the incoming HTTP requests. In a sense, the Actor behaves like a real-time web server or standard API server instead of running the logic once to process everything in batch. Always keep `usesStandbyMode: true` unless there is a specific documented reason to disable it
95- **ALWAYS implement readiness probe handler for standby Actors** - Handle the `x-apify-container-server-readiness-probe` header at GET / endpoint to ensure proper Actor lifecycle management
96
97You can recognize a standby Actor by checking the `usesStandbyMode` property in `.actor/actor.json`. Only implement the readiness probe if this property is set to `true`.
98
99### Readiness Probe Implementation Example
100
101```python
102# Apify standby readiness probe
103from http.server import SimpleHTTPRequestHandler
104
105class GetHandler(SimpleHTTPRequestHandler):
106    def do_GET(self):
107        # Handle Apify standby readiness probe
108        if 'x-apify-container-server-readiness-probe' in self.headers:
109            self.send_response(200)
110            self.end_headers()
111            self.wfile.write(b'Readiness probe OK')
112            return
113
114        self.send_response(200)
115        self.end_headers()
116        self.wfile.write(b'Actor is ready')
117```
118
119Key points:
120
121- Detect the `x-apify-container-server-readiness-probe` header in incoming requests
122- Respond with HTTP 200 status code for both readiness probe and normal requests
123- This enables proper Actor lifecycle management in standby mode
124
125## Commands
126
127```bash
128# Local development
129apify run                              # Run Actor locally
130
131# Authentication & deployment
132apify login                            # Authenticate account
133apify push                             # Deploy to Apify platform
134
135# Help
136apify help                             # List all commands
137```
138
139## Safety and Permissions
140
141Allowed without prompt:
142
143- read files with `Actor.get_value()`
144- push data with `Actor.push_data()`
145- set values with `Actor.set_value()`
146- enqueue requests to RequestQueue
147- run locally with `apify run`
148
149Ask first:
150
151- npm/pip package installations
152- apify push (deployment to cloud)
153- proxy configuration changes (requires paid plan)
154- Dockerfile changes affecting builds
155- deleting datasets or key-value stores
156
157## Project Structure
158
159.actor/
160├── actor.json # Actor config: name, version, env vars, runtime settings
161├── input_schema.json # Input validation & Console form definition
162└── output_schema.json # Specifies where an Actor stores its output
163src/
164└── main.js # Actor entry point and orchestrator
165storage/ # Local-only storage for development (NOT synced to Cloud)
166├── datasets/ # Output items (JSON objects)
167├── key_value_stores/ # Files, config, INPUT
168└── request_queues/ # Pending crawl requests
169Dockerfile # Container image definition
170AGENTS.md # AI agent instructions (this file)
171
172## Local vs Cloud Storage
173
174When running locally with `apify run`, the Apify SDK emulates Cloud storage APIs using the local `storage/` directory. This local storage behaves differently from Cloud storage:
175
176- **Local storage is NOT persistent** - The `storage/` directory is meant for local development and testing only. Data stored there (datasets, key-value stores, request queues) exists only on your local disk.
177- **Local storage is NOT automatically pushed to Apify Console** - Running `apify run` does not upload any storage data to the Apify platform. The data stays local.
178- **Each local run may overwrite previous data** - The local `storage/` directory is reused between runs, but this is local-only behavior, not Cloud persistence.
179- **Cloud storage only works when running on Apify platform** - After deploying with `apify push` and running the Actor in the Cloud, storage calls (`Actor.push_data()`, `Actor.set_value()`, etc.) interact with real Apify Cloud storage, which is then visible in the Apify Console.
180- **To verify Actor output, deploy and run in Cloud** - Do not rely on local `storage/` contents as proof that data will appear in the Apify Console. Always test by deploying (`apify push`) and running the Actor on the platform.
181
182## Actor Input Schema
183
184The input schema defines the input parameters for an Actor. It's a JSON object comprising various field types supported by the Apify platform.
185
186### Structure
187
188```json
189{
190    "title": "<INPUT-SCHEMA-TITLE>",
191    "type": "object",
192    "schemaVersion": 1,
193    "properties": {
194        /* define input fields here */
195    },
196    "required": []
197}
198```
199
200### Example
201
202```json
203{
204    "title": "E-commerce Product Scraper Input",
205    "type": "object",
206    "schemaVersion": 1,
207    "properties": {
208        "startUrls": {
209            "title": "Start URLs",
210            "type": "array",
211            "description": "URLs to start scraping from (category pages or product pages)",
212            "editor": "requestListSources",
213            "default": [{ "url": "https://example.com/category" }],
214            "prefill": [{ "url": "https://example.com/category" }]
215        },
216        "followVariants": {
217            "title": "Follow Product Variants",
218            "type": "boolean",
219            "description": "Whether to scrape product variants (different colors, sizes)",
220            "default": true
221        },
222        "maxRequestsPerCrawl": {
223            "title": "Max Requests per Crawl",
224            "type": "integer",
225            "description": "Maximum number of pages to scrape (0 = unlimited)",
226            "default": 1000,
227            "minimum": 0
228        },
229        "proxyConfiguration": {
230            "title": "Proxy Configuration",
231            "type": "object",
232            "description": "Proxy settings for anti-bot protection",
233            "editor": "proxy",
234            "default": { "useApifyProxy": false }
235        },
236        "locale": {
237            "title": "Locale",
238            "type": "string",
239            "description": "Language/country code for localized content",
240            "default": "cs",
241            "enum": ["cs", "en", "de", "sk"],
242            "enumTitles": ["Czech", "English", "German", "Slovak"]
243        }
244    },
245    "required": ["startUrls"]
246}
247```
248
249## Actor Output Schema
250
251The Actor output schema builds upon the schemas for the dataset and key-value store. It specifies where an Actor stores its output and defines templates for accessing that output. Apify Console uses these output definitions to display run results.
252
253### Structure
254
255```json
256{
257    "actorOutputSchemaVersion": 1,
258    "title": "<OUTPUT-SCHEMA-TITLE>",
259    "properties": {
260        /* define your outputs here */
261    }
262}
263```
264
265### Example
266
267```json
268{
269    "actorOutputSchemaVersion": 1,
270    "title": "Output schema of the files scraper",
271    "properties": {
272        "files": {
273            "type": "string",
274            "title": "Files",
275            "template": "{{links.apiDefaultKeyValueStoreUrl}}/keys"
276        },
277        "dataset": {
278            "type": "string",
279            "title": "Dataset",
280            "template": "{{links.apiDefaultDatasetUrl}}/items"
281        }
282    }
283}
284```
285
286### Output Schema Template Variables
287
288- `links` (object) - Contains quick links to most commonly used URLs
289- `links.publicRunUrl` (string) - Public run url in format `https://console.apify.com/view/runs/:runId`
290- `links.consoleRunUrl` (string) - Console run url in format `https://console.apify.com/actors/runs/:runId`
291- `links.apiRunUrl` (string) - API run url in format `https://api.apify.com/v2/actor-runs/:runId`
292- `links.apiDefaultDatasetUrl` (string) - API url of default dataset in format `https://api.apify.com/v2/datasets/:defaultDatasetId`
293- `links.apiDefaultKeyValueStoreUrl` (string) - API url of default key-value store in format `https://api.apify.com/v2/key-value-stores/:defaultKeyValueStoreId`
294- `links.containerRunUrl` (string) - URL of a webserver running inside the run in format `https://<containerId>.runs.apify.net/`
295- `run` (object) - Contains information about the run same as it is returned from the `GET Run` API endpoint
296- `run.defaultDatasetId` (string) - ID of the default dataset
297- `run.defaultKeyValueStoreId` (string) - ID of the default key-value store
298
299## Dataset Schema Specification
300
301The dataset schema defines how your Actor's output data is structured, transformed, and displayed in the Output tab in the Apify Console.
302
303### Example
304
305Consider an example Actor that calls `Actor.pushData()` to store data into dataset:
306
307```python
308# Dataset push example (Python)
309import asyncio
310from datetime import datetime
311from apify import Actor
312
313async def main():
314    await Actor.init()
315
316    # Actor code
317    await Actor.push_data({
318        'numericField': 10,
319        'pictureUrl': 'https://www.google.com/images/branding/googlelogo/2x/googlelogo_color_92x30dp.png',
320        'linkUrl': 'https://google.com',
321        'textField': 'Google',
322        'booleanField': True,
323        'dateField': datetime.now().isoformat(),
324        'arrayField': ['#hello', '#world'],
325        'objectField': {},
326    })
327
328    # Exit successfully
329    await Actor.exit()
330
331if __name__ == '__main__':
332    asyncio.run(main())
333```
334
335To set up the Actor's output tab UI, reference a dataset schema file in `.actor/actor.json`:
336
337```json
338{
339    "actorSpecification": 1,
340    "name": "book-library-scraper",
341    "title": "Book Library Scraper",
342    "version": "1.0.0",
343    "storages": {
344        "dataset": "./dataset_schema.json"
345    }
346}
347```
348
349Then create the dataset schema in `.actor/dataset_schema.json`:
350
351```json
352{
353    "actorSpecification": 1,
354    "fields": {},
355    "views": {
356        "overview": {
357            "title": "Overview",
358            "transformation": {
359                "fields": [
360                    "pictureUrl",
361                    "linkUrl",
362                    "textField",
363                    "booleanField",
364                    "arrayField",
365                    "objectField",
366                    "dateField",
367                    "numericField"
368                ]
369            },
370            "display": {
371                "component": "table",
372                "properties": {
373                    "pictureUrl": {
374                        "label": "Image",
375                        "format": "image"
376                    },
377                    "linkUrl": {
378                        "label": "Link",
379                        "format": "link"
380                    },
381                    "textField": {
382                        "label": "Text",
383                        "format": "text"
384                    },
385                    "booleanField": {
386                        "label": "Boolean",
387                        "format": "boolean"
388                    },
389                    "arrayField": {
390                        "label": "Array",
391                        "format": "array"
392                    },
393                    "objectField": {
394                        "label": "Object",
395                        "format": "object"
396                    },
397                    "dateField": {
398                        "label": "Date",
399                        "format": "date"
400                    },
401                    "numericField": {
402                        "label": "Number",
403                        "format": "number"
404                    }
405                }
406            }
407        }
408    }
409}
410```
411
412### Structure
413
414```json
415{
416    "actorSpecification": 1,
417    "fields": {},
418    "views": {
419        "<VIEW_NAME>": {
420            "title": "string (required)",
421            "description": "string (optional)",
422            "transformation": {
423                "fields": ["string (required)"],
424                "unwind": ["string (optional)"],
425                "flatten": ["string (optional)"],
426                "omit": ["string (optional)"],
427                "limit": "integer (optional)",
428                "desc": "boolean (optional)"
429            },
430            "display": {
431                "component": "table (required)",
432                "properties": {
433                    "<FIELD_NAME>": {
434                        "label": "string (optional)",
435                        "format": "text|number|date|link|boolean|image|array|object (optional)"
436                    }
437                }
438            }
439        }
440    }
441}
442```
443
444**Dataset Schema Properties:**
445
446- `actorSpecification` (integer, required) - Specifies the version of dataset schema structure document (currently only version 1)
447- `fields` (JSONSchema object, required) - Schema of one dataset object (use JsonSchema Draft 2020-12 or compatible)
448- `views` (DatasetView object, required) - Object with API and UI views description
449
450**DatasetView Properties:**
451
452- `title` (string, required) - Visible in UI Output tab and API
453- `description` (string, optional) - Only available in API response
454- `transformation` (ViewTransformation object, required) - Data transformation applied when loading from Dataset API
455- `display` (ViewDisplay object, required) - Output tab UI visualization definition
456
457**ViewTransformation Properties:**
458
459- `fields` (string[], required) - Fields to present in output (order matches column order)
460- `unwind` (string[], optional) - Deconstructs nested children into parent object
461- `flatten` (string[], optional) - Transforms nested object into flat structure
462- `omit` (string[], optional) - Removes specified fields from output
463- `limit` (integer, optional) - Maximum number of results (default: all)
464- `desc` (boolean, optional) - Sort order (true = newest first)
465
466**ViewDisplay Properties:**
467
468- `component` (string, required) - Only `table` is available
469- `properties` (Object, optional) - Keys matching `transformation.fields` with ViewDisplayProperty values
470
471**ViewDisplayProperty Properties:**
472
473- `label` (string, optional) - Table column header
474- `format` (string, optional) - One of: `text`, `number`, `date`, `link`, `boolean`, `image`, `array`, `object`
475
476## Key-Value Store Schema Specification
477
478The key-value store schema organizes keys into logical groups called collections for easier data management.
479
480### Example
481
482Consider an example Actor that calls `Actor.setValue()` to save records into the key-value store:
483
484```python
485# Key-Value Store set example (Python)
486import asyncio
487from apify import Actor
488
489async def main():
490    await Actor.init()
491
492    # Actor code
493    await Actor.set_value('document-1', 'my text data', content_type='text/plain')
494
495    image_id = '123'          # example placeholder
496    image_buffer = b'...'     # bytes buffer with image data
497    await Actor.set_value(f'image-{image_id}', image_buffer, content_type='image/jpeg')
498
499    # Exit successfully
500    await Actor.exit()
501
502if __name__ == '__main__':
503    asyncio.run(main())
504```
505
506To configure the key-value store schema, reference a schema file in `.actor/actor.json`:
507
508```json
509{
510    "actorSpecification": 1,
511    "name": "data-collector",
512    "title": "Data Collector",
513    "version": "1.0.0",
514    "storages": {
515        "keyValueStore": "./key_value_store_schema.json"
516    }
517}
518```
519
520Then create the key-value store schema in `.actor/key_value_store_schema.json`:
521
522```json
523{
524    "actorKeyValueStoreSchemaVersion": 1,
525    "title": "Key-Value Store Schema",
526    "collections": {
527        "documents": {
528            "title": "Documents",
529            "description": "Text documents stored by the Actor",
530            "keyPrefix": "document-"
531        },
532        "images": {
533            "title": "Images",
534            "description": "Images stored by the Actor",
535            "keyPrefix": "image-",
536            "contentTypes": ["image/jpeg"]
537        }
538    }
539}
540```
541
542### Structure
543
544```json
545{
546    "actorKeyValueStoreSchemaVersion": 1,
547    "title": "string (required)",
548    "description": "string (optional)",
549    "collections": {
550        "<COLLECTION_NAME>": {
551            "title": "string (required)",
552            "description": "string (optional)",
553            "key": "string (conditional - use key OR keyPrefix)",
554            "keyPrefix": "string (conditional - use key OR keyPrefix)",
555            "contentTypes": ["string (optional)"],
556            "jsonSchema": "object (optional)"
557        }
558    }
559}
560```
561
562**Key-Value Store Schema Properties:**
563
564- `actorKeyValueStoreSchemaVersion` (integer, required) - Version of key-value store schema structure document (currently only version 1)
565- `title` (string, required) - Title of the schema
566- `description` (string, optional) - Description of the schema
567- `collections` (Object, required) - Object where each key is a collection ID and value is a Collection object
568
569**Collection Properties:**
570
571- `title` (string, required) - Collection title shown in UI tabs
572- `description` (string, optional) - Description appearing in UI tooltips
573- `key` (string, conditional\*) - Single specific key for this collection
574- `keyPrefix` (string, conditional\*) - Prefix for keys included in this collection
575- `contentTypes` (string[], optional) - Allowed content types for validation
576- `jsonSchema` (object, optional) - JSON Schema Draft 07 format for `application/json` content type validation
577
578\*Either `key` or `keyPrefix` must be specified for each collection, but not both.
579
580## Apify MCP Tools
581
582If MCP server is configured, use these tools for documentation:
583
584- `search-apify-docs` - Search documentation
585- `fetch-apify-docs` - Get full doc pages
586
587Otherwise, reference: `@https://mcp.apify.com/`
588
589## Resources
590
591- [docs.apify.com/llms.txt](https://docs.apify.com/llms.txt) - Quick reference
592- [docs.apify.com/llms-full.txt](https://docs.apify.com/llms-full.txt) - Complete docs
593- [crawlee.dev](https://crawlee.dev) - Crawlee documentation
594- [whitepaper.actor](https://raw.githubusercontent.com/apify/actor-whitepaper/refs/heads/master/README.md) - Complete Actor specification

CLAUDE.md

1# Apify Actors Development Guide
2
3Important: Before you begin, fill in the `generatedBy` property in the meta section of `.actor/actor.json`. Replace it with the tool and model you're currently using, such as "Claude Code with Claude Sonnet 4.5". This helps Apify monitor and improve AGENTS.md for specific AI tools and models.
4
5## What are Apify Actors?
6
7- Actors are serverless programs that run in the cloud. They're inspired by the UNIX philosophy - programs that do one thing well and can be easily combined to build complex systems.
8- Actors are programs packaged as Docker images that run in isolated containers
9
10## Core Concepts
11
12- Accept well-defined JSON input
13- Perform isolated tasks (web scraping, automation, data processing)
14- Produce structured JSON output to datasets and/or store data in key-value stores
15- Can run from seconds to hours or even indefinitely
16- Persist state and can be restarted
17
18## Do
19
20- accept well-defined JSON input and produce structured JSON output
21- use Apify SDK (`apify`) for code running ON Apify platform
22- validate input early with proper error handling and fail gracefully
23- use CheerioCrawler for static HTML content (10x faster than browsers)
24- use PlaywrightCrawler only for JavaScript-heavy sites and dynamic content
25- use router pattern (createCheerioRouter/createPlaywrightRouter) for complex crawls
26- implement retry strategies with exponential backoff for failed requests
27- use proper concurrency settings (HTTP: 10-50, Browser: 1-5)
28- set sensible defaults in `.actor/input_schema.json` for all optional fields
29- set up output schema in `.actor/output_schema.json`
30- clean and validate data before pushing to dataset
31- use semantic CSS selectors and fallback strategies for missing elements
32- respect robots.txt, ToS, and implement rate limiting with delays
33- check which tools (cheerio/playwright/crawlee) are installed before applying guidance
34- use `Actor.log` for logging (censors sensitive data)
35- implement readiness probe handler for standby Actors
36- handle the `aborting` event to gracefully shut down when Actor is stopped
37
38## Don't
39
40- do not rely on `Dataset.getInfo()` for final counts on Cloud platform
41- do not use browser crawlers when HTTP/Cheerio works (massive performance gains with HTTP)
42- do not hard code values that should be in input schema or environment variables
43- do not skip input validation or error handling
44- do not overload servers - use appropriate concurrency and delays
45- do not scrape prohibited content or ignore Terms of Service
46- do not store personal/sensitive data unless explicitly permitted
47- do not use deprecated options like `requestHandlerTimeoutMillis` on CheerioCrawler (v3.x)
48- do not use `additionalHttpHeaders` - use `preNavigationHooks` instead
49- do not assume that local storage is persistent or automatically synced to Apify Console - when running locally with `apify run`, the `storage/` directory is local-only and is NOT pushed to the Cloud
50- do not disable standby mode (`usesStandbyMode: false`) without explicit permission
51
52## Logging
53
54- **ALWAYS use `Actor.log` for logging** - This logger contains critical security logic including censoring sensitive data (Apify tokens, API keys, credentials) to prevent accidental exposure in logs
55
56### Available Log Levels
57
58The Apify Actor logger provides the following methods for logging:
59
60- `Actor.log.debug()` - Debug level logs (detailed diagnostic information)
61- `Actor.log.info()` - Info level logs (general informational messages)
62- `Actor.log.warning()` - Warning level logs (warning messages for potentially problematic situations)
63- `Actor.log.error()` - Error level logs (error messages for failures)
64- `Actor.log.exception()` - Exception level logs (for exceptions with stack traces)
65
66**Best practices:**
67
68- Use `Actor.log.debug()` for detailed operation-level diagnostics (inside functions)
69- Use `Actor.log.info()` for general informational messages (API requests, successful operations)
70- Use `Actor.log.warning()` for potentially problematic situations (validation failures, unexpected states)
71- Use `Actor.log.error()` for actual errors and failures
72- Use `Actor.log.exception()` for caught exceptions with stack traces
73
74## Graceful Abort Handling
75
76Handle the `aborting` event to terminate the Actor quickly when stopped by user or platform, minimizing costs especially for PPU/PPE+U billing.
77
78```python
79import asyncio
80
81async def on_aborting() -> None:
82    # Persist any state, do any cleanup you need, and terminate the Actor using `await Actor.exit()` explicitly as soon as possible
83    # This will help ensure that the Actor is doing best effort to honor any potential limits on costs of a single run set by the user
84    # Wait 1 second to allow Crawlee/SDK state persistence operations to complete
85    # This is a temporary workaround until SDK implements proper state persistence in the aborting event
86    await asyncio.sleep(1)
87    await Actor.exit()
88
89Actor.on('aborting', on_aborting)
90```
91
92## Standby Mode
93
94- **NEVER disable standby mode (`usesStandbyMode: false`) in `.actor/actor.json` without explicit permission** - Actor Standby mode solves this problem by letting you have the Actor ready in the background, waiting for the incoming HTTP requests. In a sense, the Actor behaves like a real-time web server or standard API server instead of running the logic once to process everything in batch. Always keep `usesStandbyMode: true` unless there is a specific documented reason to disable it
95- **ALWAYS implement readiness probe handler for standby Actors** - Handle the `x-apify-container-server-readiness-probe` header at GET / endpoint to ensure proper Actor lifecycle management
96
97You can recognize a standby Actor by checking the `usesStandbyMode` property in `.actor/actor.json`. Only implement the readiness probe if this property is set to `true`.
98
99### Readiness Probe Implementation Example
100
101```python
102# Apify standby readiness probe
103from http.server import SimpleHTTPRequestHandler
104
105class GetHandler(SimpleHTTPRequestHandler):
106    def do_GET(self):
107        # Handle Apify standby readiness probe
108        if 'x-apify-container-server-readiness-probe' in self.headers:
109            self.send_response(200)
110            self.end_headers()
111            self.wfile.write(b'Readiness probe OK')
112            return
113
114        self.send_response(200)
115        self.end_headers()
116        self.wfile.write(b'Actor is ready')
117```
118
119Key points:
120
121- Detect the `x-apify-container-server-readiness-probe` header in incoming requests
122- Respond with HTTP 200 status code for both readiness probe and normal requests
123- This enables proper Actor lifecycle management in standby mode
124
125## Commands
126
127```bash
128# Local development
129apify run                              # Run Actor locally
130
131# Authentication & deployment
132apify login                            # Authenticate account
133apify push                             # Deploy to Apify platform
134
135# Help
136apify help                             # List all commands
137```
138
139## Safety and Permissions
140
141Allowed without prompt:
142
143- read files with `Actor.get_value()`
144- push data with `Actor.push_data()`
145- set values with `Actor.set_value()`
146- enqueue requests to RequestQueue
147- run locally with `apify run`
148
149Ask first:
150
151- npm/pip package installations
152- apify push (deployment to cloud)
153- proxy configuration changes (requires paid plan)
154- Dockerfile changes affecting builds
155- deleting datasets or key-value stores
156
157## Project Structure
158
159.actor/
160├── actor.json # Actor config: name, version, env vars, runtime settings
161├── input_schema.json # Input validation & Console form definition
162└── output_schema.json # Specifies where an Actor stores its output
163src/
164└── main.js # Actor entry point and orchestrator
165storage/ # Local-only storage for development (NOT synced to Cloud)
166├── datasets/ # Output items (JSON objects)
167├── key_value_stores/ # Files, config, INPUT
168└── request_queues/ # Pending crawl requests
169Dockerfile # Container image definition
170AGENTS.md # AI agent instructions (this file)
171
172## Local vs Cloud Storage
173
174When running locally with `apify run`, the Apify SDK emulates Cloud storage APIs using the local `storage/` directory. This local storage behaves differently from Cloud storage:
175
176- **Local storage is NOT persistent** - The `storage/` directory is meant for local development and testing only. Data stored there (datasets, key-value stores, request queues) exists only on your local disk.
177- **Local storage is NOT automatically pushed to Apify Console** - Running `apify run` does not upload any storage data to the Apify platform. The data stays local.
178- **Each local run may overwrite previous data** - The local `storage/` directory is reused between runs, but this is local-only behavior, not Cloud persistence.
179- **Cloud storage only works when running on Apify platform** - After deploying with `apify push` and running the Actor in the Cloud, storage calls (`Actor.push_data()`, `Actor.set_value()`, etc.) interact with real Apify Cloud storage, which is then visible in the Apify Console.
180- **To verify Actor output, deploy and run in Cloud** - Do not rely on local `storage/` contents as proof that data will appear in the Apify Console. Always test by deploying (`apify push`) and running the Actor on the platform.
181
182## Actor Input Schema
183
184The input schema defines the input parameters for an Actor. It's a JSON object comprising various field types supported by the Apify platform.
185
186### Structure
187
188```json
189{
190    "title": "<INPUT-SCHEMA-TITLE>",
191    "type": "object",
192    "schemaVersion": 1,
193    "properties": {
194        /* define input fields here */
195    },
196    "required": []
197}
198```
199
200### Example
201
202```json
203{
204    "title": "E-commerce Product Scraper Input",
205    "type": "object",
206    "schemaVersion": 1,
207    "properties": {
208        "startUrls": {
209            "title": "Start URLs",
210            "type": "array",
211            "description": "URLs to start scraping from (category pages or product pages)",
212            "editor": "requestListSources",
213            "default": [{ "url": "https://example.com/category" }],
214            "prefill": [{ "url": "https://example.com/category" }]
215        },
216        "followVariants": {
217            "title": "Follow Product Variants",
218            "type": "boolean",
219            "description": "Whether to scrape product variants (different colors, sizes)",
220            "default": true
221        },
222        "maxRequestsPerCrawl": {
223            "title": "Max Requests per Crawl",
224            "type": "integer",
225            "description": "Maximum number of pages to scrape (0 = unlimited)",
226            "default": 1000,
227            "minimum": 0
228        },
229        "proxyConfiguration": {
230            "title": "Proxy Configuration",
231            "type": "object",
232            "description": "Proxy settings for anti-bot protection",
233            "editor": "proxy",
234            "default": { "useApifyProxy": false }
235        },
236        "locale": {
237            "title": "Locale",
238            "type": "string",
239            "description": "Language/country code for localized content",
240            "default": "cs",
241            "enum": ["cs", "en", "de", "sk"],
242            "enumTitles": ["Czech", "English", "German", "Slovak"]
243        }
244    },
245    "required": ["startUrls"]
246}
247```
248
249## Actor Output Schema
250
251The Actor output schema builds upon the schemas for the dataset and key-value store. It specifies where an Actor stores its output and defines templates for accessing that output. Apify Console uses these output definitions to display run results.
252
253### Structure
254
255```json
256{
257    "actorOutputSchemaVersion": 1,
258    "title": "<OUTPUT-SCHEMA-TITLE>",
259    "properties": {
260        /* define your outputs here */
261    }
262}
263```
264
265### Example
266
267```json
268{
269    "actorOutputSchemaVersion": 1,
270    "title": "Output schema of the files scraper",
271    "properties": {
272        "files": {
273            "type": "string",
274            "title": "Files",
275            "template": "{{links.apiDefaultKeyValueStoreUrl}}/keys"
276        },
277        "dataset": {
278            "type": "string",
279            "title": "Dataset",
280            "template": "{{links.apiDefaultDatasetUrl}}/items"
281        }
282    }
283}
284```
285
286### Output Schema Template Variables
287
288- `links` (object) - Contains quick links to most commonly used URLs
289- `links.publicRunUrl` (string) - Public run url in format `https://console.apify.com/view/runs/:runId`
290- `links.consoleRunUrl` (string) - Console run url in format `https://console.apify.com/actors/runs/:runId`
291- `links.apiRunUrl` (string) - API run url in format `https://api.apify.com/v2/actor-runs/:runId`
292- `links.apiDefaultDatasetUrl` (string) - API url of default dataset in format `https://api.apify.com/v2/datasets/:defaultDatasetId`
293- `links.apiDefaultKeyValueStoreUrl` (string) - API url of default key-value store in format `https://api.apify.com/v2/key-value-stores/:defaultKeyValueStoreId`
294- `links.containerRunUrl` (string) - URL of a webserver running inside the run in format `https://<containerId>.runs.apify.net/`
295- `run` (object) - Contains information about the run same as it is returned from the `GET Run` API endpoint
296- `run.defaultDatasetId` (string) - ID of the default dataset
297- `run.defaultKeyValueStoreId` (string) - ID of the default key-value store
298
299## Dataset Schema Specification
300
301The dataset schema defines how your Actor's output data is structured, transformed, and displayed in the Output tab in the Apify Console.
302
303### Example
304
305Consider an example Actor that calls `Actor.pushData()` to store data into dataset:
306
307```python
308# Dataset push example (Python)
309import asyncio
310from datetime import datetime
311from apify import Actor
312
313async def main():
314    await Actor.init()
315
316    # Actor code
317    await Actor.push_data({
318        'numericField': 10,
319        'pictureUrl': 'https://www.google.com/images/branding/googlelogo/2x/googlelogo_color_92x30dp.png',
320        'linkUrl': 'https://google.com',
321        'textField': 'Google',
322        'booleanField': True,
323        'dateField': datetime.now().isoformat(),
324        'arrayField': ['#hello', '#world'],
325        'objectField': {},
326    })
327
328    # Exit successfully
329    await Actor.exit()
330
331if __name__ == '__main__':
332    asyncio.run(main())
333```
334
335To set up the Actor's output tab UI, reference a dataset schema file in `.actor/actor.json`:
336
337```json
338{
339    "actorSpecification": 1,
340    "name": "book-library-scraper",
341    "title": "Book Library Scraper",
342    "version": "1.0.0",
343    "storages": {
344        "dataset": "./dataset_schema.json"
345    }
346}
347```
348
349Then create the dataset schema in `.actor/dataset_schema.json`:
350
351```json
352{
353    "actorSpecification": 1,
354    "fields": {},
355    "views": {
356        "overview": {
357            "title": "Overview",
358            "transformation": {
359                "fields": [
360                    "pictureUrl",
361                    "linkUrl",
362                    "textField",
363                    "booleanField",
364                    "arrayField",
365                    "objectField",
366                    "dateField",
367                    "numericField"
368                ]
369            },
370            "display": {
371                "component": "table",
372                "properties": {
373                    "pictureUrl": {
374                        "label": "Image",
375                        "format": "image"
376                    },
377                    "linkUrl": {
378                        "label": "Link",
379                        "format": "link"
380                    },
381                    "textField": {
382                        "label": "Text",
383                        "format": "text"
384                    },
385                    "booleanField": {
386                        "label": "Boolean",
387                        "format": "boolean"
388                    },
389                    "arrayField": {
390                        "label": "Array",
391                        "format": "array"
392                    },
393                    "objectField": {
394                        "label": "Object",
395                        "format": "object"
396                    },
397                    "dateField": {
398                        "label": "Date",
399                        "format": "date"
400                    },
401                    "numericField": {
402                        "label": "Number",
403                        "format": "number"
404                    }
405                }
406            }
407        }
408    }
409}
410```
411
412### Structure
413
414```json
415{
416    "actorSpecification": 1,
417    "fields": {},
418    "views": {
419        "<VIEW_NAME>": {
420            "title": "string (required)",
421            "description": "string (optional)",
422            "transformation": {
423                "fields": ["string (required)"],
424                "unwind": ["string (optional)"],
425                "flatten": ["string (optional)"],
426                "omit": ["string (optional)"],
427                "limit": "integer (optional)",
428                "desc": "boolean (optional)"
429            },
430            "display": {
431                "component": "table (required)",
432                "properties": {
433                    "<FIELD_NAME>": {
434                        "label": "string (optional)",
435                        "format": "text|number|date|link|boolean|image|array|object (optional)"
436                    }
437                }
438            }
439        }
440    }
441}
442```
443
444**Dataset Schema Properties:**
445
446- `actorSpecification` (integer, required) - Specifies the version of dataset schema structure document (currently only version 1)
447- `fields` (JSONSchema object, required) - Schema of one dataset object (use JsonSchema Draft 2020-12 or compatible)
448- `views` (DatasetView object, required) - Object with API and UI views description
449
450**DatasetView Properties:**
451
452- `title` (string, required) - Visible in UI Output tab and API
453- `description` (string, optional) - Only available in API response
454- `transformation` (ViewTransformation object, required) - Data transformation applied when loading from Dataset API
455- `display` (ViewDisplay object, required) - Output tab UI visualization definition
456
457**ViewTransformation Properties:**
458
459- `fields` (string[], required) - Fields to present in output (order matches column order)
460- `unwind` (string[], optional) - Deconstructs nested children into parent object
461- `flatten` (string[], optional) - Transforms nested object into flat structure
462- `omit` (string[], optional) - Removes specified fields from output
463- `limit` (integer, optional) - Maximum number of results (default: all)
464- `desc` (boolean, optional) - Sort order (true = newest first)
465
466**ViewDisplay Properties:**
467
468- `component` (string, required) - Only `table` is available
469- `properties` (Object, optional) - Keys matching `transformation.fields` with ViewDisplayProperty values
470
471**ViewDisplayProperty Properties:**
472
473- `label` (string, optional) - Table column header
474- `format` (string, optional) - One of: `text`, `number`, `date`, `link`, `boolean`, `image`, `array`, `object`
475
476## Key-Value Store Schema Specification
477
478The key-value store schema organizes keys into logical groups called collections for easier data management.
479
480### Example
481
482Consider an example Actor that calls `Actor.setValue()` to save records into the key-value store:
483
484```python
485# Key-Value Store set example (Python)
486import asyncio
487from apify import Actor
488
489async def main():
490    await Actor.init()
491
492    # Actor code
493    await Actor.set_value('document-1', 'my text data', content_type='text/plain')
494
495    image_id = '123'          # example placeholder
496    image_buffer = b'...'     # bytes buffer with image data
497    await Actor.set_value(f'image-{image_id}', image_buffer, content_type='image/jpeg')
498
499    # Exit successfully
500    await Actor.exit()
501
502if __name__ == '__main__':
503    asyncio.run(main())
504```
505
506To configure the key-value store schema, reference a schema file in `.actor/actor.json`:
507
508```json
509{
510    "actorSpecification": 1,
511    "name": "data-collector",
512    "title": "Data Collector",
513    "version": "1.0.0",
514    "storages": {
515        "keyValueStore": "./key_value_store_schema.json"
516    }
517}
518```
519
520Then create the key-value store schema in `.actor/key_value_store_schema.json`:
521
522```json
523{
524    "actorKeyValueStoreSchemaVersion": 1,
525    "title": "Key-Value Store Schema",
526    "collections": {
527        "documents": {
528            "title": "Documents",
529            "description": "Text documents stored by the Actor",
530            "keyPrefix": "document-"
531        },
532        "images": {
533            "title": "Images",
534            "description": "Images stored by the Actor",
535            "keyPrefix": "image-",
536            "contentTypes": ["image/jpeg"]
537        }
538    }
539}
540```
541
542### Structure
543
544```json
545{
546    "actorKeyValueStoreSchemaVersion": 1,
547    "title": "string (required)",
548    "description": "string (optional)",
549    "collections": {
550        "<COLLECTION_NAME>": {
551            "title": "string (required)",
552            "description": "string (optional)",
553            "key": "string (conditional - use key OR keyPrefix)",
554            "keyPrefix": "string (conditional - use key OR keyPrefix)",
555            "contentTypes": ["string (optional)"],
556            "jsonSchema": "object (optional)"
557        }
558    }
559}
560```
561
562**Key-Value Store Schema Properties:**
563
564- `actorKeyValueStoreSchemaVersion` (integer, required) - Version of key-value store schema structure document (currently only version 1)
565- `title` (string, required) - Title of the schema
566- `description` (string, optional) - Description of the schema
567- `collections` (Object, required) - Object where each key is a collection ID and value is a Collection object
568
569**Collection Properties:**
570
571- `title` (string, required) - Collection title shown in UI tabs
572- `description` (string, optional) - Description appearing in UI tooltips
573- `key` (string, conditional\*) - Single specific key for this collection
574- `keyPrefix` (string, conditional\*) - Prefix for keys included in this collection
575- `contentTypes` (string[], optional) - Allowed content types for validation
576- `jsonSchema` (object, optional) - JSON Schema Draft 07 format for `application/json` content type validation
577
578\*Either `key` or `keyPrefix` must be specified for each collection, but not both.
579
580## Apify MCP Tools
581
582If MCP server is configured, use these tools for documentation:
583
584- `search-apify-docs` - Search documentation
585- `fetch-apify-docs` - Get full doc pages
586
587Otherwise, reference: `@https://mcp.apify.com/`
588
589## Resources
590
591- [docs.apify.com/llms.txt](https://docs.apify.com/llms.txt) - Quick reference
592- [docs.apify.com/llms-full.txt](https://docs.apify.com/llms-full.txt) - Complete docs
593- [crawlee.dev](https://crawlee.dev) - Crawlee documentation
594- [whitepaper.actor](https://raw.githubusercontent.com/apify/actor-whitepaper/refs/heads/master/README.md) - Complete Actor specification

Dockerfile

# First, specify the base Docker image.
# You can see the Docker images from Apify at https://hub.docker.com/r/apify/.
# You can also use any other image from Docker Hub.
FROM apify/actor-python:3.14

USER myuser

# Second, copy just requirements.txt into the Actor image,
# since it should be the only file that affects the dependency install in the next step,
# in order to speed up the build
COPY --chown=myuser:myuser requirements.txt ./

# Install the packages specified in requirements.txt,
# Print the installed Python version, pip version
# and all installed packages with their versions for debugging
RUN echo "Python version:" \
 && python --version \
 && echo "Pip version:" \
 && pip --version \
 && echo "Installing dependencies:" \
 && pip install -r requirements.txt \
 && echo "All installed Python packages:" \
 && pip freeze

# Next, copy the remaining files and directories with the source code.
# Since we do this after installing the dependencies, quick build will be really fast
# for most source file changes.
COPY --chown=myuser:myuser . ./

# Use compileall to ensure the runnability of the Actor Python code.
RUN python3 -m compileall -q src/

# Specify how to launch the source code of your Actor.
# By default, the "python3 -m src" command is run
CMD ["python3", "-m", "src"]

requirements.txt

1# Feel free to add your Python dependencies below. For formatting guidelines, see:
2# https://pip.pypa.io/en/latest/reference/requirements-file-format/
3
4apify < 4.0.0
5crawlee[beautifulsoup]

src/init.py

src/main.py

1import asyncio
2
3from .main import main
4
5# Execute the Actor entry point.
6asyncio.run(main())

src/main.py

1"""Module defines the main entry point for the Apify Actor.
2
3Feel free to modify this file to suit your specific needs.
4
5To build Apify Actors, utilize the Apify SDK toolkit, read more at the official documentation:
6https://docs.apify.com/sdk/python
7"""
8
9from __future__ import annotations
10
11import asyncio
12
13from apify import Actor, Event
14from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
15
16
17async def main() -> None:
18    """Define a main entry point for the Apify Actor.
19
20    This coroutine is executed using `asyncio.run()`, so it must remain an asynchronous function for proper execution.
21    Asynchronous execution is required for communication with Apify platform, and it also enhances performance in
22    the field of web scraping significantly.
23    """
24    # Enter the context of the Actor.
25    async with Actor:
26        # Handle graceful abort - Actor is being stopped by user or platform
27        async def on_aborting() -> None:
28            # Persist any state, do any cleanup you need, and terminate the Actor using
29            # `await Actor.exit()` explicitly as soon as possible. This will help ensure that
30            # the Actor is doing best effort to honor any potential limits on costs of a
31            # single run set by the user.
32            # Wait 1 second to allow Crawlee/SDK state persistence operations to complete
33            # This is a temporary workaround until SDK implements proper state persistence in the aborting event
34            await asyncio.sleep(1)
35            await Actor.exit()
36
37        Actor.on(Event.ABORTING, on_aborting)
38
39        # Retrieve the Actor input, and use default values if not provided.
40        actor_input = await Actor.get_input() or {}
41        start_urls = [
42            url.get('url')
43            for url in actor_input.get(
44                'start_urls',
45                [{'url': 'https://apify.com'}],
46            )
47        ]
48
49        # Exit if no start URLs are provided.
50        if not start_urls:
51            Actor.log.info('No start URLs specified in Actor input, exiting...')
52            await Actor.exit()
53
54        # Create a crawler.
55        crawler = BeautifulSoupCrawler(
56            # Limit the crawl to max requests. Remove or increase it for crawling all links.
57            max_requests_per_crawl=10,
58        )
59
60        # Define a request handler, which will be called for every request.
61        @crawler.router.default_handler
62        async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
63            url = context.request.url
64            Actor.log.info(f'Scraping {url}...')
65
66            # Extract the desired data.
67            data = {
68                'url': context.request.url,
69                'title': context.soup.title.string if context.soup.title else None,
70                'h1s': [h1.text for h1 in context.soup.find_all('h1')],
71                'h2s': [h2.text for h2 in context.soup.find_all('h2')],
72                'h3s': [h3.text for h3 in context.soup.find_all('h3')],
73            }
74
75            # Store the extracted data to the default dataset.
76            await context.push_data(data)
77
78            # Enqueue additional links found on the current page.
79            await context.enqueue_links()
80
81        # Run the crawler with the starting requests.
82        await crawler.run(start_urls)

src/py.typed