Deprecated

Pricing

Pay per usage

See alternative Actors

Go to Apify Store

My Actor 5

Deprecated

See alternative Actors

Going with the itunes

Pricing

Pay per usage

Rating

0.0

(0)

Developer

OG Stoner

Actor stats

Bookmarked

Total users

Monthly active users

25 days ago

Last modified

.actor/actor.json

{
    "actorSpecification": 1,
    "name": "my-actor-5",
    "title": "Getting started with Python and Playwright",
    "description": "Scrapes titles of websites using Playwright.",
    "version": "0.0",
    "buildTag": "latest",
    "meta": {
        "templateId": "python-playwright",
        "generatedBy": "<FILL-IN-MODEL>"
    },
    "input": "./input_schema.json",
    "output": "./output_schema.json",
    "storages": {
        "dataset": "./dataset_schema.json"
    },
    "dockerfile": "../Dockerfile"
}

.actor/dataset_schema.json

{
    "actorSpecification": 1,
    "fields": {},
    "views": {
        "overview": {
            "title": "Overview",
            "transformation": {
                "fields": ["title", "url"]
            },
            "display": {
                "component": "table",
                "properties": {
                    "title": {
                        "label": "Title",
                        "format": "text"
                    },
                    "url": {
                        "label": "URL",
                        "format": "link"
                    }
                }
            }
        }
    }
}

.actor/input_schema.json

{
    "title": "Python Playwright Scraper",
    "type": "object",
    "schemaVersion": 1,
    "properties": {
        "start_urls": {
            "title": "Start URLs",
            "type": "array",
            "description": "URLs to start with",
            "prefill": [{ "url": "https://apify.com" }],
            "editor": "requestListSources"
        },
        "max_depth": {
            "title": "Maximum depth",
            "type": "integer",
            "description": "Depth to which to scrape to",
            "default": 1
        }
    },
    "required": ["start_urls"]
}

.actor/output_schema.json

{
    "actorOutputSchemaVersion": 1,
    "title": "Output schema",
    "properties": {
        "overview": {
            "type": "string",
            "title": "Overview",
            "template": "{{links.apiDefaultDatasetUrl}}/items?view=overview"
        }
    }
}

.dockerignore

.git
.mise.toml
.nvim.lua
storage

# The rest is copied from https://github.com/github/gitignore/blob/main/Python.gitignore

# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
#  Usually these files are written by a python script from a template
#  before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
cover/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
.pybuilder/
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# pyenv
#   For a library or package, you might want to ignore these files since the code is
#   intended to run in multiple environments; otherwise, check them in:
.python-version

# pdm
#   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
#pdm.lock
#   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
#   in version control.
#   https://pdm.fming.dev/latest/usage/project/#working-with-version-control
.pdm.toml
.pdm-python
.pdm-build/

# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/

# pytype static type analyzer
.pytype/

# Cython debug symbols
cython_debug/

# PyCharm
#  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
#  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
#  and can be added to the global gitignore or merged into this file.  For a more nuclear
#  option (not recommended) you can uncomment the following to ignore the entire idea folder.
.idea/

# Visual Studio Code
#  Ignores the folder created by VS Code when changing workspace settings, doing debugger
#  configuration, etc. Can be commented out to share Workspace Settings within a team
.vscode

# Zed editor
#  Ignores the folder created when setting Project Settings in the Zed editor. Can be commented out
#  to share Project Settings within a team
.zed

.gitignore

.mise.toml
.nvim.lua
storage

# The rest is copied from https://github.com/github/gitignore/blob/main/Python.gitignore

# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
#  Usually these files are written by a python script from a template
#  before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
cover/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
.pybuilder/
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# pyenv
#   For a library or package, you might want to ignore these files since the code is
#   intended to run in multiple environments; otherwise, check them in:
.python-version

# pdm
#   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
#pdm.lock
#   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
#   in version control.
#   https://pdm.fming.dev/latest/usage/project/#working-with-version-control
.pdm.toml
.pdm-python
.pdm-build/

# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/

# pytype static type analyzer
.pytype/

# Cython debug symbols
cython_debug/

# PyCharm
#  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
#  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
#  and can be added to the global gitignore or merged into this file.  For a more nuclear
#  option (not recommended) you can uncomment the following to ignore the entire idea folder.
.idea/

# Visual Studio Code
#  Ignores the folder created by VS Code when changing workspace settings, doing debugger
#  configuration, etc. Can be commented out to share Workspace Settings within a team
.vscode

# Zed editor
#  Ignores the folder created when setting Project Settings in the Zed editor. Can be commented out
#  to share Project Settings within a team
.zed

AGENTS.md

1# Apify Actors Development Guide
2
3Important: Before you begin, fill in the `generatedBy` property in the meta section of `.actor/actor.json`. Replace it with the tool and model you're currently using, such as "Claude Code with Claude Sonnet 4.5". This helps Apify monitor and improve AGENTS.md for specific AI tools and models.
4
5## What are Apify Actors?
6
7- Actors are serverless programs that run in the cloud. They're inspired by the UNIX philosophy - programs that do one thing well and can be easily combined to build complex systems.
8- Actors are programs packaged as Docker images that run in isolated containers
9
10## Core Concepts
11
12- Accept well-defined JSON input
13- Perform isolated tasks (web scraping, automation, data processing)
14- Produce structured JSON output to datasets and/or store data in key-value stores
15- Can run from seconds to hours or even indefinitely
16- Persist state and can be restarted
17
18## Do
19
20- accept well-defined JSON input and produce structured JSON output
21- use Apify SDK (`apify`) for code running ON Apify platform
22- validate input early with proper error handling and fail gracefully
23- use CheerioCrawler for static HTML content (10x faster than browsers)
24- use PlaywrightCrawler only for JavaScript-heavy sites and dynamic content
25- use router pattern (createCheerioRouter/createPlaywrightRouter) for complex crawls
26- implement retry strategies with exponential backoff for failed requests
27- use proper concurrency settings (HTTP: 10-50, Browser: 1-5)
28- set sensible defaults in `.actor/input_schema.json` for all optional fields
29- set up output schema in `.actor/output_schema.json`
30- clean and validate data before pushing to dataset
31- use semantic CSS selectors and fallback strategies for missing elements
32- respect robots.txt, ToS, and implement rate limiting with delays
33- check which tools (cheerio/playwright/crawlee) are installed before applying guidance
34- use `Actor.log` for logging (censors sensitive data)
35- implement readiness probe handler for standby Actors
36- handle the `aborting` event to gracefully shut down when Actor is stopped
37
38## Don't
39
40- do not rely on `Dataset.getInfo()` for final counts on Cloud platform
41- do not use browser crawlers when HTTP/Cheerio works (massive performance gains with HTTP)
42- do not hard code values that should be in input schema or environment variables
43- do not skip input validation or error handling
44- do not overload servers - use appropriate concurrency and delays
45- do not scrape prohibited content or ignore Terms of Service
46- do not store personal/sensitive data unless explicitly permitted
47- do not use deprecated options like `requestHandlerTimeoutMillis` on CheerioCrawler (v3.x)
48- do not use `additionalHttpHeaders` - use `preNavigationHooks` instead
49- do not disable standby mode (`usesStandbyMode: false`) without explicit permission
50
51## Logging
52
53- **ALWAYS use `Actor.log` for logging** - This logger contains critical security logic including censoring sensitive data (Apify tokens, API keys, credentials) to prevent accidental exposure in logs
54
55### Available Log Levels
56
57The Apify Actor logger provides the following methods for logging:
58
59- `Actor.log.debug()` - Debug level logs (detailed diagnostic information)
60- `Actor.log.info()` - Info level logs (general informational messages)
61- `Actor.log.warning()` - Warning level logs (warning messages for potentially problematic situations)
62- `Actor.log.error()` - Error level logs (error messages for failures)
63- `Actor.log.exception()` - Exception level logs (for exceptions with stack traces)
64
65**Best practices:**
66
67- Use `Actor.log.debug()` for detailed operation-level diagnostics (inside functions)
68- Use `Actor.log.info()` for general informational messages (API requests, successful operations)
69- Use `Actor.log.warning()` for potentially problematic situations (validation failures, unexpected states)
70- Use `Actor.log.error()` for actual errors and failures
71- Use `Actor.log.exception()` for caught exceptions with stack traces
72
73## Graceful Abort Handling
74
75Handle the `aborting` event to terminate the Actor quickly when stopped by user or platform, minimizing costs especially for PPU/PPE+U billing.
76
77```python
78import asyncio
79
80async def on_aborting() -> None:
81    # Persist any state, do any cleanup you need, and terminate the Actor using `await Actor.exit()` explicitly as soon as possible
82    # This will help ensure that the Actor is doing best effort to honor any potential limits on costs of a single run set by the user
83    # Wait 1 second to allow Crawlee/SDK state persistence operations to complete
84    # This is a temporary workaround until SDK implements proper state persistence in the aborting event
85    await asyncio.sleep(1)
86    await Actor.exit()
87
88Actor.on('aborting', on_aborting)
89```
90
91## Standby Mode
92
93- **NEVER disable standby mode (`usesStandbyMode: false`) in `.actor/actor.json` without explicit permission** - Actor Standby mode solves this problem by letting you have the Actor ready in the background, waiting for the incoming HTTP requests. In a sense, the Actor behaves like a real-time web server or standard API server instead of running the logic once to process everything in batch. Always keep `usesStandbyMode: true` unless there is a specific documented reason to disable it
94- **ALWAYS implement readiness probe handler for standby Actors** - Handle the `x-apify-container-server-readiness-probe` header at GET / endpoint to ensure proper Actor lifecycle management
95
96You can recognize a standby Actor by checking the `usesStandbyMode` property in `.actor/actor.json`. Only implement the readiness probe if this property is set to `true`.
97
98### Readiness Probe Implementation Example
99
100```python
101# Apify standby readiness probe
102from http.server import SimpleHTTPRequestHandler
103
104class GetHandler(SimpleHTTPRequestHandler):
105    def do_GET(self):
106        # Handle Apify standby readiness probe
107        if 'x-apify-container-server-readiness-probe' in self.headers:
108            self.send_response(200)
109            self.end_headers()
110            self.wfile.write(b'Readiness probe OK')
111            return
112
113        self.send_response(200)
114        self.end_headers()
115        self.wfile.write(b'Actor is ready')
116```
117
118Key points:
119
120- Detect the `x-apify-container-server-readiness-probe` header in incoming requests
121- Respond with HTTP 200 status code for both readiness probe and normal requests
122- This enables proper Actor lifecycle management in standby mode
123
124## Commands
125
126```bash
127# Local development
128apify run                              # Run Actor locally
129
130# Authentication & deployment
131apify login                            # Authenticate account
132apify push                             # Deploy to Apify platform
133
134# Help
135apify help                             # List all commands
136```
137
138## Safety and Permissions
139
140Allowed without prompt:
141
142- read files with `Actor.get_value()`
143- push data with `Actor.push_data()`
144- set values with `Actor.set_value()`
145- enqueue requests to RequestQueue
146- run locally with `apify run`
147
148Ask first:
149
150- npm/pip package installations
151- apify push (deployment to cloud)
152- proxy configuration changes (requires paid plan)
153- Dockerfile changes affecting builds
154- deleting datasets or key-value stores
155
156## Project Structure
157
158.actor/
159├── actor.json # Actor config: name, version, env vars, runtime settings
160├── input_schema.json # Input validation & Console form definition
161└── output_schema.json # Specifies where an Actor stores its output
162src/
163└── main.js # Actor entry point and orchestrator
164storage/ # Local storage (mirrors Cloud during development)
165├── datasets/ # Output items (JSON objects)
166├── key_value_stores/ # Files, config, INPUT
167└── request_queues/ # Pending crawl requests
168Dockerfile # Container image definition
169AGENTS.md # AI agent instructions (this file)
170
171## Actor Input Schema
172
173The input schema defines the input parameters for an Actor. It's a JSON object comprising various field types supported by the Apify platform.
174
175### Structure
176
177```json
178{
179    "title": "<INPUT-SCHEMA-TITLE>",
180    "type": "object",
181    "schemaVersion": 1,
182    "properties": {
183        /* define input fields here */
184    },
185    "required": []
186}
187```
188
189### Example
190
191```json
192{
193    "title": "E-commerce Product Scraper Input",
194    "type": "object",
195    "schemaVersion": 1,
196    "properties": {
197        "startUrls": {
198            "title": "Start URLs",
199            "type": "array",
200            "description": "URLs to start scraping from (category pages or product pages)",
201            "editor": "requestListSources",
202            "default": [{ "url": "https://example.com/category" }],
203            "prefill": [{ "url": "https://example.com/category" }]
204        },
205        "followVariants": {
206            "title": "Follow Product Variants",
207            "type": "boolean",
208            "description": "Whether to scrape product variants (different colors, sizes)",
209            "default": true
210        },
211        "maxRequestsPerCrawl": {
212            "title": "Max Requests per Crawl",
213            "type": "integer",
214            "description": "Maximum number of pages to scrape (0 = unlimited)",
215            "default": 1000,
216            "minimum": 0
217        },
218        "proxyConfiguration": {
219            "title": "Proxy Configuration",
220            "type": "object",
221            "description": "Proxy settings for anti-bot protection",
222            "editor": "proxy",
223            "default": { "useApifyProxy": false }
224        },
225        "locale": {
226            "title": "Locale",
227            "type": "string",
228            "description": "Language/country code for localized content",
229            "default": "cs",
230            "enum": ["cs", "en", "de", "sk"],
231            "enumTitles": ["Czech", "English", "German", "Slovak"]
232        }
233    },
234    "required": ["startUrls"]
235}
236```
237
238## Actor Output Schema
239
240The Actor output schema builds upon the schemas for the dataset and key-value store. It specifies where an Actor stores its output and defines templates for accessing that output. Apify Console uses these output definitions to display run results.
241
242### Structure
243
244```json
245{
246    "actorOutputSchemaVersion": 1,
247    "title": "<OUTPUT-SCHEMA-TITLE>",
248    "properties": {
249        /* define your outputs here */
250    }
251}
252```
253
254### Example
255
256```json
257{
258    "actorOutputSchemaVersion": 1,
259    "title": "Output schema of the files scraper",
260    "properties": {
261        "files": {
262            "type": "string",
263            "title": "Files",
264            "template": "{{links.apiDefaultKeyValueStoreUrl}}/keys"
265        },
266        "dataset": {
267            "type": "string",
268            "title": "Dataset",
269            "template": "{{links.apiDefaultDatasetUrl}}/items"
270        }
271    }
272}
273```
274
275### Output Schema Template Variables
276
277- `links` (object) - Contains quick links to most commonly used URLs
278- `links.publicRunUrl` (string) - Public run url in format `https://console.apify.com/view/runs/:runId`
279- `links.consoleRunUrl` (string) - Console run url in format `https://console.apify.com/actors/runs/:runId`
280- `links.apiRunUrl` (string) - API run url in format `https://api.apify.com/v2/actor-runs/:runId`
281- `links.apiDefaultDatasetUrl` (string) - API url of default dataset in format `https://api.apify.com/v2/datasets/:defaultDatasetId`
282- `links.apiDefaultKeyValueStoreUrl` (string) - API url of default key-value store in format `https://api.apify.com/v2/key-value-stores/:defaultKeyValueStoreId`
283- `links.containerRunUrl` (string) - URL of a webserver running inside the run in format `https://<containerId>.runs.apify.net/`
284- `run` (object) - Contains information about the run same as it is returned from the `GET Run` API endpoint
285- `run.defaultDatasetId` (string) - ID of the default dataset
286- `run.defaultKeyValueStoreId` (string) - ID of the default key-value store
287
288## Dataset Schema Specification
289
290The dataset schema defines how your Actor's output data is structured, transformed, and displayed in the Output tab in the Apify Console.
291
292### Example
293
294Consider an example Actor that calls `Actor.pushData()` to store data into dataset:
295
296```python
297# Dataset push example (Python)
298import asyncio
299from datetime import datetime
300from apify import Actor
301
302async def main():
303    await Actor.init()
304
305    # Actor code
306    await Actor.push_data({
307        'numericField': 10,
308        'pictureUrl': 'https://www.google.com/images/branding/googlelogo/2x/googlelogo_color_92x30dp.png',
309        'linkUrl': 'https://google.com',
310        'textField': 'Google',
311        'booleanField': True,
312        'dateField': datetime.now().isoformat(),
313        'arrayField': ['#hello', '#world'],
314        'objectField': {},
315    })
316
317    # Exit successfully
318    await Actor.exit()
319
320if __name__ == '__main__':
321    asyncio.run(main())
322```
323
324To set up the Actor's output tab UI, reference a dataset schema file in `.actor/actor.json`:
325
326```json
327{
328    "actorSpecification": 1,
329    "name": "book-library-scraper",
330    "title": "Book Library Scraper",
331    "version": "1.0.0",
332    "storages": {
333        "dataset": "./dataset_schema.json"
334    }
335}
336```
337
338Then create the dataset schema in `.actor/dataset_schema.json`:
339
340```json
341{
342    "actorSpecification": 1,
343    "fields": {},
344    "views": {
345        "overview": {
346            "title": "Overview",
347            "transformation": {
348                "fields": [
349                    "pictureUrl",
350                    "linkUrl",
351                    "textField",
352                    "booleanField",
353                    "arrayField",
354                    "objectField",
355                    "dateField",
356                    "numericField"
357                ]
358            },
359            "display": {
360                "component": "table",
361                "properties": {
362                    "pictureUrl": {
363                        "label": "Image",
364                        "format": "image"
365                    },
366                    "linkUrl": {
367                        "label": "Link",
368                        "format": "link"
369                    },
370                    "textField": {
371                        "label": "Text",
372                        "format": "text"
373                    },
374                    "booleanField": {
375                        "label": "Boolean",
376                        "format": "boolean"
377                    },
378                    "arrayField": {
379                        "label": "Array",
380                        "format": "array"
381                    },
382                    "objectField": {
383                        "label": "Object",
384                        "format": "object"
385                    },
386                    "dateField": {
387                        "label": "Date",
388                        "format": "date"
389                    },
390                    "numericField": {
391                        "label": "Number",
392                        "format": "number"
393                    }
394                }
395            }
396        }
397    }
398}
399```
400
401### Structure
402
403```json
404{
405    "actorSpecification": 1,
406    "fields": {},
407    "views": {
408        "<VIEW_NAME>": {
409            "title": "string (required)",
410            "description": "string (optional)",
411            "transformation": {
412                "fields": ["string (required)"],
413                "unwind": ["string (optional)"],
414                "flatten": ["string (optional)"],
415                "omit": ["string (optional)"],
416                "limit": "integer (optional)",
417                "desc": "boolean (optional)"
418            },
419            "display": {
420                "component": "table (required)",
421                "properties": {
422                    "<FIELD_NAME>": {
423                        "label": "string (optional)",
424                        "format": "text|number|date|link|boolean|image|array|object (optional)"
425                    }
426                }
427            }
428        }
429    }
430}
431```
432
433**Dataset Schema Properties:**
434
435- `actorSpecification` (integer, required) - Specifies the version of dataset schema structure document (currently only version 1)
436- `fields` (JSONSchema object, required) - Schema of one dataset object (use JsonSchema Draft 2020-12 or compatible)
437- `views` (DatasetView object, required) - Object with API and UI views description
438
439**DatasetView Properties:**
440
441- `title` (string, required) - Visible in UI Output tab and API
442- `description` (string, optional) - Only available in API response
443- `transformation` (ViewTransformation object, required) - Data transformation applied when loading from Dataset API
444- `display` (ViewDisplay object, required) - Output tab UI visualization definition
445
446**ViewTransformation Properties:**
447
448- `fields` (string[], required) - Fields to present in output (order matches column order)
449- `unwind` (string[], optional) - Deconstructs nested children into parent object
450- `flatten` (string[], optional) - Transforms nested object into flat structure
451- `omit` (string[], optional) - Removes specified fields from output
452- `limit` (integer, optional) - Maximum number of results (default: all)
453- `desc` (boolean, optional) - Sort order (true = newest first)
454
455**ViewDisplay Properties:**
456
457- `component` (string, required) - Only `table` is available
458- `properties` (Object, optional) - Keys matching `transformation.fields` with ViewDisplayProperty values
459
460**ViewDisplayProperty Properties:**
461
462- `label` (string, optional) - Table column header
463- `format` (string, optional) - One of: `text`, `number`, `date`, `link`, `boolean`, `image`, `array`, `object`
464
465## Key-Value Store Schema Specification
466
467The key-value store schema organizes keys into logical groups called collections for easier data management.
468
469### Example
470
471Consider an example Actor that calls `Actor.setValue()` to save records into the key-value store:
472
473```python
474# Key-Value Store set example (Python)
475import asyncio
476from apify import Actor
477
478async def main():
479    await Actor.init()
480
481    # Actor code
482    await Actor.set_value('document-1', 'my text data', content_type='text/plain')
483
484    image_id = '123'          # example placeholder
485    image_buffer = b'...'     # bytes buffer with image data
486    await Actor.set_value(f'image-{image_id}', image_buffer, content_type='image/jpeg')
487
488    # Exit successfully
489    await Actor.exit()
490
491if __name__ == '__main__':
492    asyncio.run(main())
493```
494
495To configure the key-value store schema, reference a schema file in `.actor/actor.json`:
496
497```json
498{
499    "actorSpecification": 1,
500    "name": "data-collector",
501    "title": "Data Collector",
502    "version": "1.0.0",
503    "storages": {
504        "keyValueStore": "./key_value_store_schema.json"
505    }
506}
507```
508
509Then create the key-value store schema in `.actor/key_value_store_schema.json`:
510
511```json
512{
513    "actorKeyValueStoreSchemaVersion": 1,
514    "title": "Key-Value Store Schema",
515    "collections": {
516        "documents": {
517            "title": "Documents",
518            "description": "Text documents stored by the Actor",
519            "keyPrefix": "document-"
520        },
521        "images": {
522            "title": "Images",
523            "description": "Images stored by the Actor",
524            "keyPrefix": "image-",
525            "contentTypes": ["image/jpeg"]
526        }
527    }
528}
529```
530
531### Structure
532
533```json
534{
535    "actorKeyValueStoreSchemaVersion": 1,
536    "title": "string (required)",
537    "description": "string (optional)",
538    "collections": {
539        "<COLLECTION_NAME>": {
540            "title": "string (required)",
541            "description": "string (optional)",
542            "key": "string (conditional - use key OR keyPrefix)",
543            "keyPrefix": "string (conditional - use key OR keyPrefix)",
544            "contentTypes": ["string (optional)"],
545            "jsonSchema": "object (optional)"
546        }
547    }
548}
549```
550
551**Key-Value Store Schema Properties:**
552
553- `actorKeyValueStoreSchemaVersion` (integer, required) - Version of key-value store schema structure document (currently only version 1)
554- `title` (string, required) - Title of the schema
555- `description` (string, optional) - Description of the schema
556- `collections` (Object, required) - Object where each key is a collection ID and value is a Collection object
557
558**Collection Properties:**
559
560- `title` (string, required) - Collection title shown in UI tabs
561- `description` (string, optional) - Description appearing in UI tooltips
562- `key` (string, conditional\*) - Single specific key for this collection
563- `keyPrefix` (string, conditional\*) - Prefix for keys included in this collection
564- `contentTypes` (string[], optional) - Allowed content types for validation
565- `jsonSchema` (object, optional) - JSON Schema Draft 07 format for `application/json` content type validation
566
567\*Either `key` or `keyPrefix` must be specified for each collection, but not both.
568
569## Apify MCP Tools
570
571If MCP server is configured, use these tools for documentation:
572
573- `search-apify-docs` - Search documentation
574- `fetch-apify-docs` - Get full doc pages
575
576Otherwise, reference: `@https://mcp.apify.com/`
577
578## Resources
579
580- [docs.apify.com/llms.txt](https://docs.apify.com/llms.txt) - Quick reference
581- [docs.apify.com/llms-full.txt](https://docs.apify.com/llms-full.txt) - Complete docs
582- [crawlee.dev](https://crawlee.dev) - Crawlee documentation
583- [whitepaper.actor](https://raw.githubusercontent.com/apify/actor-whitepaper/refs/heads/master/README.md) - Complete Actor specification

Dockerfile

# First, specify the base Docker image.
# You can see the Docker images from Apify at https://hub.docker.com/r/apify/.
# You can also use any other image from Docker Hub.
FROM apify/actor-python-playwright:3.14-1.57.0

USER myuser

# Second, copy just requirements.txt into the Actor image,
# since it should be the only file that affects the dependency install in the next step,
# in order to speed up the build
COPY --chown=myuser:myuser requirements.txt ./

# Install the packages specified in requirements.txt,
# Print the installed Python version, pip version
# and all installed packages with their versions for debugging
RUN echo "Python version:" \
 && python --version \
 && echo "Pip version:" \
 && pip --version \
 && echo "Installing dependencies:" \
 && pip install -r requirements.txt \
 && echo "All installed Python packages:" \
 && pip freeze

# Next, copy the remaining files and directories with the source code.
# Since we do this after installing the dependencies, quick build will be really fast
# for most source file changes.
COPY --chown=myuser:myuser . ./

# Use compileall to ensure the runnability of the Actor Python code.
RUN python3 -m compileall -q src/

# Specify how to launch the source code of your Actor.
# By default, the "python3 -m src" command is run
CMD ["python3", "-m", "src"]

requirements.txt

1# Feel free to add your Python dependencies below. For formatting guidelines, see:
2# https://pip.pypa.io/en/latest/reference/requirements-file-format/
3
4apify < 4.0.0
5playwright

src/init.py

src/main.py

1import asyncio
2
3from .main import main
4
5# Execute the Actor entry point.
6asyncio.run(main())

src/main.py

1"""Module defines the main entry point for the Apify Actor.
2
3Feel free to modify this file to suit your specific needs.
4
5To build Apify Actors, utilize the Apify SDK toolkit, read more at the official documentation:
6https://docs.apify.com/sdk/python
7"""
8
9from __future__ import annotations
10
11from urllib.parse import urljoin
12
13from apify import Actor, Request
14from playwright.async_api import async_playwright
15
16# Note: To run this Actor locally, ensure that Playwright browsers are installed.
17# Run `playwright install --with-deps` in the Actor's virtual environment to install them.
18# When running on the Apify platform, these dependencies are already included
19# in the Actor's Docker image.
20
21
22async def main() -> None:
23    """Define a main entry point for the Apify Actor.
24
25    This coroutine is executed using `asyncio.run()`, so it must remain an asynchronous function for proper execution.
26    Asynchronous execution is required for communication with Apify platform, and it also enhances performance in
27    the field of web scraping significantly.
28    """
29    # Enter the context of the Actor.
30    async with Actor:
31        # Retrieve the Actor input, and use default values if not provided.
32        actor_input = await Actor.get_input() or {}
33        start_urls = actor_input.get('start_urls', [{'url': 'https://apify.com'}])
34        max_depth = actor_input.get('max_depth', 1)
35
36        # Exit if no start URLs are provided.
37        if not start_urls:
38            Actor.log.info('No start URLs specified in Actor input, exiting...')
39            await Actor.exit()
40
41        # Open the default request queue for handling URLs to be processed.
42        request_queue = await Actor.open_request_queue()
43
44        # Enqueue the start URLs with an initial crawl depth of 0.
45        for start_url in start_urls:
46            url = start_url.get('url')
47            Actor.log.info(f'Enqueuing {url} ...')
48            new_request = Request.from_url(url, user_data={'depth': 0})
49            await request_queue.add_request(new_request)
50
51        Actor.log.info('Launching Playwright...')
52
53        # Launch Playwright and open a new browser context.
54        async with async_playwright() as playwright:
55            # Configure the browser to launch in headless mode as per Actor configuration.
56            browser = await playwright.chromium.launch(
57                headless=Actor.configuration.headless,
58                args=['--disable-gpu'],
59            )
60            context = await browser.new_context()
61
62            # Process the URLs from the request queue.
63            while request := await request_queue.fetch_next_request():
64                url = request.url
65
66                if not isinstance(request.user_data['depth'], (str, int)):
67                    raise TypeError('Request.depth is an unexpected type.')
68
69                depth = int(request.user_data['depth'])
70                Actor.log.info(f'Scraping {url} (depth={depth}) ...')
71
72                try:
73                    # Open a new page in the browser context and navigate to the URL.
74                    page = await context.new_page()
75                    await page.goto(url)
76
77                    # If the current depth is less than max_depth, find nested links
78                    # and enqueue them.
79                    if depth < max_depth:
80                        for link in await page.locator('a').all():
81                            link_href = await link.get_attribute('href')
82                            link_url = urljoin(url, link_href)
83
84                            if link_url.startswith(('http://', 'https://')):
85                                Actor.log.info(f'Enqueuing {link_url} ...')
86                                new_request = Request.from_url(
87                                    link_url,
88                                    user_data={'depth': depth + 1},
89                                )
90                                await request_queue.add_request(new_request)
91
92                    # Extract the desired data.
93                    data = {
94                        'url': url,
95                        'title': await page.title(),
96                    }
97
98                    # Store the extracted data to the default dataset.
99                    await Actor.push_data(data)
100
101                except Exception:
102                    Actor.log.exception(f'Cannot extract data from {url}.')
103
104                finally:
105                    await page.close()
106                    # Mark the request as handled to ensure it is not processed again.
107                    await request_queue.mark_request_as_handled(request)

src/py.typed

ReelFarm

your-crawling-partner/my-actor-5

This will scrape Data of Reel Farm

Code and Curl

Python BeautifulSoup template

ellustar/my-actor-5

Python BeautifulSoup Actor Template: Streamline web scraping with this ready-to-use Python template. Effortlessly extract, parse, and manage data from websites using BeautifulSoup, with clean code, reusable functions, and flexible structure for fast, efficient automation projects.

Ellustar

Taobao Email Scraper – Cheapest 📧🛒

contactminerlabs/my-actor-5

🔍 **Scrape Taobao Emails Instantly** Enter a keyword & extract public email addresses from Taobao sellers, merchants, shops & customer service pages — including store name, email address & source URL 🛍️ Perfect for **supplier sourcing, e-commerce research, lead generation & CRM enrichment** 🚀

ContactMinerLabs

5.0

(1)

TripAdvisor Scraper

epctex/tripadvisor-scraper

Explore with our Trip Advisor Scraper: an easy search for hotels, restaurants, attractions, and more by keywords or start URL. Enter check-in/out dates, and select currency and language. Promote locations, capture details, and retrieve emails and phone numbers if shared.

epctex

781

4.5

(9)

Amazon Scraper

epctex/amazon-scraper

The fastest and most reliable Amazon scraper ever. Extract key data like reviews, prices, ASINs, without relying on Amazon's API. Tailor searches with specific URLs, keywords, and local postal codes for targeted insights. Download in various formats for easy analysis.

epctex

345

5.0

(8)

Google Image Search

devisty/google-image-search

Enables users to search for images on Google based on specific keywords

Devisty

273

5.0

(2)

There's An AI For That Scraper | TAAFT | $20 / mo

fatihtahta/theres-an-ai-for-that-scraper

Scrape There’s An AI For That (TAAFT) categories, frontpage and searches into a clean, deduped dataset. Captures tool names, links, ratings, views, saves, images, and launch info. Streams results while exploring, supports max caps. Ideal for field research and lead lists.

Fatih Tahta

5.0

(3)

Twitter Get FollowersIds

twitterapi/twitter-get-followersids

twitter V1.1 Get FollowersIds api by twitter.utools.me

dayang

AI Text Analyzer for Google Reviews

geneea-analytics/reviews-text-nlp-analyzer

Quickly analyze customer reviews extracted by Google Maps Scraper. Find out what the most frequently used keywords are in each review. Learn how people view your staff and prices. Obtain structured information from unstructured text. Monitor changes in customers’ sentiment over time.

Geneea Analytics

597

Audio And Video Transcriber (OpenAI GPT-4o-transcribe)

stanvanrooy6/audio-video-transcriber

Downloads videos from public URLs, extracts audio, and transcribes them using OpenAI

Stan Van Rooy

5.0

(1)

My Actor 5

.actor/actor.json

.actor/dataset_schema.json

.actor/input_schema.json

.actor/output_schema.json

.dockerignore

.gitignore

AGENTS.md

Dockerfile

requirements.txt

src/__init__.py

src/__main__.py

src/main.py

src/py.typed

You might also like

ReelFarm

Python BeautifulSoup template

Taobao Email Scraper – Cheapest 📧🛒

TripAdvisor Scraper

Amazon Scraper

Google Image Search

There's An AI For That Scraper | TAAFT | $20 / mo

Twitter Get FollowersIds

AI Text Analyzer for Google Reviews

Audio And Video Transcriber (OpenAI GPT-4o-transcribe)

.actor/actor.json

.actor/dataset_schema.json

.actor/input_schema.json

.actor/output_schema.json

.dockerignore

.gitignore

AGENTS.md

Dockerfile

requirements.txt

src/__init__.py

src/__main__.py

src/main.py

src/py.typed

src/init.py

src/main.py

src/init.py

src/main.py