Pricing

from $0.00005 / actor start

Go to Apify Store

test_actor

Try for free

This is a test actor

Pricing

from $0.00005 / actor start

Rating

0.0

(0)

Developer

Zubair Shahzad

Actor stats

Bookmarked

Total users

Monthly active users

4 months ago

Last modified

Test Actor

godofpotato/test-actor

This is a test actor do not use it!

Emir Yılmaz

test

piercing_runner/input-orchestrator

Test

Big Soup

My Personal Test Actor

enchanted_clamp/my-actor-1

This is my Test actor for Booking web scraper.

Pro Dev

Test Actor

rainminer/test-actor

rainminer

Test Actor

rainminer/test-actor-2

rainminer

Test Actor

zdenekkuncar/test-actor

Zdenek

test actor

elxin/test-actor

InvoUp

Test PPE Actor

dhrumil/test-ppe-actor

This is test PPE actor which describes how PPE is billed

Dhrumil Bhankhar

Test

matej/test

Matej Hamaš

Test

mitch320/Test123

Mitch

{ "$schema": "https://apify.com/schemas/v1/actor.ide.json", "actorSpecification": 1, "name": "my-actor", "title": "Getting started with Python and Scrapy", "description": "Scrapes titles of websites using Scrapy.", "version": "0.0", "buildTag": "latest", "meta": { "templateId": "python-scrapy", "generatedBy": "<FILL-IN-MODEL>" }, "input": "./input_schema.json", "dockerfile": "../Dockerfile" }

{ "$schema": "https://apify.com/schemas/v1/input.ide.json", "title": "Python Scrapy scraper", "type": "object", "schemaVersion": 1, "properties": { "startUrls": { "title": "Start URLs", "type": "array", "description": "URLs to start with", "editor": "requestListSources", "prefill": [ { "url": "https://apify.com/" } ], "default": [ { "url": "https://apify.com/" } ] }, "allowedDomains": { "title": "Allowed domains", "type": "array", "description": "Domains that the scraper is allowed to crawl.", "editor": "json", "prefill": ["apify.com"], "default": ["apify.com"] }, "proxyConfiguration": { "sectionCaption": "Proxy and HTTP configuration", "title": "Proxy configuration", "type": "object", "description": "Specifies proxy servers that will be used by the scraper in order to hide its origin.", "editor": "proxy", "prefill": { "useApifyProxy": true }, "default": { "useApifyProxy": true } } }, "required": ["startUrls"] }

1"""Scrapy spiders package. 2 3This package contains the spiders for your Scrapy project. Spiders are the classes that define how to scrape 4and process data from websites. 5 6For detailed information on creating and utilizing spiders, refer to the official documentation: 7https://docs.scrapy.org/en/latest/topics/spiders.html 8""" 9 10from .title import TitleSpider 11 12__all__ = ['TitleSpider']

1# ruff: noqa: RUF012, TID252 2 3from __future__ import annotations 4 5from typing import TYPE_CHECKING, Any 6from urllib.parse import urljoin 7 8from scrapy import Request, Spider 9 10from ..items import TitleItem 11 12if TYPE_CHECKING: 13 from collections.abc import Generator 14 15 from scrapy.http.response import Response 16 17 18class TitleSpider(Spider): 19 """A spider that scrapes web pages to extract titles and discover new links. 20 21 This spider retrieves the content of the <title> element from each page and queues 22 any valid hyperlinks for further crawling. 23 """ 24 25 name = 'title_spider' 26 27 # Limit the number of pages to scrape. 28 custom_settings = {'CLOSESPIDER_PAGECOUNT': 10} 29 30 def __init__( 31 self, 32 start_urls: list[str], 33 allowed_domains: list[str], 34 *args: Any, 35 **kwargs: Any, 36 ) -> None: 37 """Initialize a new instance. 38 39 Args: 40 start_urls: URLs to start the scraping from. 41 allowed_domains: Domains that the scraper is allowed to crawl. 42 *args: Additional positional arguments. 43 **kwargs: Additional keyword arguments. 44 """ 45 super().__init__(*args, **kwargs) 46 self.start_urls = start_urls 47 self.allowed_domains = allowed_domains 48 49 def parse(self, response: Response) -> Generator[TitleItem | Request, None, None]: 50 """Parse the web page response. 51 52 Args: 53 response: The web page response. 54 55 Yields: 56 Yields scraped `TitleItem` and new `Request` objects for links. 57 """ 58 self.logger.info('TitleSpider is parsing %s...', response) 59 60 # Extract and yield the TitleItem 61 url = response.url 62 title = response.css('title::text').extract_first() 63 yield TitleItem(url=url, title="Test Title") 64 65 # Extract all links from the page, create `Request` objects out of them, 66 # and yield them. 67 for link_href in response.css('a::attr("href")'): 68 link_url = urljoin(response.url, link_href.get()) 69 if link_url.startswith(('http://', 'https://')): 70 yield Request(link_url)

1"""Apify Actor integration for Scrapy projects. 2 3This module transforms a Scrapy project into an Apify Actor, handling the configuration of logging, patching Scrapy's 4logging system, and establishing the required environment to run the Scrapy spider within the Apify platform. 5 6This file is specifically designed to be executed when the project is run as an Apify Actor using `apify run` locally 7or being run on the Apify platform. It is not being executed when running the project as a Scrapy project using 8`scrapy crawl title_spider`. 9 10We recommend you do not modify this file unless you really know what you are doing. 11""" 12 13from __future__ import annotations 14 15import os 16 17from apify.scrapy import initialize_logging, run_scrapy_actor 18 19# Import your main Actor coroutine here. 20from .main import main 21 22# Ensure the location to the Scrapy settings module is defined. 23os.environ['SCRAPY_SETTINGS_MODULE'] = 'src.settings' 24 25 26if __name__ == '__main__': 27 initialize_logging() 28 run_scrapy_actor(main())

1"""Scrapy item models module. 2 3Module defines Scrapy item models for scraped data. Items represent structured data 4extracted by spiders. 5 6For detailed information on creating and utilizing items, refer to the official documentation: 7https://docs.scrapy.org/en/latest/topics/items.html 8""" 9 10from __future__ import annotations 11 12from scrapy import Field, Item 13 14 15class TitleItem(Item): 16 """Represents a title item scraped from a web page.""" 17 18 url = Field() 19 title = Field()

1"""Module defines the main entry point for the Apify Actor. 2 3Module defines the main coroutine for the Apify Scrapy Actor, executed from the __main__.py file. The coroutine 4processes the Actor's input and executes the Scrapy spider. Additionally, it updates Scrapy project settings by 5applying Apify-related settings. Which includes adding a custom scheduler, retry middleware, and an item pipeline 6for pushing data to the Apify dataset. 7 8Customization: 9-------------- 10 11Feel free to customize this file to add specific functionality to the Actor, such as incorporating your own Scrapy 12components like spiders and handling Actor input. However, make sure you have a clear understanding of your 13modifications. For instance, removing `apply_apify_settings` break the integration between Scrapy and Apify. 14 15Documentation: 16-------------- 17 18For an in-depth description of the Apify-Scrapy integration process, our Scrapy components, known limitations and 19other stuff, please refer to the following documentation page: https://docs.apify.com/cli/docs/integrating-scrapy. 20""" 21 22from __future__ import annotations 23 24from apify import Actor 25from apify.scrapy import apply_apify_settings 26from scrapy.crawler import AsyncCrawlerRunner 27 28# Import your Scrapy spider here. 29from .spiders import TitleSpider as Spider 30 31 32async def main() -> None: 33 """Apify Actor main coroutine for executing the Scrapy spider.""" 34 async with Actor: 35 # Retrieve and process Actor input. 36 actor_input = await Actor.get_input() or {} 37 start_urls = [url['url'] for url in actor_input.get('startUrls', [])] 38 allowed_domains = actor_input.get('allowedDomains') 39 proxy_config = actor_input.get('proxyConfiguration') 40 41 # Apply Apify settings, which will override the Scrapy project settings. 42 settings = apply_apify_settings(proxy_config=proxy_config) 43 44 # Create AsyncCrawlerRunner and execute the Scrapy spider. 45 crawler_runner = AsyncCrawlerRunner(settings) 46 await crawler_runner.crawl( 47 Spider, 48 start_urls=start_urls, 49 allowed_domains=allowed_domains, 50 )

1"""Scrapy middlewares module. 2 3Module defines Scrapy middlewares. Middlewares are processing components that handle requests and 4responses, typically used for adding custom headers, retrying requests, and handling exceptions. 5 6There are 2 types of middlewares: spider middlewares and downloader middlewares. For detailed information 7on creating and utilizing them, refer to the official documentation: 8https://docs.scrapy.org/en/latest/topics/downloader-middleware.html 9https://docs.scrapy.org/en/latest/topics/spider-middleware.html 10""" 11# ruff: noqa: D101, D102, ARG002, UP028 12 13from __future__ import annotations 14 15from typing import TYPE_CHECKING 16 17# Useful for handling different item types with a single interface 18from scrapy import Request, Spider, signals 19 20if TYPE_CHECKING: 21 from collections.abc import Generator, Iterable 22 23 from scrapy.crawler import Crawler 24 from scrapy.http.response import Response 25 26 27class TitleSpiderMiddleware: 28 # Not all methods need to be defined. If a method is not defined, 29 # scrapy acts as if the spider middleware does not modify the 30 # passed objects. 31 32 @classmethod 33 def from_crawler(cls, crawler: Crawler) -> TitleSpiderMiddleware: 34 # This method is used by Scrapy to create your spiders. 35 s = cls() 36 crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) 37 return s 38 39 def process_spider_input(self, response: Response, spider: Spider) -> None: 40 # Called for each response that goes through the spider 41 # middleware and into the spider. 42 43 # Should return None or raise an exception. 44 return None 45 46 def process_spider_output( 47 self, 48 response: Response, 49 result: Iterable, 50 spider: Spider, 51 ) -> Generator[Iterable[Request] | None, None, None]: 52 # Called with the results returned from the Spider, after 53 # it has processed the response. 54 55 # Must return an iterable of Request, or item objects. 56 for i in result: 57 yield i 58 59 def process_spider_exception( 60 self, 61 response: Response, 62 exception: BaseException, 63 spider: Spider, 64 ) -> Iterable[Request] | None: 65 # Called when a spider or process_spider_input() method 66 # (from other spider middleware) raises an exception. 67 68 # Should return either None or an iterable of Request or item objects. 69 pass 70 71 def process_start_requests( 72 self, 73 start_requests: Iterable[Request], 74 spider: Spider, 75 ) -> Iterable[Request]: # Called with the start requests of the spider, and works 76 # similarly to the process_spider_output() method, except 77 # that it doesn't have a response associated. 78 79 # Must return only requests (not items). 80 for r in start_requests: 81 yield r 82 83 def spider_opened(self, spider: Spider) -> None: 84 pass 85 86 87class TitleDownloaderMiddleware: 88 # Not all methods need to be defined. If a method is not defined, 89 # scrapy acts as if the downloader middleware does not modify the 90 # passed objects. 91 92 @classmethod 93 def from_crawler(cls, crawler: Crawler) -> TitleDownloaderMiddleware: 94 # This method is used by Scrapy to create your spiders. 95 s = cls() 96 crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) 97 return s 98 99 def process_request(self, request: Request, spider: Spider) -> Request | Response | None: 100 # Called for each request that goes through the downloader 101 # middleware. 102 103 # Must either: 104 # - return None: continue processing this request 105 # - or return a Response object 106 # - or return a Request object 107 # - or raise IgnoreRequest: process_exception() methods of 108 # installed downloader middleware will be called 109 return None 110 111 def process_response(self, request: Request, response: Response, spider: Spider) -> Request | Response: 112 # Called with the response returned from the downloader. 113 114 # Must either; 115 # - return a Response object 116 # - return a Request object 117 # - or raise IgnoreRequest 118 return response 119 120 def process_exception(self, request: Request, exception: BaseException, spider: Spider) -> Response | None: 121 # Called when a download handler or a process_request() 122 # (from other downloader middleware) raises an exception. 123 124 # Must either: 125 # - return None: continue processing this exception 126 # - return a Response object: stops process_exception() chain 127 # - return a Request object: stops process_exception() chain 128 pass 129 130 def spider_opened(self, spider: Spider) -> None: 131 pass

1"""Scrapy item pipelines module. 2 3Module defines Scrapy item pipelines for scraped data. Item pipelines are processing components 4that handle the scraped items, typically used for cleaning, validating, and persisting data. 5 6For detailed information on creating and utilizing item pipelines, refer to the official documentation: 7http://doc.scrapy.org/en/latest/topics/item-pipeline.html 8""" 9# ruff: noqa: ARG002, D102 10 11from __future__ import annotations 12 13from typing import TYPE_CHECKING 14 15if TYPE_CHECKING: 16 from scrapy import Spider 17 18 from .items import TitleItem 19 20 21class TitleItemPipeline: 22 """Define processing steps for `TitleItem` objects scraped by spiders.""" 23 24 def process_item(self, item: TitleItem, spider: Spider) -> TitleItem: 25 # Do something with the item here, such as cleaning it or persisting it to a database 26 return item

1"""Scrapy settings module. 2 3This module contains Scrapy settings for the project, defining various configurations and options. 4 5For more comprehensive details on Scrapy settings, refer to the official documentation: 6http://doc.scrapy.org/en/latest/topics/settings.html 7""" 8 9BOT_NAME = 'titlebot' 10DEPTH_LIMIT = 1 11LOG_LEVEL = 'INFO' 12NEWSPIDER_MODULE = 'src.spiders' 13ROBOTSTXT_OBEY = True 14SPIDER_MODULES = ['src.spiders'] 15TELNETCONSOLE_ENABLED = False 16# Do not change the Twisted reactor unless you really know what you are doing. 17TWISTED_REACTOR = 'twisted.internet.asyncioreactor.AsyncioSelectorReactor' 18HTTPCACHE_ENABLED = True 19HTTPCACHE_EXPIRATION_SECS = 7200 20ITEM_PIPELINES = { 21 'src.pipelines.TitleItemPipeline': 123, 22} 23SPIDER_MIDDLEWARES = { 24 'src.middlewares.TitleSpiderMiddleware': 543, 25} 26DOWNLOADER_MIDDLEWARES = { 27 'src.middlewares.TitleDownloaderMiddleware': 543, 28}

.git .mise.toml .nvim.lua storage # The rest is copied from https://github.com/github/gitignore/blob/main/Python.gitignore # Byte-compiled / optimized / DLL files __pycache__/ *.py[cod] *$py.class # C extensions *.so # Distribution / packaging .Python build/ develop-eggs/ dist/ downloads/ eggs/ .eggs/ lib/ lib64/ parts/ sdist/ var/ wheels/ share/python-wheels/ *.egg-info/ .installed.cfg *.egg MANIFEST # PyInstaller # Usually these files are written by a python script from a template # before PyInstaller builds the exe, so as to inject date/other infos into it. *.manifest *.spec # Installer logs pip-log.txt pip-delete-this-directory.txt # Unit test / coverage reports htmlcov/ .tox/ .nox/ .coverage .coverage.* .cache nosetests.xml coverage.xml *.cover *.py,cover .hypothesis/ .pytest_cache/ cover/ # Translations *.mo *.pot # Django stuff: *.log local_settings.py db.sqlite3 db.sqlite3-journal # Flask stuff: instance/ .webassets-cache # Scrapy stuff: .scrapy # Sphinx documentation docs/_build/ # PyBuilder .pybuilder/ target/ # Jupyter Notebook .ipynb_checkpoints # IPython profile_default/ ipython_config.py # pyenv # For a library or package, you might want to ignore these files since the code is # intended to run in multiple environments; otherwise, check them in: .python-version # pdm # Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control. #pdm.lock # pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it # in version control. # https://pdm.fming.dev/latest/usage/project/#working-with-version-control .pdm.toml .pdm-python .pdm-build/ # PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm __pypackages__/ # Celery stuff celerybeat-schedule celerybeat.pid # SageMath parsed files *.sage.py # Environments .env .venv env/ venv/ ENV/ env.bak/ venv.bak/ # Spyder project settings .spyderproject .spyproject # Rope project settings .ropeproject # mkdocs documentation /site # mypy .mypy_cache/ .dmypy.json dmypy.json # Pyre type checker .pyre/ # pytype static type analyzer .pytype/ # Cython debug symbols cython_debug/ # PyCharm # JetBrains specific template is maintained in a separate JetBrains.gitignore that can # be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore # and can be added to the global gitignore or merged into this file. For a more nuclear # option (not recommended) you can uncomment the following to ignore the entire idea folder. .idea/ # Visual Studio Code # Ignores the folder created by VS Code when changing workspace settings, doing debugger # configuration, etc. Can be commented out to share Workspace Settings within a team .vscode # Zed editor # Ignores the folder created when setting Project Settings in the Zed editor. Can be commented out # to share Project Settings within a team .zed

.mise.toml .nvim.lua storage # The rest is copied from https://github.com/github/gitignore/blob/main/Python.gitignore # Byte-compiled / optimized / DLL files __pycache__/ *.py[cod] *$py.class # C extensions *.so # Distribution / packaging .Python build/ develop-eggs/ dist/ downloads/ eggs/ .eggs/ lib/ lib64/ parts/ sdist/ var/ wheels/ share/python-wheels/ *.egg-info/ .installed.cfg *.egg MANIFEST # PyInstaller # Usually these files are written by a python script from a template # before PyInstaller builds the exe, so as to inject date/other infos into it. *.manifest *.spec # Installer logs pip-log.txt pip-delete-this-directory.txt # Unit test / coverage reports htmlcov/ .tox/ .nox/ .coverage .coverage.* .cache nosetests.xml coverage.xml *.cover *.py,cover .hypothesis/ .pytest_cache/ cover/ # Translations *.mo *.pot # Django stuff: *.log local_settings.py db.sqlite3 db.sqlite3-journal # Flask stuff: instance/ .webassets-cache # Scrapy stuff: .scrapy # Sphinx documentation docs/_build/ # PyBuilder .pybuilder/ target/ # Jupyter Notebook .ipynb_checkpoints # IPython profile_default/ ipython_config.py # pyenv # For a library or package, you might want to ignore these files since the code is # intended to run in multiple environments; otherwise, check them in: .python-version # pdm # Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control. #pdm.lock # pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it # in version control. # https://pdm.fming.dev/latest/usage/project/#working-with-version-control .pdm.toml .pdm-python .pdm-build/ # PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm __pypackages__/ # Celery stuff celerybeat-schedule celerybeat.pid # SageMath parsed files *.sage.py # Environments .env .venv env/ venv/ ENV/ env.bak/ venv.bak/ # Spyder project settings .spyderproject .spyproject # Rope project settings .ropeproject # mkdocs documentation /site # mypy .mypy_cache/ .dmypy.json dmypy.json # Pyre type checker .pyre/ # pytype static type analyzer .pytype/ # Cython debug symbols cython_debug/ # PyCharm # JetBrains specific template is maintained in a separate JetBrains.gitignore that can # be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore # and can be added to the global gitignore or merged into this file. For a more nuclear # option (not recommended) you can uncomment the following to ignore the entire idea folder. .idea/ # Visual Studio Code # Ignores the folder created by VS Code when changing workspace settings, doing debugger # configuration, etc. Can be commented out to share Workspace Settings within a team .vscode # Zed editor # Ignores the folder created when setting Project Settings in the Zed editor. Can be commented out # to share Project Settings within a team .zed

1# Apify Actors Development Guide 2 3Important: Before you begin, fill in the `generatedBy` property in the meta section of `.actor/actor.json`. Replace it with the tool and model you're currently using, such as "Claude Code with Claude Sonnet 4.5". This helps Apify monitor and improve AGENTS.md for specific AI tools and models. 4 5## What are Apify Actors? 6 7- Actors are serverless programs that run in the cloud. They're inspired by the UNIX philosophy - programs that do one thing well and can be easily combined to build complex systems. 8- Actors are programs packaged as Docker images that run in isolated containers 9 10## Core Concepts 11 12- Accept well-defined JSON input 13- Perform isolated tasks (web scraping, automation, data processing) 14- Produce structured JSON output to datasets and/or store data in key-value stores 15- Can run from seconds to hours or even indefinitely 16- Persist state and can be restarted 17 18## Do 19 20- accept well-defined JSON input and produce structured JSON output 21- use Apify SDK (`apify`) for code running ON Apify platform 22- validate input early with proper error handling and fail gracefully 23- use CheerioCrawler for static HTML content (10x faster than browsers) 24- use PlaywrightCrawler only for JavaScript-heavy sites and dynamic content 25- use router pattern (createCheerioRouter/createPlaywrightRouter) for complex crawls 26- implement retry strategies with exponential backoff for failed requests 27- use proper concurrency settings (HTTP: 10-50, Browser: 1-5) 28- set sensible defaults in `.actor/input_schema.json` for all optional fields 29- set up output schema in `.actor/output_schema.json` 30- clean and validate data before pushing to dataset 31- use semantic CSS selectors and fallback strategies for missing elements 32- respect robots.txt, ToS, and implement rate limiting with delays 33- check which tools (cheerio/playwright/crawlee) are installed before applying guidance 34- use `Actor.log` for logging (censors sensitive data) 35- implement readiness probe handler for standby Actors 36- handle the `aborting` event to gracefully shut down when Actor is stopped 37 38## Don't 39 40- do not rely on `Dataset.getInfo()` for final counts on Cloud platform 41- do not use browser crawlers when HTTP/Cheerio works (massive performance gains with HTTP) 42- do not hard code values that should be in input schema or environment variables 43- do not skip input validation or error handling 44- do not overload servers - use appropriate concurrency and delays 45- do not scrape prohibited content or ignore Terms of Service 46- do not store personal/sensitive data unless explicitly permitted 47- do not use deprecated options like `requestHandlerTimeoutMillis` on CheerioCrawler (v3.x) 48- do not use `additionalHttpHeaders` - use `preNavigationHooks` instead 49- do not assume that local storage is persistent or automatically synced to Apify Console - when running locally with `apify run`, the `storage/` directory is local-only and is NOT pushed to the Cloud 50- do not disable standby mode (`usesStandbyMode: false`) without explicit permission 51 52## Logging 53 54- **ALWAYS use `Actor.log` for logging** - This logger contains critical security logic including censoring sensitive data (Apify tokens, API keys, credentials) to prevent accidental exposure in logs 55 56### Available Log Levels 57 58The Apify Actor logger provides the following methods for logging: 59 60- `Actor.log.debug()` - Debug level logs (detailed diagnostic information) 61- `Actor.log.info()` - Info level logs (general informational messages) 62- `Actor.log.warning()` - Warning level logs (warning messages for potentially problematic situations) 63- `Actor.log.error()` - Error level logs (error messages for failures) 64- `Actor.log.exception()` - Exception level logs (for exceptions with stack traces) 65 66**Best practices:** 67 68- Use `Actor.log.debug()` for detailed operation-level diagnostics (inside functions) 69- Use `Actor.log.info()` for general informational messages (API requests, successful operations) 70- Use `Actor.log.warning()` for potentially problematic situations (validation failures, unexpected states) 71- Use `Actor.log.error()` for actual errors and failures 72- Use `Actor.log.exception()` for caught exceptions with stack traces 73 74## Graceful Abort Handling 75 76Handle the `aborting` event to terminate the Actor quickly when stopped by user or platform, minimizing costs especially for PPU/PPE+U billing. 77 78```python 79import asyncio 80 81async def on_aborting() -> None: 82 # Persist any state, do any cleanup you need, and terminate the Actor using `await Actor.exit()` explicitly as soon as possible 83 # This will help ensure that the Actor is doing best effort to honor any potential limits on costs of a single run set by the user 84 # Wait 1 second to allow Crawlee/SDK state persistence operations to complete 85 # This is a temporary workaround until SDK implements proper state persistence in the aborting event 86 await asyncio.sleep(1) 87 await Actor.exit() 88 89Actor.on('aborting', on_aborting) 90``` 91 92## Standby Mode 93 94- **NEVER disable standby mode (`usesStandbyMode: false`) in `.actor/actor.json` without explicit permission** - Actor Standby mode solves this problem by letting you have the Actor ready in the background, waiting for the incoming HTTP requests. In a sense, the Actor behaves like a real-time web server or standard API server instead of running the logic once to process everything in batch. Always keep `usesStandbyMode: true` unless there is a specific documented reason to disable it 95- **ALWAYS implement readiness probe handler for standby Actors** - Handle the `x-apify-container-server-readiness-probe` header at GET / endpoint to ensure proper Actor lifecycle management 96 97You can recognize a standby Actor by checking the `usesStandbyMode` property in `.actor/actor.json`. Only implement the readiness probe if this property is set to `true`. 98 99### Readiness Probe Implementation Example 100 101```python 102# Apify standby readiness probe 103from http.server import SimpleHTTPRequestHandler 104 105class GetHandler(SimpleHTTPRequestHandler): 106 def do_GET(self): 107 # Handle Apify standby readiness probe 108 if 'x-apify-container-server-readiness-probe' in self.headers: 109 self.send_response(200) 110 self.end_headers() 111 self.wfile.write(b'Readiness probe OK') 112 return 113 114 self.send_response(200) 115 self.end_headers() 116 self.wfile.write(b'Actor is ready') 117``` 118 119Key points: 120 121- Detect the `x-apify-container-server-readiness-probe` header in incoming requests 122- Respond with HTTP 200 status code for both readiness probe and normal requests 123- This enables proper Actor lifecycle management in standby mode 124 125## Commands 126 127```bash 128# Local development 129apify run # Run Actor locally 130 131# Authentication & deployment 132apify login # Authenticate account 133apify push # Deploy to Apify platform 134 135# Help 136apify help # List all commands 137``` 138 139## Safety and Permissions 140 141Allowed without prompt: 142 143- read files with `Actor.get_value()` 144- push data with `Actor.push_data()` 145- set values with `Actor.set_value()` 146- enqueue requests to RequestQueue 147- run locally with `apify run` 148 149Ask first: 150 151- npm/pip package installations 152- apify push (deployment to cloud) 153- proxy configuration changes (requires paid plan) 154- Dockerfile changes affecting builds 155- deleting datasets or key-value stores 156 157## Project Structure 158 159.actor/ 160├── actor.json # Actor config: name, version, env vars, runtime settings 161├── input_schema.json # Input validation & Console form definition 162└── output_schema.json # Specifies where an Actor stores its output 163src/ 164└── main.js # Actor entry point and orchestrator 165storage/ # Local-only storage for development (NOT synced to Cloud) 166├── datasets/ # Output items (JSON objects) 167├── key_value_stores/ # Files, config, INPUT 168└── request_queues/ # Pending crawl requests 169Dockerfile # Container image definition 170AGENTS.md # AI agent instructions (this file) 171 172## Local vs Cloud Storage 173 174When running locally with `apify run`, the Apify SDK emulates Cloud storage APIs using the local `storage/` directory. This local storage behaves differently from Cloud storage: 175 176- **Local storage is NOT persistent** - The `storage/` directory is meant for local development and testing only. Data stored there (datasets, key-value stores, request queues) exists only on your local disk. 177- **Local storage is NOT automatically pushed to Apify Console** - Running `apify run` does not upload any storage data to the Apify platform. The data stays local. 178- **Each local run may overwrite previous data** - The local `storage/` directory is reused between runs, but this is local-only behavior, not Cloud persistence. 179- **Cloud storage only works when running on Apify platform** - After deploying with `apify push` and running the Actor in the Cloud, storage calls (`Actor.push_data()`, `Actor.set_value()`, etc.) interact with real Apify Cloud storage, which is then visible in the Apify Console. 180- **To verify Actor output, deploy and run in Cloud** - Do not rely on local `storage/` contents as proof that data will appear in the Apify Console. Always test by deploying (`apify push`) and running the Actor on the platform. 181 182## Actor Input Schema 183 184The input schema defines the input parameters for an Actor. It's a JSON object comprising various field types supported by the Apify platform. 185 186### Structure 187 188```json 189{ 190 "title": "<INPUT-SCHEMA-TITLE>", 191 "type": "object", 192 "schemaVersion": 1, 193 "properties": { 194 /* define input fields here */ 195 }, 196 "required": [] 197} 198``` 199 200### Example 201 202```json 203{ 204 "title": "E-commerce Product Scraper Input", 205 "type": "object", 206 "schemaVersion": 1, 207 "properties": { 208 "startUrls": { 209 "title": "Start URLs", 210 "type": "array", 211 "description": "URLs to start scraping from (category pages or product pages)", 212 "editor": "requestListSources", 213 "default": [{ "url": "https://example.com/category" }], 214 "prefill": [{ "url": "https://example.com/category" }] 215 }, 216 "followVariants": { 217 "title": "Follow Product Variants", 218 "type": "boolean", 219 "description": "Whether to scrape product variants (different colors, sizes)", 220 "default": true 221 }, 222 "maxRequestsPerCrawl": { 223 "title": "Max Requests per Crawl", 224 "type": "integer", 225 "description": "Maximum number of pages to scrape (0 = unlimited)", 226 "default": 1000, 227 "minimum": 0 228 }, 229 "proxyConfiguration": { 230 "title": "Proxy Configuration", 231 "type": "object", 232 "description": "Proxy settings for anti-bot protection", 233 "editor": "proxy", 234 "default": { "useApifyProxy": false } 235 }, 236 "locale": { 237 "title": "Locale", 238 "type": "string", 239 "description": "Language/country code for localized content", 240 "default": "cs", 241 "enum": ["cs", "en", "de", "sk"], 242 "enumTitles": ["Czech", "English", "German", "Slovak"] 243 } 244 }, 245 "required": ["startUrls"] 246} 247``` 248 249## Actor Output Schema 250 251The Actor output schema builds upon the schemas for the dataset and key-value store. It specifies where an Actor stores its output and defines templates for accessing that output. Apify Console uses these output definitions to display run results. 252 253### Structure 254 255```json 256{ 257 "actorOutputSchemaVersion": 1, 258 "title": "<OUTPUT-SCHEMA-TITLE>", 259 "properties": { 260 /* define your outputs here */ 261 } 262} 263``` 264 265### Example 266 267```json 268{ 269 "actorOutputSchemaVersion": 1, 270 "title": "Output schema of the files scraper", 271 "properties": { 272 "files": { 273 "type": "string", 274 "title": "Files", 275 "template": "{{links.apiDefaultKeyValueStoreUrl}}/keys" 276 }, 277 "dataset": { 278 "type": "string", 279 "title": "Dataset", 280 "template": "{{links.apiDefaultDatasetUrl}}/items" 281 } 282 } 283} 284``` 285 286### Output Schema Template Variables 287 288- `links` (object) - Contains quick links to most commonly used URLs 289- `links.publicRunUrl` (string) - Public run url in format `https://console.apify.com/view/runs/:runId` 290- `links.consoleRunUrl` (string) - Console run url in format `https://console.apify.com/actors/runs/:runId` 291- `links.apiRunUrl` (string) - API run url in format `https://api.apify.com/v2/actor-runs/:runId` 292- `links.apiDefaultDatasetUrl` (string) - API url of default dataset in format `https://api.apify.com/v2/datasets/:defaultDatasetId` 293- `links.apiDefaultKeyValueStoreUrl` (string) - API url of default key-value store in format `https://api.apify.com/v2/key-value-stores/:defaultKeyValueStoreId` 294- `links.containerRunUrl` (string) - URL of a webserver running inside the run in format `https://<containerId>.runs.apify.net/` 295- `run` (object) - Contains information about the run same as it is returned from the `GET Run` API endpoint 296- `run.defaultDatasetId` (string) - ID of the default dataset 297- `run.defaultKeyValueStoreId` (string) - ID of the default key-value store 298 299## Dataset Schema Specification 300 301The dataset schema defines how your Actor's output data is structured, transformed, and displayed in the Output tab in the Apify Console. 302 303### Example 304 305Consider an example Actor that calls `Actor.pushData()` to store data into dataset: 306 307```python 308# Dataset push example (Python) 309import asyncio 310from datetime import datetime 311from apify import Actor 312 313async def main(): 314 await Actor.init() 315 316 # Actor code 317 await Actor.push_data({ 318 'numericField': 10, 319 'pictureUrl': 'https://www.google.com/images/branding/googlelogo/2x/googlelogo_color_92x30dp.png', 320 'linkUrl': 'https://google.com', 321 'textField': 'Google', 322 'booleanField': True, 323 'dateField': datetime.now().isoformat(), 324 'arrayField': ['#hello', '#world'], 325 'objectField': {}, 326 }) 327 328 # Exit successfully 329 await Actor.exit() 330 331if __name__ == '__main__': 332 asyncio.run(main()) 333``` 334 335To set up the Actor's output tab UI, reference a dataset schema file in `.actor/actor.json`: 336 337```json 338{ 339 "actorSpecification": 1, 340 "name": "book-library-scraper", 341 "title": "Book Library Scraper", 342 "version": "1.0.0", 343 "storages": { 344 "dataset": "./dataset_schema.json" 345 } 346} 347``` 348 349Then create the dataset schema in `.actor/dataset_schema.json`: 350 351```json 352{ 353 "actorSpecification": 1, 354 "fields": {}, 355 "views": { 356 "overview": { 357 "title": "Overview", 358 "transformation": { 359 "fields": [ 360 "pictureUrl", 361 "linkUrl", 362 "textField", 363 "booleanField", 364 "arrayField", 365 "objectField", 366 "dateField", 367 "numericField" 368 ] 369 }, 370 "display": { 371 "component": "table", 372 "properties": { 373 "pictureUrl": { 374 "label": "Image", 375 "format": "image" 376 }, 377 "linkUrl": { 378 "label": "Link", 379 "format": "link" 380 }, 381 "textField": { 382 "label": "Text", 383 "format": "text" 384 }, 385 "booleanField": { 386 "label": "Boolean", 387 "format": "boolean" 388 }, 389 "arrayField": { 390 "label": "Array", 391 "format": "array" 392 }, 393 "objectField": { 394 "label": "Object", 395 "format": "object" 396 }, 397 "dateField": { 398 "label": "Date", 399 "format": "date" 400 }, 401 "numericField": { 402 "label": "Number", 403 "format": "number" 404 } 405 } 406 } 407 } 408 } 409} 410``` 411 412### Structure 413 414```json 415{ 416 "actorSpecification": 1, 417 "fields": {}, 418 "views": { 419 "<VIEW_NAME>": { 420 "title": "string (required)", 421 "description": "string (optional)", 422 "transformation": { 423 "fields": ["string (required)"], 424 "unwind": ["string (optional)"], 425 "flatten": ["string (optional)"], 426 "omit": ["string (optional)"], 427 "limit": "integer (optional)", 428 "desc": "boolean (optional)" 429 }, 430 "display": { 431 "component": "table (required)", 432 "properties": { 433 "<FIELD_NAME>": { 434 "label": "string (optional)", 435 "format": "text|number|date|link|boolean|image|array|object (optional)" 436 } 437 } 438 } 439 } 440 } 441} 442``` 443 444**Dataset Schema Properties:** 445 446- `actorSpecification` (integer, required) - Specifies the version of dataset schema structure document (currently only version 1) 447- `fields` (JSONSchema object, required) - Schema of one dataset object (use JsonSchema Draft 2020-12 or compatible) 448- `views` (DatasetView object, required) - Object with API and UI views description 449 450**DatasetView Properties:** 451 452- `title` (string, required) - Visible in UI Output tab and API 453- `description` (string, optional) - Only available in API response 454- `transformation` (ViewTransformation object, required) - Data transformation applied when loading from Dataset API 455- `display` (ViewDisplay object, required) - Output tab UI visualization definition 456 457**ViewTransformation Properties:** 458 459- `fields` (string[], required) - Fields to present in output (order matches column order) 460- `unwind` (string[], optional) - Deconstructs nested children into parent object 461- `flatten` (string[], optional) - Transforms nested object into flat structure 462- `omit` (string[], optional) - Removes specified fields from output 463- `limit` (integer, optional) - Maximum number of results (default: all) 464- `desc` (boolean, optional) - Sort order (true = newest first) 465 466**ViewDisplay Properties:** 467 468- `component` (string, required) - Only `table` is available 469- `properties` (Object, optional) - Keys matching `transformation.fields` with ViewDisplayProperty values 470 471**ViewDisplayProperty Properties:** 472 473- `label` (string, optional) - Table column header 474- `format` (string, optional) - One of: `text`, `number`, `date`, `link`, `boolean`, `image`, `array`, `object` 475 476## Key-Value Store Schema Specification 477 478The key-value store schema organizes keys into logical groups called collections for easier data management. 479 480### Example 481 482Consider an example Actor that calls `Actor.setValue()` to save records into the key-value store: 483 484```python 485# Key-Value Store set example (Python) 486import asyncio 487from apify import Actor 488 489async def main(): 490 await Actor.init() 491 492 # Actor code 493 await Actor.set_value('document-1', 'my text data', content_type='text/plain') 494 495 image_id = '123' # example placeholder 496 image_buffer = b'...' # bytes buffer with image data 497 await Actor.set_value(f'image-{image_id}', image_buffer, content_type='image/jpeg') 498 499 # Exit successfully 500 await Actor.exit() 501 502if __name__ == '__main__': 503 asyncio.run(main()) 504``` 505 506To configure the key-value store schema, reference a schema file in `.actor/actor.json`: 507 508```json 509{ 510 "actorSpecification": 1, 511 "name": "data-collector", 512 "title": "Data Collector", 513 "version": "1.0.0", 514 "storages": { 515 "keyValueStore": "./key_value_store_schema.json" 516 } 517} 518``` 519 520Then create the key-value store schema in `.actor/key_value_store_schema.json`: 521 522```json 523{ 524 "actorKeyValueStoreSchemaVersion": 1, 525 "title": "Key-Value Store Schema", 526 "collections": { 527 "documents": { 528 "title": "Documents", 529 "description": "Text documents stored by the Actor", 530 "keyPrefix": "document-" 531 }, 532 "images": { 533 "title": "Images", 534 "description": "Images stored by the Actor", 535 "keyPrefix": "image-", 536 "contentTypes": ["image/jpeg"] 537 } 538 } 539} 540``` 541 542### Structure 543 544```json 545{ 546 "actorKeyValueStoreSchemaVersion": 1, 547 "title": "string (required)", 548 "description": "string (optional)", 549 "collections": { 550 "<COLLECTION_NAME>": { 551 "title": "string (required)", 552 "description": "string (optional)", 553 "key": "string (conditional - use key OR keyPrefix)", 554 "keyPrefix": "string (conditional - use key OR keyPrefix)", 555 "contentTypes": ["string (optional)"], 556 "jsonSchema": "object (optional)" 557 } 558 } 559} 560``` 561 562**Key-Value Store Schema Properties:** 563 564- `actorKeyValueStoreSchemaVersion` (integer, required) - Version of key-value store schema structure document (currently only version 1) 565- `title` (string, required) - Title of the schema 566- `description` (string, optional) - Description of the schema 567- `collections` (Object, required) - Object where each key is a collection ID and value is a Collection object 568 569**Collection Properties:** 570 571- `title` (string, required) - Collection title shown in UI tabs 572- `description` (string, optional) - Description appearing in UI tooltips 573- `key` (string, conditional\*) - Single specific key for this collection 574- `keyPrefix` (string, conditional\*) - Prefix for keys included in this collection 575- `contentTypes` (string[], optional) - Allowed content types for validation 576- `jsonSchema` (object, optional) - JSON Schema Draft 07 format for `application/json` content type validation 577 578\*Either `key` or `keyPrefix` must be specified for each collection, but not both. 579 580## Apify MCP Tools 581 582If MCP server is configured, use these tools for documentation: 583 584- `search-apify-docs` - Search documentation 585- `fetch-apify-docs` - Get full doc pages 586 587Otherwise, reference: `@https://mcp.apify.com/` 588 589## Resources 590 591- [docs.apify.com/llms.txt](https://docs.apify.com/llms.txt) - Quick reference 592- [docs.apify.com/llms-full.txt](https://docs.apify.com/llms-full.txt) - Complete docs 593- [crawlee.dev](https://crawlee.dev) - Crawlee documentation 594- [whitepaper.actor](https://raw.githubusercontent.com/apify/actor-whitepaper/refs/heads/master/README.md) - Complete Actor specification

# First, specify the base Docker image. # You can see the Docker images from Apify at https://hub.docker.com/r/apify/. # You can also use any other image from Docker Hub. FROM apify/actor-python:3.14 # Second, copy just requirements.txt into the Actor image, # since it should be the only file that affects the dependency install in the next step, # in order to speed up the build COPY --chown=myuser:myuser requirements.txt ./ # Install the packages specified in requirements.txt, # Print the installed Python version, pip version # and all installed packages with their versions for debugging RUN echo "Python version:" \ && python --version \ && echo "Pip version:" \ && pip --version \ && echo "Installing dependencies:" \ && pip install -r requirements.txt \ && echo "All installed Python packages:" \ && pip freeze # Next, copy the remaining files and directories with the source code. # Since we do this after installing the dependencies, quick build will be really fast # for most source file changes. COPY --chown=myuser:myuser . ./ # Use compileall to ensure the runnability of the Actor Python code. RUN python3 -m compileall -q src/ # Specify how to launch the source code of your Actor. # By default, the "python3 -m src" command is run CMD ["python3", "-m", "src"]

1# Feel free to add your Python dependencies below. For formatting guidelines, see: 2# https://pip.pypa.io/en/latest/reference/requirements-file-format/ 3 4apify[scrapy] >= 3.3.0, < 4.0.0 5scrapy >= 2.14.0, < 3.0.0

test_actor

Test Actor

test

My Personal Test Actor

Test Actor

Test Actor

Test Actor

test actor

Test PPE Actor

Test

Test

.actor/actor.json

.actor/input_schema.json

src/spiders/init.py

src/spiders/py.typed

src/spiders/title.py

src/init.py

src/main.py

src/items.py

src/main.py

src/middlewares.py

src/pipelines.py

src/py.typed

src/settings.py

.dockerignore

.gitignore

AGENTS.md

CLAUDE.md

Dockerfile

requirements.txt

scrapy.cfg

test_actor

You might also like

Test Actor

test

My Personal Test Actor

Test Actor

Test Actor

Test Actor

test actor

Test PPE Actor

Test

Test

.actor/actor.json

.actor/input_schema.json

src/spiders/__init__.py

src/spiders/py.typed

src/spiders/title.py

src/__init__.py

src/__main__.py

src/items.py

src/main.py

src/middlewares.py

src/pipelines.py

src/py.typed

src/settings.py

.dockerignore

.gitignore

AGENTS.md

CLAUDE.md

Dockerfile

requirements.txt

scrapy.cfg

src/spiders/init.py

src/init.py

src/main.py