1# Apify Actors Development Guide
2
3Important: Before you begin, fill in the `generatedBy` property in the meta section of `.actor/actor.json`. Replace it with the model you're currently using. This helps Apify monitor and improve AGENTS.md for specific LLM models.
4
5## What are Apify Actors?
6
7- Actors are serverless programs that run in the cloud. They're inspired by the UNIX philosophy - programs that do one thing well and can be easily combined to build complex systems.
8- Actors are programs packaged as Docker images that run in isolated containers
9
10## Core Concepts
11
12- Accept well-defined JSON input
13- Perform isolated tasks (web scraping, automation, data processing)
14- Produce structured JSON output to datasets and/or store data in key-value stores
15- Can run from seconds to hours or even indefinitely
16- Persist state and can be restarted
17
18## Do
19
20- accept well-defined JSON input and produce structured JSON output
21- use Apify SDK (`apify`) for code running ON Apify platform
22- validate input early with proper error handling and fail gracefully
23- use CheerioCrawler for static HTML content (10x faster than browsers)
24- use PlaywrightCrawler only for JavaScript-heavy sites and dynamic content
25- use router pattern (createCheerioRouter/createPlaywrightRouter) for complex crawls
26- implement retry strategies with exponential backoff for failed requests
27- use proper concurrency settings (HTTP: 10-50, Browser: 1-5)
28- set sensible defaults in `.actor/input_schema.json` for all optional fields
29- set up output schema in `.actor/output_schema.json`
30- clean and validate data before pushing to dataset
31- use semantic CSS selectors and fallback strategies for missing elements
32- respect robots.txt, ToS, and implement rate limiting with delays
33- check which tools (cheerio/playwright/crawlee) are installed before applying guidance
34- use `Actor.log` for logging (censors sensitive data)
35- implement readiness probe handler for standby Actors
36
37## Don't
38
39- do not rely on `Dataset.getInfo()` for final counts on Cloud platform
40- do not use browser crawlers when HTTP/Cheerio works (massive performance gains with HTTP)
41- do not hard code values that should be in input schema or environment variables
42- do not skip input validation or error handling
43- do not overload servers - use appropriate concurrency and delays
44- do not scrape prohibited content or ignore Terms of Service
45- do not store personal/sensitive data unless explicitly permitted
46- do not use deprecated options like `requestHandlerTimeoutMillis` on CheerioCrawler (v3.x)
47- do not use `additionalHttpHeaders` - use `preNavigationHooks` instead
48- do not disable standby mode (`usesStandbyMode: false`) without explicit permission
49
50## Logging
51
52- **ALWAYS use `Actor.log` for logging** - This logger contains critical security logic including censoring sensitive data (Apify tokens, API keys, credentials) to prevent accidental exposure in logs
53
54### Available Log Levels
55
56The Apify Actor logger provides the following methods for logging:
57
58- `Actor.log.debug()` - Debug level logs (detailed diagnostic information)
59- `Actor.log.info()` - Info level logs (general informational messages)
60- `Actor.log.warning()` - Warning level logs (warning messages for potentially problematic situations)
61- `Actor.log.error()` - Error level logs (error messages for failures)
62- `Actor.log.exception()` - Exception level logs (for exceptions with stack traces)
63
64**Best practices:**
65
66- Use `Actor.log.debug()` for detailed operation-level diagnostics (inside functions)
67- Use `Actor.log.info()` for general informational messages (API requests, successful operations)
68- Use `Actor.log.warning()` for potentially problematic situations (validation failures, unexpected states)
69- Use `Actor.log.error()` for actual errors and failures
70- Use `Actor.log.exception()` for caught exceptions with stack traces
71
72## Standby Mode
73
74- **NEVER disable standby mode (`usesStandbyMode: false`) in `.actor/actor.json` without explicit permission** - Actor Standby mode solves this problem by letting you have the Actor ready in the background, waiting for the incoming HTTP requests. In a sense, the Actor behaves like a real-time web server or standard API server instead of running the logic once to process everything in batch. Always keep `usesStandbyMode: true` unless there is a specific documented reason to disable it
75- **ALWAYS implement readiness probe handler for standby Actors** - Handle the `x-apify-container-server-readiness-probe` header at GET / endpoint to ensure proper Actor lifecycle management
76
77You can recognize a standby Actor by checking the `usesStandbyMode` property in `.actor/actor.json`. Only implement the readiness probe if this property is set to `true`.
78
79### Readiness Probe Implementation Example
80
81```python
82# Apify standby readiness probe
83from http.server import SimpleHTTPRequestHandler
84
85class GetHandler(SimpleHTTPRequestHandler):
86 def do_GET(self):
87 # Handle Apify standby readiness probe
88 if 'x-apify-container-server-readiness-probe' in self.headers:
89 self.send_response(200)
90 self.end_headers()
91 self.wfile.write(b'Readiness probe OK')
92 return
93
94 self.send_response(200)
95 self.end_headers()
96 self.wfile.write(b'Actor is ready')
97```
98
99Key points:
100
101- Detect the `x-apify-container-server-readiness-probe` header in incoming requests
102- Respond with HTTP 200 status code for both readiness probe and normal requests
103- This enables proper Actor lifecycle management in standby mode
104
105## Commands
106
107```bash
108# Local development
109apify run # Run Actor locally
110
111# Authentication & deployment
112apify login # Authenticate account
113apify push # Deploy to Apify platform
114
115# Help
116apify help # List all commands
117```
118
119## Safety and Permissions
120
121Allowed without prompt:
122
123- read files with `Actor.get_value()`
124- push data with `Actor.push_data()`
125- set values with `Actor.set_value()`
126- enqueue requests to RequestQueue
127- run locally with `apify run`
128
129Ask first:
130
131- npm/pip package installations
132- apify push (deployment to cloud)
133- proxy configuration changes (requires paid plan)
134- Dockerfile changes affecting builds
135- deleting datasets or key-value stores
136
137## Project Structure
138
139.actor/
140├── actor.json # Actor config: name, version, env vars, runtime settings
141├── input_schema.json # Input validation & Console form definition
142└── output_schema.json # Specifies where an Actor stores its output
143src/
144└── main.js # Actor entry point and orchestrator
145storage/ # Local storage (mirrors Cloud during development)
146├── datasets/ # Output items (JSON objects)
147├── key_value_stores/ # Files, config, INPUT
148└── request_queues/ # Pending crawl requests
149Dockerfile # Container image definition
150AGENTS.md # AI agent instructions (this file)
151
152## Actor Input Schema
153
154The input schema defines the input parameters for an Actor. It's a JSON object comprising various field types supported by the Apify platform.
155
156### Structure
157
158```json
159{
160 "title": "<INPUT-SCHEMA-TITLE>",
161 "type": "object",
162 "schemaVersion": 1,
163 "properties": {
164 /* define input fields here */
165 },
166 "required": []
167}
168```
169
170### Example
171
172```json
173{
174 "title": "E-commerce Product Scraper Input",
175 "type": "object",
176 "schemaVersion": 1,
177 "properties": {
178 "startUrls": {
179 "title": "Start URLs",
180 "type": "array",
181 "description": "URLs to start scraping from (category pages or product pages)",
182 "editor": "requestListSources",
183 "default": [{ "url": "https://example.com/category" }],
184 "prefill": [{ "url": "https://example.com/category" }]
185 },
186 "followVariants": {
187 "title": "Follow Product Variants",
188 "type": "boolean",
189 "description": "Whether to scrape product variants (different colors, sizes)",
190 "default": true
191 },
192 "maxRequestsPerCrawl": {
193 "title": "Max Requests per Crawl",
194 "type": "integer",
195 "description": "Maximum number of pages to scrape (0 = unlimited)",
196 "default": 1000,
197 "minimum": 0
198 },
199 "proxyConfiguration": {
200 "title": "Proxy Configuration",
201 "type": "object",
202 "description": "Proxy settings for anti-bot protection",
203 "editor": "proxy",
204 "default": { "useApifyProxy": false }
205 },
206 "locale": {
207 "title": "Locale",
208 "type": "string",
209 "description": "Language/country code for localized content",
210 "default": "cs",
211 "enum": ["cs", "en", "de", "sk"],
212 "enumTitles": ["Czech", "English", "German", "Slovak"]
213 }
214 },
215 "required": ["startUrls"]
216}
217```
218
219## Actor Output Schema
220
221The Actor output schema builds upon the schemas for the dataset and key-value store. It specifies where an Actor stores its output and defines templates for accessing that output. Apify Console uses these output definitions to display run results.
222
223### Structure
224
225```json
226{
227 "actorOutputSchemaVersion": 1,
228 "title": "<OUTPUT-SCHEMA-TITLE>",
229 "properties": {
230 /* define your outputs here */
231 }
232}
233```
234
235### Example
236
237```json
238{
239 "actorOutputSchemaVersion": 1,
240 "title": "Output schema of the files scraper",
241 "properties": {
242 "files": {
243 "type": "string",
244 "title": "Files",
245 "template": "{{links.apiDefaultKeyValueStoreUrl}}/keys"
246 },
247 "dataset": {
248 "type": "string",
249 "title": "Dataset",
250 "template": "{{links.apiDefaultDatasetUrl}}/items"
251 }
252 }
253}
254```
255
256### Output Schema Template Variables
257
258- `links` (object) - Contains quick links to most commonly used URLs
259- `links.publicRunUrl` (string) - Public run url in format `https://console.apify.com/view/runs/:runId`
260- `links.consoleRunUrl` (string) - Console run url in format `https://console.apify.com/actors/runs/:runId`
261- `links.apiRunUrl` (string) - API run url in format `https://api.apify.com/v2/actor-runs/:runId`
262- `links.apiDefaultDatasetUrl` (string) - API url of default dataset in format `https://api.apify.com/v2/datasets/:defaultDatasetId`
263- `links.apiDefaultKeyValueStoreUrl` (string) - API url of default key-value store in format `https://api.apify.com/v2/key-value-stores/:defaultKeyValueStoreId`
264- `links.containerRunUrl` (string) - URL of a webserver running inside the run in format `https://<containerId>.runs.apify.net/`
265- `run` (object) - Contains information about the run same as it is returned from the `GET Run` API endpoint
266- `run.defaultDatasetId` (string) - ID of the default dataset
267- `run.defaultKeyValueStoreId` (string) - ID of the default key-value store
268
269## Dataset Schema Specification
270
271The dataset schema defines how your Actor's output data is structured, transformed, and displayed in the Output tab in the Apify Console.
272
273### Example
274
275Consider an example Actor that calls `Actor.pushData()` to store data into dataset:
276
277```python
278# Dataset push example (Python)
279import asyncio
280from datetime import datetime
281from apify import Actor
282
283async def main():
284 await Actor.init()
285
286 # Actor code
287 await Actor.push_data({
288 'numericField': 10,
289 'pictureUrl': 'https://www.google.com/images/branding/googlelogo/2x/googlelogo_color_92x30dp.png',
290 'linkUrl': 'https://google.com',
291 'textField': 'Google',
292 'booleanField': True,
293 'dateField': datetime.now().isoformat(),
294 'arrayField': ['#hello', '#world'],
295 'objectField': {},
296 })
297
298 # Exit successfully
299 await Actor.exit()
300
301if __name__ == '__main__':
302 asyncio.run(main())
303```
304
305To set up the Actor's output tab UI, reference a dataset schema file in `.actor/actor.json`:
306
307```json
308{
309 "actorSpecification": 1,
310 "name": "book-library-scraper",
311 "title": "Book Library Scraper",
312 "version": "1.0.0",
313 "storages": {
314 "dataset": "./dataset_schema.json"
315 }
316}
317```
318
319Then create the dataset schema in `.actor/dataset_schema.json`:
320
321```json
322{
323 "actorSpecification": 1,
324 "fields": {},
325 "views": {
326 "overview": {
327 "title": "Overview",
328 "transformation": {
329 "fields": [
330 "pictureUrl",
331 "linkUrl",
332 "textField",
333 "booleanField",
334 "arrayField",
335 "objectField",
336 "dateField",
337 "numericField"
338 ]
339 },
340 "display": {
341 "component": "table",
342 "properties": {
343 "pictureUrl": {
344 "label": "Image",
345 "format": "image"
346 },
347 "linkUrl": {
348 "label": "Link",
349 "format": "link"
350 },
351 "textField": {
352 "label": "Text",
353 "format": "text"
354 },
355 "booleanField": {
356 "label": "Boolean",
357 "format": "boolean"
358 },
359 "arrayField": {
360 "label": "Array",
361 "format": "array"
362 },
363 "objectField": {
364 "label": "Object",
365 "format": "object"
366 },
367 "dateField": {
368 "label": "Date",
369 "format": "date"
370 },
371 "numericField": {
372 "label": "Number",
373 "format": "number"
374 }
375 }
376 }
377 }
378 }
379}
380```
381
382### Structure
383
384```json
385{
386 "actorSpecification": 1,
387 "fields": {},
388 "views": {
389 "<VIEW_NAME>": {
390 "title": "string (required)",
391 "description": "string (optional)",
392 "transformation": {
393 "fields": ["string (required)"],
394 "unwind": ["string (optional)"],
395 "flatten": ["string (optional)"],
396 "omit": ["string (optional)"],
397 "limit": "integer (optional)",
398 "desc": "boolean (optional)"
399 },
400 "display": {
401 "component": "table (required)",
402 "properties": {
403 "<FIELD_NAME>": {
404 "label": "string (optional)",
405 "format": "text|number|date|link|boolean|image|array|object (optional)"
406 }
407 }
408 }
409 }
410 }
411}
412```
413
414**Dataset Schema Properties:**
415
416- `actorSpecification` (integer, required) - Specifies the version of dataset schema structure document (currently only version 1)
417- `fields` (JSONSchema object, required) - Schema of one dataset object (use JsonSchema Draft 2020-12 or compatible)
418- `views` (DatasetView object, required) - Object with API and UI views description
419
420**DatasetView Properties:**
421
422- `title` (string, required) - Visible in UI Output tab and API
423- `description` (string, optional) - Only available in API response
424- `transformation` (ViewTransformation object, required) - Data transformation applied when loading from Dataset API
425- `display` (ViewDisplay object, required) - Output tab UI visualization definition
426
427**ViewTransformation Properties:**
428
429- `fields` (string[], required) - Fields to present in output (order matches column order)
430- `unwind` (string[], optional) - Deconstructs nested children into parent object
431- `flatten` (string[], optional) - Transforms nested object into flat structure
432- `omit` (string[], optional) - Removes specified fields from output
433- `limit` (integer, optional) - Maximum number of results (default: all)
434- `desc` (boolean, optional) - Sort order (true = newest first)
435
436**ViewDisplay Properties:**
437
438- `component` (string, required) - Only `table` is available
439- `properties` (Object, optional) - Keys matching `transformation.fields` with ViewDisplayProperty values
440
441**ViewDisplayProperty Properties:**
442
443- `label` (string, optional) - Table column header
444- `format` (string, optional) - One of: `text`, `number`, `date`, `link`, `boolean`, `image`, `array`, `object`
445
446## Key-Value Store Schema Specification
447
448The key-value store schema organizes keys into logical groups called collections for easier data management.
449
450### Example
451
452Consider an example Actor that calls `Actor.setValue()` to save records into the key-value store:
453
454```python
455# Key-Value Store set example (Python)
456import asyncio
457from apify import Actor
458
459async def main():
460 await Actor.init()
461
462 # Actor code
463 await Actor.set_value('document-1', 'my text data', content_type='text/plain')
464
465 image_id = '123' # example placeholder
466 image_buffer = b'...' # bytes buffer with image data
467 await Actor.set_value(f'image-{image_id}', image_buffer, content_type='image/jpeg')
468
469 # Exit successfully
470 await Actor.exit()
471
472if __name__ == '__main__':
473 asyncio.run(main())
474```
475
476To configure the key-value store schema, reference a schema file in `.actor/actor.json`:
477
478```json
479{
480 "actorSpecification": 1,
481 "name": "data-collector",
482 "title": "Data Collector",
483 "version": "1.0.0",
484 "storages": {
485 "keyValueStore": "./key_value_store_schema.json"
486 }
487}
488```
489
490Then create the key-value store schema in `.actor/key_value_store_schema.json`:
491
492```json
493{
494 "actorKeyValueStoreSchemaVersion": 1,
495 "title": "Key-Value Store Schema",
496 "collections": {
497 "documents": {
498 "title": "Documents",
499 "description": "Text documents stored by the Actor",
500 "keyPrefix": "document-"
501 },
502 "images": {
503 "title": "Images",
504 "description": "Images stored by the Actor",
505 "keyPrefix": "image-",
506 "contentTypes": ["image/jpeg"]
507 }
508 }
509}
510```
511
512### Structure
513
514```json
515{
516 "actorKeyValueStoreSchemaVersion": 1,
517 "title": "string (required)",
518 "description": "string (optional)",
519 "collections": {
520 "<COLLECTION_NAME>": {
521 "title": "string (required)",
522 "description": "string (optional)",
523 "key": "string (conditional - use key OR keyPrefix)",
524 "keyPrefix": "string (conditional - use key OR keyPrefix)",
525 "contentTypes": ["string (optional)"],
526 "jsonSchema": "object (optional)"
527 }
528 }
529}
530```
531
532**Key-Value Store Schema Properties:**
533
534- `actorKeyValueStoreSchemaVersion` (integer, required) - Version of key-value store schema structure document (currently only version 1)
535- `title` (string, required) - Title of the schema
536- `description` (string, optional) - Description of the schema
537- `collections` (Object, required) - Object where each key is a collection ID and value is a Collection object
538
539**Collection Properties:**
540
541- `title` (string, required) - Collection title shown in UI tabs
542- `description` (string, optional) - Description appearing in UI tooltips
543- `key` (string, conditional\*) - Single specific key for this collection
544- `keyPrefix` (string, conditional\*) - Prefix for keys included in this collection
545- `contentTypes` (string[], optional) - Allowed content types for validation
546- `jsonSchema` (object, optional) - JSON Schema Draft 07 format for `application/json` content type validation
547
548\*Either `key` or `keyPrefix` must be specified for each collection, but not both.
549
550## Apify MCP Tools
551
552If MCP server is configured, use these tools for documentation:
553
554- `search-apify-docs` - Search documentation
555- `fetch-apify-docs` - Get full doc pages
556
557Otherwise, reference: `@https://mcp.apify.com/`
558
559## Resources
560
561- [docs.apify.com/llms.txt](https://docs.apify.com/llms.txt) - Quick reference
562- [docs.apify.com/llms-full.txt](https://docs.apify.com/llms-full.txt) - Complete docs
563- [crawlee.dev](https://crawlee.dev) - Crawlee documentation
564- [whitepaper.actor](https://raw.githubusercontent.com/apify/actor-whitepaper/refs/heads/master/README.md) - Complete Actor specification