1# Apify Actors Development Guide
2
3Important: Before you begin, fill in the `generatedBy` property in the meta section of `.actor/actor.json`. Replace it with the model you're currently using. This helps Apify monitor and improve AGENTS.md for specific LLM models.
4
5## What are Apify Actors?
6
7- Actors are serverless programs that run in the cloud. They're inspired by the UNIX philosophy - programs that do one thing well and can be easily combined to build complex systems.
8- Actors are programs packaged as Docker images that run in isolated containers
9
10## Core Concepts
11
12- Accept well-defined JSON input
13- Perform isolated tasks (web scraping, automation, data processing)
14- Produce structured JSON output to datasets and/or store data in key-value stores
15- Can run from seconds to hours or even indefinitely
16- Persist state and can be restarted
17
18## Do
19
20- accept well-defined JSON input and produce structured JSON output
21- use Apify SDK (`apify`) for code running ON Apify platform
22- validate input early with proper error handling and fail gracefully
23- use CheerioCrawler for static HTML content (10x faster than browsers)
24- use PlaywrightCrawler only for JavaScript-heavy sites and dynamic content
25- use router pattern (createCheerioRouter/createPlaywrightRouter) for complex crawls
26- implement retry strategies with exponential backoff for failed requests
27- use proper concurrency settings (HTTP: 10-50, Browser: 1-5)
28- set sensible defaults in `.actor/input_schema.json` for all optional fields
29- set up output schema in `.actor/output_schema.json`
30- clean and validate data before pushing to dataset
31- use semantic CSS selectors and fallback strategies for missing elements
32- respect robots.txt, ToS, and implement rate limiting with delays
33- check which tools (cheerio/playwright/crawlee) are installed before applying guidance
34- use `apify/log` package for logging (censors sensitive data)
35- implement readiness probe handler for standby Actors
36
37## Don't
38
39- do not rely on `Dataset.getInfo()` for final counts on Cloud platform
40- do not use browser crawlers when HTTP/Cheerio works (massive performance gains with HTTP)
41- do not hard code values that should be in input schema or environment variables
42- do not skip input validation or error handling
43- do not overload servers - use appropriate concurrency and delays
44- do not scrape prohibited content or ignore Terms of Service
45- do not store personal/sensitive data unless explicitly permitted
46- do not use deprecated options like `requestHandlerTimeoutMillis` on CheerioCrawler (v3.x)
47- do not use `additionalHttpHeaders` - use `preNavigationHooks` instead
48- do not disable standby mode (`usesStandbyMode: false`) without explicit permission
49
50## Logging
51
52- **ALWAYS use the `apify/log` package for logging** - This package contains critical security logic including censoring sensitive data (Apify tokens, API keys, credentials) to prevent accidental exposure in logs
53
54### Available Log Levels in `apify/log`
55
56The Apify log package provides the following methods for logging:
57
58- `log.debug()` - Debug level logs (detailed diagnostic information)
59- `log.info()` - Info level logs (general informational messages)
60- `log.warning()` - Warning level logs (warning messages for potentially problematic situations)
61- `log.warningOnce()` - Warning level logs (same warning message logged only once)
62- `log.error()` - Error level logs (error messages for failures)
63- `log.exception()` - Exception level logs (for exceptions with stack traces)
64- `log.perf()` - Performance level logs (performance metrics and timing information)
65- `log.deprecated()` - Deprecation level logs (warnings about deprecated code)
66- `log.softFail()` - Soft failure logs (non-critical failures that don't stop execution, e.g., input validation errors, skipped items)
67- `log.internal()` - Internal level logs (internal/system messages)
68
69**Best practices:**
70
71- Use `log.debug()` for detailed operation-level diagnostics (inside functions)
72- Use `log.info()` for general informational messages (API requests, successful operations)
73- Use `log.warning()` for potentially problematic situations (validation failures, unexpected states)
74- Use `log.error()` for actual errors and failures
75- Use `log.exception()` for caught exceptions with stack traces
76
77## Standby Mode
78
79- **NEVER disable standby mode (`usesStandbyMode: false`) in `.actor/actor.json` without explicit permission** - Actor Standby mode solves this problem by letting you have the Actor ready in the background, waiting for the incoming HTTP requests. In a sense, the Actor behaves like a real-time web server or standard API server instead of running the logic once to process everything in batch. Always keep `usesStandbyMode: true` unless there is a specific documented reason to disable it
80- **ALWAYS implement readiness probe handler for standby Actors** - Handle the `x-apify-container-server-readiness-probe` header at GET / endpoint to ensure proper Actor lifecycle management
81
82You can recognize a standby Actor by checking the `usesStandbyMode` property in `.actor/actor.json`. Only implement the readiness probe if this property is set to `true`.
83
84### Readiness Probe Implementation Example
85
86```javascript
87// Apify standby readiness probe at root path
88app.get('/', (req, res) => {
89 res.writeHead(200, { 'Content-Type': 'text/plain' });
90 if (req.headers['x-apify-container-server-readiness-probe']) {
91 res.end('Readiness probe OK\n');
92 } else {
93 res.end('Actor is ready\n');
94 }
95});
96```
97
98Key points:
99
100- Detect the `x-apify-container-server-readiness-probe` header in incoming requests
101- Respond with HTTP 200 status code for both readiness probe and normal requests
102- This enables proper Actor lifecycle management in standby mode
103
104## Commands
105
106```bash
107# Local development
108apify run # Run Actor locally
109
110# Authentication & deployment
111apify login # Authenticate account
112apify push # Deploy to Apify platform
113
114# Help
115apify help # List all commands
116```
117
118## Safety and Permissions
119
120Allowed without prompt:
121
122- read files with `Actor.getValue()`
123- push data with `Actor.pushData()`
124- set values with `Actor.setValue()`
125- enqueue requests to RequestQueue
126- run locally with `apify run`
127
128Ask first:
129
130- npm/pip package installations
131- apify push (deployment to cloud)
132- proxy configuration changes (requires paid plan)
133- Dockerfile changes affecting builds
134- deleting datasets or key-value stores
135
136## Project Structure
137
138.actor/
139├── actor.json # Actor config: name, version, env vars, runtime settings
140├── input_schema.json # Input validation & Console form definition
141└── output_schema.json # Specifies where an Actor stores its output
142src/
143└── main.js # Actor entry point and orchestrator
144storage/ # Local storage (mirrors Cloud during development)
145├── datasets/ # Output items (JSON objects)
146├── key_value_stores/ # Files, config, INPUT
147└── request_queues/ # Pending crawl requests
148Dockerfile # Container image definition
149AGENTS.md # AI agent instructions (this file)
150
151## Actor Input Schema
152
153The input schema defines the input parameters for an Actor. It's a JSON object comprising various field types supported by the Apify platform.
154
155### Structure
156
157```json
158{
159 "title": "<INPUT-SCHEMA-TITLE>",
160 "type": "object",
161 "schemaVersion": 1,
162 "properties": {
163 /* define input fields here */
164 },
165 "required": []
166}
167```
168
169### Example
170
171```json
172{
173 "title": "E-commerce Product Scraper Input",
174 "type": "object",
175 "schemaVersion": 1,
176 "properties": {
177 "startUrls": {
178 "title": "Start URLs",
179 "type": "array",
180 "description": "URLs to start scraping from (category pages or product pages)",
181 "editor": "requestListSources",
182 "default": [{ "url": "https://example.com/category" }],
183 "prefill": [{ "url": "https://example.com/category" }]
184 },
185 "followVariants": {
186 "title": "Follow Product Variants",
187 "type": "boolean",
188 "description": "Whether to scrape product variants (different colors, sizes)",
189 "default": true
190 },
191 "maxRequestsPerCrawl": {
192 "title": "Max Requests per Crawl",
193 "type": "integer",
194 "description": "Maximum number of pages to scrape (0 = unlimited)",
195 "default": 1000,
196 "minimum": 0
197 },
198 "proxyConfiguration": {
199 "title": "Proxy Configuration",
200 "type": "object",
201 "description": "Proxy settings for anti-bot protection",
202 "editor": "proxy",
203 "default": { "useApifyProxy": false }
204 },
205 "locale": {
206 "title": "Locale",
207 "type": "string",
208 "description": "Language/country code for localized content",
209 "default": "cs",
210 "enum": ["cs", "en", "de", "sk"],
211 "enumTitles": ["Czech", "English", "German", "Slovak"]
212 }
213 },
214 "required": ["startUrls"]
215}
216```
217
218## Actor Output Schema
219
220The Actor output schema builds upon the schemas for the dataset and key-value store. It specifies where an Actor stores its output and defines templates for accessing that output. Apify Console uses these output definitions to display run results.
221
222### Structure
223
224```json
225{
226 "actorOutputSchemaVersion": 1,
227 "title": "<OUTPUT-SCHEMA-TITLE>",
228 "properties": {
229 /* define your outputs here */
230 }
231}
232```
233
234### Example
235
236```json
237{
238 "actorOutputSchemaVersion": 1,
239 "title": "Output schema of the files scraper",
240 "properties": {
241 "files": {
242 "type": "string",
243 "title": "Files",
244 "template": "{{links.apiDefaultKeyValueStoreUrl}}/keys"
245 },
246 "dataset": {
247 "type": "string",
248 "title": "Dataset",
249 "template": "{{links.apiDefaultDatasetUrl}}/items"
250 }
251 }
252}
253```
254
255### Output Schema Template Variables
256
257- `links` (object) - Contains quick links to most commonly used URLs
258- `links.publicRunUrl` (string) - Public run url in format `https://console.apify.com/view/runs/:runId`
259- `links.consoleRunUrl` (string) - Console run url in format `https://console.apify.com/actors/runs/:runId`
260- `links.apiRunUrl` (string) - API run url in format `https://api.apify.com/v2/actor-runs/:runId`
261- `links.apiDefaultDatasetUrl` (string) - API url of default dataset in format `https://api.apify.com/v2/datasets/:defaultDatasetId`
262- `links.apiDefaultKeyValueStoreUrl` (string) - API url of default key-value store in format `https://api.apify.com/v2/key-value-stores/:defaultKeyValueStoreId`
263- `links.containerRunUrl` (string) - URL of a webserver running inside the run in format `https://<containerId>.runs.apify.net/`
264- `run` (object) - Contains information about the run same as it is returned from the `GET Run` API endpoint
265- `run.defaultDatasetId` (string) - ID of the default dataset
266- `run.defaultKeyValueStoreId` (string) - ID of the default key-value store
267
268## Dataset Schema Specification
269
270The dataset schema defines how your Actor's output data is structured, transformed, and displayed in the Output tab in the Apify Console.
271
272### Example
273
274Consider an example Actor that calls `Actor.pushData()` to store data into dataset:
275
276```javascript
277import { Actor } from 'apify';
278// Initialize the JavaScript SDK
279await Actor.init();
280
281/**
282 * Actor code
283 */
284await Actor.pushData({
285 numericField: 10,
286 pictureUrl: 'https://www.google.com/images/branding/googlelogo/2x/googlelogo_color_92x30dp.png',
287 linkUrl: 'https://google.com',
288 textField: 'Google',
289 booleanField: true,
290 dateField: new Date(),
291 arrayField: ['#hello', '#world'],
292 objectField: {},
293});
294
295// Exit successfully
296await Actor.exit();
297```
298
299To set up the Actor's output tab UI, reference a dataset schema file in `.actor/actor.json`:
300
301```json
302{
303 "actorSpecification": 1,
304 "name": "book-library-scraper",
305 "title": "Book Library Scraper",
306 "version": "1.0.0",
307 "storages": {
308 "dataset": "./dataset_schema.json"
309 }
310}
311```
312
313Then create the dataset schema in `.actor/dataset_schema.json`:
314
315```json
316{
317 "actorSpecification": 1,
318 "fields": {},
319 "views": {
320 "overview": {
321 "title": "Overview",
322 "transformation": {
323 "fields": [
324 "pictureUrl",
325 "linkUrl",
326 "textField",
327 "booleanField",
328 "arrayField",
329 "objectField",
330 "dateField",
331 "numericField"
332 ]
333 },
334 "display": {
335 "component": "table",
336 "properties": {
337 "pictureUrl": {
338 "label": "Image",
339 "format": "image"
340 },
341 "linkUrl": {
342 "label": "Link",
343 "format": "link"
344 },
345 "textField": {
346 "label": "Text",
347 "format": "text"
348 },
349 "booleanField": {
350 "label": "Boolean",
351 "format": "boolean"
352 },
353 "arrayField": {
354 "label": "Array",
355 "format": "array"
356 },
357 "objectField": {
358 "label": "Object",
359 "format": "object"
360 },
361 "dateField": {
362 "label": "Date",
363 "format": "date"
364 },
365 "numericField": {
366 "label": "Number",
367 "format": "number"
368 }
369 }
370 }
371 }
372 }
373}
374```
375
376### Structure
377
378```json
379{
380 "actorSpecification": 1,
381 "fields": {},
382 "views": {
383 "<VIEW_NAME>": {
384 "title": "string (required)",
385 "description": "string (optional)",
386 "transformation": {
387 "fields": ["string (required)"],
388 "unwind": ["string (optional)"],
389 "flatten": ["string (optional)"],
390 "omit": ["string (optional)"],
391 "limit": "integer (optional)",
392 "desc": "boolean (optional)"
393 },
394 "display": {
395 "component": "table (required)",
396 "properties": {
397 "<FIELD_NAME>": {
398 "label": "string (optional)",
399 "format": "text|number|date|link|boolean|image|array|object (optional)"
400 }
401 }
402 }
403 }
404 }
405}
406```
407
408**Dataset Schema Properties:**
409
410- `actorSpecification` (integer, required) - Specifies the version of dataset schema structure document (currently only version 1)
411- `fields` (JSONSchema object, required) - Schema of one dataset object (use JsonSchema Draft 2020-12 or compatible)
412- `views` (DatasetView object, required) - Object with API and UI views description
413
414**DatasetView Properties:**
415
416- `title` (string, required) - Visible in UI Output tab and API
417- `description` (string, optional) - Only available in API response
418- `transformation` (ViewTransformation object, required) - Data transformation applied when loading from Dataset API
419- `display` (ViewDisplay object, required) - Output tab UI visualization definition
420
421**ViewTransformation Properties:**
422
423- `fields` (string[], required) - Fields to present in output (order matches column order)
424- `unwind` (string[], optional) - Deconstructs nested children into parent object
425- `flatten` (string[], optional) - Transforms nested object into flat structure
426- `omit` (string[], optional) - Removes specified fields from output
427- `limit` (integer, optional) - Maximum number of results (default: all)
428- `desc` (boolean, optional) - Sort order (true = newest first)
429
430**ViewDisplay Properties:**
431
432- `component` (string, required) - Only `table` is available
433- `properties` (Object, optional) - Keys matching `transformation.fields` with ViewDisplayProperty values
434
435**ViewDisplayProperty Properties:**
436
437- `label` (string, optional) - Table column header
438- `format` (string, optional) - One of: `text`, `number`, `date`, `link`, `boolean`, `image`, `array`, `object`
439
440## Key-Value Store Schema Specification
441
442The key-value store schema organizes keys into logical groups called collections for easier data management.
443
444### Example
445
446Consider an example Actor that calls `Actor.setValue()` to save records into the key-value store:
447
448```javascript
449import { Actor } from 'apify';
450// Initialize the JavaScript SDK
451await Actor.init();
452
453/**
454 * Actor code
455 */
456await Actor.setValue('document-1', 'my text data', { contentType: 'text/plain' });
457
458await Actor.setValue(`image-${imageID}`, imageBuffer, { contentType: 'image/jpeg' });
459
460// Exit successfully
461await Actor.exit();
462```
463
464To configure the key-value store schema, reference a schema file in `.actor/actor.json`:
465
466```json
467{
468 "actorSpecification": 1,
469 "name": "data-collector",
470 "title": "Data Collector",
471 "version": "1.0.0",
472 "storages": {
473 "keyValueStore": "./key_value_store_schema.json"
474 }
475}
476```
477
478Then create the key-value store schema in `.actor/key_value_store_schema.json`:
479
480```json
481{
482 "actorKeyValueStoreSchemaVersion": 1,
483 "title": "Key-Value Store Schema",
484 "collections": {
485 "documents": {
486 "title": "Documents",
487 "description": "Text documents stored by the Actor",
488 "keyPrefix": "document-"
489 },
490 "images": {
491 "title": "Images",
492 "description": "Images stored by the Actor",
493 "keyPrefix": "image-",
494 "contentTypes": ["image/jpeg"]
495 }
496 }
497}
498```
499
500### Structure
501
502```json
503{
504 "actorKeyValueStoreSchemaVersion": 1,
505 "title": "string (required)",
506 "description": "string (optional)",
507 "collections": {
508 "<COLLECTION_NAME>": {
509 "title": "string (required)",
510 "description": "string (optional)",
511 "key": "string (conditional - use key OR keyPrefix)",
512 "keyPrefix": "string (conditional - use key OR keyPrefix)",
513 "contentTypes": ["string (optional)"],
514 "jsonSchema": "object (optional)"
515 }
516 }
517}
518```
519
520**Key-Value Store Schema Properties:**
521
522- `actorKeyValueStoreSchemaVersion` (integer, required) - Version of key-value store schema structure document (currently only version 1)
523- `title` (string, required) - Title of the schema
524- `description` (string, optional) - Description of the schema
525- `collections` (Object, required) - Object where each key is a collection ID and value is a Collection object
526
527**Collection Properties:**
528
529- `title` (string, required) - Collection title shown in UI tabs
530- `description` (string, optional) - Description appearing in UI tooltips
531- `key` (string, conditional) - Single specific key for this collection
532- `keyPrefix` (string, conditional) - Prefix for keys included in this collection
533- `contentTypes` (string[], optional) - Allowed content types for validation
534- `jsonSchema` (object, optional) - JSON Schema Draft 07 format for `application/json` content type validation
535
536Either `key` or `keyPrefix` must be specified for each collection, but not both.
537
538## Apify MCP Tools
539
540If MCP server is configured, use these tools for documentation:
541
542- `search-apify-docs` - Search documentation
543- `fetch-apify-docs` - Get full doc pages
544
545Otherwise, reference: `@https://mcp.apify.com/`
546
547## Resources
548
549- [docs.apify.com/llms.txt](https://docs.apify.com/llms.txt) - Quick reference
550- [docs.apify.com/llms-full.txt](https://docs.apify.com/llms-full.txt) - Complete docs
551- [crawlee.dev](https://crawlee.dev) - Crawlee documentation
552- [whitepaper.actor](https://raw.githubusercontent.com/apify/actor-whitepaper/refs/heads/master/README.md) - Complete Actor specification