1# Apify Actors Development Guide
2
3Important: Before you begin, fill in the `generatedBy` property in the meta section of `.actor/actor.json`. Replace it with the model you're currently using. This helps Apify monitor and improve AGENTS.md for specific LLM models.
4
5## What are Apify Actors?
6
7- Actors are serverless programs that run in the cloud. They're inspired by the UNIX philosophy - programs that do one thing well and can be easily combined to build complex systems.
8- Actors are programs packaged as Docker images that run in isolated containers
9
10## Core Concepts
11
12- Accept well-defined JSON input
13- Perform isolated tasks (web scraping, automation, data processing)
14- Produce structured JSON output to datasets and/or store data in key-value stores
15- Can run from seconds to hours or even indefinitely
16- Persist state and can be restarted
17
18## Do
19
20- accept well-defined JSON input and produce structured JSON output
21- use Apify SDK (`apify`) for code running ON Apify platform
22- validate input early with proper error handling and fail gracefully
23- use CheerioCrawler for static HTML content (10x faster than browsers)
24- use PlaywrightCrawler only for JavaScript-heavy sites and dynamic content
25- use router pattern (createCheerioRouter/createPlaywrightRouter) for complex crawls
26- implement retry strategies with exponential backoff for failed requests
27- use proper concurrency settings (HTTP: 10-50, Browser: 1-5)
28- set sensible defaults in `.actor/input_schema.json` for all optional fields
29- set up output schema in `.actor/output_schema.json`
30- clean and validate data before pushing to dataset
31- use semantic CSS selectors and fallback strategies for missing elements
32- respect robots.txt, ToS, and implement rate limiting with delays
33- check which tools (cheerio/playwright/crawlee) are installed before applying guidance
34
35## Don't
36
37- do not rely on `Dataset.getInfo()` for final counts on Cloud platform
38- do not use browser crawlers when HTTP/Cheerio works (massive performance gains with HTTP)
39- do not hard code values that should be in input schema or environment variables
40- do not skip input validation or error handling
41- do not overload servers - use appropriate concurrency and delays
42- do not scrape prohibited content or ignore Terms of Service
43- do not store personal/sensitive data unless explicitly permitted
44- do not use deprecated options like `requestHandlerTimeoutMillis` on CheerioCrawler (v3.x)
45- do not use `additionalHttpHeaders` - use `preNavigationHooks` instead
46
47## Commands
48
49```bash
50# Local development
51apify run # Run Actor locally
52
53# Authentication & deployment
54apify login # Authenticate account
55apify push # Deploy to Apify platform
56
57# Help
58apify help # List all commands
59```
60
61## Safety and Permissions
62
63Allowed without prompt:
64
65- read files with `Actor.get_value()`
66- push data with `Actor.push_data()`
67- set values with `Actor.set_value()`
68- enqueue requests to RequestQueue
69- run locally with `apify run`
70
71Ask first:
72
73- npm/pip package installations
74- apify push (deployment to cloud)
75- proxy configuration changes (requires paid plan)
76- Dockerfile changes affecting builds
77- deleting datasets or key-value stores
78
79## Project Structure
80
81.actor/
82├── actor.json # Actor config: name, version, env vars, runtime settings
83├── input_schema.json # Input validation & Console form definition
84└── output_schema.json # Specifies where an Actor stores its output
85src/
86└── main.js # Actor entry point and orchestrator
87storage/ # Local storage (mirrors Cloud during development)
88├── datasets/ # Output items (JSON objects)
89├── key_value_stores/ # Files, config, INPUT
90└── request_queues/ # Pending crawl requests
91Dockerfile # Container image definition
92AGENTS.md # AI agent instructions (this file)
93
94## Actor Input Schema
95
96The input schema defines the input parameters for an Actor. It's a JSON object comprising various field types supported by the Apify platform.
97
98### Structure
99
100```json
101{
102 "title": "<INPUT-SCHEMA-TITLE>",
103 "type": "object",
104 "schemaVersion": 1,
105 "properties": {
106 /* define input fields here */
107 },
108 "required": []
109}
110```
111
112### Example
113
114```json
115{
116 "title": "E-commerce Product Scraper Input",
117 "type": "object",
118 "schemaVersion": 1,
119 "properties": {
120 "startUrls": {
121 "title": "Start URLs",
122 "type": "array",
123 "description": "URLs to start scraping from (category pages or product pages)",
124 "editor": "requestListSources",
125 "default": [{ "url": "https://example.com/category" }],
126 "prefill": [{ "url": "https://example.com/category" }]
127 },
128 "followVariants": {
129 "title": "Follow Product Variants",
130 "type": "boolean",
131 "description": "Whether to scrape product variants (different colors, sizes)",
132 "default": true
133 },
134 "maxRequestsPerCrawl": {
135 "title": "Max Requests per Crawl",
136 "type": "integer",
137 "description": "Maximum number of pages to scrape (0 = unlimited)",
138 "default": 1000,
139 "minimum": 0
140 },
141 "proxyConfiguration": {
142 "title": "Proxy Configuration",
143 "type": "object",
144 "description": "Proxy settings for anti-bot protection",
145 "editor": "proxy",
146 "default": { "useApifyProxy": false }
147 },
148 "locale": {
149 "title": "Locale",
150 "type": "string",
151 "description": "Language/country code for localized content",
152 "default": "cs",
153 "enum": ["cs", "en", "de", "sk"],
154 "enumTitles": ["Czech", "English", "German", "Slovak"]
155 }
156 },
157 "required": ["startUrls"]
158}
159```
160
161## Actor Output Schema
162
163The Actor output schema builds upon the schemas for the dataset and key-value store. It specifies where an Actor stores its output and defines templates for accessing that output. Apify Console uses these output definitions to display run results.
164
165### Structure
166
167```json
168{
169 "actorOutputSchemaVersion": 1,
170 "title": "<OUTPUT-SCHEMA-TITLE>",
171 "properties": {
172 /* define your outputs here */
173 }
174}
175```
176
177### Example
178
179```json
180{
181 "actorOutputSchemaVersion": 1,
182 "title": "Output schema of the files scraper",
183 "properties": {
184 "files": {
185 "type": "string",
186 "title": "Files",
187 "template": "{{links.apiDefaultKeyValueStoreUrl}}/keys"
188 },
189 "dataset": {
190 "type": "string",
191 "title": "Dataset",
192 "template": "{{links.apiDefaultDatasetUrl}}/items"
193 }
194 }
195}
196```
197
198### Output Schema Template Variables
199
200- `links` (object) - Contains quick links to most commonly used URLs
201- `links.publicRunUrl` (string) - Public run url in format `https://console.apify.com/view/runs/:runId`
202- `links.consoleRunUrl` (string) - Console run url in format `https://console.apify.com/actors/runs/:runId`
203- `links.apiRunUrl` (string) - API run url in format `https://api.apify.com/v2/actor-runs/:runId`
204- `links.apiDefaultDatasetUrl` (string) - API url of default dataset in format `https://api.apify.com/v2/datasets/:defaultDatasetId`
205- `links.apiDefaultKeyValueStoreUrl` (string) - API url of default key-value store in format `https://api.apify.com/v2/key-value-stores/:defaultKeyValueStoreId`
206- `links.containerRunUrl` (string) - URL of a webserver running inside the run in format `https://<containerId>.runs.apify.net/`
207- `run` (object) - Contains information about the run same as it is returned from the `GET Run` API endpoint
208- `run.defaultDatasetId` (string) - ID of the default dataset
209- `run.defaultKeyValueStoreId` (string) - ID of the default key-value store
210
211## Dataset Schema Specification
212
213The dataset schema defines how your Actor's output data is structured, transformed, and displayed in the Output tab in the Apify Console.
214
215### Example
216
217Consider an example Actor that calls `Actor.pushData()` to store data into dataset:
218
219```python
220# Dataset push example (Python)
221import asyncio
222from datetime import datetime
223from apify import Actor
224
225async def main():
226 await Actor.init()
227
228 # Actor code
229 await Actor.push_data({
230 'numericField': 10,
231 'pictureUrl': 'https://www.google.com/images/branding/googlelogo/2x/googlelogo_color_92x30dp.png',
232 'linkUrl': 'https://google.com',
233 'textField': 'Google',
234 'booleanField': True,
235 'dateField': datetime.now().isoformat(),
236 'arrayField': ['#hello', '#world'],
237 'objectField': {},
238 })
239
240 # Exit successfully
241 await Actor.exit()
242
243if __name__ == '__main__':
244 asyncio.run(main())
245```
246
247To set up the Actor's output tab UI, reference a dataset schema file in `.actor/actor.json`:
248
249```json
250{
251 "actorSpecification": 1,
252 "name": "book-library-scraper",
253 "title": "Book Library Scraper",
254 "version": "1.0.0",
255 "storages": {
256 "dataset": "./dataset_schema.json"
257 }
258}
259```
260
261Then create the dataset schema in `.actor/dataset_schema.json`:
262
263```json
264{
265 "actorSpecification": 1,
266 "fields": {},
267 "views": {
268 "overview": {
269 "title": "Overview",
270 "transformation": {
271 "fields": [
272 "pictureUrl",
273 "linkUrl",
274 "textField",
275 "booleanField",
276 "arrayField",
277 "objectField",
278 "dateField",
279 "numericField"
280 ]
281 },
282 "display": {
283 "component": "table",
284 "properties": {
285 "pictureUrl": {
286 "label": "Image",
287 "format": "image"
288 },
289 "linkUrl": {
290 "label": "Link",
291 "format": "link"
292 },
293 "textField": {
294 "label": "Text",
295 "format": "text"
296 },
297 "booleanField": {
298 "label": "Boolean",
299 "format": "boolean"
300 },
301 "arrayField": {
302 "label": "Array",
303 "format": "array"
304 },
305 "objectField": {
306 "label": "Object",
307 "format": "object"
308 },
309 "dateField": {
310 "label": "Date",
311 "format": "date"
312 },
313 "numericField": {
314 "label": "Number",
315 "format": "number"
316 }
317 }
318 }
319 }
320 }
321}
322```
323
324### Structure
325
326```json
327{
328 "actorSpecification": 1,
329 "fields": {},
330 "views": {
331 "<VIEW_NAME>": {
332 "title": "string (required)",
333 "description": "string (optional)",
334 "transformation": {
335 "fields": ["string (required)"],
336 "unwind": ["string (optional)"],
337 "flatten": ["string (optional)"],
338 "omit": ["string (optional)"],
339 "limit": "integer (optional)",
340 "desc": "boolean (optional)"
341 },
342 "display": {
343 "component": "table (required)",
344 "properties": {
345 "<FIELD_NAME>": {
346 "label": "string (optional)",
347 "format": "text|number|date|link|boolean|image|array|object (optional)"
348 }
349 }
350 }
351 }
352 }
353}
354```
355
356**Dataset Schema Properties:**
357
358- `actorSpecification` (integer, required) - Specifies the version of dataset schema structure document (currently only version 1)
359- `fields` (JSONSchema object, required) - Schema of one dataset object (use JsonSchema Draft 2020-12 or compatible)
360- `views` (DatasetView object, required) - Object with API and UI views description
361
362**DatasetView Properties:**
363
364- `title` (string, required) - Visible in UI Output tab and API
365- `description` (string, optional) - Only available in API response
366- `transformation` (ViewTransformation object, required) - Data transformation applied when loading from Dataset API
367- `display` (ViewDisplay object, required) - Output tab UI visualization definition
368
369**ViewTransformation Properties:**
370
371- `fields` (string[], required) - Fields to present in output (order matches column order)
372- `unwind` (string[], optional) - Deconstructs nested children into parent object
373- `flatten` (string[], optional) - Transforms nested object into flat structure
374- `omit` (string[], optional) - Removes specified fields from output
375- `limit` (integer, optional) - Maximum number of results (default: all)
376- `desc` (boolean, optional) - Sort order (true = newest first)
377
378**ViewDisplay Properties:**
379
380- `component` (string, required) - Only `table` is available
381- `properties` (Object, optional) - Keys matching `transformation.fields` with ViewDisplayProperty values
382
383**ViewDisplayProperty Properties:**
384
385- `label` (string, optional) - Table column header
386- `format` (string, optional) - One of: `text`, `number`, `date`, `link`, `boolean`, `image`, `array`, `object`
387
388## Key-Value Store Schema Specification
389
390The key-value store schema organizes keys into logical groups called collections for easier data management.
391
392### Example
393
394Consider an example Actor that calls `Actor.setValue()` to save records into the key-value store:
395
396```python
397# Key-Value Store set example (Python)
398import asyncio
399from apify import Actor
400
401async def main():
402 await Actor.init()
403
404 # Actor code
405 await Actor.set_value('document-1', 'my text data', content_type='text/plain')
406
407 image_id = '123' # example placeholder
408 image_buffer = b'...' # bytes buffer with image data
409 await Actor.set_value(f'image-{image_id}', image_buffer, content_type='image/jpeg')
410
411 # Exit successfully
412 await Actor.exit()
413
414if __name__ == '__main__':
415 asyncio.run(main())
416```
417
418To configure the key-value store schema, reference a schema file in `.actor/actor.json`:
419
420```json
421{
422 "actorSpecification": 1,
423 "name": "data-collector",
424 "title": "Data Collector",
425 "version": "1.0.0",
426 "storages": {
427 "keyValueStore": "./key_value_store_schema.json"
428 }
429}
430```
431
432Then create the key-value store schema in `.actor/key_value_store_schema.json`:
433
434```json
435{
436 "actorKeyValueStoreSchemaVersion": 1,
437 "title": "Key-Value Store Schema",
438 "collections": {
439 "documents": {
440 "title": "Documents",
441 "description": "Text documents stored by the Actor",
442 "keyPrefix": "document-"
443 },
444 "images": {
445 "title": "Images",
446 "description": "Images stored by the Actor",
447 "keyPrefix": "image-",
448 "contentTypes": ["image/jpeg"]
449 }
450 }
451}
452```
453
454### Structure
455
456```json
457{
458 "actorKeyValueStoreSchemaVersion": 1,
459 "title": "string (required)",
460 "description": "string (optional)",
461 "collections": {
462 "<COLLECTION_NAME>": {
463 "title": "string (required)",
464 "description": "string (optional)",
465 "key": "string (conditional - use key OR keyPrefix)",
466 "keyPrefix": "string (conditional - use key OR keyPrefix)",
467 "contentTypes": ["string (optional)"],
468 "jsonSchema": "object (optional)"
469 }
470 }
471}
472```
473
474**Key-Value Store Schema Properties:**
475
476- `actorKeyValueStoreSchemaVersion` (integer, required) - Version of key-value store schema structure document (currently only version 1)
477- `title` (string, required) - Title of the schema
478- `description` (string, optional) - Description of the schema
479- `collections` (Object, required) - Object where each key is a collection ID and value is a Collection object
480
481**Collection Properties:**
482
483- `title` (string, required) - Collection title shown in UI tabs
484- `description` (string, optional) - Description appearing in UI tooltips
485- `key` (string, conditional\*) - Single specific key for this collection
486- `keyPrefix` (string, conditional\*) - Prefix for keys included in this collection
487- `contentTypes` (string[], optional) - Allowed content types for validation
488- `jsonSchema` (object, optional) - JSON Schema Draft 07 format for `application/json` content type validation
489
490\*Either `key` or `keyPrefix` must be specified for each collection, but not both.
491
492## Apify MCP Tools
493
494If MCP server is configured, use these tools for documentation:
495
496- `search-apify-docs` - Search documentation
497- `fetch-apify-docs` - Get full doc pages
498
499Otherwise, reference: `@https://mcp.apify.com/`
500
501## Resources
502
503- [docs.apify.com/llms.txt](https://docs.apify.com/llms.txt) - Quick reference
504- [docs.apify.com/llms-full.txt](https://docs.apify.com/llms-full.txt) - Complete docs
505- [crawlee.dev](https://crawlee.dev) - Crawlee documentation
506- [whitepaper.actor](https://raw.githubusercontent.com/apify/actor-whitepaper/refs/heads/master/README.md) - Complete Actor specification