1# Apify Actors Development Guide
2
3Important: Before you begin, fill in the model property in the meta section of `.actor/actor.json`. Replace it with the model you're currently using. This helps Apify monitor and improve AGENTS.md for specific LLM models.
4
5## What are Apify Actors?
6
7- Actors are serverless programs that run in the cloud. They're inspired by the UNIX philosophy - programs that do one thing well and can be easily combined to build complex systems.
8- Actors are programs packaged as Docker images that run in isolated containers
9
10## Core Concepts
11
12- Accept well-defined JSON input
13- Perform isolated tasks (web scraping, automation, data processing)
14- Produce structured JSON output to datasets and/or store data in key-value stores
15- Can run from seconds to hours or even indefinitely
16- Persist state and can be restarted
17
18## Do
19
20- accept well-defined JSON input and produce structured JSON output
21- use Apify SDK (`apify`) for code running ON Apify platform
22- validate input early with proper error handling and fail gracefully
23- use CheerioCrawler for static HTML content (10x faster than browsers)
24- use PlaywrightCrawler only for JavaScript-heavy sites and dynamic content
25- use router pattern (createCheerioRouter/createPlaywrightRouter) for complex crawls
26- implement retry strategies with exponential backoff for failed requests
27- use proper concurrency settings (HTTP: 10-50, Browser: 1-5)
28- set sensible defaults in `.actor/input_schema.json` for all optional fields
29- set up output schema in `.actor/output_schema.json`
30- clean and validate data before pushing to dataset
31- use semantic CSS selectors and fallback strategies for missing elements
32- respect robots.txt, ToS, and implement rate limiting with delays
33- check which tools (cheerio/playwright/crawlee) are installed before applying guidance
34
35## Don't
36
37- do not rely on `Dataset.getInfo()` for final counts on Cloud platform
38- do not use browser crawlers when HTTP/Cheerio works (massive performance gains with HTTP)
39- do not hard code values that should be in input schema or environment variables
40- do not skip input validation or error handling
41- do not overload servers - use appropriate concurrency and delays
42- do not scrape prohibited content or ignore Terms of Service
43- do not store personal/sensitive data unless explicitly permitted
44- do not use deprecated options like `requestHandlerTimeoutMillis` on CheerioCrawler (v3.x)
45- do not use `additionalHttpHeaders` - use `preNavigationHooks` instead
46
47## Commands
48
49```bash
50# Local development
51apify run # Run Actor locally
52
53# Authentication & deployment
54apify login # Authenticate account
55apify push # Deploy to Apify platform
56
57# Help
58apify help # List all commands
59```
60
61## Safety and Permissions
62
63Allowed without prompt:
64- read files with `Actor.get_value()`
65- push data with `Actor.push_data()`
66- set values with `Actor.set_value()`
67- enqueue requests to RequestQueue
68- run locally with `apify run`
69
70Ask first:
71- npm/pip package installations
72- apify push (deployment to cloud)
73- proxy configuration changes (requires paid plan)
74- Dockerfile changes affecting builds
75- deleting datasets or key-value stores
76
77## Project Structure
78
79.actor/
80├── actor.json # Actor config: name, version, env vars, runtime settings
81├── input_schema.json # Input validation & Console form definition
82└── output_schema.json # Specifies where an Actor stores its output
83src/
84└── main.js # Actor entry point and orchestrator
85storage/ # Local storage (mirrors Cloud during development)
86├── datasets/ # Output items (JSON objects)
87├── key_value_stores/ # Files, config, INPUT
88└── request_queues/ # Pending crawl requests
89Dockerfile # Container image definition
90AGENTS.md # AI agent instructions (this file)
91
92## Actor Input Schema
93
94The input schema defines the input parameters for an Actor. It's a JSON object comprising various field types supported by the Apify platform.
95
96### Structure
97
98```json
99{
100 "title": "<INPUT-SCHEMA-TITLE>",
101 "type": "object",
102 "schemaVersion": 1,
103 "properties": { /* define input fields here */ },
104 "required": []
105}
106```
107
108### Example
109
110```json
111{
112 "title": "E-commerce Product Scraper Input",
113 "type": "object",
114 "schemaVersion": 1,
115 "properties": {
116 "startUrls": {
117 "title": "Start URLs",
118 "type": "array",
119 "description": "URLs to start scraping from (category pages or product pages)",
120 "editor": "requestListSources",
121 "default": [{"url": "https://example.com/category"}],
122 "prefill": [{"url": "https://example.com/category"}]
123 },
124 "followVariants": {
125 "title": "Follow Product Variants",
126 "type": "boolean",
127 "description": "Whether to scrape product variants (different colors, sizes)",
128 "default": true
129 },
130 "maxRequestsPerCrawl": {
131 "title": "Max Requests per Crawl",
132 "type": "integer",
133 "description": "Maximum number of pages to scrape (0 = unlimited)",
134 "default": 1000,
135 "minimum": 0
136 },
137 "proxyConfiguration": {
138 "title": "Proxy Configuration",
139 "type": "object",
140 "description": "Proxy settings for anti-bot protection",
141 "editor": "proxy",
142 "default": {"useApifyProxy": false}
143 },
144 "locale": {
145 "title": "Locale",
146 "type": "string",
147 "description": "Language/country code for localized content",
148 "default": "cs",
149 "enum": ["cs", "en", "de", "sk"],
150 "enumTitles": ["Czech", "English", "German", "Slovak"]
151 }
152 },
153 "required": ["startUrls"]
154}
155```
156
157## Actor Output Schema
158
159The Actor output schema builds upon the schemas for the dataset and key-value store. It specifies where an Actor stores its output and defines templates for accessing that output. Apify Console uses these output definitions to display run results.
160
161### Structure
162
163```json
164{
165 "actorOutputSchemaVersion": 1,
166 "title": "<OUTPUT-SCHEMA-TITLE>",
167 "properties": { /* define your outputs here */ }
168}
169```
170
171### Example
172
173```json
174{
175 "actorOutputSchemaVersion": 1,
176 "title": "Output schema of the files scraper",
177 "properties": {
178 "files": {
179 "type": "string",
180 "title": "Files",
181 "template": "{{links.apiDefaultKeyValueStoreUrl}}/keys"
182 },
183 "dataset": {
184 "type": "string",
185 "title": "Dataset",
186 "template": "{{links.apiDefaultDatasetUrl}}/items"
187 }
188 }
189}
190```
191
192### Output Schema Template Variables
193
194- `links` (object) - Contains quick links to most commonly used URLs
195- `links.publicRunUrl` (string) - Public run url in format `https://console.apify.com/view/runs/:runId`
196- `links.consoleRunUrl` (string) - Console run url in format `https://console.apify.com/actors/runs/:runId`
197- `links.apiRunUrl` (string) - API run url in format `https://api.apify.com/v2/actor-runs/:runId`
198- `links.apiDefaultDatasetUrl` (string) - API url of default dataset in format `https://api.apify.com/v2/datasets/:defaultDatasetId`
199- `links.apiDefaultKeyValueStoreUrl` (string) - API url of default key-value store in format `https://api.apify.com/v2/key-value-stores/:defaultKeyValueStoreId`
200- `links.containerRunUrl` (string) - URL of a webserver running inside the run in format `https://<containerId>.runs.apify.net/`
201- `run` (object) - Contains information about the run same as it is returned from the `GET Run` API endpoint
202- `run.defaultDatasetId` (string) - ID of the default dataset
203- `run.defaultKeyValueStoreId` (string) - ID of the default key-value store
204
205## Dataset Schema Specification
206
207The dataset schema defines how your Actor's output data is structured, transformed, and displayed in the Output tab in the Apify Console.
208
209### Example
210
211Consider an example Actor that calls `Actor.pushData()` to store data into dataset:
212
213```python
214# Dataset push example (Python)
215import asyncio
216from datetime import datetime
217from apify import Actor
218
219async def main():
220 await Actor.init()
221
222 # Actor code
223 await Actor.push_data({
224 'numericField': 10,
225 'pictureUrl': 'https://www.google.com/images/branding/googlelogo/2x/googlelogo_color_92x30dp.png',
226 'linkUrl': 'https://google.com',
227 'textField': 'Google',
228 'booleanField': True,
229 'dateField': datetime.now().isoformat(),
230 'arrayField': ['#hello', '#world'],
231 'objectField': {},
232 })
233
234 # Exit successfully
235 await Actor.exit()
236
237if __name__ == '__main__':
238 asyncio.run(main())
239```
240
241To set up the Actor's output tab UI, reference a dataset schema file in `.actor/actor.json`:
242
243```json
244{
245 "actorSpecification": 1,
246 "name": "book-library-scraper",
247 "title": "Book Library Scraper",
248 "version": "1.0.0",
249 "storages": {
250 "dataset": "./dataset_schema.json"
251 }
252}
253```
254
255Then create the dataset schema in `.actor/dataset_schema.json`:
256
257```json
258{
259 "actorSpecification": 1,
260 "fields": {},
261 "views": {
262 "overview": {
263 "title": "Overview",
264 "transformation": {
265 "fields": [
266 "pictureUrl",
267 "linkUrl",
268 "textField",
269 "booleanField",
270 "arrayField",
271 "objectField",
272 "dateField",
273 "numericField"
274 ]
275 },
276 "display": {
277 "component": "table",
278 "properties": {
279 "pictureUrl": {
280 "label": "Image",
281 "format": "image"
282 },
283 "linkUrl": {
284 "label": "Link",
285 "format": "link"
286 },
287 "textField": {
288 "label": "Text",
289 "format": "text"
290 },
291 "booleanField": {
292 "label": "Boolean",
293 "format": "boolean"
294 },
295 "arrayField": {
296 "label": "Array",
297 "format": "array"
298 },
299 "objectField": {
300 "label": "Object",
301 "format": "object"
302 },
303 "dateField": {
304 "label": "Date",
305 "format": "date"
306 },
307 "numericField": {
308 "label": "Number",
309 "format": "number"
310 }
311 }
312 }
313 }
314 }
315}
316```
317
318### Structure
319
320```json
321{
322 "actorSpecification": 1,
323 "fields": {},
324 "views": {
325 "<VIEW_NAME>": {
326 "title": "string (required)",
327 "description": "string (optional)",
328 "transformation": {
329 "fields": ["string (required)"],
330 "unwind": ["string (optional)"],
331 "flatten": ["string (optional)"],
332 "omit": ["string (optional)"],
333 "limit": "integer (optional)",
334 "desc": "boolean (optional)"
335 },
336 "display": {
337 "component": "table (required)",
338 "properties": {
339 "<FIELD_NAME>": {
340 "label": "string (optional)",
341 "format": "text|number|date|link|boolean|image|array|object (optional)"
342 }
343 }
344 }
345 }
346 }
347}
348```
349
350**Dataset Schema Properties:**
351
352- `actorSpecification` (integer, required) - Specifies the version of dataset schema structure document (currently only version 1)
353- `fields` (JSONSchema object, required) - Schema of one dataset object (use JsonSchema Draft 2020-12 or compatible)
354- `views` (DatasetView object, required) - Object with API and UI views description
355
356**DatasetView Properties:**
357
358- `title` (string, required) - Visible in UI Output tab and API
359- `description` (string, optional) - Only available in API response
360- `transformation` (ViewTransformation object, required) - Data transformation applied when loading from Dataset API
361- `display` (ViewDisplay object, required) - Output tab UI visualization definition
362
363**ViewTransformation Properties:**
364
365- `fields` (string[], required) - Fields to present in output (order matches column order)
366- `unwind` (string[], optional) - Deconstructs nested children into parent object
367- `flatten` (string[], optional) - Transforms nested object into flat structure
368- `omit` (string[], optional) - Removes specified fields from output
369- `limit` (integer, optional) - Maximum number of results (default: all)
370- `desc` (boolean, optional) - Sort order (true = newest first)
371
372**ViewDisplay Properties:**
373
374- `component` (string, required) - Only `table` is available
375- `properties` (Object, optional) - Keys matching `transformation.fields` with ViewDisplayProperty values
376
377**ViewDisplayProperty Properties:**
378
379- `label` (string, optional) - Table column header
380- `format` (string, optional) - One of: `text`, `number`, `date`, `link`, `boolean`, `image`, `array`, `object`
381
382## Key-Value Store Schema Specification
383
384The key-value store schema organizes keys into logical groups called collections for easier data management.
385
386### Example
387
388Consider an example Actor that calls `Actor.setValue()` to save records into the key-value store:
389
390```python
391# Key-Value Store set example (Python)
392import asyncio
393from apify import Actor
394
395async def main():
396 await Actor.init()
397
398 # Actor code
399 await Actor.set_value('document-1', 'my text data', content_type='text/plain')
400
401 image_id = '123' # example placeholder
402 image_buffer = b'...' # bytes buffer with image data
403 await Actor.set_value(f'image-{image_id}', image_buffer, content_type='image/jpeg')
404
405 # Exit successfully
406 await Actor.exit()
407
408if __name__ == '__main__':
409 asyncio.run(main())
410```
411
412To configure the key-value store schema, reference a schema file in `.actor/actor.json`:
413
414```json
415{
416 "actorSpecification": 1,
417 "name": "data-collector",
418 "title": "Data Collector",
419 "version": "1.0.0",
420 "storages": {
421 "keyValueStore": "./key_value_store_schema.json"
422 }
423}
424```
425
426Then create the key-value store schema in `.actor/key_value_store_schema.json`:
427
428```json
429{
430 "actorKeyValueStoreSchemaVersion": 1,
431 "title": "Key-Value Store Schema",
432 "collections": {
433 "documents": {
434 "title": "Documents",
435 "description": "Text documents stored by the Actor",
436 "keyPrefix": "document-"
437 },
438 "images": {
439 "title": "Images",
440 "description": "Images stored by the Actor",
441 "keyPrefix": "image-",
442 "contentTypes": ["image/jpeg"]
443 }
444 }
445}
446```
447
448### Structure
449
450```json
451{
452 "actorKeyValueStoreSchemaVersion": 1,
453 "title": "string (required)",
454 "description": "string (optional)",
455 "collections": {
456 "<COLLECTION_NAME>": {
457 "title": "string (required)",
458 "description": "string (optional)",
459 "key": "string (conditional - use key OR keyPrefix)",
460 "keyPrefix": "string (conditional - use key OR keyPrefix)",
461 "contentTypes": ["string (optional)"],
462 "jsonSchema": "object (optional)"
463 }
464 }
465}
466```
467
468**Key-Value Store Schema Properties:**
469
470- `actorKeyValueStoreSchemaVersion` (integer, required) - Version of key-value store schema structure document (currently only version 1)
471- `title` (string, required) - Title of the schema
472- `description` (string, optional) - Description of the schema
473- `collections` (Object, required) - Object where each key is a collection ID and value is a Collection object
474
475**Collection Properties:**
476
477- `title` (string, required) - Collection title shown in UI tabs
478- `description` (string, optional) - Description appearing in UI tooltips
479- `key` (string, conditional*) - Single specific key for this collection
480- `keyPrefix` (string, conditional*) - Prefix for keys included in this collection
481- `contentTypes` (string[], optional) - Allowed content types for validation
482- `jsonSchema` (object, optional) - JSON Schema Draft 07 format for `application/json` content type validation
483
484*Either `key` or `keyPrefix` must be specified for each collection, but not both.
485
486## Apify MCP Tools
487
488If MCP server is configured, use these tools for documentation:
489- `search-apify-docs` - Search documentation
490- `fetch-apify-docs` - Get full doc pages
491
492Otherwise, reference: `@https://mcp.apify.com/`
493
494## Resources
495
496- [docs.apify.com/llms.txt](https://docs.apify.com/llms.txt) - Quick reference
497- [docs.apify.com/llms-full.txt](https://docs.apify.com/llms-full.txt) - Complete docs
498- [crawlee.dev](https://crawlee.dev) - Crawlee documentation
499- [whitepaper.actor](https://raw.githubusercontent.com/apify/actor-whitepaper/refs/heads/master/README.md) - Complete Actor specification