1# Apify Actor AI Agent Instructions
2
3## What are Apify Actors?
4
5- Actors are serverless cloud programs that can perform anything from a simple action, like filling out a web form, to a complex operation, like crawling an entire website or removing duplicates from a large dataset.
6- Actors are programs packaged as Docker images, which accept a well-defined JSON input, perform an action, and optionally produce a well-defined JSON output.
7
8### Apify Actor directory structure
9
10```text
11.actor/
12├── actor.json # Actor config: name, version, env vars, runtime settings
13├── input_schema.json # Input validation & Console form definition
14├── dataset_schema.json # Dataset schema definition
15└── output_schema.json # Specifies where an Actor stores its output
16src/
17└── main.js # Actor entry point and orchestrator
18storage/ # Local storage (mirrors Cloud during development)
19├── datasets/ # Output items (JSON objects)
20├── key_value_stores/ # Files, config, INPUT
21└── request_queues/ # Pending crawl requests
22Dockerfile # Container image definition
23AGENTS.md # AI agent instructions (this file)
24```
25
26## Apify CLI
27
28### Installation
29
30- Install Apify CLI only if it is not already installed.
31- If Apify CLI is not installed, install it using the following commands:
32 - macOS/Linux: `curl -fsSL https://apify.com/install-cli.sh | bash`
33 - Windows: `irm https://apify.com/install-cli.ps1 | iex`
34
35### Apify CLI Commands
36
37```bash
38# Local development
39apify run # Run Actor locally
40
41# Authentication & deployment
42apify login # Authenticate account
43apify push # Deploy to Apify platform
44
45# Help
46apify help # List all commands
47```
48
49## Do
50
51- use the default values for all fields in the actor.json, input_schema.json, output_schema.json, and main.js files
52- use Apify CLI to run the Actor locally, and push it to the Apify platform
53- accept well-defined JSON input and produce structured JSON output
54- use Apify SDK (`apify`) for code running ON Apify platform
55- validate input early with proper error handling and fail gracefully
56- use CheerioCrawler for static HTML content (10x faster than browsers)
57- use PlaywrightCrawler only for JavaScript-heavy sites and dynamic content
58- use router pattern (createCheerioRouter/createPlaywrightRouter) for complex crawls
59- implement retry strategies with exponential backoff for failed requests
60- use proper concurrency settings (HTTP: 10-50, Browser: 1-5)
61- set sensible defaults in `.actor/input_schema.json` for all optional fields
62- set up output schema in `.actor/output_schema.json`
63- clean and validate data before pushing to dataset
64- use semantic CSS selectors and fallback strategies for missing elements
65- respect robots.txt, ToS, and implement rate limiting with delays
66- check which tools (cheerio/playwright/crawlee) are installed before applying guidance
67
68## Don't
69
70- do not run apify create command
71- do not rely on `Dataset.getInfo()` for final counts on Cloud platform
72- do not use browser crawlers when HTTP/Cheerio works (massive performance gains with HTTP)
73- do not hard code values that should be in input schema or environment variables
74- do not skip input validation or error handling
75- do not overload servers - use appropriate concurrency and delays
76- do not scrape prohibited content or ignore Terms of Service
77- do not store personal/sensitive data unless explicitly permitted
78- do not use deprecated options like `requestHandlerTimeoutMillis` on CheerioCrawler (v3.x)
79- do not use `additionalHttpHeaders` - use `preNavigationHooks` instead
80
81## Actor Input Schema
82
83The input schema defines the input parameters for an Actor. It's a JSON object comprising various field types supported by the Apify platform.
84
85### Structure
86
87```json
88{
89 "title": "<INPUT-SCHEMA-TITLE>",
90 "type": "object",
91 "schemaVersion": 1,
92 "properties": {
93 /* define input fields here */
94 },
95 "required": []
96}
97```
98
99## Actor Output Schema
100
101The Actor output schema builds upon the schemas for the dataset and key-value store. It specifies where an Actor stores its output and defines templates for accessing that output. Apify Console uses these output definitions to display run results.
102
103### Structure
104
105```json
106{
107 "actorOutputSchemaVersion": 1,
108 "title": "<OUTPUT-SCHEMA-TITLE>",
109 "properties": {
110 /* define your outputs here */
111 }
112}
113```
114
115### Output Schema Template Variables
116
117- `links` (object) - Contains quick links to most commonly used URLs
118- `links.publicRunUrl` (string) - Public run url in format `https://console.apify.com/view/runs/:runId`
119- `links.consoleRunUrl` (string) - Console run url in format `https://console.apify.com/actors/runs/:runId`
120- `links.apiRunUrl` (string) - API run url in format `https://api.apify.com/v2/actor-runs/:runId`
121- `links.apiDefaultDatasetUrl` (string) - API url of default dataset in format `https://api.apify.com/v2/datasets/:defaultDatasetId`
122- `links.apiDefaultKeyValueStoreUrl` (string) - API url of default key-value store in format `https://api.apify.com/v2/key-value-stores/:defaultKeyValueStoreId`
123- `links.containerRunUrl` (string) - URL of a webserver running inside the run in format `https://<containerId>.runs.apify.net/`
124- `run` (object) - Contains information about the run same as it is returned from the `GET Run` API endpoint
125- `run.defaultDatasetId` (string) - ID of the default dataset
126- `run.defaultKeyValueStoreId` (string) - ID of the default key-value store
127
128## Dataset Schema Specification
129
130The dataset schema defines how your Actor's output data is structured, transformed, and displayed in the Output tab in the Apify Console.
131
132### Structure
133
134```json
135{
136 "actorSpecification": 1,
137 "fields": {},
138 "views": {
139 "<VIEW_NAME>": {
140 "title": "string (required)",
141 "description": "string (optional)",
142 "transformation": {
143 "fields": ["string (required)"],
144 "unwind": ["string (optional)"],
145 "flatten": ["string (optional)"],
146 "omit": ["string (optional)"],
147 "limit": "integer (optional)",
148 "desc": "boolean (optional)"
149 },
150 "display": {
151 "component": "table (required)",
152 "properties": {
153 "<FIELD_NAME>": {
154 "label": "string (optional)",
155 "format": "text|number|date|link|boolean|image|array|object (optional)"
156 }
157 }
158 }
159 }
160 }
161}
162```
163
164**Dataset Schema Properties:**
165
166- `actorSpecification` (integer, required) - Specifies the version of dataset schema structure document (currently only version 1)
167- `fields` (JSONSchema object, required) - Schema of one dataset object (use JsonSchema Draft 2020-12 or compatible)
168- `views` (DatasetView object, required) - Object with API and UI views description
169
170**DatasetView Properties:**
171
172- `title` (string, required) - Visible in UI Output tab and API
173- `description` (string, optional) - Only available in API response
174- `transformation` (ViewTransformation object, required) - Data transformation applied when loading from Dataset API
175- `display` (ViewDisplay object, required) - Output tab UI visualization definition
176
177**ViewTransformation Properties:**
178
179- `fields` (string[], required) - Fields to present in output (order matches column order)
180- `unwind` (string[], optional) - Deconstructs nested children into parent object
181- `flatten` (string[], optional) - Transforms nested object into flat structure
182- `omit` (string[], optional) - Removes specified fields from output
183- `limit` (integer, optional) - Maximum number of results (default: all)
184- `desc` (boolean, optional) - Sort order (true = newest first)
185
186**ViewDisplay Properties:**
187
188- `component` (string, required) - Only `table` is available
189- `properties` (Object, optional) - Keys matching `transformation.fields` with ViewDisplayProperty values
190
191**ViewDisplayProperty Properties:**
192
193- `label` (string, optional) - Table column header
194- `format` (string, optional) - One of: `text`, `number`, `date`, `link`, `boolean`, `image`, `array`, `object`
195
196## Key-Value Store Schema Specification
197
198The key-value store schema organizes keys into logical groups called collections for easier data management.
199
200### Structure
201
202```json
203{
204 "actorKeyValueStoreSchemaVersion": 1,
205 "title": "string (required)",
206 "description": "string (optional)",
207 "collections": {
208 "<COLLECTION_NAME>": {
209 "title": "string (required)",
210 "description": "string (optional)",
211 "key": "string (conditional - use key OR keyPrefix)",
212 "keyPrefix": "string (conditional - use key OR keyPrefix)",
213 "contentTypes": ["string (optional)"],
214 "jsonSchema": "object (optional)"
215 }
216 }
217}
218```
219
220**Key-Value Store Schema Properties:**
221
222- `actorKeyValueStoreSchemaVersion` (integer, required) - Version of key-value store schema structure document (currently only version 1)
223- `title` (string, required) - Title of the schema
224- `description` (string, optional) - Description of the schema
225- `collections` (Object, required) - Object where each key is a collection ID and value is a Collection object
226
227**Collection Properties:**
228
229- `title` (string, required) - Collection title shown in UI tabs
230- `description` (string, optional) - Description appearing in UI tooltips
231- `key` (string, conditional) - Single specific key for this collection
232- `keyPrefix` (string, conditional) - Prefix for keys included in this collection
233- `contentTypes` (string[], optional) - Allowed content types for validation
234- `jsonSchema` (object, optional) - JSON Schema Draft 07 format for `application/json` content type validation
235
236Either `key` or `keyPrefix` must be specified for each collection, but not both.
237
238## Apify MCP Tools
239
240If MCP server is configured, use these tools for documentation:
241
242- `search-apify-docs` - Search documentation
243- `fetch-apify-docs` - Get full doc pages
244
245Otherwise, reference: `@https://mcp.apify.com/`
246
247## Resources
248
249- [docs.apify.com/llms.txt](https://docs.apify.com/llms.txt) - Quick reference
250- [docs.apify.com/llms-full.txt](https://docs.apify.com/llms-full.txt) - Complete docs
251- [crawlee.dev](https://crawlee.dev) - Crawlee documentation
252- [whitepaper.actor](https://raw.githubusercontent.com/apify/actor-whitepaper/refs/heads/master/README.md) - Complete Actor specification