Docs To Rag avatar
Docs To Rag

Pricing

Pay per usage

Go to Apify Store
Docs To Rag

Docs To Rag

Transform documentation websites into RAG-ready chunks with semantic understanding, quality scoring, and direct vector database integration.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Gabriel Antony Xaviour

Gabriel Antony Xaviour

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

20 hours ago

Last modified

Share

Website Health Monitor - Apify Actor Reference

A comprehensive reference implementation demonstrating all major Apify Actor features using CheerioCrawler. This Actor monitors website health by checking URLs for status codes, load times, and broken links.

Purpose

This template serves as a copy-paste reference for building Apify Actors. It demonstrates every major Apify feature in a single, working Actor that you can use as a starting point for your own projects.

Features Demonstrated

1. Actor Lifecycle

// main.ts - Actor.main() handles init/exit automatically
Actor.main(async () => {
// Your Actor code here
// Actor.init() called automatically at start
// Actor.exit() called automatically at end
});

2. Input Handling

// Get typed input from Actor.getInput()
const rawInput = await Actor.getInput<ActorInput>();
const input = validateInput(rawInput);

Input Schema (input_schema.json):

  • urls (array, required) - URLs to monitor
  • maxConcurrency (integer, default: 5) - Concurrent requests
  • proxyConfig (object) - Apify proxy configuration
  • notifyOnFailure (boolean) - Enable failure notifications
  • notificationActorId (string) - Actor to call on failures
  • webhookUrl (string) - Webhook to trigger on completion

3. Storage - Dataset

// Push results to dataset during crawling
await Dataset.pushData(healthCheckResult);
// Access dataset info
const dataset = await Actor.openDataset();
const info = await dataset.getInfo();

Dataset Output Schema:

{
url: string;
status: number;
loadTime: number;
pageTitle: string | null;
brokenLinks: string[];
totalLinks: number;
isHealthy: boolean;
errorMessage: string | null;
timestamp: string;
}

4. Storage - Key-Value Store

// Open default Key-Value Store
const kvStore = await Actor.openKeyValueStore();
// Read INPUT (alternative to Actor.getInput())
const input = await kvStore.getValue('INPUT');
// Write OUTPUT summary
await kvStore.setValue('OUTPUT', summary);
// Write file with content type
await kvStore.setValue('SCREENSHOT_STATUS', content, {
contentType: 'application/json',
});

5. Storage - Request Queue

// Open request queue
const requestQueue = await Actor.openRequestQueue();
// Add requests with user data
await requestQueue.addRequest({
url: 'https://example.com',
userData: { originalUrl: url, startTime: Date.now() },
});

6. Crawlee Integration (CheerioCrawler)

import { CheerioCrawler, createCheerioRouter } from 'crawlee';
// Create router with handlers
const router = createCheerioRouter();
router.addDefaultHandler(async ({ request, response, $, log }) => {
// Extract data using Cheerio
const title = $('title').text();
const links = $('a[href]').map((_, el) => $(el).attr('href')).get();
// Push to dataset
await Dataset.pushData({ url: request.url, title, links });
});
// Create crawler
const crawler = new CheerioCrawler({
requestQueue,
requestHandler: router,
failedRequestHandler: async ({ request, error }) => {
// Handle failed requests
},
maxConcurrency: 5,
maxRequestRetries: 3,
});
// Run crawler
await crawler.run();

7. Proxy Configuration

// Create proxy from input configuration
const proxyConfiguration = await Actor.createProxyConfiguration({
groups: ['RESIDENTIAL'],
countryCode: 'US',
});
// Use in crawler
const crawler = new CheerioCrawler({
proxyConfiguration,
// ...
});

8. Actor-to-Actor Communication

// Call another Actor and wait for result
const run = await Actor.call(
'apify/send-email', // Actor ID
{ // Input for called Actor
subject: 'Alert',
message: 'Something happened',
},
{
memory: 256, // Memory in MB
timeout: 60, // Timeout in seconds
}
);
// Start Actor without waiting (fire and forget)
const run = await Actor.start('apify/some-actor', input);
// Call Actor and get dataset items
const { items } = await Actor.callTask('user/my-task', input);

9. Logging

// Different log levels
Actor.log.debug('Detailed debug info');
Actor.log.info('General information');
Actor.log.warning('Non-critical warning');
Actor.log.error('Error occurred', { error: err.message });
// Log with structured data
Actor.log.info('Processing URL', {
url: 'https://example.com',
status: 200,
loadTime: 150,
});

10. Status Messages

// Update Actor status (visible in Apify Console)
await Actor.setStatusMessage('Processing URL 5/10...');
await Actor.setStatusMessage('✓ Completed successfully');

11. Environment Information

// Get Actor environment variables
const env = Actor.getEnv();
console.log({
actorId: env.actorId,
actorRunId: env.actorRunId,
userId: env.userId,
memoryMbytes: env.memoryMbytes,
isAtHome: env.isAtHome,
defaultDatasetId: env.defaultDatasetId,
defaultKeyValueStoreId: env.defaultKeyValueStoreId,
startedAt: env.startedAt,
timeoutAt: env.timeoutAt,
});

12. Graceful Shutdown

// Handle Actor migration (server change)
Actor.on('migrating', async () => {
// Save state before migration
const kvStore = await Actor.openKeyValueStore();
await kvStore.setValue('MIGRATION_STATE', currentState);
});
// Handle Actor abort
Actor.on('aborting', async () => {
// Save partial results
await Dataset.pushData(partialResults);
});
// Other events: 'persistState', 'systemInfo'

13. Standby Mode (HTTP Server)

// Create HTTP server for standby mode
const server = await Actor.createServer(async (req, res) => {
const url = new URL(req.url || '/', `http://${req.headers.host}`);
if (url.pathname === '/health') {
res.writeHead(200, { 'Content-Type': 'application/json' });
res.end(JSON.stringify({
status: 'running',
urlsProcessed: 10,
memoryUsageMB: 128,
}));
} else {
res.writeHead(404);
res.end('Not found');
}
});
// Server is automatically bound to Actor's port

File Structure

templates/cheerio-reference/
├── src/
│ ├── main.ts # Entry point with Actor.main()
│ ├── routes.ts # Crawlee router handlers
│ ├── types.ts # TypeScript interfaces
│ └── utils.ts # Helper functions
├── package.json # Dependencies
├── tsconfig.json # TypeScript config
├── Dockerfile # Multi-stage build
├── .actor/
│ ├── actor.json # Actor metadata
│ └── input_schema.json # Input schema
└── README.md # This file

Running Locally

# Install dependencies
npm install
# Build TypeScript
npm run build
# Run with test input
echo '{"urls": ["https://example.com"]}' | npx apify-cli run -p

Deploying to Apify

# Login to Apify
npx apify-cli login
# Push to Apify platform
npx apify-cli push

Output

Dataset Items

Each checked URL produces a dataset item:

{
"url": "https://example.com",
"status": 200,
"loadTime": 523,
"pageTitle": "Example Domain",
"brokenLinks": [],
"totalLinks": 15,
"isHealthy": true,
"errorMessage": null,
"timestamp": "2024-01-15T10:30:00.000Z"
}

Key-Value Store OUTPUT

Summary of the health check run:

{
"totalChecked": 10,
"failedCount": 2,
"successCount": 8,
"avgLoadTime": 450,
"totalBrokenLinks": 5,
"failedUrls": ["https://broken.example.com"],
"startTime": "2024-01-15T10:00:00.000Z",
"endTime": "2024-01-15T10:05:00.000Z",
"durationSeconds": 300
}

Quick Reference: Common Patterns

Reading Files from Key-Value Store

const kvStore = await Actor.openKeyValueStore();
const data = await kvStore.getValue('MY_DATA');

Writing Binary Files

await kvStore.setValue('image.png', buffer, {
contentType: 'image/png',
});

Named Stores

// Open named stores (persist across runs)
const kvStore = await Actor.openKeyValueStore('my-store');
const dataset = await Actor.openDataset('my-dataset');
const queue = await Actor.openRequestQueue('my-queue');

Metamorph (Transform Actor)

// Transform into another Actor
await Actor.metamorph('apify/web-scraper', newInput);

Abort Run

// Abort with status message
await Actor.fail('Critical error occurred');
// Exit successfully
await Actor.exit('Completed');

License

ISC