Scrapy Cloud Runner
Pricing
from $2.00 / 1,000 results
Scrapy Cloud Runner
Run Scrapy spiders on Apify with request queue, dataset export, proxy rotation, scheduling, and cloud-ready deployment.
Pricing
from $2.00 / 1,000 results
Rating
0.0
(0)
Developer
Solutions Smart
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
2 days ago
Last modified
Share
Run Scrapy spiders on Apify with cloud scheduling, API access, dataset output, proxy support, and production-friendly crawl defaults.
What this actor does
Scrapy Cloud Runner is a Python Apify Actor that executes Scrapy spiders bundled with the actor codebase. It uses the official Apify Python SDK and Scrapy integration so you can run, schedule, and monitor Scrapy crawls in the Apify Console or through the API.
The actor:
- runs a selected Scrapy spider by
spiderName - reads input from the Apify input form or API
- pushes scraped items to the default dataset
- stores a crawl summary in the
OUTPUTkey-value store record - supports Apify proxy configuration
- exposes crawl controls for limits, retries, delays, cache, and robots.txt
Included spider
The actor includes one bundled example spider:
page_meta: crawls pages, extracts basic page metadata, and optionally follows links
The example spider is designed to be a solid starting point, not a universal website crawler. By default it stays on the same hostname as the start URLs to avoid drifting into subdomains with different blocking or rate-limit behavior.
Why use it on Apify
Running Scrapy on Apify gives you:
- scheduled runs
- API-triggered runs
- centralized logs
- dataset export
- proxy integration
- managed cloud execution
You keep the Scrapy spider model, but you do not need to manage servers, deployment plumbing, or result storage yourself.
Quick start
- Open the actor in Apify Console.
- Set
spiderNametopage_metaor to your own bundled spider. - Add one or more
startUrls. - Keep the default limits for the first run.
- Run the actor.
- Review results in the Dataset tab and the summary in the
OUTPUTrecord.
Example input
{"spiderName": "page_meta","startUrls": [{ "url": "https://apify.com" }],"followLinks": true,"sameHostnameOnly": true,"includeHtml": false,"maxRequestsPerCrawl": 20,"maxDepth": 1,"maxConcurrency": 16,"requestTimeoutSecs": 30,"downloadDelaySecs": 1,"retryTimes": 2,"useAutoThrottle": true,"autoThrottleTargetConcurrency": 1,"autoThrottleStartDelaySecs": 1,"autoThrottleMaxDelaySecs": 15,"respectRobotsTxt": true,"useHttpCache": true,"httpCacheExpirationSecs": 7200,"spiderArgs": [{ "key": "category", "value": "books" }]}
Input settings
| Input | Type | Description |
|---|---|---|
spiderName | string | Name of the bundled Scrapy spider to run. |
startUrls | array | Starting URLs for the crawl. |
allowedDomains | array | Optional domain allowlist for Scrapy offsite filtering. |
followLinks | boolean | Follow links discovered on crawled pages. |
sameHostnameOnly | boolean | Restrict followed links to the exact hostnames from startUrls. Recommended for focused crawls. |
includeHtml | boolean | Include raw HTML in dataset items. |
maxRequestsPerCrawl | integer | Maximum number of scraped pages/items emitted by the bundled spider. |
maxDepth | integer | Maximum follow depth from the initial pages. |
maxConcurrency | integer | Maximum concurrent Scrapy requests. |
requestTimeoutSecs | integer | Download timeout per request. |
downloadDelaySecs | number | Base delay between requests to the same site. |
retryTimes | integer | Retry count for retryable failures. |
useAutoThrottle | boolean | Enable Scrapy AutoThrottle. |
autoThrottleTargetConcurrency | number | Target average concurrency per remote site. |
autoThrottleStartDelaySecs | number | Initial AutoThrottle delay. |
autoThrottleMaxDelaySecs | number | Maximum AutoThrottle delay. |
respectRobotsTxt | boolean | Respect robots.txt. |
useHttpCache | boolean | Enable HTTP cache. |
httpCacheExpirationSecs | integer | Cache expiration time in seconds. |
proxyConfiguration | object | Apify proxy configuration. |
spiderArgs | array | Spider arguments entered as schema-based key/value rows in Apify Console. |
spiderArgsJson | object | Structured spider arguments for API callers. Merged over spiderArgs on duplicate keys. |
Output
The default dataset contains one item per scraped page. For the bundled page_meta spider, each item includes fields such as:
{"url": "https://apify.com","status": 200,"title": "Apify: Full-stack web scraping and data extraction platform","metaDescription": "Cloud platform for web scraping, browser automation, AI agents, and data for AI.","canonicalUrl": "https://apify.com","h1": "Get real-time web data for your AI","contentType": "text/html; charset=utf-8","depth": 0,"referrer": null,"textLength": 125419,"crawledAt": "2026-05-16T11:21:11.435924+00:00","html": null}
The actor also stores a summary record in OUTPUT:
{"availableSpiders": ["page_meta"],"finishedAt": "2026-05-16T11:21:15.000000+00:00","itemCount": 5,"requestCount": 12,"spiderName": "page_meta","startedAt": "2026-05-16T11:21:10.000000+00:00","stats": {}}
Default crawl behavior
The bundled actor defaults are tuned for focused website crawls:
- same-host following is enabled by default
- AutoThrottle is enabled by default
- HTTP cache uses RFC2616 policy
- common blocked/error responses such as
403and429are not cached - cookies are disabled
- robots.txt is respected by default
These defaults are more conservative and more production-friendly than simply running Scrapy at high parallelism.
Add your own spiders
- Add a spider module under
src/spiders/. - Give the spider a unique Scrapy
name. - Read any custom runtime options from spider kwargs or
spiderArgsJson. - Deploy the updated actor.
- Run the actor with
spiderNameset to your spider's name.
The actor uses Scrapy's spider loader, so bundled spiders are discovered automatically from src.spiders.
Practical guidance
- Start with one or two
startUrls. - Keep
sameHostnameOnlyenabled unless you intentionally want cross-subdomain crawling. - Use proxy configuration for websites with blocking or rate limiting.
- Keep
includeHtmloff unless you need full source in the dataset. - For broad or multi-domain crawling, create a dedicated spider with different settings instead of using the bundled example as-is.
Legal and operational note
You are responsible for using this actor in compliance with the target site's terms, applicable law, and reasonable load limits. Keep respectRobotsTxt enabled unless you have a clear reason not to.