Scrapy Cloud Runner avatar

Scrapy Cloud Runner

Pricing

from $2.00 / 1,000 results

Go to Apify Store
Scrapy Cloud Runner

Scrapy Cloud Runner

Run Scrapy spiders on Apify with request queue, dataset export, proxy rotation, scheduling, and cloud-ready deployment.

Pricing

from $2.00 / 1,000 results

Rating

0.0

(0)

Developer

Solutions Smart

Solutions Smart

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

2 days ago

Last modified

Share

Run Scrapy spiders on Apify with cloud scheduling, API access, dataset output, proxy support, and production-friendly crawl defaults.

What this actor does

Scrapy Cloud Runner is a Python Apify Actor that executes Scrapy spiders bundled with the actor codebase. It uses the official Apify Python SDK and Scrapy integration so you can run, schedule, and monitor Scrapy crawls in the Apify Console or through the API.

The actor:

  • runs a selected Scrapy spider by spiderName
  • reads input from the Apify input form or API
  • pushes scraped items to the default dataset
  • stores a crawl summary in the OUTPUT key-value store record
  • supports Apify proxy configuration
  • exposes crawl controls for limits, retries, delays, cache, and robots.txt

Included spider

The actor includes one bundled example spider:

  • page_meta: crawls pages, extracts basic page metadata, and optionally follows links

The example spider is designed to be a solid starting point, not a universal website crawler. By default it stays on the same hostname as the start URLs to avoid drifting into subdomains with different blocking or rate-limit behavior.

Why use it on Apify

Running Scrapy on Apify gives you:

  • scheduled runs
  • API-triggered runs
  • centralized logs
  • dataset export
  • proxy integration
  • managed cloud execution

You keep the Scrapy spider model, but you do not need to manage servers, deployment plumbing, or result storage yourself.

Quick start

  1. Open the actor in Apify Console.
  2. Set spiderName to page_meta or to your own bundled spider.
  3. Add one or more startUrls.
  4. Keep the default limits for the first run.
  5. Run the actor.
  6. Review results in the Dataset tab and the summary in the OUTPUT record.

Example input

{
"spiderName": "page_meta",
"startUrls": [
{ "url": "https://apify.com" }
],
"followLinks": true,
"sameHostnameOnly": true,
"includeHtml": false,
"maxRequestsPerCrawl": 20,
"maxDepth": 1,
"maxConcurrency": 16,
"requestTimeoutSecs": 30,
"downloadDelaySecs": 1,
"retryTimes": 2,
"useAutoThrottle": true,
"autoThrottleTargetConcurrency": 1,
"autoThrottleStartDelaySecs": 1,
"autoThrottleMaxDelaySecs": 15,
"respectRobotsTxt": true,
"useHttpCache": true,
"httpCacheExpirationSecs": 7200,
"spiderArgs": [
{ "key": "category", "value": "books" }
]
}

Input settings

InputTypeDescription
spiderNamestringName of the bundled Scrapy spider to run.
startUrlsarrayStarting URLs for the crawl.
allowedDomainsarrayOptional domain allowlist for Scrapy offsite filtering.
followLinksbooleanFollow links discovered on crawled pages.
sameHostnameOnlybooleanRestrict followed links to the exact hostnames from startUrls. Recommended for focused crawls.
includeHtmlbooleanInclude raw HTML in dataset items.
maxRequestsPerCrawlintegerMaximum number of scraped pages/items emitted by the bundled spider.
maxDepthintegerMaximum follow depth from the initial pages.
maxConcurrencyintegerMaximum concurrent Scrapy requests.
requestTimeoutSecsintegerDownload timeout per request.
downloadDelaySecsnumberBase delay between requests to the same site.
retryTimesintegerRetry count for retryable failures.
useAutoThrottlebooleanEnable Scrapy AutoThrottle.
autoThrottleTargetConcurrencynumberTarget average concurrency per remote site.
autoThrottleStartDelaySecsnumberInitial AutoThrottle delay.
autoThrottleMaxDelaySecsnumberMaximum AutoThrottle delay.
respectRobotsTxtbooleanRespect robots.txt.
useHttpCachebooleanEnable HTTP cache.
httpCacheExpirationSecsintegerCache expiration time in seconds.
proxyConfigurationobjectApify proxy configuration.
spiderArgsarraySpider arguments entered as schema-based key/value rows in Apify Console.
spiderArgsJsonobjectStructured spider arguments for API callers. Merged over spiderArgs on duplicate keys.

Output

The default dataset contains one item per scraped page. For the bundled page_meta spider, each item includes fields such as:

{
"url": "https://apify.com",
"status": 200,
"title": "Apify: Full-stack web scraping and data extraction platform",
"metaDescription": "Cloud platform for web scraping, browser automation, AI agents, and data for AI.",
"canonicalUrl": "https://apify.com",
"h1": "Get real-time web data for your AI",
"contentType": "text/html; charset=utf-8",
"depth": 0,
"referrer": null,
"textLength": 125419,
"crawledAt": "2026-05-16T11:21:11.435924+00:00",
"html": null
}

The actor also stores a summary record in OUTPUT:

{
"availableSpiders": ["page_meta"],
"finishedAt": "2026-05-16T11:21:15.000000+00:00",
"itemCount": 5,
"requestCount": 12,
"spiderName": "page_meta",
"startedAt": "2026-05-16T11:21:10.000000+00:00",
"stats": {}
}

Default crawl behavior

The bundled actor defaults are tuned for focused website crawls:

  • same-host following is enabled by default
  • AutoThrottle is enabled by default
  • HTTP cache uses RFC2616 policy
  • common blocked/error responses such as 403 and 429 are not cached
  • cookies are disabled
  • robots.txt is respected by default

These defaults are more conservative and more production-friendly than simply running Scrapy at high parallelism.

Add your own spiders

  1. Add a spider module under src/spiders/.
  2. Give the spider a unique Scrapy name.
  3. Read any custom runtime options from spider kwargs or spiderArgsJson.
  4. Deploy the updated actor.
  5. Run the actor with spiderName set to your spider's name.

The actor uses Scrapy's spider loader, so bundled spiders are discovered automatically from src.spiders.

Practical guidance

  • Start with one or two startUrls.
  • Keep sameHostnameOnly enabled unless you intentionally want cross-subdomain crawling.
  • Use proxy configuration for websites with blocking or rate limiting.
  • Keep includeHtml off unless you need full source in the dataset.
  • For broad or multi-domain crawling, create a dedicated spider with different settings instead of using the bundled example as-is.

You are responsible for using this actor in compliance with the target site's terms, applicable law, and reasonable load limits. Keep respectRobotsTxt enabled unless you have a clear reason not to.