Scrapy Cloud Runner
Pricing
from $3.00 / 1,000 scraped items
Scrapy Cloud Runner
Scrapy Cloud Runner runs Scrapy spiders on Apify with runtime arguments, custom settings, schedules, webhooks, and automatic dataset export. It supports custom spiders, compact JSON output, and JSON/CSV/Excel dataset downloads.
Pricing
from $3.00 / 1,000 scraped items
Rating
0.0
(0)
Developer
Sovanza
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
3 days ago
Last modified
Categories
Share
Scrapy Cloud Runner – Run & Schedule Scrapy Spiders on Apify
Run your Scrapy spiders on the Apify platform without managing servers, deployment pipelines, or dataset wiring. This actor executes Python crawlers in a managed environment, pushes compact structured items to the default dataset, and integrates with schedules, webhooks, and the Apify API.
Overview
Scrapy Cloud Runner is a thin Apify wrapper around your Scrapy project. You choose a spider by name, pass runtime arguments and settings, and the actor runs scrapy crawl with an Apify dataset pipeline attached.
Output is compact: empty or missing fields are omitted so each row contains only what your spider extracted for that page.
Key benefits
- Run Scrapy spiders on demand or on a schedule
- Pass spider arguments and Scrapy settings without rebuilding the Actor
- Automatic dataset integration with JSON, CSV, and Excel export
- Compact JSON — no noise from null or empty fields
- Add custom spiders under
scrapy_cloud_runner/spiders/and run them by name
Core features
- Execute any registered Scrapy spider via
spiderName - Pass keyword args through
spiderArgs(scrapy crawl ... -a key=value) - Override Scrapy settings through
scrapySettings(scrapy crawl ... -s KEY=value) - Configurable dataset push batch size (
pushItemsBatchSize) - Included example spider for quick smoke tests (title, meta, links, custom CSS selectors)
- Built-in
ApifyDatasetPipelinecompacts items before push
How to Use Scrapy Cloud Runner on Apify
Using the Actor
- Open the Actor on the Apify platform and go to the Input tab.
- Set
spiderNameto the spider you want to run (default:example). - Configure
spiderArgsand optionalscrapySettings(see below). - Start the run. The actor launches Scrapy, streams logs, and pushes compact items to the default dataset.
- Open the Dataset tab to browse, download JSON/CSV/Excel, or pull data via the Apify API.
- Schedule or integrate (optional): use schedules, webhooks, Zapier/Make, or your own code against the Apify API.
Input Configuration
Full schema: INPUT_SCHEMA.json. Example:
{"spiderName": "example","spiderArgs": {"start_url": "https://news.ycombinator.com/","include_links": true,"include_headers": false,"selectors": "{\"firstTitle\":\".titleline > a::text\",\"firstTitleUrl\":\".titleline > a::attr(href)\"}"},"scrapySettings": {"DOWNLOAD_DELAY": 0.5,"CONCURRENT_REQUESTS": 16},"logLevel": "INFO","pushItemsBatchSize": 50}
spiderName(required): Spider name to run (must exist inscrapy_cloud_runner/spiders/).spiderArgs(optional): Keyword arguments passed to the spider. For the built-inexamplespider:start_url— page to scrape (defaulthttps://example.com)selectors— JSON object mapping field names to CSS selectors (e.g. Hacker News first story title)include_links— include absolutelinksarray (defaulttrue)include_headers— include responseheadersobject (defaultfalse)
scrapySettings(optional): Scrapy settings overrides at runtime.logLevel(optional):CRITICAL,ERROR,WARNING,INFO, orDEBUG(defaultINFO).pushItemsBatchSize(optional): Buffer size before pushing to the dataset (default50, max1000).
Run locally
INPUT.json is gitignored. Copy INPUT.example.json to INPUT.json, set APIFY_TOKEN and APIFY_DEFAULT_DATASET_ID (or run inside Apify), then:
cd scrapy-cloud-runnerpip install -r requirements.txtcp INPUT.example.json INPUT.jsonpython main.py
Output
Results are stored in the Actor’s default dataset. Each item is a compact JSON object: fields that are empty or unknown are not included.
Typical fields from the included example spider (when data is available):
- Page identity:
url,status,contentType - HTML metadata:
title,h1,metaDescription - Custom selectors: any keys from
spiderArgs.selectors(e.g.firstTitle,firstTitleUrl) - Optional:
links(wheninclude_linksis true),headers(wheninclude_headersis true) - Errors:
erroron failure rows from custom spiders
Example item (illustrative — real items only include keys that have values):
{"url": "https://news.ycombinator.com/","status": 200,"title": "Hacker News","contentType": "text/html; charset=utf-8","firstTitle": "VoidZero Is Joining Cloudflare","firstTitleUrl": "https://blog.cloudflare.com/voidzero-joins-cloudflare/","links": ["https://news.ycombinator.com/item?id=123", "..."]}
Custom spiders may yield any JSON-serializable shape; the pipeline compacts every item before push.
Add your own spiders
Place spiders under:
scrapy_cloud_runner/spiders/
Example:
import scrapyclass MySpider(scrapy.Spider):name = "my_spider"def parse(self, response):yield {"url": response.url, "title": response.css("title::text").get()}
Run with:
{ "spiderName": "my_spider" }
Use Cases
- Web scraping automation — run Scrapy crawlers for sites, marketplaces, or directories
- Scheduled crawling — daily or hourly jobs via Apify schedules
- Data pipelines — feed scraped data into analytics tools or internal systems
- E-commerce monitoring — track listings, prices, and availability
- Data engineering — integrate Scrapy into larger processing workflows
Integrations & API
- Run and fetch results through the Apify API
- Use Python, Node.js, or HTTP clients against run and dataset endpoints
- Connect Zapier, Make, Google Sheets, and other Apify integrations
- Webhooks and schedules for recurring runs
FAQ
How does Scrapy Cloud Runner work?
The actor reads input, builds a scrapy crawl <spiderName> command with your args and settings, runs it as a subprocess, and attaches ApifyDatasetPipeline so yielded items are compacted and pushed to the default dataset.
Do I need my own server?
No. Scrapy runs inside the Apify Actor container.
Can I pass arguments to my spider?
Yes. Use spiderArgs — each key becomes -a key=value on the Scrapy CLI.
Can I override Scrapy settings?
Yes. Use scrapySettings — each key becomes -s KEY=value.
Why are some fields missing from output?
That is intentional. The pipeline removes null, empty strings, empty lists, and empty objects so dataset rows stay clean. If a CSS selector finds nothing (e.g. no <h1> on the page), that key is omitted.
What formats can I export?
JSON, CSV, and Excel from the Apify dataset UI, plus full access via the Apify API.
SEO Keywords
scrapy cloud runner
run scrapy on apify
scrapy apify actor
scrapy spider cloud
scrapy crawler automation
scheduled scrapy crawler
scrapy web scraping
python scrapy cloud
scrapy dataset export
scrapy crawl api
apify scrapy integration
scrapy spider scheduler
cloud web crawler scrapy
scrapy data pipeline
scrapy spider hosting
Limitations
- One spider runs per Actor invocation (use schedules or API for multiple parallel runs).
- Output shape depends on your spider implementation; the dataset schema documents the
examplespider fields. - Heavy crawls may require higher Apify memory and concurrency settings via
scrapySettings.
Get Started
Set spiderName to example (or your own spider), adjust spiderArgs, and start your first run on Apify.