Scrapy Cloud Runner avatar

Scrapy Cloud Runner

Pricing

from $3.00 / 1,000 scraped items

Go to Apify Store
Scrapy Cloud Runner

Scrapy Cloud Runner

Scrapy Cloud Runner runs Scrapy spiders on Apify with runtime arguments, custom settings, schedules, webhooks, and automatic dataset export. It supports custom spiders, compact JSON output, and JSON/CSV/Excel dataset downloads.

Pricing

from $3.00 / 1,000 scraped items

Rating

0.0

(0)

Developer

Sovanza

Sovanza

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

3 days ago

Last modified

Share

Scrapy Cloud Runner – Run & Schedule Scrapy Spiders on Apify

Run your Scrapy spiders on the Apify platform without managing servers, deployment pipelines, or dataset wiring. This actor executes Python crawlers in a managed environment, pushes compact structured items to the default dataset, and integrates with schedules, webhooks, and the Apify API.

Overview

Scrapy Cloud Runner is a thin Apify wrapper around your Scrapy project. You choose a spider by name, pass runtime arguments and settings, and the actor runs scrapy crawl with an Apify dataset pipeline attached.

Output is compact: empty or missing fields are omitted so each row contains only what your spider extracted for that page.

Key benefits

  • Run Scrapy spiders on demand or on a schedule
  • Pass spider arguments and Scrapy settings without rebuilding the Actor
  • Automatic dataset integration with JSON, CSV, and Excel export
  • Compact JSON — no noise from null or empty fields
  • Add custom spiders under scrapy_cloud_runner/spiders/ and run them by name

Core features

  • Execute any registered Scrapy spider via spiderName
  • Pass keyword args through spiderArgs (scrapy crawl ... -a key=value)
  • Override Scrapy settings through scrapySettings (scrapy crawl ... -s KEY=value)
  • Configurable dataset push batch size (pushItemsBatchSize)
  • Included example spider for quick smoke tests (title, meta, links, custom CSS selectors)
  • Built-in ApifyDatasetPipeline compacts items before push

How to Use Scrapy Cloud Runner on Apify

Using the Actor

  1. Open the Actor on the Apify platform and go to the Input tab.
  2. Set spiderName to the spider you want to run (default: example).
  3. Configure spiderArgs and optional scrapySettings (see below).
  4. Start the run. The actor launches Scrapy, streams logs, and pushes compact items to the default dataset.
  5. Open the Dataset tab to browse, download JSON/CSV/Excel, or pull data via the Apify API.
  6. Schedule or integrate (optional): use schedules, webhooks, Zapier/Make, or your own code against the Apify API.

Input Configuration

Full schema: INPUT_SCHEMA.json. Example:

{
"spiderName": "example",
"spiderArgs": {
"start_url": "https://news.ycombinator.com/",
"include_links": true,
"include_headers": false,
"selectors": "{\"firstTitle\":\".titleline > a::text\",\"firstTitleUrl\":\".titleline > a::attr(href)\"}"
},
"scrapySettings": {
"DOWNLOAD_DELAY": 0.5,
"CONCURRENT_REQUESTS": 16
},
"logLevel": "INFO",
"pushItemsBatchSize": 50
}
  • spiderName (required): Spider name to run (must exist in scrapy_cloud_runner/spiders/).
  • spiderArgs (optional): Keyword arguments passed to the spider. For the built-in example spider:
    • start_url — page to scrape (default https://example.com)
    • selectors — JSON object mapping field names to CSS selectors (e.g. Hacker News first story title)
    • include_links — include absolute links array (default true)
    • include_headers — include response headers object (default false)
  • scrapySettings (optional): Scrapy settings overrides at runtime.
  • logLevel (optional): CRITICAL, ERROR, WARNING, INFO, or DEBUG (default INFO).
  • pushItemsBatchSize (optional): Buffer size before pushing to the dataset (default 50, max 1000).

Run locally

INPUT.json is gitignored. Copy INPUT.example.json to INPUT.json, set APIFY_TOKEN and APIFY_DEFAULT_DATASET_ID (or run inside Apify), then:

cd scrapy-cloud-runner
pip install -r requirements.txt
cp INPUT.example.json INPUT.json
python main.py

Output

Results are stored in the Actor’s default dataset. Each item is a compact JSON object: fields that are empty or unknown are not included.

Typical fields from the included example spider (when data is available):

  • Page identity: url, status, contentType
  • HTML metadata: title, h1, metaDescription
  • Custom selectors: any keys from spiderArgs.selectors (e.g. firstTitle, firstTitleUrl)
  • Optional: links (when include_links is true), headers (when include_headers is true)
  • Errors: error on failure rows from custom spiders

Example item (illustrative — real items only include keys that have values):

{
"url": "https://news.ycombinator.com/",
"status": 200,
"title": "Hacker News",
"contentType": "text/html; charset=utf-8",
"firstTitle": "VoidZero Is Joining Cloudflare",
"firstTitleUrl": "https://blog.cloudflare.com/voidzero-joins-cloudflare/",
"links": ["https://news.ycombinator.com/item?id=123", "..."]
}

Custom spiders may yield any JSON-serializable shape; the pipeline compacts every item before push.

Add your own spiders

Place spiders under:

scrapy_cloud_runner/spiders/

Example:

import scrapy
class MySpider(scrapy.Spider):
name = "my_spider"
def parse(self, response):
yield {"url": response.url, "title": response.css("title::text").get()}

Run with:

{ "spiderName": "my_spider" }

Use Cases

  • Web scraping automation — run Scrapy crawlers for sites, marketplaces, or directories
  • Scheduled crawling — daily or hourly jobs via Apify schedules
  • Data pipelines — feed scraped data into analytics tools or internal systems
  • E-commerce monitoring — track listings, prices, and availability
  • Data engineering — integrate Scrapy into larger processing workflows

Integrations & API

  • Run and fetch results through the Apify API
  • Use Python, Node.js, or HTTP clients against run and dataset endpoints
  • Connect Zapier, Make, Google Sheets, and other Apify integrations
  • Webhooks and schedules for recurring runs

FAQ

How does Scrapy Cloud Runner work?

The actor reads input, builds a scrapy crawl <spiderName> command with your args and settings, runs it as a subprocess, and attaches ApifyDatasetPipeline so yielded items are compacted and pushed to the default dataset.

Do I need my own server?

No. Scrapy runs inside the Apify Actor container.

Can I pass arguments to my spider?

Yes. Use spiderArgs — each key becomes -a key=value on the Scrapy CLI.

Can I override Scrapy settings?

Yes. Use scrapySettings — each key becomes -s KEY=value.

Why are some fields missing from output?

That is intentional. The pipeline removes null, empty strings, empty lists, and empty objects so dataset rows stay clean. If a CSS selector finds nothing (e.g. no <h1> on the page), that key is omitted.

What formats can I export?

JSON, CSV, and Excel from the Apify dataset UI, plus full access via the Apify API.

SEO Keywords

scrapy cloud runner
run scrapy on apify
scrapy apify actor
scrapy spider cloud
scrapy crawler automation
scheduled scrapy crawler
scrapy web scraping
python scrapy cloud
scrapy dataset export
scrapy crawl api
apify scrapy integration
scrapy spider scheduler
cloud web crawler scrapy
scrapy data pipeline
scrapy spider hosting

Limitations

  • One spider runs per Actor invocation (use schedules or API for multiple parallel runs).
  • Output shape depends on your spider implementation; the dataset schema documents the example spider fields.
  • Heavy crawls may require higher Apify memory and concurrency settings via scrapySettings.

Get Started

Set spiderName to example (or your own spider), adjust spiderArgs, and start your first run on Apify.