Pricing

from $3.00 / 1,000 scraped items

Try for free

Go to Apify Store

Scrapy Cloud Runner

Try for free

Scrapy Cloud Runner runs Scrapy spiders on Apify with runtime arguments, custom settings, schedules, webhooks, and automatic dataset export. It supports custom spiders, compact JSON output, and JSON/CSV/Excel dataset downloads.

Pricing

from $3.00 / 1,000 scraped items

Rating

5.0

(1)

Developer

Sovanza

Actor stats

Bookmarked

Total users

Monthly active users

a month ago

Last modified

Scrapy Cloud Runner – Run & Schedule Scrapy Spiders on Apify

Run your Scrapy spiders on the Apify platform without managing servers, deployment pipelines, or dataset wiring. This actor executes Python crawlers in a managed environment, pushes compact structured items to the default dataset, and integrates with schedules, webhooks, and the Apify API.

Overview

Scrapy Cloud Runner is a thin Apify wrapper around your Scrapy project. You choose a spider by name, pass runtime arguments and settings, and the actor runs scrapy crawl with an Apify dataset pipeline attached.

Output is compact: empty or missing fields are omitted so each row contains only what your spider extracted for that page.

Key benefits

Run Scrapy spiders on demand or on a schedule
Pass spider arguments and Scrapy settings without rebuilding the Actor
Automatic dataset integration with JSON, CSV, and Excel export
Compact JSON — no noise from null or empty fields
Add custom spiders under scrapy_cloud_runner/spiders/ and run them by name

Core features

Execute any registered Scrapy spider via spiderName
Pass keyword args through spiderArgs (scrapy crawl ... -a key=value)
Override Scrapy settings through scrapySettings (scrapy crawl ... -s KEY=value)
Configurable dataset push batch size (pushItemsBatchSize)
Included example spider for quick smoke tests (title, meta, links, custom CSS selectors)
Built-in ApifyDatasetPipeline compacts items before push

How to Use Scrapy Cloud Runner on Apify

Using the Actor

Open the Actor on the Apify platform and go to the Input tab.
Set spiderName to the spider you want to run (default: example).
Configure spiderArgs and optional scrapySettings (see below).
Start the run. The actor launches Scrapy, streams logs, and pushes compact items to the default dataset.
Open the Dataset tab to browse, download JSON/CSV/Excel, or pull data via the Apify API.
Schedule or integrate (optional): use schedules, webhooks, Zapier/Make, or your own code against the Apify API.

Input Configuration

Full schema: INPUT_SCHEMA.json. Example:

{
  "spiderName": "example",
  "spiderArgs": {
    "start_url": "https://news.ycombinator.com/",
    "include_links": true,
    "include_headers": false,
    "selectors": "{\"firstTitle\":\".titleline > a::text\",\"firstTitleUrl\":\".titleline > a::attr(href)\"}"
  },
  "scrapySettings": {
    "DOWNLOAD_DELAY": 0.5,
    "CONCURRENT_REQUESTS": 16
  },
  "logLevel": "INFO",
  "pushItemsBatchSize": 50
}

spiderName (required): Spider name to run (must exist in scrapy_cloud_runner/spiders/).
spiderArgs (optional): Keyword arguments passed to the spider. For the built-in example spider:
- start_url — page to scrape (default https://example.com)
- selectors — JSON object mapping field names to CSS selectors (e.g. Hacker News first story title)
- include_links — include absolute links array (default true)
- include_headers — include response headers object (default false)
scrapySettings (optional): Scrapy settings overrides at runtime.
logLevel (optional): CRITICAL, ERROR, WARNING, INFO, or DEBUG (default INFO).
pushItemsBatchSize (optional): Buffer size before pushing to the dataset (default 50, max 1000).

Run locally

INPUT.json is gitignored. Copy INPUT.example.json to INPUT.json, set APIFY_TOKEN and APIFY_DEFAULT_DATASET_ID (or run inside Apify), then:

cd scrapy-cloud-runner
pip install -r requirements.txt
cp INPUT.example.json INPUT.json
python main.py

Output

Results are stored in the Actor’s default dataset. Each item is a compact JSON object: fields that are empty or unknown are not included.

Typical fields from the included example spider (when data is available):

Page identity: url, status, contentType
HTML metadata: title, h1, metaDescription
Custom selectors: any keys from spiderArgs.selectors (e.g. firstTitle, firstTitleUrl)
Optional: links (when include_links is true), headers (when include_headers is true)
Errors: error on failure rows from custom spiders

Example item (illustrative — real items only include keys that have values):

{
  "url": "https://news.ycombinator.com/",
  "status": 200,
  "title": "Hacker News",
  "contentType": "text/html; charset=utf-8",
  "firstTitle": "VoidZero Is Joining Cloudflare",
  "firstTitleUrl": "https://blog.cloudflare.com/voidzero-joins-cloudflare/",
  "links": ["https://news.ycombinator.com/item?id=123", "..."]
}

Custom spiders may yield any JSON-serializable shape; the pipeline compacts every item before push.

Add your own spiders

Place spiders under:

scrapy_cloud_runner/spiders/

Example:

import scrapy

class MySpider(scrapy.Spider):
    name = "my_spider"

    def parse(self, response):
        yield {"url": response.url, "title": response.css("title::text").get()}

Run with:

{ "spiderName": "my_spider" }

Use Cases

Web scraping automation — run Scrapy crawlers for sites, marketplaces, or directories
Scheduled crawling — daily or hourly jobs via Apify schedules
Data pipelines — feed scraped data into analytics tools or internal systems
E-commerce monitoring — track listings, prices, and availability
Data engineering — integrate Scrapy into larger processing workflows

Integrations & API

Run and fetch results through the Apify API
Use Python, Node.js, or HTTP clients against run and dataset endpoints
Connect Zapier, Make, Google Sheets, and other Apify integrations
Webhooks and schedules for recurring runs

FAQ

How does Scrapy Cloud Runner work?

The actor reads input, builds a scrapy crawl <spiderName> command with your args and settings, runs it as a subprocess, and attaches ApifyDatasetPipeline so yielded items are compacted and pushed to the default dataset.

Do I need my own server?

No. Scrapy runs inside the Apify Actor container.

Can I pass arguments to my spider?

Yes. Use spiderArgs — each key becomes -a key=value on the Scrapy CLI.

Can I override Scrapy settings?

Yes. Use scrapySettings — each key becomes -s KEY=value.

Why are some fields missing from output?

That is intentional. The pipeline removes null, empty strings, empty lists, and empty objects so dataset rows stay clean. If a CSS selector finds nothing (e.g. no <h1> on the page), that key is omitted.

What formats can I export?

JSON, CSV, and Excel from the Apify dataset UI, plus full access via the Apify API.

SEO Keywords

scrapy cloud runner
run scrapy on apify
scrapy apify actor
scrapy spider cloud
scrapy crawler automation
scheduled scrapy crawler
scrapy web scraping
python scrapy cloud
scrapy dataset export
scrapy crawl api
apify scrapy integration
scrapy spider scheduler
cloud web crawler scrapy
scrapy data pipeline
scrapy spider hosting

Limitations

One spider runs per Actor invocation (use schedules or API for multiple parallel runs).
Output shape depends on your spider implementation; the dataset schema documents the example spider fields.
Heavy crawls may require higher Apify memory and concurrency settings via scrapySettings.

Get Started

Set spiderName to example (or your own spider), adjust spiderArgs, and start your first run on Apify.

Scrapy Cloud Runner

solutionssmart/scrapy-cloud-runner

Run Scrapy spiders on Apify with request queue, dataset export, proxy rotation, scheduling, and cloud-ready deployment.

Solutions Smart

Python Scrapy template

ellustar/python-scrapy-template

“A ready-to-use Python Scrapy template designed for building fast and scalable data extraction actors. Includes a clean project structure, example spiders, settings configuration, and best practices to help developers quickly create, customize, and deploy Scrapy-based workflows.”

Ellustar

Beautiful Soup Cloud Runner

sovanza.inc/beautiful-soup-cloud-runner

Beautiful Soup Cloud Runner runs Python BS4 scraping tasks on Apify. Use CSS extraction rules or custom scripts to scrape static HTML pages, follow links, use proxies, save CSV exports, trigger webhooks, and export compact datasets.

Sovanza

5.0

Scrapy Books Example

vdusek/scrapy-books-example

Example of Python Scrapy project. It scrapes book data from https://books.toscrape.com/.

Vlada Dusek

Best Linkedin Jobs Scrapy

lads.yc/easy-linkedin-jobs-scrapy

Easy way to get jobs and details

YC W

Fhweek Details Spider

getdataforme/fhweek-details-spider

Scrapes titles of websites using Scrapy.

GetDataForMe

Selenium Cloud Runner

sovanza.inc/selenium-cloud-runner

Selenium Cloud Runner scrapes JavaScript-heavy websites using Selenium and headless Chrome. It extracts data with CSS or XPath rules, supports scrolling, popup handling, screenshots, proxies, retries, and structured dataset exports.

Sovanza

5.0

Chicos Productdetails Spider

getdataforme/chicos-productdetails-spider

Scrapes titles of websites using Scrapy.

GetDataForMe

Youtube Scrapy - Short - Vídeo

esdrasdw/youtube-scrapy

YouTube Extractor — collects Shorts and regular videos from channels and saves key metadata to a dataset.

Esdrasdw

Tampax Urls Spiders

hello.datawizards/tampax-urls-spiders

Extract structured product data from Tampax URLs with the Tampax Urls Spiders Apify Actor. Get titles, descriptions, images, categories, and metadata in clean JSON. Fast, reliable, proxy-ready, and ideal for e-commerce, analytics, SEO, and automated data workflows.