Pricing

from $2.00 / 1,000 results

Go to Apify Store

Scrapy Cloud Runner

Try for free

Run Scrapy spiders on Apify with request queue, dataset export, proxy rotation, scheduling, and cloud-ready deployment.

Pricing

from $2.00 / 1,000 results

Rating

0.0

(0)

Developer

Solutions Smart

Actor stats

Bookmarked

Total users

Monthly active users

2 days ago

Last modified

What this actor does

Scrapy Cloud Runner is a Python Apify Actor that executes Scrapy spiders bundled with the actor codebase. It uses the official Apify Python SDK and Scrapy integration so you can run, schedule, and monitor Scrapy crawls in the Apify Console or through the API.

The actor:

runs a selected Scrapy spider by spiderName
reads input from the Apify input form or API
pushes scraped items to the default dataset
stores a crawl summary in the OUTPUT key-value store record
supports Apify proxy configuration
exposes crawl controls for limits, retries, delays, cache, and robots.txt

Included spider

The actor includes one bundled example spider:

page_meta: crawls pages, extracts basic page metadata, and optionally follows links

The example spider is designed to be a solid starting point, not a universal website crawler. By default it stays on the same hostname as the start URLs to avoid drifting into subdomains with different blocking or rate-limit behavior.

Why use it on Apify

Running Scrapy on Apify gives you:

scheduled runs
API-triggered runs
centralized logs
dataset export
proxy integration
managed cloud execution

You keep the Scrapy spider model, but you do not need to manage servers, deployment plumbing, or result storage yourself.

Quick start

Open the actor in Apify Console.
Set spiderName to page_meta or to your own bundled spider.
Add one or more startUrls.
Keep the default limits for the first run.
Run the actor.
Review results in the Dataset tab and the summary in the OUTPUT record.

Example input

{
  "spiderName": "page_meta",
  "startUrls": [
    { "url": "https://apify.com" }
  ],
  "followLinks": true,
  "sameHostnameOnly": true,
  "includeHtml": false,
  "maxRequestsPerCrawl": 20,
  "maxDepth": 1,
  "maxConcurrency": 16,
  "requestTimeoutSecs": 30,
  "downloadDelaySecs": 1,
  "retryTimes": 2,
  "useAutoThrottle": true,
  "autoThrottleTargetConcurrency": 1,
  "autoThrottleStartDelaySecs": 1,
  "autoThrottleMaxDelaySecs": 15,
  "respectRobotsTxt": true,
  "useHttpCache": true,
  "httpCacheExpirationSecs": 7200,
  "spiderArgs": [
    { "key": "category", "value": "books" }
  ]
}

Input settings

Input	Type	Description
`spiderName`	string	Name of the bundled Scrapy spider to run.
`startUrls`	array	Starting URLs for the crawl.
`allowedDomains`	array	Optional domain allowlist for Scrapy offsite filtering.
`followLinks`	boolean	Follow links discovered on crawled pages.
`sameHostnameOnly`	boolean	Restrict followed links to the exact hostnames from `startUrls`. Recommended for focused crawls.
`includeHtml`	boolean	Include raw HTML in dataset items.
`maxRequestsPerCrawl`	integer	Maximum number of scraped pages/items emitted by the bundled spider.
`maxDepth`	integer	Maximum follow depth from the initial pages.
`maxConcurrency`	integer	Maximum concurrent Scrapy requests.
`requestTimeoutSecs`	integer	Download timeout per request.
`downloadDelaySecs`	number	Base delay between requests to the same site.
`retryTimes`	integer	Retry count for retryable failures.
`useAutoThrottle`	boolean	Enable Scrapy AutoThrottle.
`autoThrottleTargetConcurrency`	number	Target average concurrency per remote site.
`autoThrottleStartDelaySecs`	number	Initial AutoThrottle delay.
`autoThrottleMaxDelaySecs`	number	Maximum AutoThrottle delay.
`respectRobotsTxt`	boolean	Respect robots.txt.
`useHttpCache`	boolean	Enable HTTP cache.
`httpCacheExpirationSecs`	integer	Cache expiration time in seconds.
`proxyConfiguration`	object	Apify proxy configuration.
`spiderArgs`	array	Spider arguments entered as schema-based key/value rows in Apify Console.
`spiderArgsJson`	object	Structured spider arguments for API callers. Merged over `spiderArgs` on duplicate keys.

Output

The default dataset contains one item per scraped page. For the bundled page_meta spider, each item includes fields such as:

{
  "url": "https://apify.com",
  "status": 200,
  "title": "Apify: Full-stack web scraping and data extraction platform",
  "metaDescription": "Cloud platform for web scraping, browser automation, AI agents, and data for AI.",
  "canonicalUrl": "https://apify.com",
  "h1": "Get real-time web data for your AI",
  "contentType": "text/html; charset=utf-8",
  "depth": 0,
  "referrer": null,
  "textLength": 125419,
  "crawledAt": "2026-05-16T11:21:11.435924+00:00",
  "html": null
}

The actor also stores a summary record in OUTPUT:

{
  "availableSpiders": ["page_meta"],
  "finishedAt": "2026-05-16T11:21:15.000000+00:00",
  "itemCount": 5,
  "requestCount": 12,
  "spiderName": "page_meta",
  "startedAt": "2026-05-16T11:21:10.000000+00:00",
  "stats": {}
}

Default crawl behavior

The bundled actor defaults are tuned for focused website crawls:

same-host following is enabled by default
AutoThrottle is enabled by default
HTTP cache uses RFC2616 policy
common blocked/error responses such as 403 and 429 are not cached
cookies are disabled
robots.txt is respected by default

These defaults are more conservative and more production-friendly than simply running Scrapy at high parallelism.

Add your own spiders

Add a spider module under src/spiders/.
Give the spider a unique Scrapy name.
Read any custom runtime options from spider kwargs or spiderArgsJson.
Deploy the updated actor.
Run the actor with spiderName set to your spider's name.

The actor uses Scrapy's spider loader, so bundled spiders are discovered automatically from src.spiders.

Practical guidance

Start with one or two startUrls.
Keep sameHostnameOnly enabled unless you intentionally want cross-subdomain crawling.
Use proxy configuration for websites with blocking or rate limiting.
Keep includeHtml off unless you need full source in the dataset.
For broad or multi-domain crawling, create a dedicated spider with different settings instead of using the bundled example as-is.

Legal and operational note

You are responsible for using this actor in compliance with the target site's terms, applicable law, and reasonable load limits. Keep respectRobotsTxt enabled unless you have a clear reason not to.

AI Search Visibility Tracker — ChatGPT, Perplexity, Gemini

highbrow_fame/ai-search-visibility-tracker

Track brand citations in ChatGPT, Perplexity, Gemini, Google AI Overviews. Multilingual (24 languages incl. Hungarian, German, French, Polish, Czech). Bring-your-own-key — start FREE with Gemini's free tier. Daily diff. Pay-per-query, no monthly minimums. Cheapest GEO/AEO tracker on Apify.

yestrue

Train Your Local LLM for Business & Finance - DataPro

omissive_aurora/train-your-local-llm-for-business-finance---datapro

Train your local LLM for business and finance with Ultimate DataPro. Scrapes live stock prices, SEC EDGAR filings, options chains, and financial news - then auto-builds Alpaca/ShareGPT fine-tuning datasets. Export as JSONL, CSV, or Parquet. Push to HuggingFace Hub.

d.leigh hunte

Cybersecurity Intelligence Directory Scraper

jonfr0/cybersecurity-intelligence-scraper

Scrapes the Cybersecurity Intelligence Supplier Directory (cybersecurityintelligence.com) for company profiles including name, website, description, location, phone, and category tags.

Jon Froemming

Sales Navigator Scraper By Filters

bestscrapers/sales-navigator-scraper-by-filters

Find and scrape LinkedIn Sales Navigator leads with advanced filters.

Linkedin Scrapers

365

s1jobs.com scraper

memo23/s1jobs-scraper

Scrape Scottish job postings (all sectors) from s1jobs.com — title, salary, employer, location, real lat/lng coordinates, posted/closing dates, full description, structured skills, and the actual recruiter apply URL or apply email. Works with any listing or vacancy URL. JSON or CSV out.

Muhamed Didovic

Shopify Scraper Pro

crawlerbros/shopify-scraper-pro

Scrape any Shopify store(s) at scale: products, collections, search, on-sale tracking, multi-store batches. Filters: price/vendor/type/tags/title/sale/availability/date. Multi-endpoint fallback. HTTP-only, no auth, no proxy.

Crawler Bros

5.0

(21)

Bilibili Scraper - Chinese Video Intelligence

zhorex/bilibili-scraper

Extract Chinese Gen-Z video sentiment, danmaku reactions, and creator analytics from Bilibili (哔哩哔哩) — China's largest video platform with 300M+ users. Built for AI training, Chinese consumer equity research, and brand monitoring. Danmaku/coins/favorites included. No login required.

Sami

All-in one Linkedin Scraper

get-leads/linkedin-scraper

LinkedIn scraper — 8 modes: Profiles, Companies, Jobs, Posts, Search, Search Profiles, Profile Complete, Company Employees. Premium residential proxy (~95% success rate) + email discovery (11 patterns + Hunter.io). From $1/1K — up to 75% cheaper. MCP-ready for AI agents.

Japi Cricket

522

Facebook & Instagram Ad Library Scraper [NO LOGIN] ✅

unseenuser/meta-ads

Extract every active ad across Meta - Facebook and Instagram. Pull full creatives, copy, CTAs, run dates, and EU spend/impression data. No Meta account, no Marketing API, no developer app required. Built for growth marketers, agencies, and e-commerce brands.

Unseen User

5.0

(3)

🧩 Shopify Apps Spy + Product Scraper

kazkn/shopify-scraper-apps-spy

Detect which apps any Shopify store has installed (Klaviyo, Recharge, Yotpo, Privy + 30 more). Plus full product catalog & reviews. No login. 5x cheaper.

KazKN

E-Commerce Product Description Rewriter

trovevault/ecommerce-product-description-rewriter

Scrapes product descriptions from any e-commerce URL, grades them against SEO best practices (0–100), then rewrites them with AI to eliminate duplicate content penalties and improve search rankings.