Scrapy

This example Scrapy spider scrapes page titles from URLs defined in input parameter. It shows how to use Apify SDK for Python and Scrapy pipelines to save results.

src/main.py

src/spiders/title.py

src/__main__.py

src/items.py

src/pipelines.py

src/settings.py

1"""Module defines the main entry point for the Apify Actor.
2
3Module defines the main coroutine for the Apify Scrapy Actor, executed from the __main__.py file. The coroutine
4processes the Actor's input and executes the Scrapy spider. Additionally, it updates Scrapy project settings by
5applying Apify-related settings. Which includes adding a custom scheduler, retry middleware, and an item pipeline
6for pushing data to the Apify dataset.
7
8Customization:
9--------------
10
11Feel free to customize this file to add specific functionality to the Actor, such as incorporating your own Scrapy
12components like spiders and handling Actor input. However, make sure you have a clear understanding of your
13modifications. For instance, removing `apply_apify_settings` break the integration between Scrapy and Apify.
14
15Documentation:
16--------------
17
18For an in-depth description of the Apify-Scrapy integration process, our Scrapy components, known limitations and
19other stuff, please refer to the following documentation page: https://docs.apify.com/cli/docs/integrating-scrapy.
20"""
21
22from __future__ import annotations
23
24from apify import Actor
25from apify.scrapy import apply_apify_settings
26from scrapy.crawler import CrawlerRunner
27from scrapy.utils.defer import deferred_to_future
28
29# Import your Scrapy spider here.
30from .spiders import TitleSpider as Spider
31
32
33async def main() -> None:
34    """Apify Actor main coroutine for executing the Scrapy spider."""
35    async with Actor:
36        # Retrieve and process Actor input.
37        actor_input = await Actor.get_input() or {}
38        start_urls = [url['url'] for url in actor_input.get('startUrls', [])]
39        allowed_domains = actor_input.get('allowedDomains')
40        proxy_config = actor_input.get('proxyConfiguration')
41
42        # Apply Apify settings, which will override the Scrapy project settings.
43        settings = apply_apify_settings(proxy_config=proxy_config)
44
45        # Create CrawlerRunner and execute the Scrapy spider.
46        crawler_runner = CrawlerRunner(settings)
47        crawl_deferred = crawler_runner.crawl(
48            Spider,
49            start_urls=start_urls,
50            allowed_domains=allowed_domains,
51        )
52        await deferred_to_future(crawl_deferred)

Python Scrapy template

A template example built with Scrapy to scrape page titles from URLs defined in the input parameter. It shows how to use Apify SDK for Python and Scrapy pipelines to save results.

Included features

Apify SDK for Python - a toolkit for building Apify Actors and scrapers in Python
Input schema - define and easily validate a schema for your Actor's input
Request queue - queues into which you can put the URLs you want to scrape
Dataset - store structured data where each object stored has the same attributes
Scrapy - a fast high-level web scraping framework

How it works

This code is a Python script that uses Scrapy to scrape web pages and extract data from them. Here's a brief overview of how it works:

The script reads the input data from the Actor instance, which is expected to contain a start_urls key with a list of URLs to scrape.
The script then creates a Scrapy spider that will scrape the URLs. This Spider (class TitleSpider) is storing URLs and titles.
Scrapy pipeline is used to save the results to the default dataset associated with the Actor run using the push_data method of the Actor instance.
The script catches any exceptions that occur during the web scraping process and logs an error message using the Actor.log.exception method.

Resources

Web scraping with Scrapy
Python tutorials in Academy
Alternatives to Scrapy for web scraping in 2023
Beautiful Soup vs. Scrapy for web scraping
Integration with Zapier, Make, Google Drive, and others
Video guide on getting scraped data using Apify API
A short guide on how to build web scrapers using code templates:

Start with Python

Scrape single page with provided URL with HTTPX and extract data from page's HTML with Beautiful Soup.

Starter

BeautifulSoup

Example of a web scraper that uses Python HTTPX to scrape HTML from URLs provided on input, parses it using BeautifulSoup and saves results to storage.

Playwright + Chrome

Crawler example that uses headless Chrome driven by Playwright to scrape a website. Headless browsers render JavaScript and can help when getting blocked.

Selenium + Chrome

Scraper example built with Selenium and headless Chrome browser to scrape a website and save the results to storage. A popular alternative to Playwright.

Empty Python project

Empty template with basic structure for the Actor with Apify SDK that allows you to easily add your own functionality.

Standby Python project

Template with basic structure for an Actor using Standby mode that allows you to easily add your own functionality.

Starter

Already have a solution in mind?

Sign up for a free Apify account and deploy your code to the platform in just a few minutes! If you want a head start without coding it yourself, browse our Store of existing solutions.

Import your code Go to store

Python Scrapy template

Included features

How it works

Resources

Related templates

Already have a solution in mind?