Scrapy
This example Scrapy spider scrapes page titles from URLs defined in input parameter. It shows how to use Apify SDK for Python and Scrapy pipelines to save results.
src/main.py
src/spiders/title.py
src/__main__.py
src/items.py
src/pipelines.py
src/settings.py
1"""
2This module defines the main coroutine for the Apify Scrapy Actor, executed from the __main__.py file. The coroutine
3processes the Actor's input and executes the Scrapy spider. Additionally, it updates Scrapy project settings by
4applying Apify-related settings. Which includes adding a custom scheduler, retry middleware, and an item pipeline
5for pushing data to the Apify dataset.
6
7Customization:
8--------------
9
10Feel free to customize this file to add specific functionality to the Actor, such as incorporating your own Scrapy
11components like spiders and handling Actor input. However, make sure you have a clear understanding of your
12modifications. For instance, removing `apply_apify_settings` break the integration between Scrapy and Apify.
13
14Documentation:
15--------------
16
17For an in-depth description of the Apify-Scrapy integration process, our Scrapy components, known limitations and
18other stuff, please refer to the following documentation page: https://docs.apify.com/cli/docs/integrating-scrapy.
19"""
20
21from __future__ import annotations
22
23from scrapy.crawler import CrawlerProcess
24
25from apify import Actor
26from apify.scrapy.utils import apply_apify_settings
27
28# Import your Scrapy spider here
29from .spiders.title import TitleSpider as Spider
30
31# Default input values for local execution using `apify run`
32LOCAL_DEFAULT_START_URLS = [{'url': 'https://apify.com'}]
33
34
35async def main() -> None:
36 """
37 Apify Actor main coroutine for executing the Scrapy spider.
38 """
39 async with Actor:
40 Actor.log.info('Actor is being executed...')
41
42 # Process Actor input
43 actor_input = await Actor.get_input() or {}
44 start_urls = actor_input.get('startUrls', LOCAL_DEFAULT_START_URLS)
45 proxy_config = actor_input.get('proxyConfiguration')
46
47 # Add start URLs to the request queue
48 rq = await Actor.open_request_queue()
49 for start_url in start_urls:
50 url = start_url.get('url')
51 await rq.add_request(request={'url': url, 'method': 'GET'})
52
53 # Apply Apify settings, it will override the Scrapy project settings
54 settings = apply_apify_settings(proxy_config=proxy_config)
55
56 # Execute the spider using Scrapy CrawlerProcess
57 process = CrawlerProcess(settings, install_root_handler=False)
58 process.crawl(Spider)
59 process.start()
Scrapy template
A template example built with Scrapy to scrape page titles from URLs defined in the input parameter. It shows how to use Apify SDK for Python and Scrapy pipelines to save results.
Included features
- Apify SDK for Python - a toolkit for building Apify Actors and scrapers in Python
- Input schema - define and easily validate a schema for your Actor's input
- Request queue - queues into which you can put the URLs you want to scrape
- Dataset - store structured data where each object stored has the same attributes
- Scrapy - a fast high-level web scraping framework
How it works
This code is a Python script that uses Scrapy to scrape web pages and extract data from them. Here's a brief overview of how it works:
- The script reads the input data from the Actor instance, which is expected to contain a
start_urls
key with a list of URLs to scrape. - The script then creates a Scrapy spider that will scrape the URLs. This Spider (class
TitleSpider
) is storing URLs and titles. - Scrapy pipeline is used to save the results to the default dataset associated with the Actor run using the
push_data
method of the Actor instance. - The script catches any exceptions that occur during the web scraping process and logs an error message using the
Actor.log.exception
method.
Resources
- Web scraping with Scrapy
- Python tutorials in Academy
- Alternatives to Scrapy for web scraping in 2023
- Beautiful Soup vs. Scrapy for web scraping
- Integration with Zapier, Make, Google Drive, and others
- Video guide on getting scraped data using Apify API
- A short guide on how to build web scrapers using code templates:
Scrape single page with provided URL with HTTPX and extract data from page's HTML with Beautiful Soup.
Example of a web scraper that uses Python HTTPX to scrape HTML from URLs provided on input, parses it using BeautifulSoup and saves results to storage.
Crawler example that uses headless Chrome driven by Playwright to scrape a website. Headless browsers render JavaScript and can help when getting blocked.
Scraper example built with Selenium and headless Chrome browser to scrape a website and save the results to storage. A popular alternative to Playwright.
Empty template with basic structure for the Actor with Apify SDK that allows you to easily add your own functionality.