Scrapy
This example Scrapy spider scrapes page titles from URLs defined in input parameter. It shows how to use Apify SDK for Python and Scrapy pipelines to save results.
1from scrapy.crawler import CrawlerProcess
2from scrapy.utils.project import get_project_settings
3from scrapy.settings import Settings
4
5from apify import Actor
6
7from ..spiders.title_spider import TitleSpider
8
9
10def _get_scrapy_settings(max_depth: int) -> Settings:
11 """
12 Get Scrapy project settings.
13 """
14 settings = get_project_settings()
15 # Add our Actor Push Pipeline with the lowest priority
16 settings['ITEM_PIPELINES']['src.apify.pipelines.ActorDatasetPushPipeline'] = 1
17 # Disable default Retry Middleware
18 settings['DOWNLOADER_MIDDLEWARES']['scrapy.downloadermiddlewares.retry.RetryMiddleware'] = None
19 # Add our custom Retry Middleware with the top priority
20 settings['DOWNLOADER_MIDDLEWARES']['src.apify.middlewares.ApifyRetryMiddleware'] = 999
21 # Add our custom Scheduler
22 settings['SCHEDULER'] = 'src.apify.scheduler.ApifyScheduler'
23 settings['DEPTH_LIMIT'] = max_depth
24 return settings
25
26
27async def main():
28 async with Actor:
29 Actor.log.info('Actor is being executed...')
30
31 # Process Actor input
32 actor_input = await Actor.get_input() or {}
33 max_depth = actor_input.get('max_depth', 1)
34 start_urls = [start_url.get('url') for start_url in actor_input.get('start_urls', [{'url': 'https://apify.com'}])]
35 settings = _get_scrapy_settings(max_depth)
36
37 # Add start URLs to the request queue
38 rq = await Actor.open_request_queue()
39 for url in start_urls:
40 await rq.add_request({'url': url})
41
42 # Currently, execution of only one Spider is supported
43 process = CrawlerProcess(settings, install_root_handler=False)
44 process.crawl(TitleSpider)
45 process.start()
46
Scrapy template
A template example built with Scrapy to scrape page titles from URLs defined in the input parameter. It shows how to use Apify SDK for Python and Scrapy pipelines to save results.
Included features
- Apify SDK for Python - a toolkit for building Apify Actors and scrapers in Python
- Input schema - define and easily validate a schema for your Actor's input
- Dataset - store structured data where each object stored has the same attributes
- Scrapy - a fast high-level web scraping framework
How it works
This code is a Python script that uses Scrapy to scrape web pages and extract data from them. Here's a brief overview of how it works:
- The script reads the input data from the Actor instance, which is expected to contain a
start_urls
key with a list of URLs to scrape and amax_depth
key with the maximum depth of nested links to follow. - The script then creates a Scrapy spider that will scrape the URLs and follow links up to the specified
max_depth
. This Spider (classTitleSpider
) is storing URLs and titles. - Scrapy pipeline is used to save the results to the default dataset associated with the Actor run using the
push_data
method of the Actor instance. - The script catches any exceptions that occur during the web scraping process and logs an error message using the
Actor.log.exception
method.
Resources
- Web scraping with Scrapy
- Python tutorials in Academy
- Alternatives to Scrapy for web scraping in 2023
- Beautiful Soup vs. Scrapy for web scraping
- Integration with Zapier, Make, Google Drive, and others
- Video guide on getting scraped data using Apify API
A short guide on how to build web scrapers using code templates: web scraper template
Scrape single page with provided URL with Requests and extract data from page's HTML with Beautiful Soup.
Example of a web scraper that uses Python Requests to scrape HTML from URLs provided on input, parses it using BeautifulSoup and saves results to storage.
Crawler example that uses headless Chrome driven by Playwright to scrape a website. Headless browsers render JavaScript and can help when getting blocked.
Scraper example built with Selenium and headless Chrome browser to scrape a website and save the results to storage. A popular alternative to Playwright.
Empty template with basic structure for the Actor with Apify SDK that allows you to easily add your own functionality.