Back to template gallery

Scrapy

This example Scrapy spider scrapes page titles from URLs defined in input parameter. It shows how to use Apify SDK for Python and Scrapy pipelines to save results.

Language

python

Tools

scrapy

Use cases

Web scraping

1from scrapy.crawler import CrawlerProcess 2from scrapy.utils.project import get_project_settings 3from scrapy.settings import Settings 4 5from apify import Actor 6 7from ..spiders.title_spider import TitleSpider 8 9 10def _get_scrapy_settings(max_depth: int) -> Settings: 11 """ 12 Get Scrapy project settings. 13 """ 14 settings = get_project_settings() 15 # Add our Actor Push Pipeline with the lowest priority 16 settings['ITEM_PIPELINES']['src.apify.pipelines.ActorDatasetPushPipeline'] = 1 17 # Disable default Retry Middleware 18 settings['DOWNLOADER_MIDDLEWARES']['scrapy.downloadermiddlewares.retry.RetryMiddleware'] = None 19 # Add our custom Retry Middleware with the top priority 20 settings['DOWNLOADER_MIDDLEWARES']['src.apify.middlewares.ApifyRetryMiddleware'] = 999 21 # Add our custom Scheduler 22 settings['SCHEDULER'] = 'src.apify.scheduler.ApifyScheduler' 23 settings['DEPTH_LIMIT'] = max_depth 24 return settings 25 26 27async def main(): 28 async with Actor: 29 Actor.log.info('Actor is being executed...') 30 31 # Process Actor input 32 actor_input = await Actor.get_input() or {} 33 max_depth = actor_input.get('max_depth', 1) 34 start_urls = [start_url.get('url') for start_url in actor_input.get('start_urls', [{'url': 'https://apify.com'}])] 35 settings = _get_scrapy_settings(max_depth) 36 37 # Add start URLs to the request queue 38 rq = await Actor.open_request_queue() 39 for url in start_urls: 40 await rq.add_request({'url': url}) 41 42 # Currently, execution of only one Spider is supported 43 process = CrawlerProcess(settings, install_root_handler=False) 44 process.crawl(TitleSpider) 45 process.start() 46

Scrapy template

A template example built with Scrapy to scrape page titles from URLs defined in the input parameter. It shows how to use Apify SDK for Python and Scrapy pipelines to save results.

Included features

  • Apify SDK for Python - a toolkit for building Apify Actors and scrapers in Python
  • Input schema - define and easily validate a schema for your Actor's input
  • Dataset - store structured data where each object stored has the same attributes
  • Scrapy - a fast high-level web scraping framework

How it works

This code is a Python script that uses Scrapy to scrape web pages and extract data from them. Here's a brief overview of how it works:

  • The script reads the input data from the Actor instance, which is expected to contain a start_urls key with a list of URLs to scrape and a max_depth key with the maximum depth of nested links to follow.
  • The script then creates a Scrapy spider that will scrape the URLs and follow links up to the specified max_depth. This Spider (class TitleSpider) is storing URLs and titles.
  • Scrapy pipeline is used to save the results to the default dataset associated with the Actor run using the push_data method of the Actor instance.
  • The script catches any exceptions that occur during the web scraping process and logs an error message using the Actor.log.exception method.

Resources

A short guide on how to build web scrapers using code templates: web scraper template

Already have a solution in mind?

Sign up for a free Apify account and deploy your code to the platform in just a few minutes! If you want a head start without coding it yourself, browse our Store of existing solutions.