Carmelita
Pay $1,000.00 for 1,000 results
Python Scrapy template
A template example built with Scrapy to scrape page titles from URLs defined in the input parameter. It shows how to use Apify SDK for Python and Scrapy pipelines to save results.
Included features
- Apify SDK for Python - a toolkit for building Apify Actors and scrapers in Python
- Input schema - define and easily validate a schema for your Actor's input
- Request queue - queues into which you can put the URLs you want to scrape
- Dataset - store structured data where each object stored has the same attributes
- Scrapy - a fast high-level web scraping framework
How it works
This code is a Python script that uses Scrapy to scrape web pages and extract data from them. Here's a brief overview of how it works:
- The script reads the input data from the Actor instance, which is expected to contain a
start_urls
key with a list of URLs to scrape. - The script then creates a Scrapy spider that will scrape the URLs. This Spider (class
TitleSpider
) is storing URLs and titles. - Scrapy pipeline is used to save the results to the default dataset associated with the Actor run using the
push_data
method of the Actor instance. - The script catches any exceptions that occur during the web scraping process and logs an error message using the
Actor.log.exception
method.
Resources
- Web scraping with Scrapy
- Python tutorials in Academy
- Alternatives to Scrapy for web scraping in 2023
- Beautiful Soup vs. Scrapy for web scraping
- Integration with Zapier, Make, Google Drive, and others
- Video guide on getting scraped data using Apify API
- A short guide on how to build web scrapers using code templates:
Getting started
For complete information see this article. In short, you will:
- Build the Actor
- Run the Actor
Pull the Actor for local development
If you would like to develop locally, you can pull the existing Actor from Apify console using Apify CLI:
-
Install
apify-cli
Using Homebrew
brew install apify-cli
Using NPM
npm -g install apify-cli
-
Pull the Actor by its unique
<ActorId>
, which is one of the following:- unique name of the Actor to pull (e.g. "apify/hello-world")
- or ID of the Actor to pull (e.g. "E2jjCZBezvAZnX8Rb")
You can find both by clicking on the Actor title at the top of the page, which will open a modal containing both Actor unique name and Actor ID.
This command will copy the Actor into the current directory on your local machine.
apify pull <ActorId>
Documentation reference
To learn more about Apify and Actors, take a look at the following resources: