Crawlee Blog Scraper avatar
Crawlee Blog Scraper

Pricing

Pay per event

Go to Apify Store
Crawlee Blog Scraper

Crawlee Blog Scraper

Developed by

Josef Procházka

Josef Procházka

Maintained by Community

Example actor created in workshop

0.0 (0)

Pricing

Pay per event

0

2

1

Last modified

3 days ago

Crawlee Blog Crawler

Example Actor to demonstrate how to use Crawlee on Apify platform. It scrapes blog posts from Crawlee blog, groups them by author and saves the results to a persistent storage we call a dataset.

This tutorial will show how to create an Actor in two different ways

Step by step creation

Create project from Apify template (on Apify platform)

Click on Develop new

Create Actor

Click on View all templates

Show all templates

Filter by Python templates and select ParselCrawler

Parsel Crawler

Use the selected template

Use the template

Build it and start it

Build and start

Navigate to https://crawlee.dev/blog and inspect the page structure. See articles and authors and also all articles list on the left side. Think about how to extract all article names and group them by author.

Approach in this example:

We will scrape all articles urls from the left side list and then visit each article page to extract the author name.

To inspect the page navigate to the page in your browser and press F12 to open developers tools. Select the inspector and pick the element you want to inspect it. You will see that all those links to the articles start are relative and start with /blog...

All articles

To extract data from webpages we use crawlers and to define how the crawler will process specific page we use request_handlers. Crawler can have many different request handlers, and we organize them in router. Router decides which page will be processed with which handler. Let's create default_handler for processing the pages by our crawler.

In our case we want to create handler that will extract all links like this "https://crawlee.dev/blog/..." from the page. We can use some helper functions defined on the crawling context object. We will use context.enqueue_links to extract all links and add them to the crawler request queue. We will use include and exclude patterns to filter out only relevant links. So we will include links with this pattern: "https://crawlee.dev/blog/...", but exclude some false positives from this pattern that do not point to the articles, like "https://crawlee.dev/blog/tags/..." and "https://crawlee.dev/blog/page/...".

One way to define a request handler is through a decorator:

from crawlee import Glob
@crawler.router.default_handler
async def request_handler(context: ParselCrawlingContext) -> None:
"""Handler for processing the starting page."""
context.log.info(f'Processing blog {context.request.url} ...')
# Find relevant links and add them to the crawler request queue
# https://crawlee.dev/python/docs/introduction/adding-more-urls#filter-urls-with-patterns
await context.enqueue_links(
include=[Glob("https://crawlee.dev/blog/**"),],
exclude=[Glob("https://crawlee.dev/blog/tags/**"), Glob("https://crawlee.dev/blog/page/**")],
label="ARTICLE")

We can use label parameter of the enqueue_links method to label those requests with "ARTICLE" label to instruct the router to process these pages with different handler.

Step 2 - Get author and article name from each detailed article page.

So now we have all article links in the request queue. We need to create another handler that will process each article page and extract the author name and article title.

This handler will use Parsel library for parsing the http response and xpath selectors to select the information we are interested in. Then the handler will save this information to predefined defaultdict, so that each author (dictionary key) has list of articles (dictionary value)

from collections import defaultdict
# Dictionary for the results
author_articles: dict[str, list[str]] = defaultdict(list)
@crawler.router.handler(label="ARTICLE")
async def request_handler(context: ParselCrawlingContext) -> None:
"""Handler for processing the article page."""
context.log.info(f'Processing article {context.request.url} ...')
# Extract relevant information from the page
# https://parsel.readthedocs.io/en/latest/usage.html#some-xpath-tips
author = context.selector.xpath("//div[contains(@class, 'avatar__name')]/a/span/text()").get()
article_title = context.selector.xpath("//h1[contains(@class, 'title_xvU1')]//text()").get()
author_articles[author].append(article_title)

Step 3 - Save results to some storage.

Now we have all the authors and their articles. Lets save it somewhere so that the user can view it or download it after the crawler is finished.

# Get storage for storing the results
result_dataset = await crawler.get_dataset()
# Save results to the storage
await result_dataset.push_data(author_articles)

Step 4 - Run Actor on the Apify platform

Rebuild the Actor if there are any changes and start it again.

Step 5 - View the results

Now you can inspect the extracted data saved in the dataset. Dataset results

Step 6 - Add Actor input

You can also have some inputs to your Actor. In this example we will add input to optionally specify the name of the author you are interested in. You can define the input schema in the ./actor/INPUT_SCHEMA.json file.

In the .actor folder you should have actor.json file that defines details of the Actor:

Add this object to the properties of input_schema.json:

"author": {
"title": "Author",
"type": "string",
"description": "Show only articles from this author. If empty, show all articles.",
"prefill": "Max",
"editor": "textfield"
}

Extract the desired author from the input:

# Extract the desired author from the input
desired_author = actor_input.get("author", "")

Filter extracted authors by the desired author:

#Filter articles by author if provided. Return all articles if no author was selected.
author_articles = {desired_author: author_articles.get(desired_author, [])} if desired_author else author_articles

Step 7 - Monetization

In order to monetize your actor you have to think about what will the user pay for. For example, you can make user pay for certain events in your code. For that you can use Actor.charge method. When that method executes the user is charged for the amount defined in the event:

# By calling Actor.charge you can monetize your Actor on Apify platform
await Actor.charge(event_name="result")

You can then set up monetization of your actor in the Apify store.

See the publication tab: Publication

Set pricing model: Pricing Model

Set pricing events: Monetization

For more details about publishing and monetization, please refer to the documentation: https://docs.apify.com/platform/actors/publishing

Local development workflow

Create project using Crawlee template (locally)

Run crawlee-cli:

uvx crawlee[cli] create

or

pipx run crawlee[cli] create

Select following options to create template that we will build upon:

Navigate to newly created project folder, examine it and try to run the template:

python -m example_crawler

Run Actor on the Apify platform

Make sure apify is installed. You can follow the installation guide: https://docs.apify.com/cli/docs/installation

Push the locally developed Actor to the Apify platform (if developing locally) by running:

apify push

Follow the link produced in the end of apify push command output and start the Actor on the Apify platform.

Did you enjoy scraping and want to learn more? Just check out one of the following links

QR code to this repo:

QR code