Crawlee Blog Scraper

Pricing

Pay per event

Try for free

Go to Apify Store

Crawlee Blog Scraper

Try for free

Example actor created in workshop

Pricing

Pay per event

Rating

0.0

(0)

Developer

Josef Procházka

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

2 months ago

Last modified

Crawlee Blog Crawler

Example Actor to demonstrate how to use Crawlee on Apify platform. It scrapes blog posts from Crawlee blog, groups them by author and saves the results to a persistent storage we call a dataset.

This tutorial will show how to create an Actor in two different ways

in browser on our platform
- you need Apify account
locally on your computer and pushing to Apify platform using Apify cli
- you need Apify account
- you need Apify cli
- having uv or poetry helps

Step by step creation

Create project from Apify template (on Apify platform)

Click on Develop new

Create Actor

Click on View all templates

Show all templates

Filter by Python templates and select ParselCrawler

Parsel Crawler

Use the selected template

Use the template

Build it and start it

Build and start

Navigate to Crawlee blog and inspect the page

Navigate to https://crawlee.dev/blog and inspect the page structure. See articles and authors and also all articles list on the left side. Think about how to extract all article names and group them by author.

Approach in this example:

We will scrape all articles urls from the left side list and then visit each article page to extract the author name.

Step 1 - Get links to all articles.

To inspect the page navigate to the page in your browser and press F12 to open developers tools. Select the inspector and pick the element you want to inspect it. You will see that all those links to the articles start are relative and start with /blog...

All articles

To extract data from webpages we use crawlers and to define how the crawler will process specific page we use request_handlers. Crawler can have many different request handlers, and we organize them in router. Router decides which page will be processed with which handler. Let's create default_handler for processing the pages by our crawler.

In our case we want to create handler that will extract all links like this "https://crawlee.dev/blog/..." from the page. We can use some helper functions defined on the crawling context object. We will use context.enqueue_links to extract all links and add them to the crawler request queue. We will use include and exclude patterns to filter out only relevant links. So we will include links with this pattern: "https://crawlee.dev/blog/...", but exclude some false positives from this pattern that do not point to the articles, like "https://crawlee.dev/blog/tags/..." and "https://crawlee.dev/blog/page/...".

One way to define a request handler is through a decorator:

from crawlee import Glob

        @crawler.router.default_handler
        async def request_handler(context: ParselCrawlingContext) -> None:
            """Handler for processing the starting page."""
            context.log.info(f'Processing blog {context.request.url} ...')
            # Find relevant links and add them to the crawler request queue
            # https://crawlee.dev/python/docs/introduction/adding-more-urls#filter-urls-with-patterns
            await context.enqueue_links(
                include=[Glob("https://crawlee.dev/blog/**"),],
                exclude=[Glob("https://crawlee.dev/blog/tags/**"), Glob("https://crawlee.dev/blog/page/**")],
                label="ARTICLE")

We can use label parameter of the enqueue_links method to label those requests with "ARTICLE" label to instruct the router to process these pages with different handler.

Step 2 - Get author and article name from each detailed article page.

So now we have all article links in the request queue. We need to create another handler that will process each article page and extract the author name and article title.

This handler will use Parsel library for parsing the http response and xpath selectors to select the information we are interested in. Then the handler will save this information to predefined defaultdict, so that each author (dictionary key) has list of articles (dictionary value)

from collections import defaultdict

        # Dictionary for the results
        author_articles: dict[str, list[str]] = defaultdict(list)

        @crawler.router.handler(label="ARTICLE")
        async def request_handler(context: ParselCrawlingContext) -> None:
            """Handler for processing the article page."""
            context.log.info(f'Processing article {context.request.url} ...')
            # Extract relevant information from the page
            # https://parsel.readthedocs.io/en/latest/usage.html#some-xpath-tips
            author = context.selector.xpath("//div[contains(@class, 'avatar__name')]/a/span/text()").get()
            article_title = context.selector.xpath("//h1[contains(@class, 'title_xvU1')]//text()").get()
            author_articles[author].append(article_title)

Step 3 - Save results to some storage.

Now we have all the authors and their articles. Lets save it somewhere so that the user can view it or download it after the crawler is finished.

# Get storage for storing the results
        result_dataset = await crawler.get_dataset()
        # Save results to the storage
        await result_dataset.push_data(author_articles)

Step 4 - Run Actor on the Apify platform

Rebuild the Actor if there are any changes and start it again.

Step 5 - View the results

Now you can inspect the extracted data saved in the dataset. Dataset results

Step 6 - Add Actor input

You can also have some inputs to your Actor. In this example we will add input to optionally specify the name of the author you are interested in. You can define the input schema in the ./actor/INPUT_SCHEMA.json file.

In the .actor folder you should have actor.json file that defines details of the Actor:

Add this object to the properties of input_schema.json:

"author": {
            "title": "Author",
            "type": "string",
            "description": "Show only articles from this author. If empty, show all articles.",
            "prefill": "Max",
            "editor": "textfield"
        }

Extract the desired author from the input:

# Extract the desired author from the input
        desired_author = actor_input.get("author", "")

Filter extracted authors by the desired author:

#Filter articles by author if provided. Return all articles if no author was selected.
        author_articles = {desired_author: author_articles.get(desired_author, [])} if desired_author else author_articles

Step 7 - Monetization

In order to monetize your actor you have to think about what will the user pay for. For example, you can make user pay for certain events in your code. For that you can use Actor.charge method. When that method executes the user is charged for the amount defined in the event:

# By calling Actor.charge you can monetize your Actor on Apify platform
        await Actor.charge(event_name="result")

You can then set up monetization of your actor in the Apify store.

See the publication tab:

Set pricing model:

Set pricing events: Monetization

For more details about publishing and monetization, please refer to the documentation: https://docs.apify.com/platform/actors/publishing

Local development workflow

Create project using Crawlee template (locally)

Run crawlee-cli:

uvx crawlee[cli] create

pipx run crawlee[cli] create

Select following options to create template that we will build upon:

example-crawler
Parsel
Impit
Uv
https://crawlee.dev/blog
y
y

Navigate to newly created project folder, examine it and try to run the template:

python -m example_crawler

Run Actor on the Apify platform

Make sure apify is installed. You can follow the installation guide: https://docs.apify.com/cli/docs/installation

Push the locally developed Actor to the Apify platform (if developing locally) by running:

apify push

Follow the link produced in the end of apify push command output and start the Actor on the Apify platform.

Useful links

Did you enjoy scraping and want to learn more? Just check out one of the following links

Crawlee for Python
Apify SDK
Apify API client
Crawlee guides
Crawlee blog
Apify web scraping academy and for python here
Step by step guide to extract data here
Looking for some inspiration what to build? Check the ideas page
Actor whitepaper
Create Actor from template video
How to build and monetize Actors on Apify Store - Earn passive income from your scrapers video
Apify Discord channel
Apify Actors developers

QR code to this repo:

QR code

Replicate Blog Scraper

yourapiservice/replicate-blog-scraper

The Replicate Blog Scraper lets you easily extract blog content in HTML or plaintext formats. It also captures key metadata like author and publication date, making it a great tool for content analysis and research.

Your API Service

Blog Scraper

assured_crown/blog-scraper

Ben

Universal Article Scraper

universal_scraping/universal-article-scraper

Universal article scraper for news websites, blogs, etc. It can scrape articles from multiple websites simultaneously, including metadata such as title, content, publication date, image, and author.

Michael Novak

5.0

Blog / Dated Content Crawler

diarmuidr/blog-content-crawler

Crawl an entire blog / knowledge base or filter to just the new content. Supporting relevant AI queries by filtering pages by date

Diarmuid

150

5.0

News Articles Scraper

proscraper/news-articles-scraper

Scrape data for news articles. Takes in list of URL's in start_urls and returns the data. Can be used to feed LLM models or training.

Owais Nazir

Smart Article Scraper - Text, Data & Insights

xtech/article-extractor

𝗔𝗿𝘁𝗶𝗰𝗹𝗲 𝗦𝗰𝗿𝗮𝗽𝗲𝗿 & 𝗖𝗼𝗻𝘁𝗲𝗻𝘁 𝗘𝘅𝘁𝗿𝗮𝗰𝘁𝗼𝗿 - Extract clean text, metadata, keywords & summaries from any web article or blog post. Perfect for 𝗿𝗲𝘀𝗲𝗮𝗿𝗰𝗵, 𝗰𝗼𝗺𝗽𝗲𝘁𝗶𝘁𝗶𝘃𝗲 𝗮𝗻𝗮𝗹𝘆𝘀𝗶𝘀 & 𝗰𝗼𝗻𝘁𝗲𝗻𝘁 𝗺𝗮𝗿𝗸𝗲𝘁𝗶𝗻𝗴.

Xtech

1.0

Smart Article Extractor

lukaskrivka/article-extractor-smart

📰 Smart Article Extractor extracts articles from any scientific, academic, or news website with just one click. The extractor crawls the whole website and automatically distinguishes articles from other web pages. Download your data as HTML table, JSON, Excel, RSS feed, and more.

Lukáš Křivka

6.4K

4.9

Articles Extractor

web.harvester/articles-extractor

The Article Extractor is an enterprise-grade web scraping solution designed specifically for extracting structured data from news articles, blog posts, and online publications. Our advanced HTML parsing engine delivers unmatched accuracy in content extraction across thousands of websites.

Web Harvester

656

5.0

News Article Scraper for Feeding LLM

proscraper/newsarticlescraper

Scrape news articles metadata to feed into LLM models. Returns article body, published date, article title, author etc.

Owais Nazir

Article Content Extractor 📄

easyapi/article-content-extractor

Extract clean article content, metadata and structured information from any web page. Supports multiple URLs and returns well-formatted JSON with title, description, content, author, publish date and more. 🔍📄