Changelog

What's new at Apify?

New

Crawlee

We launched our open-source web scraping and browser automation library Crawlee in August 2022 and got an amazing response from the JavaScript community. With many early adopters in its initial days, we got valuable feedback, which gave Crawlee a strong base for its success.

Since the launch, the feedback we’ve received most often has been to build Crawlee in Python so that the Python community can use all the features the JavaScript community does.

With all these requests in mind and to simplify the life of Python web scraping developers, we’re launching Crawlee for Python today.

The new library is still in beta, and we're looking for early adopters. Here's how you can help:

Check it out, give us feedback and star the library on Github! ⭐️

Support the release on ProductHunt 🚀

crawlee_for_python.png

Saurav Jain

Developer Community Manager


New

Update

Console

We've improved bulk actions all over Apify Console. You can now:

  • Change run options (build, memory, timeout) when resurrecting multiple runs at once.
  • Abort multiple runs gracefully
  • Configure alerts for multiple Actors and Actor tasks at the same time.

CleanShot 2024-06-17 at 15.15.01@2x.png

We've enhanced the OpenGraph images displayed when you share your developer profile to social networks.

ogimage.png

And we've also made notification settings more flexible. These now support usage alerts and monthly developer summaries.

Marek Trunkát

CTO


New

API

Storage

We’re excited to announce the launch of Apify's new Request Queue storage system. This update enhances our platform by unlocking new use cases and addressing common issues with previous request queue functionalities.

New use cases

Distributed scraping: The new system integrates a locking mechanism that allows multiple clients to process the same request queue without duplicating work. This feature is crucial for distributed scraping tasks, where multiple Actor runs scrape one request queue.

Batch operations: The new storage system supports batch operations, allowing multiple requests to be enqueued or dequeued in a single operation. This reduces network latency and accelerates the processing of large volumes of requests.

Unlimited data retention: Unlike the previous system where each request had a fixed expiration date, the new system allows for indefinite retention of requests in named queues. This feature facilitates incremental crawling by allowing you to append new URLs to the queue and pick up where you left off in subsequent runs.

For more detailed information on the new Request Queue system, please refer to our blog post. To learn how to implement these features directly, you can explore the Apify documentation or review the tutorials available at Apify Academy.

Jakub Drobník

Senior Engineer


New

Web

Actor

Our redesigned Store is now finally accessible on our website! We introduce new design, sections, Actor collections, and featured developers. All in favour to our community and highlighting the amazing Actors you create!

If you're an Actor developer, there are a few things that can help you get into Actor collections or featured developers:

  • Good quality README
  • Short response time with your Actor issues
  • Engaging & supporting Discord community
  • Activities promoting your Actors through social media

Join the community and help to grow our Apify Store even faster!

New Store.gif

Jan Ženíšek

VP of Product


New

Update

Actor

API

Performance improvements

As part of our continuous performance improvement initiative, we're happy to announce that we successfully improved the Apify API response time by 50% on average and the 90th-percentile startup time of Actors by about 20%. We will continue improving Apify in this direction.

API updates

User limits endpoint now returns maxConcurrentActorJobs and activeActorJobCount properties enabling users to keep an eye on the concurrency limit.

We also added the missing endpoint /actor-builds/:build-id/log, allowing you to quickly access the log of certain builds without a need for an Actor run ID.

Adaptive Playwright Crawler

Try out Crawlee's new AdaptivePlaywrightCrawler class abstraction, which is an extension of PlaywrightCrawler that uses a more limited request handler interface so that it's able to switch to HTTP-only crawling when it detects that it may be possible. This way, you can achieve lower costs when crawling multiple websites.

1const crawler = new AdaptivePlaywrightCrawler({
2    renderingTypeDetectionRatio: 0.1,
3    async requestHandler({ querySelector, pushData, enqueueLinks, request, log }) {
4        // This function is called to extract data from a single web page
5        const $prices = await querySelector('span.price')
6
7        await pushData({
8            url: request.url,
9            price: $prices.filter(':contains("$")').first().text(),
10        })
11
12        await enqueueLinks({ selector: '.pagination a' })
13    },
14});
15
16await crawler.run([
17    'http://www.example.com/page-1',
18    'http://www.example.com/page-2',
19]);

Marek Trunkát

CTO


We're making the web more programmable.