Web Scraper avatar
Web Scraper
Try for free

No credit card required

View all Actors
Web Scraper

Web Scraper

apify/web-scraper
Try for free

No credit card required

Crawls arbitrary websites using the Chrome browser and extracts data from pages using a provided JavaScript code. The actor supports both recursive crawling and lists of URLs and automatically manages concurrency for maximum performance. This is Apify's basic tool for web crawling and scraping.

User avatar

Tracking Webscraper

Closed

earnest_lawnmower opened this issue
3 months ago

Hi there. I am currently using your web scraping actor in my site to scrape websites. Some of these scraping jobs get large and can take a few minutes, so I want to develop a tracking system. Is there any way I can update, say, a counter in the context local to where I am initializing and running the actor? Thanks

User avatar

earnest_lawnmower

3 months ago

Here is my code. I want to call some logic whenever a url is scraped: input.startUrls = urls; const run = await client .actor(properties.apifyCredentials.actorId) .call(input); // Run x logic when a url is scraped, not when all urls are scraped at end of job const { items } = await client.dataset(run.defaultDatasetId).listItems();

User avatar

Hello and thank you for your interest in this Actor!

Just to make sure I understand your question - this is more about Apify and less about this specific Actor, right?

If you want to run some code whenever an Actor produces a result, you basically have two options:

  • The "easy" option: polling the Dataset API:
    • here, we retrieve the run id (and the default dataset id) and repeatedly ask the server, how many results are in the dataset.
    • this is very simple, but may be not "granular" enough for some use cases - if the Actor stores tens of results per second, the code will only see the results appear in multiples of ten (this might actually be a good thing, saving you some computation power).
1// Run the Actor and get the run ID immediately - `waitSecs: 0` makes the call resolve immediately (and returns a reference to a "running" run).
2    const { id: runId, defaultDatasetId } = await client.actor("yourActorId").call(input, { waitSecs: 0 });
3
4    const interval = setInterval(async () => {
5        const runInfo = await client.run(runId).get();
6        if (runInfo?.status !== 'RUNNING') {
7            clearInterval(interval);
8        }
9
10        const dataset = await client.dataset(defaultDatasetId).get();
11        console.log(`The dataset currently contains ${dataset?.itemCount} items.`);
12    }, 1000);
  • The "precise" option: make your Actor call your own API
    • With this option, you implement some "notification" system in your Actor so it calls your server whenever it stores a new result. Please note that this might not be possible with Web Scraper due to browser CORS policies.
    • This requires much more setup - a private server reachable from the Internet, the actual server application listening on the API and the Actor-side logic.

So, the choice is yours - and largely depends on your use case for this. If the precise timing is not the top priority for you, the first option is IMO perfect.

Does this solve your problem? Let us know if you have any more questions. Thanks!

User avatar

Closing due to inactivity.

Developer
Maintained by Apify
Actor metrics
  • 3.7k monthly users
  • 98.8% runs succeeded
  • 3.6 days response time
  • Created in Mar 2019
  • Modified about 1 month ago