Web Scraper avatar
Web Scraper

Pricing

Pay per usage

Go to Store
Web Scraper

Web Scraper

Developed by

Apify

Apify

Maintained by Apify

Crawls arbitrary websites using a web browser and extracts structured data from web pages using a provided JavaScript function. The Actor supports both recursive crawling and lists of URLs, and automatically manages concurrency for maximum performance.

4.5 (22)

Pricing

Pay per usage

697

Total users

82.5k

Monthly users

3.9k

Runs succeeded

>99%

Issue response

32 days

Last modified

19 days ago

EL

Tracking Webscraper

Closed

earnest_lawnmower opened this issue
a year ago

Hi there. I am currently using your web scraping actor in my site to scrape websites. Some of these scraping jobs get large and can take a few minutes, so I want to develop a tracking system. Is there any way I can update, say, a counter in the context local to where I am initializing and running the actor? Thanks

EL

earnest_lawnmower

a year ago

Here is my code. I want to call some logic whenever a url is scraped: input.startUrls = urls; const run = await client .actor(properties.apifyCredentials.actorId) .call(input); // Run x logic when a url is scraped, not when all urls are scraped at end of job const { items } = await client.dataset(run.defaultDatasetId).listItems();

jindrich.bar avatar

Hello and thank you for your interest in this Actor!

Just to make sure I understand your question - this is more about Apify and less about this specific Actor, right?

If you want to run some code whenever an Actor produces a result, you basically have two options:

  • The "easy" option: polling the Dataset API:
    • here, we retrieve the run id (and the default dataset id) and repeatedly ask the server, how many results are in the dataset.
    • this is very simple, but may be not "granular" enough for some use cases - if the Actor stores tens of results per second, the code will only see the results appear in multiples of ten (this might actually be a good thing, saving you some computation power).
// Run the Actor and get the run ID immediately - `waitSecs: 0` makes the call resolve immediately (and returns a reference to a "running" run).
const { id: runId, defaultDatasetId } = await client.actor("yourActorId").call(input, { waitSecs: 0 });
const interval = setInterval(async () => {
const runInfo = await client.run(runId).get();
if (runInfo?.status !== 'RUNNING') {
clearInterval(interval);
}
const dataset = await client.dataset(defaultDatasetId).get();
console.log(`The dataset currently contains ${dataset?.itemCount} items.`);
}, 1000);
  • The "precise" option: make your Actor call your own API
    • With this option, you implement some "notification" system in your Ac... [trimmed]
jindrich.bar avatar

Closing due to inactivity.