[Deprecated] Outdated actor avatar
[Deprecated] Outdated actor

Deprecated

Pricing

$5.00 / 1,000 results

Go to Store
[Deprecated] Outdated actor

[Deprecated] Outdated actor

Deprecated

Developed by

AIRabbit

AIRabbit

Maintained by Community

Outdated

0.0 (0)

Pricing

$5.00 / 1,000 results

0

Total users

1

Monthly users

1

Last modified

9 months ago

URL Fetcher for Recent Web Pages

This actor efficiently fetches URLs of web pages that have been recently added or updated on a specified website. It's designed to save time, money, and resources by focusing only on new or changed content, reducing unnecessary crawling and processing of already known pages.

Why Use This Actor?

  • Efficiency: By fetching only recent changes, you significantly reduce crawling time and resource usage.
  • Cost-effective: Less data to process means lower computing costs and storage requirements.
  • Up-to-date: Ensures you're always working with the latest content from your target website.
  • Flexible: Can be easily integrated into larger workflows for content monitoring, indexing, or analysis.

How It Works

  1. The actor takes a target URL and a time frame (in days) as input.
  2. It searches for pages on the specified website that have been added or updated within that time frame.
  3. The actor returns a list of URLs for these recent pages.

Input Configuration

  • url: The target website URL to search for recent pages.
  • daysAgo: Number of days to look back for changes (default: 1).
  • maxResults: Maximum number of results to fetch (default: 50).
  • language: Language for search results (default: 'en').
  • country: Country for search results (default: 'us').
  • searchDomain: Domain to use for search (default: 'com').
  • nextRunId: ID of the next task to run (optional).
  • nextRunAttribute: Attribute to replace in the next run (default: 'startUrls', optional).

Output

The actor outputs a list of URLs to recently added or updated pages on the target website. This data is stored in the default dataset associated with the actor run.

Usage Tips

Moving Window Approach

To ensure you don't miss any updates, it's recommended to use a moving window approach. This means setting the daysAgo parameter to a value slightly larger than the interval between your crawls.

For example:

  • If you run the actor daily, set daysAgo to 2 or 3.
  • If you run it weekly, set daysAgo to 8 or 9.

This overlapping window helps catch pages that might have been added or updated shortly after your last crawl, ensuring no content is missed due to delayed indexing or updates.

Integrating with Other Actors

To process the fetched URLs with another actor:

  1. Create a task for the subsequent actor you want to use.
  2. In this URL Fetcher actor's input, specify the nextRunId (the ID of the task you created) and nextRunAttribute (usually 'startUrls').
  3. When this actor finishes, it will automatically start the task you specified, passing the fetched URLs as input.

Alternatively, you can access the output directly from the dataset and use it as needed in your workflows.

Benefits

  • Focused crawling saves bandwidth and reduces load on target servers.
  • Smaller datasets mean faster processing and analysis downstream.
  • Ideal for monitoring frequently updated sites or tracking new content across multiple sources.
  • Easily integrates into larger data pipelines or content monitoring systems.

Example Inputs and Outputs

Example Input Configurations

  1. Basic configuration:
{
"url": "https://www.example-news-site.com",
"daysAgo": 2,
"maxResults": 100
}
  1. Advanced configuration with next actor task:
{
"url": "https://www.example-e-commerce.com",
"daysAgo": 3,
"maxResults": 200,
"language": "de",
"country": "de",
"searchDomain": "de",
"nextRunId": "your-task-id",
"nextRunAttribute": "startUrls"
}

Example Output

The actor will output a list of URLs in the following format:

[
{ "url": "https://www.example-news-site.com/article/breaking-news-2024-08-20" },
{ "url": "https://www.example-news-site.com/article/tech-innovation-unveiled" },
{ "url": "https://www.example-news-site.com/article/sports-update-championship" },
...
]

These URLs represent pages that have been added or updated within the specified time frame on the target website.

Getting Started

To run the actor locally, use the following command:

$apify run

Deploying to Apify

You can deploy this actor to Apify in two ways:

  1. Connect your Git repository to Apify:

  2. Push the project from your local machine:

    • Log in to Apify: apify login
    • Deploy your actor: apify push

For more information, see the Apify documentation.

Resources