This actor efficiently fetches URLs of web pages that have been recently added or updated on a specified website. It's designed to save time, money, and resources by focusing only on new or changed content, reducing unnecessary crawling and processing of already known pages.

Why Use This Actor?

Efficiency: By fetching only recent changes, you significantly reduce crawling time and resource usage.
Cost-effective: Less data to process means lower computing costs and storage requirements.
Up-to-date: Ensures you're always working with the latest content from your target website.
Flexible: Can be easily integrated into larger workflows for content monitoring, indexing, or analysis.

How It Works

The actor takes a target URL and a time frame (in days) as input.
It searches for pages on the specified website that have been added or updated within that time frame.
The actor returns a list of URLs for these recent pages.

Input Configuration

url: The target website URL to search for recent pages.
daysAgo: Number of days to look back for changes (default: 1).
maxResults: Maximum number of results to fetch (default: 50).
language: Language for search results (default: 'en').
country: Country for search results (default: 'us').
searchDomain: Domain to use for search (default: 'com').
nextRunId: ID of the next task to run (optional).
nextRunAttribute: Attribute to replace in the next run (default: 'startUrls', optional).

Output

The actor outputs a list of URLs to recently added or updated pages on the target website. This data is stored in the default dataset associated with the actor run.

Usage Tips

Moving Window Approach

To ensure you don't miss any updates, it's recommended to use a moving window approach. This means setting the daysAgo parameter to a value slightly larger than the interval between your crawls.

For example:

If you run the actor daily, set daysAgo to 2 or 3.
If you run it weekly, set daysAgo to 8 or 9.

This overlapping window helps catch pages that might have been added or updated shortly after your last crawl, ensuring no content is missed due to delayed indexing or updates.

Integrating with Other Actors

To process the fetched URLs with another actor:

Create a task for the subsequent actor you want to use.
In this URL Fetcher actor's input, specify the nextRunId (the ID of the task you created) and nextRunAttribute (usually 'startUrls').
When this actor finishes, it will automatically start the task you specified, passing the fetched URLs as input.

Alternatively, you can access the output directly from the dataset and use it as needed in your workflows.

Benefits

Focused crawling saves bandwidth and reduces load on target servers.
Smaller datasets mean faster processing and analysis downstream.
Ideal for monitoring frequently updated sites or tracking new content across multiple sources.
Easily integrates into larger data pipelines or content monitoring systems.

Example Inputs and Outputs

Example Input Configurations

Basic configuration:

{
  "url": "https://www.example-news-site.com",
  "daysAgo": 2,
  "maxResults": 100
}

Advanced configuration with next actor task:

{
  "url": "https://www.example-e-commerce.com",
  "daysAgo": 3,
  "maxResults": 200,
  "language": "de",
  "country": "de",
  "searchDomain": "de",
  "nextRunId": "your-task-id",
  "nextRunAttribute": "startUrls"
}

Example Output

The actor will output a list of URLs in the following format:

[
  { "url": "https://www.example-news-site.com/article/breaking-news-2024-08-20" },
  { "url": "https://www.example-news-site.com/article/tech-innovation-unveiled" },
  { "url": "https://www.example-news-site.com/article/sports-update-championship" },
  ...
]

These URLs represent pages that have been added or updated within the specified time frame on the target website.

Getting Started

To run the actor locally, use the following command:

$apify run

Deploying to Apify

You can deploy this actor to Apify in two ways:

Connect your Git repository to Apify:
- Go to the Actor creation page
- Click on the "Link Git Repository" button
Push the project from your local machine:
- Log in to Apify: apify login
- Deploy your actor: apify push

For more information, see the Apify documentation.

Resources

On this page

URL Fetcher for Recent Web Pages

Share Actor:

Legacy PhantomJS Crawler

apify/legacy-phantomjs-crawler

Replacement for the legacy Apify Crawler product with a backward-compatible interface. The actor uses PhantomJS headless browser to recursively crawl websites and extract data from them using a piece of front-end JavaScript code.

Apify

1.6K

5.0

Actors MCP Server

apify/actors-mcp-server

⚠️ Legacy: This Actor is outdated. For the latest features and full documentation, visit https://mcp.apify.com. Easily connect any Apify Actor to AI agents using Anthropic’s Model Context Protocol (MCP) with our actively maintained MCP server.

Apify

4.9

Actor Compute Units Aggregator

lukaskrivka/check-compute-per-actor

Aggregates daily or monthly usage of compute units for all your actors. Please don't use this if you have thousands of daily runs as it will overload the Apify API.

Lukáš Křivka

Scraper Results Checker

drobnikj/check-crawler-results

This actor checks results from Apify's scrapers or any other actor that stores its result to a dataset, and sends a notification if there are errors. It's designed to run from webhook.

Jakub Drobník

Indeed Companies Discover By Industry And State

autoscraping/indeed-companies-discover-by-industry-and-state

Scrape Indeed to discover companies by industry & location. Extract reviews, culture scores, hiring links & more. Pricing: $4/1000 results.

AUTOScraping

Instagram Email Sniper

mohamedgb00714/instagram-email-sniper

Scrapes Instagram profiles for publicly available data, including usernames and potentially emails.

mohamed el hadi msaid

114

4.2

Linkedin Email Sniper

mohamedgb00714/linkedin-email-sniper

Scrapes linkedin profiles for publicly available data, including usernames and potentially emails

mohamed el hadi msaid

Facebook Email Sniper

mohamedgb00714/facebook-email-sniper

Scrapes facebook profiles for publicly available data, including usernames and potentially emails.

mohamed el hadi msaid

5.0

Pinecone Integration

apify/pinecone-integration

This integration transfers data from Apify Actors to a Pinecone and is a good starting point for a question-answering, search, or RAG use case.

Apify

446

3.2

Chroma Integration

apify/chroma-integration

This integration transfers data from Apify Actors to a Chroma and is a good starting point for a question-answering, search, or RAG use case.

Apify

4.8

OpenSearch Integration

apify/opensearch-integration

Transfer data from Apify Actors to Amazon OpenSearch Service. This Actor is a good starting point for building question-answering systems, search functionality, or Retrieval-Augmented Generation (RAG) use cases.

Apify

4.4