[Deprecated] Outdated actor
Deprecated
Pricing
$5.00 / 1,000 results
[Deprecated] Outdated actor
Deprecated
Outdated
0.0 (0)
Pricing
$5.00 / 1,000 results
0
Total users
1
Monthly users
1
Last modified
9 months ago
URL Fetcher for Recent Web Pages
This actor efficiently fetches URLs of web pages that have been recently added or updated on a specified website. It's designed to save time, money, and resources by focusing only on new or changed content, reducing unnecessary crawling and processing of already known pages.
Why Use This Actor?
- Efficiency: By fetching only recent changes, you significantly reduce crawling time and resource usage.
- Cost-effective: Less data to process means lower computing costs and storage requirements.
- Up-to-date: Ensures you're always working with the latest content from your target website.
- Flexible: Can be easily integrated into larger workflows for content monitoring, indexing, or analysis.
How It Works
- The actor takes a target URL and a time frame (in days) as input.
- It searches for pages on the specified website that have been added or updated within that time frame.
- The actor returns a list of URLs for these recent pages.
Input Configuration
url
: The target website URL to search for recent pages.daysAgo
: Number of days to look back for changes (default: 1).maxResults
: Maximum number of results to fetch (default: 50).language
: Language for search results (default: 'en').country
: Country for search results (default: 'us').searchDomain
: Domain to use for search (default: 'com').nextRunId
: ID of the next task to run (optional).nextRunAttribute
: Attribute to replace in the next run (default: 'startUrls', optional).
Output
The actor outputs a list of URLs to recently added or updated pages on the target website. This data is stored in the default dataset associated with the actor run.
Usage Tips
Moving Window Approach
To ensure you don't miss any updates, it's recommended to use a moving window approach. This means setting the daysAgo
parameter to a value slightly larger than the interval between your crawls.
For example:
- If you run the actor daily, set
daysAgo
to 2 or 3. - If you run it weekly, set
daysAgo
to 8 or 9.
This overlapping window helps catch pages that might have been added or updated shortly after your last crawl, ensuring no content is missed due to delayed indexing or updates.
Integrating with Other Actors
To process the fetched URLs with another actor:
- Create a task for the subsequent actor you want to use.
- In this URL Fetcher actor's input, specify the
nextRunId
(the ID of the task you created) andnextRunAttribute
(usually 'startUrls'). - When this actor finishes, it will automatically start the task you specified, passing the fetched URLs as input.
Alternatively, you can access the output directly from the dataset and use it as needed in your workflows.
Benefits
- Focused crawling saves bandwidth and reduces load on target servers.
- Smaller datasets mean faster processing and analysis downstream.
- Ideal for monitoring frequently updated sites or tracking new content across multiple sources.
- Easily integrates into larger data pipelines or content monitoring systems.
Example Inputs and Outputs
Example Input Configurations
- Basic configuration:
{"url": "https://www.example-news-site.com","daysAgo": 2,"maxResults": 100}
- Advanced configuration with next actor task:
{"url": "https://www.example-e-commerce.com","daysAgo": 3,"maxResults": 200,"language": "de","country": "de","searchDomain": "de","nextRunId": "your-task-id","nextRunAttribute": "startUrls"}
Example Output
The actor will output a list of URLs in the following format:
[{ "url": "https://www.example-news-site.com/article/breaking-news-2024-08-20" },{ "url": "https://www.example-news-site.com/article/tech-innovation-unveiled" },{ "url": "https://www.example-news-site.com/article/sports-update-championship" },...]
These URLs represent pages that have been added or updated within the specified time frame on the target website.
Getting Started
To run the actor locally, use the following command:
$apify run
Deploying to Apify
You can deploy this actor to Apify in two ways:
-
Connect your Git repository to Apify:
- Go to the Actor creation page
- Click on the "Link Git Repository" button
-
Push the project from your local machine:
- Log in to Apify:
apify login
- Deploy your actor:
apify push
- Log in to Apify:
For more information, see the Apify documentation.