Blog / Dated Content Crawler avatar

Blog / Dated Content Crawler

Try for free

No credit card required

Go to Store
Blog / Dated Content Crawler

Blog / Dated Content Crawler

diarmuidr/blog-content-crawler
Try for free

No credit card required

Crawl an entire blog / knowledge base or filter to just the new content. Supporting relevant AI queries by filtering pages by date

Developer
Maintained by Community

Actor Metrics

  • 9 Monthly users

  • 5.0 / 5 (2)

  • 3 bookmarks

  • 97% runs succeeded

  • Created in Feb 2025

  • Modified 17 hours ago

This actor enables you to crawl blog / dated content websites. What this means is that you can filter the content by its publish date and only keep the content that is newer than a date you select. This is very useful for AI applications to avoid training your model or feeding your LLM with old / outdated / irrelevant data.

This is also useful for any application where you want to download data from websites such as documentation,help articles, or your knowledge base.

How it works

  • Enter the url(s) (startUrls) of the pages / site you want to crawl.
  • Optional: Enter a start date (startDate) or more likely a "Relative" start date (relativeStartDate) to filter the content by. "Relative" means that you can enter a date like "1 month" or "2 years" and the crawler will calculate the date relative to the current date each time it runs.
  • Run the crawler
  • The crawler will retrieve only the pages that are newer than the start date (startDate) you entered or will retrieve all the pages if you don't enter a start date.

More Details