Website Content Crawler avatar
Website Content Crawler

Pricing

Pay per usage

Go to Store
Website Content Crawler

Website Content Crawler

Developed by

Apify

Apify

Maintained by Apify

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

3.7 (41)

Pricing

Pay per usage

1501

Total users

58K

Monthly users

8.1K

Runs succeeded

>99%

Issues response

7.6 days

Last modified

3 hours ago

SN

Feature Request: Automatic Recrawling Within Same Task Run for RAG System Integration

Closed

sprouto_net opened this issue
2 months ago

I'm currently utilizing the Website Content Crawler in my Retrieval-Augmented Generation (RAG) system to extract and process website content. My objective is to automatically recrawl pages after fix duration of time like 30 days, all within the same task run, to ensure my system remains up-to-date with the latest information. Could you please advise if such a feature is currently supported or if there are recommended approaches to achieve this functionality? If not, I would like to suggest this as a feature request for future development.

jakub.kopecky avatar

Hi,

Thank you for using Website Content Crawler!

For your use case, you can use Apify Schedules (https://docs.apify.com/platform/schedules) to schedule monthly task runs. Additionally, you can set up a webhook integration in the task (https://docs.apify.com/platform/integrations) to receive notifications when the task completes, and then retrieve the results.

Let me know if this works for you,

Jakub

jiri.spilka avatar

I'll go ahead and close this issue for now, but let me add information about Apify's vector database support.

To support automatic content updates in vector databases like Pinecone, Apify provides a dedicated integration: 👉 Pinecone Integration

This integration automates the transfer of datasets (e.g., from Website Content Crawler) into Pinecone — or other supported vector DBs — with options for:

  • Chunking long text for embedding
  • Custom metadata mapping
  • Delta updates (only changed/new records)
  • Efficient document identification using url or other fields

After a crawl, you get a dataset like:

{
"url": "https://example.com",
"text": "...long page content...",
"metadata": { "title": "Example Page" }
}

Then configure the integration like this:

{
"datasetFields": ["text"],
"metadataDatasetFields": {"title": "metadata.title"},
"performChunking": true,
"chunkSize": 1000,
"chunkOverlap": 0,
"dataUpdatesStrategy": "deltaUpdates",
"dataUpdatePrimaryDatasetFields": ["url"],
"usePineconeIdPrefix": true
}

This ensures:

  • Only updated content is re-indexed
  • Records are uniquely identified (e.g., by URL)
  • No duplicates or wasteful reprocessing

Besides Pinecone, Apify also supports similar integrations with: Qdrant, Milvus, PGVector and other database

You can find these in the Apify Store.