
Website Content Crawler
Pricing
Pay per usage

Website Content Crawler
Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.
3.7 (41)
Pricing
Pay per usage
1501
Total users
58K
Monthly users
8.1K
Runs succeeded
>99%
Issues response
7.6 days
Last modified
3 hours ago
Feature Request: Automatic Recrawling Within Same Task Run for RAG System Integration
Closed
I'm currently utilizing the Website Content Crawler in my Retrieval-Augmented Generation (RAG) system to extract and process website content. My objective is to automatically recrawl pages after fix duration of time like 30 days, all within the same task run, to ensure my system remains up-to-date with the latest information. Could you please advise if such a feature is currently supported or if there are recommended approaches to achieve this functionality? If not, I would like to suggest this as a feature request for future development.

Hi,
Thank you for using Website Content Crawler!
For your use case, you can use Apify Schedules (https://docs.apify.com/platform/schedules) to schedule monthly task runs. Additionally, you can set up a webhook integration in the task (https://docs.apify.com/platform/integrations) to receive notifications when the task completes, and then retrieve the results.
Let me know if this works for you,
Jakub

I'll go ahead and close this issue for now, but let me add information about Apify's vector database support.
To support automatic content updates in vector databases like Pinecone, Apify provides a dedicated integration: 👉 Pinecone Integration
This integration automates the transfer of datasets (e.g., from Website Content Crawler) into Pinecone — or other supported vector DBs — with options for:
- Chunking long text for embedding
- Custom metadata mapping
- Delta updates (only changed/new records)
- Efficient document identification using
url
or other fields
After a crawl, you get a dataset like:
{"url": "https://example.com","text": "...long page content...","metadata": { "title": "Example Page" }}
Then configure the integration like this:
{"datasetFields": ["text"],"metadataDatasetFields": {"title": "metadata.title"},"performChunking": true,"chunkSize": 1000,"chunkOverlap": 0,"dataUpdatesStrategy": "deltaUpdates","dataUpdatePrimaryDatasetFields": ["url"],"usePineconeIdPrefix": true}
This ensures:
- Only updated content is re-indexed
- Records are uniquely identified (e.g., by URL)
- No duplicates or wasteful reprocessing
Besides Pinecone, Apify also supports similar integrations with: Qdrant, Milvus, PGVector and other database
You can find these in the Apify Store.