Pricing

Pay per usage

Go to Store

Website Content Crawler

Try for free

apify/website-content-crawler

Developed by

Apify

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

4.6 (38)

Pricing

Pay per usage

1.1k

Monthly users

Runs succeeded

>99%

Response time

2.3 days

Last modified

7 days ago

AI Developer tools

Back to issues Create new issue

"htmlUrl" points to the same URL for multiple pages

Closed

cirez_d opened this issue

Hello, We have encountered the following issue in the actor run:

Some pages have identical HTML URLs (htmlUrl), resulting in duplicated content and missing the actual results.
Upon reviewing the dataset, I noticed that the htmlUrls are recorded as the same, even though each affected and crawled URL contains different content.
Could it be that the URLs are being truncated at some point during the saving process due to their length, causing the htmlUrl to appear duplicated?

You can Ctrl-F for this htmlUrl (please delete after you viewed this)

The status code is 200, loaded/crawled Url is identical for each page, but the content for htmlUrl is the same across pages.

This is currently a significant issue for us. I would greatly appreciate your prompt support and any suggestions for a temporary workaround.

Thank you!

Jakub Kopecký (jakub.kopecky)

Hi, thank you for using the Website Content Crawler.

There is an issue with key (URL) truncation when saving into the key-value store that affects URLs longer than 145 characters. This will be fixed, and I will keep you updated.

Jakub

Jakub Kopecký (jakub.kopecky)

Hi,

The fix was released in the beta build of the Actor.

Please try running it with the beta build and let me know.

Jakub

Add comment

Pricing

Pricing model

Pay per usage

This Actor is paid per platform usage. The Actor is free to use, and you only pay for the Apify platform usage.

Fast Website Content Crawler

6sigmag/fast-website-content-crawler

A high-performance web scraper that rapidly extracts and analyzes content from multiple websites simultaneously. Perfect for competitive research, content aggregation, and website structure analysis.

David Deng

475

5.0

Deep Website Content Crawler

6sigmag/deep-website-content-crawler

Scrape Failed Killer! A high-performance web scraper that rapidly extracts and analyzes content from multiple websites simultaneously. Perfect for competitive research, content aggregation, and website structure analysis.

David Deng

224

AI Website Content Markdown Scraper

quaking_pail/ai-website-content-markdown-scraper

This Apify Actor, "Website Content Crawler with Markdown Extraction," is designed to perform a comprehensive crawl of specified websites, extract their text content, convert it into Markdown format, and store it in a structured dataset. The extracted content is suitable for feeding LLMs.

AI_Builder

460

/llms.txt Generator

jakub.kopecky/llmstxt-generator

The /llms.txt Generator 🕸️📄 extracts website content to create an llms.txt file for AI apps 🤖✨ like LLM fine-tuning and indexing. Output is available 📥 in the Key-Value Store for easy download and integration into workflows. 🚀

Jakub Kopecký

5.0

Webpage Singer 🎶

josef.prochazka/webpage-singer

Ever wondered what a website would sound like as a song? This Actor takes any webpage, turns its content into lyrics, and transforms it into a track in your favorite genre. Just drop in a URL, pick a style, and let the AI do the rest.

Josef Procházka

5.0

Backlink Opportunity Finder

easyapi/backlink-opportunity-finder

🔍 Discover high-quality backlink opportunities to boost your domain authority and search rankings. Extract valuable data about potential websites for building authoritative backlinks, including domain metrics, relevance analysis, and estimated SEO impact.

EasyApi

Example Website Screenshot Crawler

dz_omar/example-website-screenshot-crawler

Automated website screenshot crawler using Pyppeteer and Apify. This open-source actor captures screenshots from specified URLs, uploads them to the Apify Key-Value Store, and provides easy access to the results, making it ideal for monitoring website changes and archiving web content.

Abdlhakim hefaia

Cheerio Scraper

apify/cheerio-scraper

Crawls websites using raw HTTP requests, parses the HTML with the Cheerio library, and extracts data from the pages using a Node.js code. Supports both recursive crawling and lists of URLs. This actor is a high-performance alternative to apify/web-scraper for websites that do not require JavaScript.

Apify

6.9k

4.7

Puppeteer Scraper

apify/puppeteer-scraper

Crawls websites with the headless Chrome and Puppeteer library using a provided server-side Node.js code. This crawler is an alternative to apify/web-scraper that gives you finer control over the process. Supports both recursive crawling and list of URLs. Supports login to website.

Apify

5.9k

5.0

Web Scraper

apify/web-scraper

Crawls arbitrary websites using the Chrome browser and extracts structured data from web pages using a provided JavaScript function. The Actor supports both recursive crawling and lists of URLs, and automatically manages concurrency for maximum performance.

Apify

79.2k

4.5