Pricing

Pay per usage

Go to Store

Website Content Crawler

Try for free

apify/website-content-crawler

Developed by

Apify

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

4.6 (38)

Pricing

Pay per usage

1.1k

Monthly users

Runs succeeded

>99%

Response time

2.3 days

Last modified

7 days ago

AI Developer tools

Back to issues Create new issue

Crawler not crawling full page data

Closed

speakeasy_marketing opened this issue

I have a few websites with the same format. They all have text divided by CTA's with a "tel" attribute. Check out https://www.lusbylaw.com/civil-disputes-lawyer-rocky-mount-north-carolina/ and https://www.citizenslawgroup.com/bankruptcy-lawyer-evanston-illinois/ to see what I mean.

The crawler only scrapes the part before the first CTA in every case. I've tried this with at least 4 websites. It leaves the rest of the text all the rest of the way to the bottom of the page unscraped.

Maybe I'm not doing something right in the settings. I'm using the default settings with a max crawl depth of 0 and max 4 pages. Everything else is default. Just putting this here in case it turns out to be a genuine bug and not just me being ignorant.

linkdev

I am facing same issue as well

Jakub Kopecký (jakub.kopecky)

Hi, thank you for using Website Content Crawler.

This might be caused by the behavior of the HTML transformer. Try setting the HTML processing -> HTML transformer to None. Please see my run where the crawler scraped the whole content of the page: https://console.apify.com/view/runs/9I0i85YcFkmpoFafs

Let me know if that helps,

Jakub

Add comment

Pricing

Pricing model

Pay per usage

This Actor is paid per platform usage. The Actor is free to use, and you only pay for the Apify platform usage.

Fast Website Content Crawler

6sigmag/fast-website-content-crawler

A high-performance web scraper that rapidly extracts and analyzes content from multiple websites simultaneously. Perfect for competitive research, content aggregation, and website structure analysis.

David Deng

475

5.0

Deep Website Content Crawler

6sigmag/deep-website-content-crawler

Scrape Failed Killer! A high-performance web scraper that rapidly extracts and analyzes content from multiple websites simultaneously. Perfect for competitive research, content aggregation, and website structure analysis.

David Deng

224

AI Website Content Markdown Scraper

quaking_pail/ai-website-content-markdown-scraper

This Apify Actor, "Website Content Crawler with Markdown Extraction," is designed to perform a comprehensive crawl of specified websites, extract their text content, convert it into Markdown format, and store it in a structured dataset. The extracted content is suitable for feeding LLMs.

AI_Builder

460

/llms.txt Generator

jakub.kopecky/llmstxt-generator

The /llms.txt Generator 🕸️📄 extracts website content to create an llms.txt file for AI apps 🤖✨ like LLM fine-tuning and indexing. Output is available 📥 in the Key-Value Store for easy download and integration into workflows. 🚀

Jakub Kopecký

5.0

Webpage Singer 🎶

josef.prochazka/webpage-singer

Ever wondered what a website would sound like as a song? This Actor takes any webpage, turns its content into lyrics, and transforms it into a track in your favorite genre. Just drop in a URL, pick a style, and let the AI do the rest.

Josef Procházka

5.0

Backlink Opportunity Finder

easyapi/backlink-opportunity-finder

🔍 Discover high-quality backlink opportunities to boost your domain authority and search rankings. Extract valuable data about potential websites for building authoritative backlinks, including domain metrics, relevance analysis, and estimated SEO impact.

EasyApi

Example Website Screenshot Crawler

dz_omar/example-website-screenshot-crawler

Automated website screenshot crawler using Pyppeteer and Apify. This open-source actor captures screenshots from specified URLs, uploads them to the Apify Key-Value Store, and provides easy access to the results, making it ideal for monitoring website changes and archiving web content.

Abdlhakim hefaia

Cheerio Scraper

apify/cheerio-scraper

Crawls websites using raw HTTP requests, parses the HTML with the Cheerio library, and extracts data from the pages using a Node.js code. Supports both recursive crawling and lists of URLs. This actor is a high-performance alternative to apify/web-scraper for websites that do not require JavaScript.

Apify

6.9k

4.7

Puppeteer Scraper

apify/puppeteer-scraper

Crawls websites with the headless Chrome and Puppeteer library using a provided server-side Node.js code. This crawler is an alternative to apify/web-scraper that gives you finer control over the process. Supports both recursive crawling and list of URLs. Supports login to website.

Apify

5.9k

5.0

Web Scraper

apify/web-scraper

Crawls arbitrary websites using the Chrome browser and extracts structured data from web pages using a provided JavaScript function. The Actor supports both recursive crawling and lists of URLs, and automatically manages concurrency for maximum performance.

Apify

79.2k

4.5