No credit card required

Website Content Crawler

apify/website-content-crawler

No credit card required

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

Do you want to learn more about this Actor?

Get a demo

Back to issues Create new issue

page scrolling

Open

kempt_trophy opened this issue

The data I need appears when I scroll, how do I customize this in this Actor?

Jiří Spilka (jiri.spilka)

Hi, thank you for using Website Content Crawler.

I reached out internally, and all credit goes to @jindrichbar for finding a possible solution.

He made a few specific changes to the settings:

1"crawlerType": "playwright:firefox" (was previously Chrome)
2"dynamicContentWaitSecs": 20
3"htmlTransformer": "none"
4"removeCookieWarnings": false
5"removeElementsCssSelector": ".i.want.everything"
6"requestTimeoutSecs": 60
7"useSitemaps": false
8"waitForSelector": "[data-icon=\"clipboard\"]"

Please see his example run here.

The ideal approach would be to retrieve the data using the Perplexity AI Actor. However, I noticed that you’ve already raised an issue there without a solution yet.

I hope this solution works for you for now. I’ll go ahead and close this issue, but please feel free to reach out with any other questions.

kempt_trophy

Thanks for the reply. If you don't mind, can you help me figure out why I'm not getting the results I want with your settings?

https://console.apify.com/actors/aYG0l9s7dbB7j3gbS/runs/uq3mwiAXydfARqGCz

Jiří Spilka (jiri.spilka)

For this URL, I had to change the selector to: "waitForSelector": "[data-icon="arrow-up"]" (as [data-icon="clipboard"] was not present). Please see my example run for reference.

I agree this solution is quite brittle, and using the Perplexity API would likely be a more convenient approach. Another option could be to use the RAG Web Browser Actor paired with an LLM Actor (though this one hasn’t been released yet).

I’d love to hear more about your use case if you don’t mind sharing it!

Please feel free to ask any additional questions.

kempt_trophy

I make a query to AI and get the data in its message. I only need to retrieve them. And since the Actor I need is not working, I decided to use this to get the results. I would like to get only the required result.

Also my link tends to be much longer. For example, like here:

https://console.apify.com/view/runs/qFOgaOMpqntQZpuzs

Jiří Spilka (jiri.spilka)

Thank you for sharing! The URL length shouldn’t be an issue.

kempt_trophy

https://console.apify.com/actors/aYG0l9s7dbB7j3gbS/runs/w09Th2kXq6cF7Hjtx#output

The length of the link seems to make a difference. I don't get the perplexity response data. Please help me to understand

Add comment

Developer

Apify

Actor metrics

3.8k monthly users
635 stars
100.0% runs succeeded
2.7 days response time
Created in Mar 2023
Modified 7 days ago

Categories

Developer tools

Fast Website Content Crawler

6sigmag/fast-website-content-crawler

A high-performance web scraper that rapidly extracts and analyzes content from multiple websites simultaneously. Perfect for competitive research, content aggregation, and website structure analysis.

David Deng

Deep Website Content Crawler

6sigmag/deep-website-content-crawler

Scrape Failed Killer! A high-performance web scraper that rapidly extracts and analyzes content from multiple websites simultaneously. Perfect for competitive research, content aggregation, and website structure analysis.

David Deng

AI Website Content Markdown Scraper

quaking_pail/ai-website-content-markdown-scraper

This Apify Actor, "Website Content Crawler with Markdown Extraction," is designed to perform a comprehensive crawl of specified websites, extract their text content, convert it into Markdown format, and store it in a structured dataset. The extracted content is suitable for feeding LLMs.

AI_Builder

180

Example Website Screenshot Crawler

dz_omar/example-website-screenshot-crawler

Automated website screenshot crawler using Pyppeteer and Apify. This open-source actor captures screenshots from specified URLs, uploads them to the Apify Key-Value Store, and provides easy access to the results, making it ideal for monitoring website changes and archiving web content.

Omar Abdlhakim

Google Maps Reviews Scraper

compass/Google-Maps-Reviews-Scraper

Extract all reviews of Google Maps places using place URLs. Get review text, published date, response from owner, review URL, and reviewer's details. Download scraped data, run the scraper via API, schedule and monitor runs or integrate with other tools.

Compass

4.2k

Simple Google Maps Scraper

pajoe/simple-google-maps-scraper

Extract comprehensive info on any niche or topic from Google Maps, such as ratings, reviews, addresses, and more.

va-gasd

Amazon Scraper

junglee/free-amazon-product-scraper

Gets you product data from Amazon. Unofficial API. Scrapes and downloads product information without using the Amazon API, including reviews, prices, descriptions, and ASIN.

Junglee

3.5k

Web Scraper

apify/web-scraper

Crawls arbitrary websites using the Chrome browser and extracts data from pages using JavaScript code. The Actor supports both recursive crawling and lists of URLs and automatically manages concurrency for maximum performance. This is Apify's basic tool for web crawling and scraping.

Apify

71.3k

219

Cheerio Scraper

apify/cheerio-scraper

Crawls websites using raw HTTP requests, parses the HTML with the Cheerio library, and extracts data from the pages using a Node.js code. Supports both recursive crawling and lists of URLs. This actor is a high-performance alternative to apify/web-scraper for websites that do not require JavaScript.

Apify

5.6k