Website Content Crawler
No credit card required
Website Content Crawler
No credit card required
Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.
Do you want to learn more about this Actor?
Get a demoI just scraped this URL:
https://www.paloaltonetworks.com/network-security/pan-os
And all I got was this:
Title: PAN-OS Software | ML-Powered NGFW Core Capabilities - Palo Alto Networks
Description: PAN-OS® is the software that runs Palo Alto Networks Next-Generation Firewalls. Featuring App-ID™, User-ID™, Device-ID™, Identity Security, Device Identity, SSL, and TLS Decryption and Cloud Identity Management.
PAN-OS Software | ML-Powered NGFW Core Capabilities
Eleven Years as a Leader
Doesn’t Happen by Magic
For the 11th straight year, we’ve been named a Leader in
the Gartner® Magic Quadrant™ for Network Firewalls.
The way we see it, that’s what happens when you
innovate to stop the most sophisticated threats
Get the report
Hello, and thank you for your interest in this Actor!
This is a common issue - the default Website Content Crawler settings use the Readable Text
HTML extractor, which is sometimes too eager to remove content. Turning the HTML extractor off keeps more content in the page. To do this, go to HTML Processing
and set HTML Transfomer
to None
.
Feel free to check my run with this input option fixed: https://console.apify.com/view/runs/se9ind7JUrhz00HJq
I'll close this issue now, but feel free to ask additional questions if you have any.
Cheers!
That did solve the problem with that site. However this URL (https://www.bigcommerce.com/) runs for 300 seconds and ends, returning nothing. The 300 second timeout is I believe partly selected because of my code but in any case it is long enough and I would have preferred to get whatever data was found in the first 300 seconds rather than an error nothing. Is there a way to configure the API call so that it returns something after n seconds even if it is not yet "finished" normally?
Hi, there seems to be an issue with the above-mentioned URL - it’s failing to load, which is causing the crawl to fail. We’ll need to debug this.
In the meantime, you can enable Consider URLs from Sitemaps
to get the crawl started.
Maybe you are testing different configurations right now, but for faster crawling, consider increasing the Max concurrency
from 1 to 200.
I'll leave this issue open to debug it further.
- 3.8k monthly users
- 635 stars
- 100.0% runs succeeded
- 2.7 days response time
- Created in Mar 2023
- Modified 7 days ago