Website Content Crawler avatar

Website Content Crawler

Try for free

No credit card required

View all Actors
Website Content Crawler

Website Content Crawler

apify/website-content-crawler
Try for free

No credit card required

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

Do you want to learn more about this Actor?

Get a demo
JD

Actor runs timing out after message about migration to another host

Closed

jjohnson-dev opened this issue
4 months ago

I've had to abort the last several runs I've attempted because they don't progress after getting to this point in the log. I've spent a ton of credits waiting, watching the logs and ultimately stopped the runs below.

2024-06-20T16:26:14.917Z ACTOR: Notifying Actor process about imminent migration to another host. 2024-06-20T16:26:58.568Z ACTOR: Sending Docker container SIGTERM signal.

https://console.apify.com/view/runs/Gnt7gbSCGAHxdYYLD https://console.apify.com/view/runs/fIhvJM9MFHlew8hqy https://console.apify.com/view/runs/CaeLop8FIErGB5lCa https://console.apify.com/view/runs/puefrlo1EjzXfeIy8

Could you take a look and let me know if it's my settings or there's an issue? Anyway we can get some credits back? Thanks!

jindrich.bar avatar

Hello and thank you for your interest in this Actor!

It seems that you're trying to submit URLs from multiple domains in each run. Do you only want to scrape these pages, or do you also want to follow the links from them?

If you only want to scrape the pages on the URLs you're directly submitting, set the maxCrawlDepth to 0. You can also turn off the useSitemaps option (set it to false), as that's what's taking most of the time in your runs (and you're not using it, if you only want to scrape the first level of pages). There is a minor performance issue with the sitemap processing in Website Content Crawler (tracked here) - but I doubt that's happening in your case. In your runs, the crawler is simply trying to process too many sitemaps at once. The fix to the issue above might help in your case too, though.

If you do want to follow the on-page links, split your run into multiple smaller runs, each with URLs from only one domain. That's what Website Content Crawler is optimized for and how you'll get the best performance out of it.

I will get in touch with our support team regarding the credit reimbursement and will let you know how it went. Cheers!

jindrich.bar avatar

Hello, I just got word from our Support team that the credits lost on these runs were successfully offset in your favor.

I'll close this issue now, but feel free to open it (or send us another one) if you encounter any other problems. Cheers!

Developer
Maintained by Apify
Actor metrics
  • 3.8k monthly users
  • 544 stars
  • 99.9% runs succeeded
  • 3.4 days response time
  • Created in Mar 2023
  • Modified 1 day ago