Website Content Crawler avatar

Website Content Crawler

Try for free

No credit card required

Go to Store
Website Content Crawler

Website Content Crawler

apify/website-content-crawler
Try for free

No credit card required

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

Do you want to learn more about this Actor?

Get a demo
CU

Actor is not parsing the rest of the pages

Closed

Custombizio opened this issue
21 days ago

Hello, Two Issues:

  • First: The actor is not parsing the full rest of the website but only the first URL. The run id is: k0Zi66wkCr4NdGuBf Task: custombizio/hvac-innogreen-solutions Container url: https://kyz0to5t3klc.runs.apify.net/ -Second issue: My Apify usage is showing $87 / $200 but it says I have used up all my prepaid usage? Anything over the plan limit will be charged as overage? It doesn't make any sense since I put in extra money to use the Apify.

Here is the task I am trying to run: custombizio/hvac-innogreen-solutions

I haven't been able to scrap this site. It only scraping 1 page. https://natural-resources.canada.ca/energy-efficiency/homes/canada-greener-homes-initiative/canada-greener-homes-grant/canada-greener-homes-grant/23441

Please give me some advice. Thanks, Syed Ali Custombizio

jiri.spilka avatar

Hi Syed, thank you for trying Website Content Crawler.

Regarding scraping (Run ID: k0Zi66wkCr4NdGuBf): The Actor will only crawl sub-pages of the specified startURLs. For example, if you specify http://example.com/blog, it will only crawl pages like http://example.com/blog/1 or http://example.com/blog/2, but not http://example.com/new.

You need to provide a startURL that is generic enough to cover the desired subpages. Or you need to play with the inputGlob. Let me know what you’re trying to achieve, and I’ll help you set it up.

For the second issue: I’m not sure I fully understand. What you mean that you "put extra money in"? Here’s how you can check current usage: Go to console.apify → Billing → Subscription. There is a column labeled "Next Invoice." Click on View Breakdown, where it explains that your usage includes prepaid amounts from your subscription plan and redeemed coupon. However, you’ve used the platform beyond the subscription plan + redeemed the coupon, and the additional usage will be added to your invoice.

CU

Custombizio

18 days ago

That is what I am saying. It is NOT parsing the subpages of this web URL. Here is the Run id: k0Zi66wkCr4NdGuBf And the URL: https://natural-resources.canada.ca/energy-efficiency/homes/canada-greener-homes-initiative/oil-heat-pump-affordability-program/24775

Run id: yV8jm4qofNKMNJKbO URL: https://natural-resources.canada.ca/energy-efficiency/homes/canada-greener-homes-initiative/canada-greener-homes-grant/canada-greener-homes-grant/23441

Second part is: Apify is saying the limit will reset on Dec. 8. I put extra money in my account ($200) so I don't have to worry about it limits and continue to parse.

CU

Custombizio

18 days ago

I have added $200 in my account so how is that possible that I have used beyond the plan? Doesn't make sense to add extra money and not being able to use it.

jiri.spilka avatar

Hi, I’m sorry for the misunderstanding.

If you want to crawl everything under energy-efficiency, you need to start the crawl using https://natural-resources.canada.ca/energy-efficiency so that other pages will be included as well. Please see my example run, which successfully crawled 904 pages.

Please refer to documentation regarding the crawling details

The actor crawls the start URLs, finds links to other pages, and recursively crawls those pages, too, as long as their URL is under the start URL.

If your startUrl is very specific, such as https://natural-resources.canada.ca/energy-efficiency/homes/canada-greener-homes-initiative/oil-heat-pump-affordability-program/24775, it will only scrape this URL (and any other pages that start with the same URL, i.e. are under this URL). As far as I can tell, there is no such URL at that particular page.

Regarding the payment issues, I can't see your payments. I’ll ask customer support to reach out to you about this.

jiri.spilka avatar

I hope you were able to resolve the payment issue with our customer support. I will go ahead and close this issue for now.

If you have any other technical requests, please don’t hesitate to ask!

Developer
Maintained by Apify

Actor Metrics

  • 3.9k monthly users

  • 718 stars

  • >99% runs succeeded

  • 2.2 days response time

  • Created in Mar 2023

  • Modified 15 hours ago