Website Content Crawler avatar
Website Content Crawler

Pricing

Pay per usage

Go to Store
Website Content Crawler

Website Content Crawler

Developed by

Apify

Apify

Maintained by Apify

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

4.6 (38)

Pricing

Pay per usage

1310

Total users

49.4k

Monthly users

6.9k

Runs succeeded

>99%

Issue response

3.8 days

Last modified

7 days ago

BZ

simple page is throwing an error

Open

burgundy_zebra opened this issue
22 days ago

Here is an example of a page which isn't being crawled properly.

https://www.bestbuy.com/site/help-topics/zip-payments/pcmcat1678205761116.c?id=pcmcat1678205761116

The crawler throws a tons of error and just returns the URL as the output.

jakub.kopecky avatar

Hey, thanks for using Website Content Crawler!

The problem might be the URL showing a country selection splash screen. Use this URL instead: https://www.bestbuy.com/site/help-topics/zip-payments/pcmcat1678205761116.c?id=pcmcat1678205761116&intl=nosplash, with &intl=nosplash.

Please see my test Actor run: https://console.apify.com/view/runs/fduJ69lkSYeNJO2kt

Jakub

BZ

burgundy_zebra

18 days ago

Hi Jakub,

Appreciate that you looked into it.

We scrape a lot of different web pages for different customers. As a result we can't identify the escape hatch for each company to bypass the country check. Instead, I was expecting the countrycode configuration to have the same effect. Why is that not working?

See attached where I used US for the run I shared earlier.

jakub.kopecky avatar

Hey,

I tested the crawler using the US residential proxy configuration as shown in your screenshot, but the site still requires &intl=nosplash to bypass the country selection screen.

The Website Content Crawler is a generic tool designed for most websites, but some, like this one, need specific workarounds to handle features like the country selection screen, which the crawler doesn't support. You might find Best Buy scrapers in the Apify Store, though they focus on product scraping, not help pages: https://apify.com/store/categories?search=bestbuy.

Jakub

BZ

burgundy_zebra

16 days ago

We dont scrape products. We monitor pages to see if the fall out of compliance on behalf of our customers. If you notice on this page - see screenshot(https://cdn.zappy.app/93fe6e185ea49cc3c9cf21830e6e9818.png), I don't need to add intl=nosplash. and it still works. This is because my IP is coming from the US. As a result, the source IP of the crawler needs to be a US IP. How do we get apify to use a US based IP Address for crawling.

jakub.kopecky avatar

Hi,

Apologies for the delayed response.

You can set the Apify proxy country under Crawler Settings > Proxy Configuration > Proxy Country by selecting United States. I tested this input option, and it now works without the &intl=nosplash using the US proxy: https://console.apify.com/view/runs/XLg7RLb9JNgrTN6RR

Please let me know if this solves your issue.

Jakub