Web Scraper avatar

Web Scraper

Try for free

No credit card required

View all Actors
Web Scraper

Web Scraper

apify/web-scraper
Try for free

No credit card required

Crawls arbitrary websites using the Chrome browser and extracts data from pages using JavaScript code. The Actor supports both recursive crawling and lists of URLs and automatically manages concurrency for maximum performance. This is Apify's basic tool for web crawling and scraping.

Do you want to learn more about this Actor?

Get a demo
VN

Crawling return blank data

Closed

visit_network opened this issue
6 months ago

I have added a custom js code that will scrape the required data from details page of a tour in viator. This code sometimes returns blank fields. Also when I try to run this with multiple urls it runs for so long and then always returns blank fields.

VN

visit_network

6 months ago

Any updates on this? I am really stuck and this is the highest priority for me.

VN

visit_network

6 months ago

This actually because of the captcha. The crawler gets the captcha first and it is unable to get the data from the page. How should I avoid the captcha?

VN

visit_network

6 months ago

I Have a paid account now, but still all the IPs from proxy were blocking from source site. Do you have Idea what can be the issue that blocks the request made by the actor?

jindrich.bar avatar

Hello and thank you for your interest in this Actor!

Looking at your run, it seems that there are multiple problems. First, you're passing an invalid CSS selector in the Link selector input option. Searching for this selector locally on the page you're submitting returns null. Because of this, no pages are being enqueued and your run always ends after scraping just one page.

However, you're right that there is some blocking going on - I managed to get past it using Residential proxy type and Run browsers in headless mode option set to false. Also, make sure you are using the "Use Chrome" option (uses a regular Chrome browser instead of a Chromium instance). See my example run here. You can see that even in my fixed example run, some parts of the content are still missing, though - I didn't look into this too closely, but it seems that your CSS selectors in your page function contain "randomized" parts. Try using the CSS attribute selector wildcard variants for this (instead of searching for .submit-button-fa14ff, search for [class*="submit-button"]).

Thank you for your patience and let us know how it went. Cheers!

VN

visit_network

6 months ago

Thank you for the response. I did the required settings that you have suggested. I have tried to return all the HTML that the page have, it seems that it is having HTML that displays captcha. This is the run ID '54itnBtlxJ9Hj99w8'. Can you please check what is causing this issue.

'<script data-cfasync="false">var dd={'rt':'c','cid':'AHrlqAAAAAMALZcVR-Q6WOQAF40Q4g==','hsh':'5D768A5D53EF4D2F5899708C392EAC','t':'fe','s':40397,'e':'3d0ac1fa33864070b7e07d7faa8de37660d235934b500c291afa2e8db35702f5','host':'geo.captcha-delivery.com'}<script data-cfasync="false" src="https://ct.captcha-delivery.com/c.js\"><iframe src="https://geo.captcha-delivery.com/captcha/?initialCid=AHrlqAAAAAMALZcVR-Q6WOQAF40Q4g%3D%3D&amp;hash=5D768A5D53EF4D2F5899708C392EAC&amp;cid=eL5unNSMh59BT5bG2JwPClRlqE92jaERjzpF5ApAn2B2SIcfiqJWtMEDJH3muHMS1nfg_B2IVUchLKiVlgKeNoDt2Wl30xtCzvxSmd5KH_xmIbpDBpOcUBLxxmZp941Z&amp;t=fe&amp;referer=https%3A%2F%2Fwww.viator.com%2FVictoria-Falls%2Fd5309-ttd%2Fp-22862P6&amp;s=40397&amp;e=3d0ac1fa33864070b7e07d7faa8de37660d235934b500c291afa2e8db35702f5&amp;dm=cd\" width="100%" height="100%" style="height:100vh;" frameborder="0" border="0" scrolling="yes">'

VN

visit_network

6 months ago

I have copied all the settings from your sample run and still getting 403 error. Please check the following error log :

2024-05-02T12:24:09.659Z ACTOR: Pulling Docker image of build tfHAuSQ2x4KpkucDF from repository. 2024-05-02T12:24:40.233Z ACTOR: Creating Docker container. 2024-05-02T12:24:40.657Z ACTOR: Starting Docker container. 2024-05-02T12:24:41.369Z Starting X virtual framebuffer using: Xvfb :99 -ac -screen 0 1920x1080x24+32 -nolisten tcp 2024-05-02T12:24:41.371Z Executing main command 2024-05-02T12:24:42.177Z INFO System info {"apifyVersion":"3.1.16","apifyClientVersion":"2.9.3","crawleeVersion":"3.8.2","osType":"Linux","nodeVersion":"v18.20.1"} 2024-05-02T12:24:42.258Z INFO Configuring Web Scraper. 2024-05-02T12:24:43.263Z INFO Configuration completed. Starting the scrape. 2024-05-02T12:24:43.391Z INFO PuppeteerCrawler: Starting the crawler. 2024-05-02T12:24:47.444Z WARN PuppeteerCrawler: Reclaiming failed request back to the list or queue. Request blocked - received 403 status code. 2024-05-02T12:24:47.445Z {"id":"3W2COR0xgUtULSl","url":"https://www.viator.com/Victoria-Falls/d5309-ttd/p-22862P6","retryCount":1}

Please try to scrape following URL- 'https://www.viator.com/tours/Victoria-Falls/Two-Days-and-One-Night-Camping-Safari-in-Chobe-National-Park/d5309-22862P6'

VN

visit_network

6 months ago

Could you please verify the above error as I am totally blocked? Is there any support number where I can directly call and communicate with your team? It is really urgent.

jindrich.bar avatar

Hello again - indeed, it seems that this Actor is struggling with the antibot scripts on this specific website. Unfortunately, evading those is more of an art than an exact science - many of the antibot systems are based on stochastic (or ML) methods, so it's not always easy to tell, why you're getting blocked.

My advice here would be: try to make a specific scraper for this website locally (e.g. using Crawlee - our Node.JS crawling library). Debugging that might be a bit easier than debugging this Actor here - and the chance is, that you might run into fewer roadblocks too (scraping from your local IP address etc.).

I also suggest you take a look at our docs section about anti-scraping protections - you might get a better idea of what might be going on.

We're sorry that this Actor couldn't help you - keep in mind that this is still a generic solution that works well enough for the proverbial 80 percent of cases. Unfortunately, your use case seems to fall into the 20% of the cases that need that extra step.

Let us know if you have any additional questions regarding anything mentioned here. Cheers!

Developer
Maintained by Apify
Actor metrics
  • 2.6k monthly users
  • 210 stars
  • 99.9% runs succeeded
  • 28 days response time
  • Created in Mar 2019
  • Modified 3 months ago