Web Scraper avatar
Web Scraper

Pricing

Pay per usage

Go to Store
Web Scraper

Web Scraper

Developed by

Apify

Apify

Maintained by Apify

Crawls arbitrary websites using a web browser and extracts structured data from web pages using a provided JavaScript function. The Actor supports both recursive crawling and lists of URLs, and automatically manages concurrency for maximum performance.

4.5 (23)

Pricing

Pay per usage

920

Total users

90K

Monthly users

4.9K

Runs succeeded

>99%

Issues response

7.8 days

Last modified

2 months ago

competent_path avatar

received 401 status code

Closed

Competent Path (competent_path) opened this issue
5 months ago

I tried this with the following input:

{
"breakpointLocation": "NONE",
"browserLog": false,
"closeCookieModals": false,
"debugLog": false,
"downloadCss": false,
"downloadMedia": false,
"excludes": [
{
"glob": "/**/*.{png,jpg,jpeg,pdf}"
}
],
"globs": [
{
"glob": ""
}
],
"headless": false,
"ignoreCorsAndCsp": true,
"ignoreSslErrors": true,
"injectJQuery": true,
"keepUrlFragments": false,
"pageFunction": "async function pageFunction(context) {\n const $ = context.jQuery;\n return {html: $('html').first().html()};\n}",
"postNavigationHooks": "// We need to return array of (possibly async) functions here.\n// The functions accept a single argument: the \"crawlingContext\" object.\n[\n async (crawlingContext) => {\n // ...\n },\n]",
"preNavigationHooks": "// We need to return array of (possibly async) functions here.\n// The functions accept two arguments: the \"crawlingContext\" object\n// and \"gotoOptions\".\n[\n async (crawlingContext, gotoOptions) => {\n // ...\n },\n]\n",
"proxyConfiguration": {
"useApifyProxy": true,
"apifyProxyGroups": [
"RESIDENTIAL"
]
},
"runMode": "PRODUCTION",
"startUrls": [
{
"url": "https://www.wsj.com/livecoverage/stock-market-today-dow-sp500-nasdaq-live-08-07-2024/card/robinhood-reports-record-quarterly-revenue-and-profit-tIlQ0DnKKwNWFeqoRcA2",
"method": "GET"
}
],
"useChrome": true,
"waitUntil": [
"networkidle2"
]
}

PuppeteerCrawler: Reclaiming failed request back to the list or queue. Request blocked - received 401 status code. 2025-02-18T22:39:55.764Z {"id":"9nWDjDToDXvA6Ny","url":"https://www.wsj.com/livecoverage/stock-market-today-dow-sp500-nasdaq-live-08-07-2024/card/robinhood-reports-record-quarterly-revenue-and-profit-tIlQ0DnKKwNWFeqoRcA2","retryCount":1}

competent_path avatar

Please let me know if there is any update on this one.

jindrich.bar avatar

Hello, and sorry for the delay.

The 401 error indicates that access was blocked, and unfortunately, the Wall Street Journal has very strong anti-bot protections in place. We haven’t been able to successfully scrape this page using either the Web Scraper or Camoufox Scraper Actors with any input options combination.

At the moment, the most viable path forward is to develop a custom solution tailored specifically for WSJ. If you don’t have the development capacity to do this yourself, you might consider hiring a freelancer from our official Apify Discord server.

I'll close this issue as there is likely no way forward for this Actor to support scraping this server. Let us know if you have any other questions!