Pricing

Pay per usage

Go to Store

Web Scraper Experimental Debug

Try for free

Developed by

Marek Trunkát

Experimental version of Apify Web Scraper with Chrome debugger integrated

0.0 (0)

Pricing

Pay per usage

Total users

Monthly users

Runs succeeded

>99%

Last modified

4 years ago

Developer tools

Open source

Experimental version of Apify Web Scraper with Chrome debugger integrated

How it works
Getting Started
Input
Page function
context
Output
- Dataset

How it works

Web Scraper is a ready-made solution for scraping the web using the Chrome browser. It takes away all the work necessary to set up a browser for crawling, controls the browser automatically and produces machine readable results in several common formats.

Underneath, it uses the Puppeteer library to control the browser, but you don't need to worry about that. Using a simple web UI and a little of basic JavaScript, you can tweak it to serve almost any scraping need.

Getting Started

If you're new to scraping or Apify, be sure to visit our tutorial to walk you through creating your first scraping task step by step.

Input

Input is provided via the pre-configured UI. See the tooltips for more info on the available options.

Page function

Page function is a single JavaScript function that enables the user to control the Scraper's operation, manipulate the visited pages and extract data as needed. It is invoked with a context object containing the following properties:

const context = {
    // USEFUL DATA
    input, // Unaltered original input as parsed from the UI
    env, // Contains information about the run such as actorId or runId
    customData, // Value of the 'Custom data' scraper option.

    // EXPOSED OBJECTS
    request, // Apify.Request object.
    response, // Response object holding the status code and headers.
    globalStore, // Represents an in memory store that can be used to share data across pageFunction invocations.
    log, // Reference to Apify.utils.log
    underscoreJs, // A reference to the Underscore _ object (if Inject Underscore was used).

    // EXPOSED FUNCTIONS
    setValue, // Reference to the Apify.setValue() function.
    getValue, // Reference to the Apify.getValue() function.
    saveSnapshot, // Saves a screenshot and full HTML of the current page to the key value store.
    waitFor, // Helps with handling dynamic content by waiting for time, selector or function.
    skipLinks, // Prevents enqueueing more links via Pseudo URLs on the current page.
    enqueueRequest, // Adds a page to the request queue.
    jQuery, // A reference to the jQuery $ function (if Inject JQuery was used).

}

`context`

The following tables describe the context object in more detail.

Data structures

Argument	Type
`input`	`Object`
Input as it was received from the UI. Each `pageFunction` invocation gets a fresh copy and you can not modify the input by changing the values in this object.
`env`	`Object`
A map of all the relevant environment variables that you may want to use. See the `Apify.getEnv()` function for a preview of the structure and full documentation.
`customData`	`Object`
Since the input UI is fixed, it does not support adding of other fields that may be needed for all specific use cases. If you need to pass arbitrary data to the scraper, use the Custom data input field and its contents will be available under the `customData` context key.

Functions

The context object provides several helper functions that make scraping and saving data easier and more streamlined. All of the functions are async so make sure to use await with their invocations.

Argument	Arguments
`setValue`	`(key: string, data: Object, options: Object)`
To save data to the default key-value store, you can use the `setValue` function. See the full documentation: `Apify.setValue()` function.
`getValue`	`(key: string)`
To read data from the default key-value store, you can use the `getValue` function. See the full documentation: `Apify.getValue()` function.
`waitFor`	`(task: number\|string\|Function, options: Object)`
The `waitFor` function enables you to wait for various events in the scraped page. The first argument determines its behavior. If you use a `number`, such as `await waitFor(1000)`, it will wait for the provided number of milliseconds. The other option is using a CSS selector `string` which will make the function wait until the given selector appears in the page. The final option is to use a `Function`. In that case, it will wait until the provided function returns `true`.
`saveSnapshot`
A helper function that enables saving a snapshot of the current page's HTML and its screenshot into the default key value store. Each snapshot overwrites the previous one and the function's invocations will also be throttled if invoked more than once in 2 seconds, to prevent abuse. So make sure you don't call it for every single request. You can find the screenshot under the SNAPSHOT-SCREENSHOT key and the HTML under the SNAPSHOT-HTML key.
`skipLinks`
With each invocation of the `pageFunction` the scraper attempts to extract new URLs from the page using the Link selector and PseudoURLs provided in the input UI. If you want to prevent this behavior in certain cases, call the `skipLinks` function and no URLs will be added to the queue for the given page.
`enqueueRequest`	`(request: Request\|Object, options: Object)`
To enqueue a specific URL manually instead of automatically by a combination of a Link selector and a Pseudo URL, use the `enqueueRequest` function. It accepts a plain object as argument that needs to have the structure to construct a `Request` object. But frankly, you just need a URL: `{ url: 'https://www.example.com }`
`jQuery`	see jQuery docs
To make the DOM manipulation within the page easier, you may choose the Inject jQuery option in the UI and all the crawled pages will have an instance of the `jQuery` library available. However, since we do not want to modify the page in any way, we don't inject it into the global `$` object as you may be used to, but instead we make it available in `context`. Feel free to `const $ = context.jQuery` to get the familiar notation.

Class instances and namespaces

The following are either class instances or namespaces, which is just a way of saying objects with functions on them.

Request

Apify uses a request object to represent metadata about the currently crawled page, such as its URL or the number of retries. See the Request class for a preview of the structure and full documentation.

Response

The response object is produced by Puppeteer. Currently, we only pass the HTTP status code and the response headers to the context.

Global Store

globalStore represents an instance of a very simple in memory store that is not scoped to the individual pageFunction invocation. This enables you to easily share global data such as API responses, tokens and other. Since the stored data need to cross from the Browser to the Node.js process, it cannot be any kind of data, but only JSON stringifiable objects. You cannot store DOM objects, functions, circular objects and so on.

globalStore supports the full Map API , with the following limitations:

All methods of globalStore are async. Use await.
Only string keys can be used and the values need to be JSON stringifiable.
map.forEach() is not supported.

Log

log is a reference to Apify.utils.log. You can use any of the logging methods such as log.info or log.exception. log.debug is special, because you can trigger visibility of those messages in the scraper's Log by the provided Debug log input option.

Underscore

Underscore is a helper library. You can use it in your pageFunction if you use the Inject Underscore input option.

Output

Output is a dataset containing extracted data for each scraped page. To save data into the dataset, return an Object or an Object[] from the pageFunction.

Dataset

For each of the scraped URLs, the dataset contains an object with results and some metadata. If you were scraping the HTML <title> of Apify and returning the following object from the pageFunction

return {
  title: "Web Scraping, Data Extraction and Automation - Apify"
}

it would look like this:

{
  "title": "Web Scraping, Data Extraction and Automation - Apify",
  "#error": false,
  "#debug": {
    "requestId": "fvwscO2UJLdr10B",
    "url": "https://apify.com",
    "loadedUrl": "https://apify.com/",
    "method": "GET",
    "retryCount": 0,
    "errorMessages": null,
    "statusCode": 200
  }
}

You can remove the metadata (and results containing only metadata) from the results by selecting the Clean items option when downloading the dataset.

The result will look like this:

{
  "title": "Web Scraping, Data Extraction and Automation - Apify"
}

Price Detector (Experimental)

equidem/price-detector-experimental

Matěj Sochor

Chrome Webstore Reviews Crawler

avemeva/chrome-webstore-reviews-crawler

Scrape reviews from Chrome Web Store items, and get GPT summary.

Andrew Kalita

Web Scraper Task

undrtkr984/web-scraper-task

Matt

111

Dynamic Web Scraper

josejet/dynamic-web-scraper

Dynamic Web Scraper is an Apify Actor that gathers information online by simulating user browsing behavior on the web. It reduces the time and amount of scraped web pages by using a model (ChatGPT) to make decisions regarding browser navigation and results evaluation.

Pepa <b>J</b>

Scrape And Bypass Any Url Using Scrappey

dormic/apify-scrappey

A template for scraping data from web pages using the Scrappey.com API service integrated with an Apify Actor. This actor provides a robust solution for handling complex web scraping scenarios, including sites with anti-bot protection such as Cloudflare, Datadome, PerimeterX and all other forms.

Pim

Web Scraper

apify/web-scraper

Crawls arbitrary websites using a web browser and extracts structured data from web pages using a provided JavaScript function. The Actor supports both recursive crawling and lists of URLs, and automatically manages concurrency for maximum performance.

Apify

82k

Example Selenium

apify/example-selenium

Example of loading a web page in headless Chrome using Selenium Webdriver.

Apify

281

Example Web Server

apify/example-web-server

This example demonstrates how to use web server in actor as communication channel with outer world. Read more at Apify docs https://docs.apify.com/actors/running#container-web-server

Apify

Web Images Scraper

jupri/web-images-scraper

Scrape Images from a Webpage

cat

322

Reddit Scraper

trudax/reddit-scraper

Unlimited Reddit web scraper to crawl posts, comments, communities, and users without login. Limit web scraping by number of posts or items and extract all data in a dataset in multiple formats.