On September 30, 2019, we'll be retiring the legacy Apify Crawler product together with API version 1. To avoid any disruption, you'll need to migrate your crawlers to the new apify/legacy-phantomjs-crawler actor, which provides the the same functionality, configuration options and results as legacy Crawler.

Please read this blog post to find out why we are retiring the Crawler product, what it means for you and how you can migrate your crawlers to the new actor, including the integrations. Once you read the blog post and understand all the implications, you can trigger migration on the Crawlers page in Apify app.

Apify provides a hosted web crawler for developers. Technically speaking, it is a bunch of web browsers hosted on Apify servers that enable you to scrape data from any website using the primary programming language of the web: JavaScript.

This document describes all the features of the crawler. You might also want to check out the following resources:

In order to extract structured data from a website, you only need two things. First, tell the crawler which pages it should visit (see Start URLs and Pseudo-URLs) and second, define a JavaScript code that will be executed on every web page visited in order to extract the data from it (see Page function). The crawler is a full-featured web browser which loads and interprets JavaScript and the code you provide is simply executed in the context of the pages it visits. This means that writing your data-extraction code is very similar to writing JavaScript code in front-end development, you can even use any client-side libraries such as jQuery or Underscore.js.

Imagine the crawler as a guy sitting in front of a web browser. Let's call him Bob. Bob opens a start URL and waits for the page to load, executes your JavaScript code using a developer console, writes down the result and then right-clicks all links on the web page to open them in new browser tabs. After that, Bob closes the current tab, goes to the next tab and repeats the same action again. Bob is pretty smart and skips pages that he has already visited. When there are no more pages, he is done. And this is where the magic happens. Bob would need about a month to click through a few hundred pages. Apify can do it in a few seconds and makes fewer mistakes.

More formally, the crawler repeats the following steps:

  1. Add each of the Start URLs into the crawling queue.
  2. Fetch the first URL from the queue and load it in the virtual browser.
  3. Execute Page function on the loaded page and save its results.
  4. Find all links from the page using Clickable elements CSS selector. If a link matches any of the Pseudo-URLs and has not yet been enqueued, add it to the queue.
  5. If there are more items in the queue, go to step 2, otherwise finish.

This process is depicted in the following diagram. Note that blue elements represent settings or operations that can be affected by crawler settings. These settings are described in detail in the following sections.

Web crawler activity diagram

Note that each crawler configuration setting can also be set using the API, the corresponding property name and type is {described in this font} right next to the property caption. When you export the crawler settings to JSON, the object will have these properties. For details, see the API section on the crawler details page.

Basic settings

Custom ID {customId: String}

A custom unique identifier of the crawler that is used to reference the crawler from API integrations. The string cannot be empty and ideally should not require URL encoding. Beware: if you change this value, the corresponding API endpoint URL will also change and your integrations might break.

Internal ID {_id: String}

An internal unique identifier of the crawler that can be used to reference the crawler in your API integrations instead of Custom ID. Note that this value is read-only and never changes.

Comments {comments: String}

Arbitrary notes or comments associated with this crawler.

Start URLs {startUrls: [{key: String, value: String}]}

The list of URLs of the first pages that the crawler will open. Optionally, each URL can be associated with a custom label that can be referenced from your JavaScript code to determine which page is currently open (see Request object for details). Each URL must start with either a http:// or https:// protocol prefix!

Note that it is possible to instruct the crawler to load a URL using a HTTP POST request simply by suffixing it with a [POST] marker, optionally followed by POST data (e.g.[POST]key1=value1&key2=value2). By default, POST requests are sent with the Content-Type: application/x-www-form-urlencoded header.

Maximum label length is 100 characters and maximum URL length is 2000 characters.

Pseudo-URLs {crawlPurls: [{key: String, value: String}]}

Specifies which pages will be visited by the crawler using a pseudo-URLs (PURL) format. PURL is simply a URL with special directives enclosed in [] brackets. Currently, the only supported directive is [regexp], which defines a JavaScript-style regular expression to match against the URL.

For example, a PURL[(\w|-)*] will match all of the following URLs:


If either [ or ] is part of the normal query string, it must be encoded as [\x5B] or [\x5D], respectively. For example, the following PURL:[\x5B]load[\x5D]=1

will match the URL:[load]=1

Optionally, each PURL can be associated with a custom label that can be referenced from your JavaScript code to determine which page is currently open (see Request object for details).

Note that you don't need to use this setting at all, because you can completely control which pages the crawler will access using the Intercept request function.

Maximum label length is 100 characters and maximum PURL length is 1000 characters.

Clickable elements {clickableElementsSelector: String}

CSS selector used to find links to other web pages. The crawler clicks all DOM elements matching this selector and then monitors whether the page generates a navigation request. If a navigation request is detected, the crawler checks whether it matches Pseudo-URLs, invokes Intercept request function, cancels the request and then continues clicking the next matching elements. By default, new crawlers are created with a safe CSS selector:


In order to reach more pages, you might want to use a wider CSS selector, such as:

a:not([rel=nofollow]), input, button, [onclick]:not([rel=nofollow])

Be careful - clicking certain DOM elements can cause unexpected and potentially harmful side effects. For example, by clicking buttons you might submit forms, flag comments, etc. In principle, the safest option is to narrow the CSS selector to as few elements as possible, which also makes the crawler run much faster.

Leave this field empty if you do not want the crawler to click any elements and only open Start URLs or pages enqueued using enqueuePage().

Page function {pageFunction: String}

A user-provided JavaScript function that is executed in the context of every page loaded by the crawler. Page function is typically used to extract some data from the page, but it can also be used to perform some non-trivial operation on the page, e.g. handle AJAX-based pagination.

IMPORTANT: Apify is currently using PhantomJS headless web-browser, which only supports JavaScript ES5.1 standard (read more in a blog post about PhantomJS 2.0).

The basic page function with no effect has the following signature:

function pageFunction(context) {
    return null;

The function can return an arbitrary JavaScript object (including array, string, number, etc.) that can be stringified to JSON; this value will be saved in the crawling results as the pageFunctionResult field of the Request object corresponding to the web page on which the pageFunction was executed. Note that Apify provides crawling results in a computer-friendly form (JSON, JSONL, XML or RSS format), as well as in a human-friendly tabular form (HTML or CSV format). If the pageFunction's return value is an array, its elements will be displayed as separate rows in such a table, to make the results more readable.

The function accepts a single argument called context, which is an object with the following properties and functions:

Name Description
request An object holding all the available information about the currently loaded web page. See Request object for details.
jQuery A jQuery object, only available if the Inject jQuery setting is enabled.
underscoreJs The Underscore.js' _ object, only available if the Inject Underscore.js setting is enabled.
If called, the crawler will not follow any links from the current page and will continue with the next page from the queue. This is useful to speed up the crawl by avoiding unnecessary paths.
skipOutput() If called, no information about the current page will be saved to the Results, including the page function result itself. This is useful to reduce the size of the output JSON by skipping unimportant pages. Note that if the page function throws an exception, the skipOutput() call is ignored and the page is outputted anyway, so that the user has a chance to determine whether there was an error (see Request object's errorInfo field).
willFinishLater() Tells the crawler that the page function will continue performing some background operation even after it returns. This is useful when you want to fetch results from an asynchronous operation, e.g. an XHR request or a click on some DOM element. If you use the willFinishLater() function, make sure you also invoke finish() or the crawler will wait infinitely for the result and eventually timeout after the period specified in Page function timeout. Note that the normal return value of the page function is ignored.
finish(result) Tells the crawler that the page function finished its background operation. The result parameter receives the result of the page function - this is a replacement for the normal return value of the page function that was ignored (see willFinishLater() above).
saveSnapshot() Captures a screenshot of the web page and saves its DOM to an HTML file, which are both then displayed in the user's crawling console. This is especially useful for debugging your page function.

Adds a new page request to the crawling queue, regardless of whether it matches any of the Pseudo-URLs. The request argument is an instance of the Request object, but only the following properties are taken into account: url, uniqueKey, label, method, postData, contentType, queuePosition and interceptRequestData; all other properties will be ignored. The url property is mandatory.

Note that the manually enqueued page is subject to the same processing as any other page found by the crawler. For example, the Intercept request function function will be called for the new request, and the page will be checked to see whether it has already been visited by the crawler and skipped if so.

For backwards compatibility, the function also supports the following signature: enqueuePage(url, method, postData, contentType).
saveCookies([cookies]) Saves current cookies of the current PhantomJS browser to the crawler's Initial cookies. All subsequently started PhantomJS processes will use these cookies. For example, this is useful to store a login. Optionally, you can pass an array of cookies to set to the browser before saving (in PhantomJS format). Note that by passing an empty array you can unset all cookies.
customData Custom user data from crawler settings. See Custom data for details.
stats An object containing a snapshot of statistics from the current crawl (see API section on crawler run page for details). Note that the statistics are collected before the current page has been crawled.
actExecutionId String containing ID of this crawler execution. It might be used to control the crawler using the API, e.g. to stop it or fetch its results.
actId String containing internal ID of crawler. See Internal ID for details.

Note that any changes made to the context parameter will be ignored. When implementing the page function, it is the user's responsibility not to break normal page's scripts which might affect the operation of the crawler.

Waiting for dynamic content

Some web pages do not load all their content immediately but only fetch it in the background using AJAX, while pageFunction might be executed before the content has actually been loaded. You can wait for dynamic content to load using the following code:

function pageFunction(context) {
    var $ = context.jQuery;
    var startedAt =;

    var extractData = function() {
        // timeout after 10 seconds
        if( - startedAt > 10000 ) {
            context.finish("Timed out before #my_element was loaded");

        // if my element still hasn't been loaded, wait a little more
        if( $('#my_element').length === 0 ) {
            setTimeout(extractData, 500);

        // refresh page screenshot and HTML for debugging

        // save a result
            value: $('#my_element').text()

    // tell the crawler that pageFunction will finish asynchronously


Advanced settings

Intercept request function {interceptRequest: String}

A user-provided JavaScript function that is called whenever a new URL is about to be added to the crawling queue, which happens at the following times:

  • At the start of crawling for all Start URLs.
  • When the crawler looks for links to new pages by clicking elements matching the Clickable elements CSS selector and detects a page navigation request, i.e. a link (GET) or a form submission (POST) that would normally cause the browser to navigate to a new web page.
  • Whenever a loaded page tries to navigate to another page, e.g. by setting window.location in JavaScript.
  • When user code invokes enqueuePage() inside of Page function.

The intercept request function allows you to affect on a low level how new pages are enqueued by the crawler. For example, it can be used to ensure that the request is added to the crawling queue even if it doesn't match any of the Pseudo-URLs, or to change the way the crawler determines whether the page has already been visited or not. Similarly to the Page function, this function is executed in the context of the originating web page (or in the context of about:blank page for Start URLs).

IMPORTANT: Apify is currently using PhantomJS headless web-browser, which only supports JavaScript ES5.1 standard (read more in blog post about PhantomJS 2.0).

The basic intercept request function with no effect has the following signature:

function interceptRequest(context, newRequest) {
    return newRequest;

The context is an object with the following properties:

request An object holding all the available information about the currently loaded web page. See Request object for details.
jQuery A jQuery object, only available if the Inject jQuery setting is enabled.
underscoreJs An Underscore.js object, only available if the Inject Underscore.js setting is enabled.
clickedElement A reference to the DOM object whose clicking initiated the current navigation request. The value is null if the navigation request was initiated by other means, e.g. using some background JavaScript action.

Beware that in rare situations when the page redirects in its JavaScript before it was completely loaded by the crawler, the jQuery and underscoreJs objects will be undefined. The newRequest parameter contains a Request object corresponding to the new page.

The way the crawler handles the new page navigation request depends on the return value of the interceptRequest function in the following way:

  • If function returns the newRequest object unchanged, the default crawler behaviour will apply.
  • If function returns the newRequest object altered, the crawler behavior will be modified, e.g. it will enqueue a page that would not normally be skipped. The following fields can be altered: willLoad, url, method, postData, contentType, uniqueKey, label, interceptRequestData and queuePosition (see Request object for details).
  • If function returns null, the request will be dropped and a new page will not be enqueued.
  • If function throws an exception, the default crawler behaviour will apply and the error will be logged to Request object's errorInfo field. Note that this is the only way a user can catch and debug such an exception.

Note that any changes made to the context parameter will be ignored (unlike the newRequest parameter). When implementing the function, it is the user's responsibility not to break normal page scripts that might affect the operation of the crawler. You have been warned. Also note that the function does not resolve HTTP redirects: it only reports the originally requested URL, but does not open it to find out which URL it eventually redirects to.


URL #fragments identify unique pages {considerUrlFragment: Boolean}

Indicates that the URL fragment identifier (i.e. should be considered when matching an URL against a PURL or when checking whether a page has already been visited. Typically, URL fragments are used as internal page anchors and therefore they should be ignored because they don't represent separate pages. However, many AJAX-based website nowadays use URL fragment to represent page parameters; in such cases, this option should be enabled.

Download HTML images {loadImages: Boolean}

Indicates whether the crawler should load HTML images, both those included using the <img> tag as well as those included in CSS styles. Disable this feature after you have fine-tuned your crawler in order to increase crawling performance and reduce your bandwidth costs.

Download CSS files {loadCss: Boolean}

Indicates whether the crawler should load CSS stylesheet files. Disable this feature after you have fine-tuned your crawler in order to increase crawling performance and reduce your bandwidth costs.

Inject jQuery {injectJQuery: Boolean}

Indicates that the jQuery library should be injected to each page before Page function is invoked. Note that the jQuery object will not be registered into global namespace in order to avoid conflicts with libraries used by the web page. It can only be accessed through context.jQuery.

Inject Underscore.js {injectUnderscoreJs: Boolean}

Indicates that the Underscore.js library should be injected to each page before Page function is invoked. Note that the Underscore object will not be registered into global namespace in order to avoid conflicts with libraries used by the web page. It can only be accessed through context.underscoreJs.

Ignore robots exclusion standards {ignoreRobotsTxt: Boolean}

Indicates that the crawler should ignore robots.txt, <meta name="robots"> tags and X-Robots-Tag HTTP headers. Use this feature at your own risk!

Don't load frames and IFRAMEs {skipLoadingFrames: Boolean}

Indicates that child frames included using FRAME or IFRAME tags will not be loaded by the crawler. This might improve crawling performance. As a side-effect, JavaScript redirects issued by the page before it was completely loaded will not be performed, which might be useful in certain situations.

Verbose log {verboseLog: Boolean}

If enabled, the log will also contain DEBUG messages. Note that this setting will dramatically slow down the crawler as well as your web browser and increase the log size.

Disable web security {disableWebSecurity: Boolean}

If checked, the virtual browser will allow cross-domain XHRs and untrusted SSL certificates, so that your crawler can access content from any domain. Only activate this feature if you know what you're doing!

Rotate User-Agent headers {rotateUserAgents: Boolean}

If checked then the crawler automatically rotates the User-Agent HTTP header for each new IP address, from a pre-defined list. This settings overwrites Custom HTTP headers with key User-Agent if it is set.

Max pages per crawl {maxCrawledPages: Number}

Maximum number of pages that the crawler will open. The crawl will stop when this limit is reached. Always set this value in order to prevent infinite loops in misconfigured crawlers. For free plan users, the maximum is limited according to the current Monthly pages limit of the free plan (see pricing for details). Note that in cases of parallel crawling, the actual number of pages visited might be slightly higher than this value.

Max result records {maxOutputPages: Number}

Maximum number of pages the crawler can output to JSON. The crawl will stop when this limit is reached. This value is useful when you only need a limited number of results.

Max crawling depth {maxCrawlDepth: Number}

Defines how many links away from the start URLs the crawler will descend. This value is a safeguard against infinite crawling depths on misconfigured crawlers. Note that pages added using enqueuePage() (see Page function) are not subject to the maximum depth constraint.

Execution timeout {timeout: Number}

Timeout for the execution of the crawler, in seconds. If the crawler is running longer than this value, it will be forcibly stopped and its status set to TIMEOUT. By default, the timeout is 604800 seconds (=7 days).

Resource timeout {resourceTimeout: Number}

Timeout for network resources loaded by the crawler specified in milliseconds. The default value is 30000 milliseconds.

Page load timeout {pageLoadTimeout: Number}

Timeout for web page load, in milliseconds. If the web page does not load in this timeframe, it is considered to have failed and will be retried, similarly as with other page load errors. The default value is 60000 milliseconds.

Page function timeout {pageFunctionTimeout: Number}

Timeout for the asynchronous part of the page function, in milliseconds. Note that this value is only applied if your page function runs code in the background, i.e. when it invokes context.willFinishLater(); the page function itself always runs to completion regardless of the timeout. The default timeout is 600000 milliseconds (= 10 minutes). .

Infinite scroll height {maxInfiniteScrollHeight: Number}

Defines the maximum client height in pixels to which the browser window is scrolled in order to fetch dynamic AJAX-based content from the web server (so-called infinite scroll). By default, the crawler doesn't scroll and uses a fixed browser window size. Note that you might need to enable Download HTML images to make infinite scroll work, because otherwise the crawler wouldn't know that some resources are still being loaded and will stop infinite scrolling prematurely.

Delay between requests {randomWaitBetweenRequests: Number}

This option forces the crawler to ensure a minimum time interval between opening two web pages, in order to prevent it from overloading the target server. The actual minimum time is a random value drawn from a Gaussian distribution with a mean specified by your setting (in milliseconds) and a standard deviation corresponding to 25% of the mean. The minimum value is 1000 milliseconds, the crawler never issues requests in shorter intervals than 1000 milliseconds.

Max pages per IP address {maxCrawledPagesPerSlave: Number}

Maximum number of pages that a single crawling process will open before it is restarted with a new proxy server setting. This option can help avoid the blocking of the crawler by the target server and also ensures that the crawling processes don't grow too large, as they are killed periodically. The default is 50.

Parallel crawling processes {maxParallelRequests: Number}

The number of parallel processes that will perform the crawl. If more than one, page screenshots and HTML snapshots are disabled, because they would switch too quickly and it would make no sense for them to be enabled. The maximum value is determined by your subscription type (see Account for your service limits). Note that each of the parallel crawling processes typically uses a different IP address for outgoing HTTP requests.

Custom HTTP headers {customHttpHeaders: [{key: String, value: String}]}

Defines custom HTTP headers used by the crawler. The maximum length of the header name is 100 characters and the maximum length of the value is 1000 characters.

Proxy {proxyType: String}

Specifies the type of proxy servers that will be used by the crawler in order to hide its origin. The following table lists all available options:

Crawler will not use any proxies. All web pages will be loaded directly from IP addresses of Apify servers running on Amazon Web Services.
Apify Proxy (automatic)
The crawler will load all web pages using the Apify Proxy in the automatic mode. In this mode, the proxy uses all proxy groups that are available to the user, and for each new web page it automatically selects the proxy that hasn't been used in the longest time for the specific hostname, in order to reduce the chance of detection by the website. You can view the list of available proxy groups on the Proxy page in the app.
Apify Proxy (selected groups)
The crawler will load all web pages using the Apify Proxy with specific groups of target proxy servers. Please refer to the Proxy groups section for more details.
Custom proxies
Enables the crawler to use a custom list of proxy servers. Please refer to the Custom proxies section for more details.

Note that the custom proxy used to fetch a specific page is stored to the proxy field of the Request object. Note that for security reasons, the usernames and passwords are redacted from the proxy URL.

Proxy groups {proxyGroups: [String]}

This field is only available for the Selected proxy groups option of the Proxy field.

The crawler will use Apify Proxy with target proxies from the selected proxy groups. Each new web page will be served by a target proxy server that hasn't been used in the longest time for the specific hostname, in order to reduce the chance of detection by the website. You can view the list of available groups on the Proxy page in the app.

If you prefer to use your own proxy servers, select the Custom proxies option in the Proxy field and then enter the proxy servers into the Custom proxies field.

Custom proxies {customProxies: String}

This field is only available for the Custom proxies option of the Proxy field.

A list of custom proxy servers to be used by the crawler. Each proxy should be specified in the scheme://user:password@host:port format, multiple proxies should be separated by a space or new line. The URL scheme defines the proxy type, possible values are http and socks5. User and password might be omitted, but the port must always be present. Separate proxies are separated by spaces or new lines.


If you want to combine your custom proxies with Apify Proxy groups, or if you wish to use the Apify Proxy rotation and proxy selection system for your custom proxies, please let us know at

Initial cookies {cookies: [Object]}

An array of cookies used to initialize the crawler. You can export the cookies from your own web browser, for example using the EditThisCookie plugin. This setting is typically used to start crawling when logged in to certain websites. The array might be null or empty, in which case the crawler will start with no cookies.

Note that if the Cookies persistence setting is Over all crawler runs, the cookies array will be overwritten with fresh cookies from the crawler whenever it successfully finishes.

WARNING: You should never share cookies or an exported crawler configuration containing cookies with untrusted parties, because they might use it to authenticate themselves to various websites with your credentials.


    "domain": "",
    "expires": "Thu, 01 Jun 2017 16:14:38 GMT",
    "expiry": 1496333678,
    "httponly": true,
    "name": "NAME",
    "path": "/",
    "secure": false,
    "value": "Some value"
    "domain": "",
    "expires": "Thu, 01 Jun 2017 16:14:37 GMT",
    "expiry": 1496333677,
    "httponly": true,
    "name": "OTHER_NAME",
    "path": "/",
    "secure": false,
    "value": "Some other value"

Cookies persistence {cookiesPersistence: String}

Indicates how the crawler saves and reuses cookies. When you start the crawler, the first PhantomJS process will use the cookies defined by the Initial cookies setting. Subsequent PhantomJS processes will use cookies as follows:

Per single crawling process only
Cookies are only maintained separately by each PhantomJS crawling process for the lifetime of that process. The cookies are not shared between crawling processes. This means that whenever the crawler rotates its IP address, it will start again with cookies defined by the Initial cookies setting. Use this setting for maximum privacy and to avoid detection of the crawler. This is the default option.
Per full crawler run
Indicates that cookies collected at the start of the crawl by the first PhantomJS process are reused by other PhantomJS processes, even when switching to a new IP address. This might be necessary to maintain a login performed at the beginning of your crawl, but it might help the server to detect the crawler. Note that cookies are only collected at the beginning of the crawl by the initial PhantomJS process. Cookies set by subsequent PhantomJS processes are only valid for the duration of that process and are not reused by other processes. This is necessary to enable crawl parallelization.
Over all crawler runs
This setting is similar to Per full crawler run, the only difference is that if the crawler finishes with SUCCEEDED status, its current cookies are automatically saved to the Initial cookies setting so that new crawler run start where the previous run left off. This is useful to keep login cookies fresh and avoid their expiration.

Custom data {customData: Anything}

Custom user data passed to the page function and intercept request function as context.customData. This setting is mainly useful if you're invoking the crawler using an API, so that you can pass some arbitrary parameters to your code. In the crawler settings editor the value can only be a string, but when passing it through the API it can be an arbitrary JSON-stringifyable object.

Finish webhook URL {finishWebhookUrl: String}

A custom endpoint that receives a HTTP POST right after every run of the crawler ends, regardless of its status, i.e. whether it finished, failed, was stopped, etc. The POST payload is a JSON object defining an _id property which contains an execution ID of the crawler run and actId which contains internal ID of crawler, e.g. {_id: "S76d9xzpvY7NLfSJc", actId: "lepE4f93lkDPqojdC"}. You can use this ID to query the crawl status and results using the API. Beware that the JSON object might be extended with other properties in the future.

The response to the POST request must have a HTTP status code in 2XX range. Otherwise it is considered an error and the request is periodically retried every hour. If the request does not succeed after 48 hours, the system gives up and stops sending the requests.

For safety reasons, your URL should contain a secret token to ensure only Apify can invoke it. To test your endpoint, you can use one of the example crawlers. If your endpoint performs a time-consuming operation, you should respond to the request immediately so that it does not time out before Apify receives the response. The timeout of the webhook is 5 minutes. In rare circumstances, the webhook might be invoked more than once, you should design your code to be idempotent to duplicate calls.

You can test your webhook endpoint by clicking the Test button right next to your webhook URL. This will create a dummy crawl that is immediately finished and has zero results, whose only purpose is to test the finish webhook in real-world conditions.

Pro tip: If you want to run your crawler in an infinite loop, i.e. start a new run right after the previous one finishes, simply set the start crawler API endpoint as your finish webhook.

Finish webhook data {finishWebhookData: String}

You can add custom string to be sent in the finish webhook's POST request. If you set value of this field to my value then the JSON object sent as POST request payload will look as follows:

  _id: "S76d9xzpvY7NLfSJc",
  actId: "lepE4f93lkDPqojdC",
  data: "my value",

Test URL {testUrl: {key: String, value: String}}

A single URL with an optional label to test your crawler on. When using this to start the crawler, all Start URLs will be ignored and the crawler will be run for this URL only. All the Pseudo-URLs stay in place to allow for navigation testing.

Crawler internals

Request object

This object contains all the available information about every single web page the crawler encounters (both visited and not visited). This object comes into play in both Page function and Intercept request function and crawling results are actually just an array of these objects.

The Request object has the following schema:

  // A string with a unique identifier of this Request object.
  // It is generated from the uniqueKey, therefore two pages from various crawls
  // with the same uniqueKey will also have the same ID.
  id: String,

  // The URL that was specified in the web page's navigation request,
  // possibly updated by the 'interceptRequest' function
  url: String,

  // The final URL reported by the browser after the page was opened
  // (will be different from 'url' if there was a redirect)
  loadedUrl: String,

  // Date and time of the original web page's navigation request
  requestedAt: Date,
  // Date and time when the page load was initiated in the web browser, or null if it wasn't
  loadingStartedAt: Date,
  // Date and time when the page was actually loaded, or null if it wasn't
  loadingFinishedAt: Date,

  // HTTP status and headers of the loaded page.
  // If there were any redirects, the status and headers correspond to the final response, not the intermediate responses.
  responseStatus: Number,
  responseHeaders: Object,

  // If the page could not be loaded for any reason (e.g. a timeout), this field contains a best guess of
  // the code of the error. The value is either one of the codes from QNetworkReply::NetworkError codes
  // or value 999 for an unknown error. This field is used internally to retry failed page loads.
  // Note that the field is only informative and might not be set for all types of errors,
  // always use errorInfo to determine whether the page was processed successfully.
  loadErrorCode: Number,

  // Date and time when the page function started and finished
  pageFunctionStartedAt: Date,
  pageFunctionFinishedAt: Date,

  // An arbitrary string that uniquely identifies the web page in the crawling queue.
  // It is used by the crawler to determine whether a page has already been visited.
  // If two or more pages have the same uniqueKey, then the crawler only visits the first one.
  // By default, uniqueKey is generated from the 'url' property as follows:
  //  * hostname and protocol is converted to lower-case
  //  * trailing slash is removed
  //  * common tracking parameters starting with 'utm_' are removed
  //  * query parameters are sorted alphabetically
  //  * whitespaces around all components of the URL are trimmed
  //  * if the 'considerUrlFragment' setting is disabled, the URL fragment is removed completely
  // If you prefer different generation of uniqueKey, you can override it in the 'interceptRequest'
  // or 'context.enqueuePage' functions.
  uniqueKey: String,

  // Describes the type of the request. It can be either one of the following values:
  // 'InitialAboutBlank', 'StartUrl', 'SingleUrl', 'ActorRequest', 'OnUrlChanged', 'UserEnqueued', 'FoundLink'
  // or in case the request originates from PhantomJS' onNavigationRequested() it can be one of the following values:
  // 'Undefined', 'LinkClicked', 'FormSubmitted', 'BackOrForward', 'Reload', 'FormResubmitted', 'Other'
  type: String,

  // Boolean value indicating whether the page was opened in a main frame or a child frame
  isMainFrame: Boolean,

  // HTTP POST payload
  postData: String,

  // Content-Type HTTP header of the POST request
  contentType: String,

  // Contains "GET" or "POST"
  method: String,

  // Indicates whether the page will be loaded by the crawler or not
  willLoad: Boolean,

  // Indicates the label specified in startUrls or crawlPurls config settings where URL/PURL corresponds
  // to this page request. If more URLs/PURLs are matching, this field contains the FIRST NON-EMPTY
  // label in order in which the labels appear in startUrls and crawlPurls arrays.
  // Note that labels are not mandatory, so the field might be null.
  label: String,

  // ID of the Request object from whose page this Request was first initiated, or null.
  referrerId: String,

  // Contains the Request object corresponding to 'referrerId'.
  // This value is only available in pageFunction and interceptRequest functions
  // and can be used to access properties and page function results of the page linking to the current page.
  // Note that the referrer Request object DOES NOT recursively define the 'referrer' property.
  referrer: Object,

  // How many links away from start URLs was this page found
  depth: Number,

  // If any error occurred while loading or processing the web page,
  // this field contains a non-empty string with a description of the error.
  // The field is used for all kinds of errors, such as page load errors, the page function or
  // intercept request function exceptions, timeouts, internal crawler errors etc.
  // If there is no error, the field is a false-ish value (empty string, null or undefined).
  errorInfo: String,

  // Results of the user-provided 'pageFunction'
  pageFunctionResult: Anything,

  // A field that might be used by 'interceptRequest' function to save custom data related to this page request
  interceptRequestData: Anything,

  // Total size of all resources downloaded during this request
  downloadedBytes: Number,

  // Indicates the position where the request will be placed in the crawling queue.
  // Can either be 'LAST' to put the request to the end of the queue (default behavior)
  // or 'FIRST' to put it before any other requests.
  queuePosition: String,

  // Custom proxy used by the crawler, or null if custom proxies were not used.
  // For security reasons, the username and password are redacted from the URL.
  proxy: String