Legacy PhantomJS Crawler
No credit card required
Legacy PhantomJS Crawler
No credit card required
Replacement for the legacy Apify Crawler product with a backward-compatible interface. The actor uses PhantomJS headless browser to recursively crawl websites and extract data from them using a piece of front-end JavaScript code.
Do you want to learn more about this Actor?
Get a demoStart URLs
startUrls
arrayRequired
List of URLs that will be loaded by the crawler on start. For a POST request, append [POST] to the URL, e.g. http://www.example.com/[POST]
Pseudo-URLs
crawlPurls
arrayOptional
Specifies URLs of pages to crawl. Put regular expressions in [ ] brackets, e.g. http://www.example.com/[.*]
Clickable elements
clickableElementsSelector
stringOptional
CSS selector used to find links to other web pages. Leave empty to ignore all links.
For example: a[href]
Page function
pageFunction
stringOptional
JavaScript function that is executed on every crawled page, use it to extract data. Note that only ES5.1 syntax is supported.
Intercept request function
interceptRequest
stringOptional
JavaScript function called whenever the crawler finds a link or form leading to a new web page. Note that only ES5.1 syntax is supported
URL #fragments identify unique pages
considerUrlFragment
booleanOptional
Indicates that the URL fragment identifier (i.e. http://example.com/page#this-guy-here
) should be considered when matching a URL against a Pseudo-URL or when checking whether a page has already been visited. Typically, URL fragments are used as internal page anchors and therefore they should be ignored because they don't represent separate pages. However, many AJAX-based website nowadays use URL fragment to represent page parameters; in such cases, this option should be enabled.
Default value of this property is false
Download HTML images
loadImages
booleanOptional
Indicates whether the crawler should load HTML images, both those included using the <img>
tag as well as those included in CSS styles. Disable this feature after you have fine-tuned your crawler in order to increase crawling performance and reduce your bandwidth costs.
Default value of this property is true
Download CSS files
loadCss
booleanOptional
Indicates whether the crawler should load CSS stylesheet files. Disable this feature after you have fine-tuned your crawler in order to increase crawling performance and reduce your bandwidth costs.
Default value of this property is true
Inject jQuery
injectJQuery
booleanOptional
Indicates that the jQuery library should be injected into each page before Page function is invoked. Note that the jQuery object will not be registered into global namespace in order to avoid conflicts with libraries used by the web page. It can only be accessed through context.jQuery
.
Default value of this property is true
Inject Underscore.js
injectUnderscoreJs
booleanOptional
Indicates that the Underscore.js library should be injected into each page before Page function is invoked. Note that the Underscore object will not be registered into global namespace in order to avoid conflicts with libraries used by the web page. It can only be accessed through context.underscoreJs
.
Default value of this property is false
Ignore robots exclusion standards
ignoreRobotsTxt
booleanOptional
Indicates that the crawler should ignore robots.txt
, <meta name='robots'>
tags and X-Robots-Tag
HTTP headers. Use this feature at your own risk!
Default value of this property is false
Don't load frames and IFRAMEs
skipLoadingFrames
booleanOptional
Indicates that child frames included using FRAME or IFRAME tags will not be loaded by the crawler. This might improve crawling performance. As a side-effect, JavaScript redirects issued by the page before it was completely loaded will not be performed, which might be useful in certain situations.
Default value of this property is false
Verbose log
verboseLog
booleanOptional
If enabled, the log will also contain DEBUG messages. Note that this setting will dramatically slow down the crawler as well as your web browser and increase the log size.
Default value of this property is false
Disable web security
disableWebSecurity
booleanOptional
If checked, the virtual browser will allow cross-domain XHRs and untrusted SSL certificates, so that your crawler can access content from any domain. Only activate this feature if you know what you're doing!
Default value of this property is false
Rotate User-Agent headers
rotateUserAgents
booleanOptional
If checked, the crawler automatically rotates the User-Agent
HTTP header for each new IP address, from a pre-defined list. This settings overwrites User-Agent
set in Custom HTTP headers.
Default value of this property is false
Max pages per crawl
maxCrawledPages
integerOptional
Maximum number of pages that the crawler will open. The crawl will stop when this limit is reached. Always set this value in order to prevent infinite loops in misconfigured crawlers. Note that in cases of parallel crawling, the actual number of pages visited might be slightly higher than this value.
Max result records
maxOutputPages
integerOptional
Maximum number of pages the crawler can output to JSON. The crawl will stop when this limit is reached. This value is useful when you only need a limited number of results.
Max crawling depth
maxCrawlDepth
integerOptional
Defines how many links away from the start URLs the crawler will descend. This value is a safeguard against infinite crawling depths on misconfigured crawlers. Note that pages added using enqueuePage()
in Page function are not subject to the maximum depth constraint.
Execution timeout
timeout
integerOptional
This field has been deprecated and its value is ignored. To set the execution timeout, use the actor run timeout option instead.
Default value of this property is 604800
Resource timeout
resourceTimeout
integerOptional
Timeout for network resources loaded by the crawler, in milliseconds.
Default value of this property is 30000
Page load timeout
pageLoadTimeout
integerOptional
Timeout for web page load, in milliseconds. If the web page does not load in this time frame, it is considered to have failed and will be retried, similarly as with other page load errors.
Default value of this property is 60000
Page function timeout
pageFunctionTimeout
integerOptional
Timeout for the asynchronous part of the Page function, in milliseconds. Note that this value is only applied if your page function runs code in the background, i.e. when it invokes context.willFinishLater()
. The page function itself always runs to completion regardless of the timeout.
Default value of this property is 600000
Infinite scroll height
maxInfiniteScrollHeight
integerOptional
Defines the maximum client height in pixels to which the browser window is scrolled in order to fetch dynamic AJAX-based content from the web server. By default, the crawler doesn't scroll and uses a fixed browser window size. Note that you might need to enable Download HTML images to make infinite scroll work, because otherwise the crawler wouldn't know that some resources are still being loaded and will stop infinite scrolling prematurely.
Delay between requests
randomWaitBetweenRequests
integerOptional
This option forces the crawler to ensure a minimum time interval between opening two web pages, in order to prevent it from overloading the target server. The actual minimum time is a random value drawn from a Gaussian distribution with a mean specified by your setting (in milliseconds) and a standard deviation corresponding to 25% of the mean. The minimum value is 1000 milliseconds, the crawler never issues requests in shorter intervals than 1000 milliseconds.
Default value of this property is 1000
Max pages per IP address
maxCrawledPagesPerSlave
integerOptional
Maximum number of pages that a single crawling process will open before it is restarted with a new proxy server setting. This option can help avoid the blocking of the crawler by the target server and also ensures that the crawling processes don't grow too large, as they are killed periodically.
Default value of this property is 50
Max parallel processes
maxParallelRequests
integerOptional
The maximum number of parallel processes that will perform the crawl. The actual number might be lower if the actor runs without enough memory. Note that each parallel process uses a different proxy (if enabled).
Default value of this property is 50
Max page retries
maxPageRetryCount
integerOptional
The maximum number of times the crawler will retry to open a web page on load error. Note that on page function errors, the pages are not retried.
Default value of this property is 3
Custom HTTP headers
customHttpHeaders
arrayOptional
Custom HTTP headers set by the crawler to all requests. It is an array of objects, where each object has the key
and value
properties.
Proxy configuration
proxyConfiguration
objectOptional
Specifies the type of proxy servers that will be used by the crawler in order to hide its origin.
Proxy type (legacy)
proxyType
stringOptional
Specifies the type of proxy servers that will be used by the crawler.
This is a legacy option only kept for backwards compatibility, use proxyConfiguration instead!
Proxy groups (legacy)
proxyGroups
arrayOptional
Specifies Apify Proxy groups to be used when proxyType is SELECTED_PROXY_GROUPS
.
This is a legacy option only kept for backwards compatibility - use proxyConfiguration instead!
Default value of this property is []
Custom proxies (legacy)
customProxies
stringOptional
Specifies Apify Proxy groups to be used when proxyType is CUSTOM
. Each proxy should be specified in the scheme://user:password@host:port
format, multiple proxies should be separated by a space or new line.
This is a legacy option only kept for backwards compatibility - use proxyConfiguration instead!
Custom data
customData
objectstringintegerbooleanarrayOptional
A custom JSON object that is passed to Page function and intercept request function as context.customData
. This setting is mainly useful if you're invoking the crawler using the API, so that you can pass some arbitrary parameters to your code.
Finish webhook URL
finishWebhookUrl
stringOptional
An HTTP endpoint that receives a POST request right after the run of this actor finishes. The POST payload is a JSON object with the following properties: actorId
, runId
, taskId
, datasetId
and data
For more information about finish webhooks, please see the actor README.
Finish webhook data
finishWebhookData
stringOptional
Custom string that is sent in the POST payload to Finish webhook URL, as the data
property.
For more information about finish webhooks, please see the actor README.
Cookies persistence
cookiesPersistence
EnumOptional
Indicates how cookies collected by the crawler are persisted. This is useful if you need to maintain a login.
For more information about cookies, please see the actor README.
Value options:
"PER_PROCESS": string"PER_CRAWLER_RUN": string"OVER_CRAWLER_RUNS": string
Default value of this property is "PER_PROCESS"
Initial cookies
cookies
arrayOptional
JSON array with cookies that the crawler starts with. This is useful for reusing a login from an external web browser. Note that if the Cookies persistence setting is Over all crawler runs, this field in the actor task configuration will be overwritten with new cookies from the crawler whenever it successfully finishes.
For more information about cookies, please see the actor README.
Actor Metrics
101 monthly users
-
21 stars
>99% runs succeeded
55 days response time
Created in Mar 2019
Modified 5 months ago