
Legacy PhantomJS Crawler
- apify/legacy-phantomjs-crawler
- Modified
- Users 1.4k
- Runs 13.8M
- Created by
Apify
Replacement for the legacy Apify Crawler product with a backward-compatible interface. The actor uses PhantomJS headless browser to recursively crawl websites and extract data from them using a piece of front-end JavaScript code.
Optional
boolean
Indicates that the URL fragment identifier (i.e. http://example.com/page#this-guy-here) should be considered when matching a URL against a Pseudo-URL or when checking whether a page has already been visited. Typically, URL fragments are used as internal page anchors and therefore they should be ignored because they don't represent separate pages. However, many AJAX-based website nowadays use URL fragment to represent page parameters; in such cases, this option should be enabled.
Optional
boolean
Indicates whether the crawler should load HTML images, both those included using the <img> tag as well as those included in CSS styles. Disable this feature after you have fine-tuned your crawler in order to increase crawling performance and reduce your bandwidth costs.
Optional
boolean
Indicates that the jQuery library should be injected into each page before Page function is invoked. Note that the jQuery object will not be registered into global namespace in order to avoid conflicts with libraries used by the web page. It can only be accessed through context.jQuery.
Optional
boolean
Indicates that the Underscore.js library should be injected into each page before Page function is invoked. Note that the Underscore object will not be registered into global namespace in order to avoid conflicts with libraries used by the web page. It can only be accessed through context.underscoreJs.
Optional
boolean
Indicates that child frames included using FRAME or IFRAME tags will not be loaded by the crawler. This might improve crawling performance. As a side-effect, JavaScript redirects issued by the page before it was completely loaded will not be performed, which might be useful in certain situations.
Optional
integer
Maximum number of pages that the crawler will open. The crawl will stop when this limit is reached. Always set this value in order to prevent infinite loops in misconfigured crawlers. Note that in cases of parallel crawling, the actual number of pages visited might be slightly higher than this value.
Optional
integer
Defines how many links away from the start URLs the crawler will descend. This value is a safeguard against infinite crawling depths on misconfigured crawlers. Note that pages added using enqueuePage() in Page function are not subject to the maximum depth constraint.
Optional
integer
Timeout for the asynchronous part of the Page function, in milliseconds. Note that this value is only applied if your page function runs code in the background, i.e. when it invokes context.willFinishLater(). The page function itself always runs to completion regardless of the timeout.
Optional
integer
Defines the maximum client height in pixels to which the browser window is scrolled in order to fetch dynamic AJAX-based content from the web server. By default, the crawler doesn't scroll and uses a fixed browser window size. Note that you might need to enable Download HTML images to make infinite scroll work, because otherwise the crawler wouldn't know that some resources are still being loaded and will stop infinite scrolling prematurely.
Optional
integer
This option forces the crawler to ensure a minimum time interval between opening two web pages, in order to prevent it from overloading the target server. The actual minimum time is a random value drawn from a Gaussian distribution with a mean specified by your setting (in milliseconds) and a standard deviation corresponding to 25% of the mean. The minimum value is 1000 milliseconds, the crawler never issues requests in shorter intervals than 1000 milliseconds.
Optional
integer
Maximum number of pages that a single crawling process will open before it is restarted with a new proxy server setting. This option can help avoid the blocking of the crawler by the target server and also ensures that the crawling processes don't grow too large, as they are killed periodically.
Optional
string
Specifies Apify Proxy groups to be used when proxyType is CUSTOM. Each proxy should be specified in the scheme://user:password@host:port format, multiple proxies should be separated by a space or new line. This is a legacy option only kept for backwards compatibility - use proxyConfiguration instead!
Optional
objectstringintegerbooleanarray
A custom JSON object that is passed to Page function and intercept request function as context.customData. This setting is mainly useful if you're invoking the crawler using the API, so that you can pass some arbitrary parameters to your code.
Optional
string
An HTTP endpoint that receives a POST request right after the run of this actor finishes. The POST payload is a JSON object with the following properties: actorId, runId, taskId, datasetId and data For more information about finish webhooks, please see the actor README.