contextractor - Trafilatura based
Pricing
Pay per usage
contextractor - Trafilatura based
Extract clean, readable content . Uses Trafilatura, the top rated library, to strip away navigation, ads, and boilerplate—leaving just the text you need.
contextractor - Trafilatura based
Pricing
Pay per usage
Extract clean, readable content . Uses Trafilatura, the top rated library, to strip away navigation, ads, and boilerplate—leaving just the text you need.
Glob patterns matching URLs of pages that will be included in crawling. Setting this option allows you to customize the crawling scope. For example https://{store,docs}.example.com/** lets the crawler access all URLs starting with https://store.example.com/ or https://docs.example.com/.
[]Glob patterns matching URLs of pages that will be excluded from crawling. Note that this affects only links found on pages, but not Start URLs, which are always crawled.
[]Pseudo-URLs to match links in the page that you want to enqueue. Alternative to glob patterns. Combine with Link selector to tell the scraper where to find links.
[]CSS selector for links to enqueue. Leave empty to disable link enqueueing.
URL fragments (the parts of URL after a #) are not considered when the scraper determines whether a URL has already been visited. Turn this on to treat URLs with different fragments as different pages.
If enabled, the crawler will consult the robots.txt file for each domain before crawling pages.
Cookies that will be pre-set to all pages the scraper opens. This is useful for pages that require login. The value is expected to be a JSON array of objects with name and value properties. For example:
[{"name": "cookieName","value": "cookieValue","path": "/","domain": ".example.com"}]
You can use the EditThisCookie browser extension to copy browser cookies in this format, and paste it here.
Note that the value is secret and encrypted to protect your login cookies.
HTTP headers that will be added to all requests made by the crawler. This is useful for setting custom authentication headers or other headers required by the target website. The value is expected to be a JSON object with header names as keys and header values as values. For example: { "Authorization": "Bearer token123", "X-Custom-Header": "value" }.
Maximum pages to crawl. Includes start URLs and pagination pages. The crawler will automatically finish after reaching this number. 0 means unlimited.
Maximum number of results that will be saved to dataset. The scraper will terminate after reaching this number. 0 means unlimited.
Maximum link depth from Start URLs. Pages discovered further from start URLs than this limit will not be crawled. 0 means unlimited.
Maximum number of browser pages running in parallel. This setting is useful to avoid overloading target websites and getting blocked.
Maximum number of retries for failed requests on network, proxy, or server errors.
Trafilatura library extraction settings. Leave empty for balanced defaults. Keys: fast, favorPrecision, favorRecall, includeComments, includeTables, includeImages, includeFormatting, includeLinks, deduplicate, targetLanguage, withMetadata, onlyWithMetadata, teiValidation, pruneXpath.
If enabled, the crawler saves the raw HTML of all pages to the default key-value store and includes the URL link in the dataset output.
If enabled, the crawler extracts plain text from all pages, saves it to the key-value store, and includes the URL link in the dataset output.
If enabled, the crawler extracts JSON with metadata from all pages, saves it to the key-value store, and includes the URL link in the dataset output.
If enabled, the crawler extracts Markdown from all pages, saves it to the key-value store, and includes the URL link in the dataset output.
If enabled, the crawler extracts XML from all pages, saves it to the key-value store, and includes the URL link in the dataset output.
If enabled, the crawler extracts XML-TEI (scholarly format) from all pages, saves it to the key-value store, and includes the URL link in the dataset output.
Name or ID of the dataset for storing results. Leave empty to use the default run dataset.
Name or ID of the key-value store for content files. Leave empty to use the default store.
Name of the request queue for pending URLs. Leave empty to use the default queue.
Enables loading websites from IP addresses in specific geographies and to circumvent blocking.
Proxy rotation strategy. RECOMMENDED automatically picks the best proxies. PER_REQUEST uses a new proxy for each request. UNTIL_FAILURE uses one proxy until it fails.
Maximum time to wait for page load in seconds
When to consider navigation finished
Browser to use for crawling
Ignore Content Security Policy and Cross-Origin Resource Sharing restrictions. Enables free XHR/Fetch requests from pages.
Automatically dismiss cookie consent modals
Maximum pixels to scroll down the page until all content is loaded. Setting to 0 disables scrolling.
Ignore SSL certificate errors. Use at your own risk.