Deprecated

Pricing

Pay per usage

See alternative Actors

Go to Apify Store

contextractor - Trafilatura based

Deprecated

See alternative Actors

Extract clean, readable content . Uses Trafilatura, the top rated library, to strip away navigation, ads, and boilerplate—leaving just the text you need.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Glueo

Actor stats

Bookmarked

Total users

Monthly active users

4 days ago

Last modified

`@contextractor/apify`

Apify Actor that crawls websites and extracts main-content text.

Built on rs-trafilatura (extraction) and Crawlee (TypeScript crawler driving Playwright).

Supported output formats

txt, markdown, json, html, original (raw page HTML).

Local prerequisites

Rust toolchain via rustup (cargo + rustc on PATH for napi build).
Apify CLI ≥ 1.4 (older versions reject the modern actor.json format).
Node 22+, pnpm 10+.

Local development

pnpm install
pnpm --filter @contextractor/apify build
apify run            # from apps/apify-actor/

Input

The full input surface is generated from the Zod 4 schema in ../../packages/schema/README.md by ../../tools/gen-input-schema/README.md; the table below is auto-rebuilt from that schema by ../../tools/gen-md-regions/.

Field	Type	Default	Description
`startUrls`	array	required	URLs to extract content from
`crawlerType`	enum (`playwright-adaptive` \| `playwright-firefox` \| `playwright-chromium` \| `cheerio`)	`"playwright-adaptive"`	Browser engine or HTTP client for crawling. playwright-adaptive automatically switches between browser and HTTP client per page. cheerio uses raw HTTP only (fastest, no JS).
`renderingTypeDetectionPercentage`	integer	`10`	(Adaptive only) Percentage of pages on which the crawler runs a rendering-type detection probe. Higher values are more accurate but slower.
`globs`	array	`[]`	Glob patterns matching URLs of pages that will be included in crawling. Setting this option allows you to customize the crawling scope. For example `https://{store,docs}.example.com/**` lets the craw…
`excludes`	array	`[]`	Glob patterns matching URLs of pages that will be excluded from crawling. Note that this affects only links found on pages, but not Start URLs, which are always crawled.
`linkSelector`	string	`""`	CSS selector for links to enqueue. Leave empty to disable link enqueueing.
`keepUrlFragments`	boolean	`false`	URL fragments (the parts of URL after a #) are not considered when the scraper determines whether a URL has already been visited. Turn this on to treat URLs with different fragments as different page…
`useSitemaps`	boolean	`false`	If enabled, the crawler looks for sitemap.xml at the root of each start URL domain and enqueues matching URLs from it in addition to link-following.
`deduplication`	enum (`none` \| `url` \| `content-hash`)	`"url"`	Deduplication level applied on top of Crawlee's built-in URL deduplication. url (default): skip pages whose was already extracted, across all handler types. content-hash: also…
`respectRobotsTxtFile`	boolean	`false`	If enabled, the crawler will consult the robots.txt file for each domain before crawling pages.
`initialCookies`	array	optional	Cookies that will be pre-set to all pages the scraper opens. This is useful for pages that require login. The value is expected to be a JSON array of objects with `name` and `value` properties. For e…
`customHttpHeaders`	object	optional	HTTP headers that will be added to all requests made by the crawler. This is useful for setting custom authentication headers or other headers required by the target website. The value is expected to…
`maxPagesPerCrawl`	integer	`0`	Maximum pages to crawl. Includes start URLs and pagination pages. The crawler will automatically finish after reaching this number. 0 means unlimited.
`maxResultsPerCrawl`	integer	`0`	Maximum number of results that will be saved to dataset. The scraper will terminate after reaching this number. 0 means unlimited.
`maxCrawlingDepth`	integer	`0`	Maximum link depth from Start URLs. Pages discovered further from start URLs than this limit will not be crawled. 0 means unlimited.
`initialConcurrency`	integer	`0`	Initial number of browser pages or HTTP clients running in parallel. Crawlee auto-scales up to maxConcurrency. 0 lets Crawlee pick the default.
`maxConcurrency`	integer	`50`	Maximum number of browser pages running in parallel. This setting is useful to avoid overloading target websites and getting blocked.
`maxRequestRetries`	integer	`3`	Maximum number of retries for failed requests on network, proxy, or server errors.
`mode`	enum (`precision` \| `balanced` \| `recall`)	`"balanced"`	Extraction mode. precision minimizes noise (may miss some content); recall maximizes content (may include noise); balanced is the default.
`includeComments`	boolean	`true`	Include HTML comments in the extracted text.
`includeTables`	boolean	`true`	Include table content in the extracted text.
`includeImages`	boolean	`false`	Include image alt text and captions in the extracted text.
`includeLinks`	boolean	`true`	Include hyperlinks in the extracted text.
`targetLanguage`	string	`""`	Filter extracted content by language code (e.g. "en"). Leave empty to accept any language.
`save`	array	`["markdown"]`	Output formats to extract and save. "original" saves the raw page HTML before extraction.
`saveDestination`	array	`["key-value-store"]`	Where to save extracted content. Supported by both Actor and CLI.
`datasetName`	string	optional	Name or ID of the dataset for storing results. Leave empty to use the default run dataset.
`keyValueStoreName`	string	optional	Name or ID of the key-value store for content files. Leave empty to use the default store.
`requestQueueName`	string	optional	Name of the request queue for pending URLs. Leave empty to use the default queue.
`storeSkippedUrls`	boolean	`false`	If enabled, pushes a dataset record for each URL skipped during crawling (excluded by globs, robots.txt, depth limit, or concurrency cap). Can produce high record volume — enable for auditing only.
`proxyConfiguration`	object	optional	Enables loading websites from IP addresses in specific geographies and to circumvent blocking.
`proxyRotation`	enum (`recommended` \| `per-request` \| `until-failure`)	`"recommended"`	Proxy rotation strategy. recommended automatically picks the best proxies. per-request uses a new proxy for each request. until-failure uses one proxy until it fails.
`sessionPoolName`	string	optional	Name for a persistent, shared session pool. Sessions (IP + cookies) are saved under this key and reused across Actor runs. Useful when proxies are frequently blocked — previously working sessions are…
`maxSessionRotations`	integer	`10`	Maximum number of session (IP + browser fingerprint) rotations per request on block detection. Independent of maxRequestRetries. Set to 0 to disable session rotation.
`pageLoadTimeoutSecs`	integer	`60`	Maximum time to wait for page load in seconds
`blockMedia`	boolean	`false`	Block loading of images, stylesheets, fonts (.woff), PDFs, and ZIPs. Reduces bandwidth and speeds up crawling. Has no effect when using the raw HTTP crawler type or non-Chromium browsers (Chromium on…
`waitForSelector`	string	`""`	Wait for this CSS selector to appear before extracting content. The request fails and is retried if the selector does not appear within the timeout. Leave empty to disable.
`softWaitForSelector`	string	`""`	Wait for this CSS selector to appear before extracting content. Unlike waitForSelector, the request continues even if the selector does not appear within the timeout. Leave empty to disable.
`dynamicContentWaitSecs`	integer	`0`	Maximum seconds to wait for dynamic page content to load after navigation. The crawler continues when the network goes idle or this timeout elapses, whichever comes first. 0 disables this wait. Also…
`waitUntil`	enum (`load` \| `domcontentloaded` \| `networkidle` \| `commit`)	`"load"`	When to consider navigation finished. networkidle waits for 500ms of network silence (best for JS-heavy SPAs, slower); load waits for the load event (default, good for most articles); domcontentloade…
`headless`	boolean	`true`	Run browser in headless mode
`ignoreCorsAndCsp`	boolean	`false`	Ignore Content Security Policy and Cross-Origin Resource Sharing restrictions. Enables free XHR/Fetch requests from pages.
`closeCookieModals`	boolean	`true`	Automatically dismiss cookie consent modals with Ghostery-based blocking.
`maxScrollHeightPixels`	integer	`5000`	Maximum pixels to scroll down the page until all content is loaded. Setting to 0 disables scrolling.
`userAgent`	string	`""`	Custom User-Agent string for the browser. Leave empty to use the default browser User-Agent.
`ignoreSslErrors`	boolean	`false`	Ignore SSL certificate errors. Use at your own risk.

Output

Successful dataset entries include status: 'success' together with loadedUrl, loadedAt, httpStatus, metadata (title, author, publishedAt, description, siteName, lang), and originalHash (32-char MD5 hex of the raw HTML, always present). Failed requests are stored as status: 'failed' records after retries are exhausted, and skipped URLs can be recorded as status: 'skipped' when storeSkippedUrls is enabled.

When saveDestination includes dataset: each enabled format (markdown, txt, json, html) appears as an inline string alongside a {format}Hash field (e.g. markdownHash) containing its 32-char MD5 hex.

When saveDestination includes key-value-store (default): each enabled format appears as a ContentInfo object (hash, length, key, url); raw HTML is stored under {hash}-original.html when save includes "original".

Article Extractor & News Scraper

web.harvester/article-extractor-news-scraper

Extract articles from any news site, blog, or webpage. Get title, full text, author, date, images & metadata using 7 extraction engines (Newspaper4k, Trafilatura, Goose3). Anti-bot bypass, proxy rotation, automatic fallback. Perfect for news monitoring, NLP datasets & content aggregation.

Web Harvester

5.0

(2)

Google News Scraper — Canonical URLs & Brand Tracking

logiover/google-news-scraper

Scrape Google News by keyword, brand, or topic across 50+ countries. Returns canonical publisher URLs (not Google redirects), source domains, dates, snippets, and thumbnails. Filter by date range or source. 100+ results per query, 3-second cold start, no proxy required.

Logiover

Twitter Account-Based-In Locator by Screen Name

fastcrawler/twitter-account-based-in-locator-by-screen-name

Fetch detailed “About” information for any Twitter/X user using their screen name. This actor retrieves profile metadata such as account_based_in, location_accurate, bio details, and other About section fields from the latest Twitter/X web API.

fastcrawler

238

GitHub Repository Intelligence - API-Based Data Scraper

benthepythondev/github-repository-intelligence

Extract repository metadata, README content, and documentation from GitHub using the official REST API. Perfect for LLM training data, developer research, and competitive analysis. Search by keywords or fetch specific repositories.

ben

Remote Jobs Aggregator - API-Based Multi-Platform Job Scraper

benthepythondev/remote-jobs-aggregator

Aggregate remote job postings from 4 free public APIs: Arbeitnow, Jobicy, Himalayas, and RemoteOK. Extract job titles, companies, salaries, locations, and descriptions. Legal, stable, and fast API-based data collection.

ben

163

Competitor-Based Keyword Recommendations for On-Page SEO

antonio_espresso/keyword-competitor-recommendation

This actor takes a keyword, language, and Google engine, then returns structured SEO insights: ideal word count, title/content terms with usage ranges, relevant questions (H1–H3, PAA), and competitor data including URLs, rankings, titles, and content scores.

Antonio Blago

3.2

(2)

Airbnb Price Scraper (Area-Based)

moving_beacon-owner1/airbnb-listings-collector

Scrapes Airbnb listings from a specified area URL, collects listing details, and posts the data asynchronously. Supports configurable check-in dates, stay duration, guest numbers, and number of listings.

Jamshaid Arif

Cochrane Reviews — Evidence-Based Medicine Source API

azureblue/cochrane-review-scraper

Search Cochrane systematic reviews via Europe PMC. Returns structured data including authors' conclusions, DOI and publication date.

azureblue

Trade-Based Money Laundering Detection MCP

ryanclinton/trade-based-money-laundering-mcp

MCP intelligence server for trade based money laundering detection and analysis.

Ryan Clinton

Enterprise API Based Email Validator

dealgate/enterprise-api-based-email-validator

🚀 While other email validators only use low-effort methods such as SMTP validation, our validator additionally uses non-documented apis to have 100% accuracy and considerably faster speed than competitors. -- Other words, We directory hook into email providers API's