contextractor - Trafilatura based avatar

contextractor - Trafilatura based

Under maintenance

Pricing

Pay per usage

Go to Apify Store
contextractor - Trafilatura based

contextractor - Trafilatura based

Under maintenance

Extract clean, readable content . Uses Trafilatura, the top rated library, to strip away navigation, ads, and boilerplate—leaving just the text you need.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Glueo

Glueo

Maintained by Community

Actor stats

0

Bookmarked

11

Total users

2

Monthly active users

3 days ago

Last modified

Share

Contextractor

Extract clean, readable content from any website using Trafilatura.

Available as: PyPI | npm | Docker | Apify actor

Try the Playground to configure extraction settings and preview commands before running.

Install

$pip install contextractor

or

$npm install -g contextractor

Requires Python 3.12+ (pip) or Node.js 18+ (npm). Playwright Chromium is installed automatically.

Usage

$contextractor https://example.com

Works with zero config. Pass URLs directly, or use a config file for complex setups:

contextractor https://example.com --precision --save json -o ./results
contextractor --config config.json --max-pages 10

CLI Options

contextractor [OPTIONS] [URLS...]
Crawl Settings:
--config, -c Path to JSON config file
--output-dir, -o Output directory
--max-pages Max pages to crawl (0 = unlimited)
--crawl-depth Max link depth from start URLs (0 = start only)
--headless/--no-headless Browser headless mode (default: headless)
--max-concurrency Max parallel requests (default: 50)
--max-retries Max request retries (default: 3)
--max-results Max results per crawl (0 = unlimited)
Proxy:
--proxy-urls Comma-separated proxy URLs (http://user:pass@host:port)
--proxy-rotation Rotation: recommended, perRequest, untilFailure
Browser:
--launcher Browser engine: chromium, firefox (default: chromium)
--wait-until Page load event: load, networkidle, domcontentloaded (default: load)
--page-load-timeout Timeout in seconds (default: 60)
--ignore-cors Disable CORS/CSP restrictions
--close-cookie-modals Auto-dismiss cookie banners
--max-scroll-height Max scroll height in pixels (default: 5000)
--ignore-ssl-errors Skip SSL certificate verification
--user-agent Custom User-Agent string
Crawl Filtering:
--globs Comma-separated glob patterns to include
--excludes Comma-separated glob patterns to exclude
--link-selector CSS selector for links to follow
--keep-url-fragments Preserve URL fragments
--respect-robots-txt Honor robots.txt
Cookies & Headers:
--cookies JSON array of cookie objects
--headers JSON object of custom HTTP headers
Output Format:
--save Output formats, comma-separated (default: markdown)
Valid: markdown, html, text, json, jsonl, xml, xml-tei, all
Content Extraction:
--precision High precision mode (less noise)
--recall High recall mode (more content)
--fast Fast extraction mode (less thorough)
--no-links Exclude links from output
--no-comments Exclude comments from output
--include-tables/--no-tables Include tables (default: include)
--include-images Include image descriptions
--include-formatting/--no-formatting Preserve formatting (default: preserve)
--deduplicate Deduplicate extracted content
--target-language Filter by language (e.g. "en")
--with-metadata/--no-metadata Extract metadata (default: with)
--prune-xpath XPath patterns to remove from content
Diagnostics:
--verbose, -v Enable verbose logging

CLI flags override config file settings. Merge order: defaults → config file → CLI args

Config File (optional)

Use a JSON config file to set options:

{
"urls": ["https://example.com", "https://docs.example.com"],
"save": ["markdown"],
"outputDir": "./output",
"crawlDepth": 1,
"proxy": {
"urls": ["http://user:pass@host:port"],
"rotation": "recommended"
},
"trafilaturaConfig": {
"favorPrecision": true,
"includeLinks": true,
"includeTables": true,
"deduplicate": true
}
}

Crawl Settings

FieldTypeDefaultDescription
urlsarray[]URLs to extract content from
maxPagesint0Max pages to crawl (0 = unlimited)
outputDirstring"./output"Directory for extracted content
crawlDepthint0How deep to follow links (0 = start URLs only)
headlessbooltrueBrowser headless mode
maxConcurrencyint50Max parallel browser pages
maxRetriesint3Max retries for failed requests
maxResultsint0Max results per crawl (0 = unlimited)

Proxy Configuration

FieldTypeDefaultDescription
proxy.urlsarray[]Proxy URLs (http://user:pass@host:port or socks5://host:port)
proxy.rotationstring"recommended"recommended, perRequest, untilFailure
proxy.tieredarray[]Tiered proxy escalation (config-file only)

Browser Settings

FieldTypeDefaultDescription
launcherstring"chromium"Browser engine: chromium, firefox
waitUntilstring"load"Page load event: load, networkidle, domcontentloaded
pageLoadTimeoutint60Page load timeout in seconds
ignoreCorsboolfalseDisable CORS/CSP restrictions
closeCookieModalsbooltrueAuto-dismiss cookie consent banners
maxScrollHeightint5000Max scroll height in pixels (0 = disable)
ignoreSslErrorsboolfalseSkip SSL certificate verification
userAgentstring""Custom User-Agent string

Crawl Filtering

FieldTypeDefaultDescription
globsarray[]Glob patterns for URLs to include
excludesarray[]Glob patterns for URLs to exclude
linkSelectorstring""CSS selector for links to follow
keepUrlFragmentsboolfalseTreat URLs with different fragments as different pages
respectRobotsTxtboolfalseHonor robots.txt

Cookies & Headers

FieldTypeDefaultDescription
cookiesarray[]Initial cookies ([{"name": "...", "value": "...", "domain": "..."}])
headersobject{}Custom HTTP headers ({"Authorization": "Bearer token"})

Output Format

FieldTypeDefaultDescription
savearray["markdown"]Output formats: markdown, html, text, json, jsonl, xml, xml-tei, all

Content Extraction

All options go under the trafilaturaConfig key in config files, or use the equivalent CLI flags:

FieldTypeDefaultDescription
favorPrecisionboolfalseHigh precision, less noise
favorRecallboolfalseHigh recall, more content
includeCommentsbooltrueInclude comments
includeTablesbooltrueInclude tables
includeImagesboolfalseInclude images
includeFormattingbooltruePreserve formatting
includeLinksbooltrueInclude links
deduplicateboolfalseDeduplicate content
withMetadatabooltrueExtract metadata (title, author, date)
targetLanguagestringnullFilter by language (e.g. "en")
fastboolfalseFast mode (less thorough)
pruneXpatharraynullXPath patterns to remove from content

Node.js API

Use contextractor as a library in your Node.js code:

const { extract } = require("contextractor");
// Extract a single URL
await extract("https://example.com", {
save: "markdown",
outputDir: "./output",
});
// Multiple URLs with extraction options
await extract(["https://a.com", "https://b.com"], {
precision: true,
noLinks: true,
includeTables: true,
save: ["markdown", "json"],
outputDir: "./results",
});
// Using a config file
await extract("https://example.com", { config: "./config.json" });

ESM import:

import { extract } from "contextractor";

extract(urls, options) returns Promise<void> — output goes to outputDir or stdout. Options use the same camelCase names as listed in CLI Options and Config File.

Python API

Install the extraction engine:

$pip install contextractor-engine

Use ContentExtractor to extract content from HTML:

from contextractor_engine import ContentExtractor, TrafilaturaConfig
# Basic extraction
extractor = ContentExtractor()
result = extractor.extract(html, url="https://example.com", output_format="markdown")
print(result.content)
# High precision with custom config
config = TrafilaturaConfig(favor_precision=True, include_tables=True, deduplicate=True)
extractor = ContentExtractor(config=config)
result = extractor.extract(html, output_format="json")

Extract metadata:

meta = extractor.extract_metadata(html, url="https://example.com")
print(meta.title, meta.author, meta.date)

Available output formats: txt, markdown, json, xml, xmltei

See the packages/contextractor_engine/README.md for full API reference.

Docker

$docker run ghcr.io/contextractor/contextractor https://example.com

Save output to your local machine:

$docker run -v ./output:/output ghcr.io/contextractor/contextractor https://example.com -o /output

Use a config file:

$docker run -v ./config.json:/config.json ghcr.io/contextractor/contextractor --config /config.json

All CLI flags work the same inside Docker.

Docker from Code

Call Docker extraction programmatically:

Node.js:

const { execSync } = require("child_process");
const result = execSync(
"docker run ghcr.io/contextractor/contextractor https://example.com",
{ encoding: "utf-8" }
);
console.log(result);

Python:

import subprocess
result = subprocess.run(
["docker", "run", "ghcr.io/contextractor/contextractor", "https://example.com"],
capture_output=True, text=True
)
print(result.stdout)

Volume mount for output:

$docker run -v $(pwd)/output:/output ghcr.io/contextractor/contextractor https://example.com -o /output

Output

One file per crawled page, named from the URL slug (e.g. example-com-page.md). Metadata (title, author, date) is included in the output header when available.

Platforms

  • npm: macOS arm64, Linux (x64, arm64), Windows x64
  • Docker: linux/amd64, linux/arm64

License

Apache-2.0

Docs version

2026-04-16T12:41:28Z