contextractor - Trafilatura based
Pricing
Pay per usage
contextractor - Trafilatura based
Under maintenanceExtract clean, readable content . Uses Trafilatura, the top rated library, to strip away navigation, ads, and boilerplate—leaving just the text you need.
Pricing
Pay per usage
Rating
0.0
(0)
Developer
Glueo
Actor stats
0
Bookmarked
11
Total users
2
Monthly active users
3 days ago
Last modified
Categories
Share
Contextractor
Extract clean, readable content from any website using Trafilatura.
Available as: PyPI | npm | Docker | Apify actor
Try the Playground to configure extraction settings and preview commands before running.
Install
$pip install contextractor
or
$npm install -g contextractor
Requires Python 3.12+ (pip) or Node.js 18+ (npm). Playwright Chromium is installed automatically.
Usage
$contextractor https://example.com
Works with zero config. Pass URLs directly, or use a config file for complex setups:
contextractor https://example.com --precision --save json -o ./resultscontextractor --config config.json --max-pages 10
CLI Options
contextractor [OPTIONS] [URLS...]Crawl Settings:--config, -c Path to JSON config file--output-dir, -o Output directory--max-pages Max pages to crawl (0 = unlimited)--crawl-depth Max link depth from start URLs (0 = start only)--headless/--no-headless Browser headless mode (default: headless)--max-concurrency Max parallel requests (default: 50)--max-retries Max request retries (default: 3)--max-results Max results per crawl (0 = unlimited)Proxy:--proxy-urls Comma-separated proxy URLs (http://user:pass@host:port)--proxy-rotation Rotation: recommended, perRequest, untilFailureBrowser:--launcher Browser engine: chromium, firefox (default: chromium)--wait-until Page load event: load, networkidle, domcontentloaded (default: load)--page-load-timeout Timeout in seconds (default: 60)--ignore-cors Disable CORS/CSP restrictions--close-cookie-modals Auto-dismiss cookie banners--max-scroll-height Max scroll height in pixels (default: 5000)--ignore-ssl-errors Skip SSL certificate verification--user-agent Custom User-Agent stringCrawl Filtering:--globs Comma-separated glob patterns to include--excludes Comma-separated glob patterns to exclude--link-selector CSS selector for links to follow--keep-url-fragments Preserve URL fragments--respect-robots-txt Honor robots.txtCookies & Headers:--cookies JSON array of cookie objects--headers JSON object of custom HTTP headersOutput Format:--save Output formats, comma-separated (default: markdown)Valid: markdown, html, text, json, jsonl, xml, xml-tei, allContent Extraction:--precision High precision mode (less noise)--recall High recall mode (more content)--fast Fast extraction mode (less thorough)--no-links Exclude links from output--no-comments Exclude comments from output--include-tables/--no-tables Include tables (default: include)--include-images Include image descriptions--include-formatting/--no-formatting Preserve formatting (default: preserve)--deduplicate Deduplicate extracted content--target-language Filter by language (e.g. "en")--with-metadata/--no-metadata Extract metadata (default: with)--prune-xpath XPath patterns to remove from contentDiagnostics:--verbose, -v Enable verbose logging
CLI flags override config file settings. Merge order: defaults → config file → CLI args
Config File (optional)
Use a JSON config file to set options:
{"urls": ["https://example.com", "https://docs.example.com"],"save": ["markdown"],"outputDir": "./output","crawlDepth": 1,"proxy": {"urls": ["http://user:pass@host:port"],"rotation": "recommended"},"trafilaturaConfig": {"favorPrecision": true,"includeLinks": true,"includeTables": true,"deduplicate": true}}
Crawl Settings
| Field | Type | Default | Description |
|---|---|---|---|
urls | array | [] | URLs to extract content from |
maxPages | int | 0 | Max pages to crawl (0 = unlimited) |
outputDir | string | "./output" | Directory for extracted content |
crawlDepth | int | 0 | How deep to follow links (0 = start URLs only) |
headless | bool | true | Browser headless mode |
maxConcurrency | int | 50 | Max parallel browser pages |
maxRetries | int | 3 | Max retries for failed requests |
maxResults | int | 0 | Max results per crawl (0 = unlimited) |
Proxy Configuration
| Field | Type | Default | Description |
|---|---|---|---|
proxy.urls | array | [] | Proxy URLs (http://user:pass@host:port or socks5://host:port) |
proxy.rotation | string | "recommended" | recommended, perRequest, untilFailure |
proxy.tiered | array | [] | Tiered proxy escalation (config-file only) |
Browser Settings
| Field | Type | Default | Description |
|---|---|---|---|
launcher | string | "chromium" | Browser engine: chromium, firefox |
waitUntil | string | "load" | Page load event: load, networkidle, domcontentloaded |
pageLoadTimeout | int | 60 | Page load timeout in seconds |
ignoreCors | bool | false | Disable CORS/CSP restrictions |
closeCookieModals | bool | true | Auto-dismiss cookie consent banners |
maxScrollHeight | int | 5000 | Max scroll height in pixels (0 = disable) |
ignoreSslErrors | bool | false | Skip SSL certificate verification |
userAgent | string | "" | Custom User-Agent string |
Crawl Filtering
| Field | Type | Default | Description |
|---|---|---|---|
globs | array | [] | Glob patterns for URLs to include |
excludes | array | [] | Glob patterns for URLs to exclude |
linkSelector | string | "" | CSS selector for links to follow |
keepUrlFragments | bool | false | Treat URLs with different fragments as different pages |
respectRobotsTxt | bool | false | Honor robots.txt |
Cookies & Headers
| Field | Type | Default | Description |
|---|---|---|---|
cookies | array | [] | Initial cookies ([{"name": "...", "value": "...", "domain": "..."}]) |
headers | object | {} | Custom HTTP headers ({"Authorization": "Bearer token"}) |
Output Format
| Field | Type | Default | Description |
|---|---|---|---|
save | array | ["markdown"] | Output formats: markdown, html, text, json, jsonl, xml, xml-tei, all |
Content Extraction
All options go under the trafilaturaConfig key in config files, or use the equivalent CLI flags:
| Field | Type | Default | Description |
|---|---|---|---|
favorPrecision | bool | false | High precision, less noise |
favorRecall | bool | false | High recall, more content |
includeComments | bool | true | Include comments |
includeTables | bool | true | Include tables |
includeImages | bool | false | Include images |
includeFormatting | bool | true | Preserve formatting |
includeLinks | bool | true | Include links |
deduplicate | bool | false | Deduplicate content |
withMetadata | bool | true | Extract metadata (title, author, date) |
targetLanguage | string | null | Filter by language (e.g. "en") |
fast | bool | false | Fast mode (less thorough) |
pruneXpath | array | null | XPath patterns to remove from content |
Node.js API
Use contextractor as a library in your Node.js code:
const { extract } = require("contextractor");// Extract a single URLawait extract("https://example.com", {save: "markdown",outputDir: "./output",});// Multiple URLs with extraction optionsawait extract(["https://a.com", "https://b.com"], {precision: true,noLinks: true,includeTables: true,save: ["markdown", "json"],outputDir: "./results",});// Using a config fileawait extract("https://example.com", { config: "./config.json" });
ESM import:
import { extract } from "contextractor";
extract(urls, options) returns Promise<void> — output goes to outputDir or stdout. Options use the same camelCase names as listed in CLI Options and Config File.
Python API
Install the extraction engine:
$pip install contextractor-engine
Use ContentExtractor to extract content from HTML:
from contextractor_engine import ContentExtractor, TrafilaturaConfig# Basic extractionextractor = ContentExtractor()result = extractor.extract(html, url="https://example.com", output_format="markdown")print(result.content)# High precision with custom configconfig = TrafilaturaConfig(favor_precision=True, include_tables=True, deduplicate=True)extractor = ContentExtractor(config=config)result = extractor.extract(html, output_format="json")
Extract metadata:
meta = extractor.extract_metadata(html, url="https://example.com")print(meta.title, meta.author, meta.date)
Available output formats: txt, markdown, json, xml, xmltei
See the packages/contextractor_engine/README.md for full API reference.
Docker
$docker run ghcr.io/contextractor/contextractor https://example.com
Save output to your local machine:
$docker run -v ./output:/output ghcr.io/contextractor/contextractor https://example.com -o /output
Use a config file:
$docker run -v ./config.json:/config.json ghcr.io/contextractor/contextractor --config /config.json
All CLI flags work the same inside Docker.
Docker from Code
Call Docker extraction programmatically:
Node.js:
const { execSync } = require("child_process");const result = execSync("docker run ghcr.io/contextractor/contextractor https://example.com",{ encoding: "utf-8" });console.log(result);
Python:
import subprocessresult = subprocess.run(["docker", "run", "ghcr.io/contextractor/contextractor", "https://example.com"],capture_output=True, text=True)print(result.stdout)
Volume mount for output:
$docker run -v $(pwd)/output:/output ghcr.io/contextractor/contextractor https://example.com -o /output
Output
One file per crawled page, named from the URL slug (e.g. example-com-page.md). Metadata (title, author, date) is included in the output header when available.
Platforms
- npm: macOS arm64, Linux (x64, arm64), Windows x64
- Docker: linux/amd64, linux/arm64
License
Apache-2.0
Docs version
2026-04-16T12:41:28Z


