contextractor - Trafilatura based
Under maintenancePricing
Pay per usage
contextractor - Trafilatura based
Under maintenanceExtract clean, readable content . Uses Trafilatura, the top rated library, to strip away navigation, ads, and boilerplate—leaving just the text you need.
Pricing
Pay per usage
Rating
0.0
(0)
Developer
Glueo
Maintained by CommunityActor stats
0
Bookmarked
11
Total users
2
Monthly active users
25 days ago
Last modified
Categories
Share
Contextractor
Extract clean, readable content from any website using Trafilatura.
Available as: PyPI | npm | Docker | Apify actor
Try the Playground to configure extraction settings and preview commands before running.
Install
$pip install contextractor
or
$npm install -g contextractor
Requires Python 3.12+ (pip) or Node.js 18+ (npm). Playwright Chromium is installed automatically.
Usage
$contextractor https://example.com
Works with zero config. Pass URLs directly, or use a config file for complex setups:
contextractor https://example.com --precision --save json -o ./resultscontextractor --config config.json --max-pages 10
CLI Options
contextractor [OPTIONS] [URLS...]Crawl Settings:--config, -c Path to JSON config file--output-dir, -o Output directory--max-pages Max pages to crawl (0 = unlimited)--crawl-depth Max link depth from start URLs (0 = start only)--headless/--no-headless Browser headless mode (default: headless)--max-concurrency Max parallel requests (default: 50)--max-retries Max request retries (default: 3)--max-results Max results per crawl (0 = unlimited)Proxy:--proxy-urls Comma-separated proxy URLs (http://user:pass@host:port)--proxy-rotation Rotation: recommended, perRequest, untilFailureBrowser:--launcher Browser engine: chromium, firefox (default: chromium)--wait-until Page load event: load, networkidle, domcontentloaded (default: load)--page-load-timeout Timeout in seconds (default: 60)--ignore-cors Disable CORS/CSP restrictions--close-cookie-modals Auto-dismiss cookie banners--max-scroll-height Max scroll height in pixels (default: 5000)--ignore-ssl-errors Skip SSL certificate verification--user-agent Custom User-Agent stringCrawl Filtering:--globs Comma-separated glob patterns to include--excludes Comma-separated glob patterns to exclude--link-selector CSS selector for links to follow--keep-url-fragments Preserve URL fragments--respect-robots-txt Honor robots.txtCookies & Headers:--cookies JSON array of cookie objects--headers JSON object of custom HTTP headersOutput Format:--save Output formats, comma-separated (default: markdown)Valid: markdown, html, text, json, jsonl, xml, xml-tei, allContent Extraction:--precision High precision mode (less noise)--recall High recall mode (more content)--fast Fast extraction mode (less thorough)--no-links Exclude links from output--no-comments Exclude comments from output--include-tables/--no-tables Include tables (default: include)--include-images Include image descriptions--include-formatting/--no-formatting Preserve formatting (default: preserve)--deduplicate Deduplicate extracted content--target-language Filter by language (e.g. "en")--with-metadata/--no-metadata Extract metadata (default: with)--prune-xpath XPath patterns to remove from contentDiagnostics:--verbose, -v Enable verbose logging
CLI flags override config file settings. Merge order: defaults → config file → CLI args
Config File (optional)
Use a JSON config file to set options:
{"urls": ["https://example.com", "https://docs.example.com"],"save": ["markdown"],"outputDir": "./output","crawlDepth": 1,"proxy": {"urls": ["http://user:pass@host:port"],"rotation": "recommended"},"trafilaturaConfig": {"favorPrecision": true,"includeLinks": true,"includeTables": true,"deduplicate": true}}
Crawl Settings
| Field | Type | Default | Description |
|---|---|---|---|
urls | array | [] | URLs to extract content from |
maxPages | int | 0 | Max pages to crawl (0 = unlimited) |
outputDir | string | "./output" | Directory for extracted content |
crawlDepth | int | 0 | How deep to follow links (0 = start URLs only) |
headless | bool | true | Browser headless mode |
maxConcurrency | int | 50 | Max parallel browser pages |
maxRetries | int | 3 | Max retries for failed requests |
maxResults | int | 0 | Max results per crawl (0 = unlimited) |
Proxy Configuration
| Field | Type | Default | Description |
|---|---|---|---|
proxy.urls | array | [] | Proxy URLs (http://user:pass@host:port or socks5://host:port) |
proxy.rotation | string | "recommended" | recommended, perRequest, untilFailure |
proxy.tiered | array | [] | Tiered proxy escalation (config-file only) |
Browser Settings
| Field | Type | Default | Description |
|---|---|---|---|
launcher | string | "chromium" | Browser engine: chromium, firefox |
waitUntil | string | "load" | Page load event: load, networkidle, domcontentloaded |
pageLoadTimeout | int | 60 | Page load timeout in seconds |
ignoreCors | bool | false | Disable CORS/CSP restrictions |
closeCookieModals | bool | true | Auto-dismiss cookie consent banners |
maxScrollHeight | int | 5000 | Max scroll height in pixels (0 = disable) |
ignoreSslErrors | bool | false | Skip SSL certificate verification |
userAgent | string | "" | Custom User-Agent string |
Crawl Filtering
| Field | Type | Default | Description |
|---|---|---|---|
globs | array | [] | Glob patterns for URLs to include |
excludes | array | [] | Glob patterns for URLs to exclude |
linkSelector | string | "" | CSS selector for links to follow |
keepUrlFragments | bool | false | Treat URLs with different fragments as different pages |
respectRobotsTxt | bool | false | Honor robots.txt |
Cookies & Headers
| Field | Type | Default | Description |
|---|---|---|---|
cookies | array | [] | Initial cookies ([{"name": "...", "value": "...", "domain": "..."}]) |
headers | object | {} | Custom HTTP headers ({"Authorization": "Bearer token"}) |
Output Format
| Field | Type | Default | Description |
|---|---|---|---|
save | array | ["markdown"] | Output formats: markdown, html, text, json, jsonl, xml, xml-tei, all |
Content Extraction
All options go under the trafilaturaConfig key in config files, or use the equivalent CLI flags:
| Field | Type | Default | Description |
|---|---|---|---|
favorPrecision | bool | false | High precision, less noise |
favorRecall | bool | false | High recall, more content |
includeComments | bool | true | Include comments |
includeTables | bool | true | Include tables |
includeImages | bool | false | Include images |
includeFormatting | bool | true | Preserve formatting |
includeLinks | bool | true | Include links |
deduplicate | bool | false | Deduplicate content |
withMetadata | bool | true | Extract metadata (title, author, date) |
targetLanguage | string | null | Filter by language (e.g. "en") |
fast | bool | false | Fast mode (less thorough) |
pruneXpath | array | null | XPath patterns to remove from content |
Node.js API
Use contextractor as a library in your Node.js code:
const { extract } = require("contextractor");// Extract a single URLawait extract("https://example.com", {save: "markdown",outputDir: "./output",});// Multiple URLs with extraction optionsawait extract(["https://a.com", "https://b.com"], {precision: true,noLinks: true,includeTables: true,save: ["markdown", "json"],outputDir: "./results",});// Using a config fileawait extract("https://example.com", { config: "./config.json" });
ESM import:
import { extract } from "contextractor";
extract(urls, options) returns Promise<void> — output goes to outputDir or stdout. Options use the same camelCase names as listed in CLI Options and Config File.
Python API
Install the extraction engine:
$pip install contextractor-engine
Use ContentExtractor to extract content from HTML:
from contextractor_engine import ContentExtractor, TrafilaturaConfig# Basic extractionextractor = ContentExtractor()result = extractor.extract(html, url="https://example.com", output_format="markdown")print(result.content)# High precision with custom configconfig = TrafilaturaConfig(favor_precision=True, include_tables=True, deduplicate=True)extractor = ContentExtractor(config=config)result = extractor.extract(html, output_format="json")
Extract metadata:
meta = extractor.extract_metadata(html, url="https://example.com")print(meta.title, meta.author, meta.date)
Available output formats: txt, markdown, json, xml, xmltei
See the packages/contextractor_engine/README.md for full API reference.
Docker
$docker run ghcr.io/contextractor/contextractor https://example.com
Save output to your local machine:
$docker run -v ./output:/output ghcr.io/contextractor/contextractor https://example.com -o /output
Use a config file:
$docker run -v ./config.json:/config.json ghcr.io/contextractor/contextractor --config /config.json
All CLI flags work the same inside Docker.
Docker from Code
Call Docker extraction programmatically:
Node.js:
const { execSync } = require("child_process");const result = execSync("docker run ghcr.io/contextractor/contextractor https://example.com",{ encoding: "utf-8" });console.log(result);
Python:
import subprocessresult = subprocess.run(["docker", "run", "ghcr.io/contextractor/contextractor", "https://example.com"],capture_output=True, text=True)print(result.stdout)
Volume mount for output:
$docker run -v $(pwd)/output:/output ghcr.io/contextractor/contextractor https://example.com -o /output
Output
One file per crawled page, named from the URL slug (e.g. example-com-page.md). Metadata (title, author, date) is included in the output header when available.
Platforms
- npm: macOS arm64, Linux (x64, arm64), Windows x64
- Docker: linux/amd64, linux/arm64
License
Apache-2.0
Docs version
2026-04-16T12:41:28Z


