contextractor - Trafilatura based avatar

contextractor - Trafilatura based

Pricing

Pay per usage

Go to Apify Store
contextractor - Trafilatura based

contextractor - Trafilatura based

Extract clean, readable content . Uses Trafilatura, the top rated library, to strip away navigation, ads, and boilerplate—leaving just the text you need.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Glueo

Glueo

Maintained by Community

Actor stats

0

Bookmarked

11

Total users

2

Monthly active users

6 hours ago

Last modified

Share

Contextractor — Trafilatura Powered Web Content Extractor

Extract clean, readable content from any website. Uses Trafilatura to strip away navigation, ads, and boilerplate—leaving just the text you need.

Why Trafilatura?

Trafilatura is a Python library designed for web content extraction, created by Adrien Barbaresi at the Berlin-Brandenburg Academy of Sciences. The library achieves the highest F1 score (0.958) among open-source content extraction tools in independent benchmarks, outperforming newspaper4k (0.949), Mozilla Readability (0.947), and goose3 (0.896). [1][2]

With over 4,900 GitHub stars and production deployments at HuggingFace, IBM, and Microsoft Research, Trafilatura has become the de facto standard for text extraction in data pipelines and LLM applications. [3]

Understanding the F1 Score

The F1 score is a standard metric for evaluating extraction quality, combining two complementary measures:

  • Precision: How much of the extracted content is actually relevant (avoiding noise like ads, navigation, footers)
  • Recall: How much of the relevant content was successfully extracted (avoiding missed paragraphs or sections)

The F1 score is the harmonic mean of precision and recall, ranging from 0 to 1. A score of 0.958 means Trafilatura correctly extracts 95.8% of the main content while excluding nearly all boilerplate — the best balance among tested tools. [2]

For comparison, a tool with high precision but low recall might extract clean content but miss important paragraphs. Conversely, high recall with low precision captures everything but includes unwanted elements like sidebars and advertisements.

Benchmark Comparison

The following results are from the ScrapingHub Article Extraction Benchmark, which tests extraction quality across 181 diverse web pages: [1]

ToolF1 ScorePrecisionRecallBest For
Trafilatura0.9580.9380.978General web content, LLM pipelines
newspaper4k0.9490.9640.934News sites with rich metadata
@mozilla/readability0.9470.9140.982Browser-based extraction
readability-lxml0.9220.9130.931Simple HTML preservation
goose30.8960.9400.856High-precision requirements
jusText0.8040.8580.756Academic corpus building

Trafilatura's 0.978 recall is particularly notable — it captures nearly all relevant content while maintaining excellent precision. This balance is achieved through a hybrid extraction approach that combines multiple algorithms. [2]

Key Advantages

LLM-optimized output formats

Trafilatura natively supports markdown output, which reduces token count by approximately 67% compared to raw HTML. [4] This makes it ideal for RAG pipelines, LLM fine-tuning datasets, and any application where token efficiency matters. The library supports seven output formats: plain text, Markdown, HTML, XML, XML-TEI (for academic research), JSON, and CSV.

Comprehensive metadata extraction

Beyond main content, Trafilatura automatically extracts structured metadata including title, author, publication date, language (via py3langid), site name, categories, tags, and content license. This metadata is invaluable for content organization, filtering, and downstream processing. [3]

Hybrid extraction with intelligent fallbacks

Trafilatura achieves its superior accuracy through a multi-stage approach: it first applies its own heuristic algorithms, then falls back to jusText and readability-lxml when needed. This redundancy ensures robust extraction across diverse page layouts and edge cases. [2]

Production-proven at scale

The library is trusted by major organizations including HuggingFace (for dataset curation), IBM, and Microsoft Research. Its efficient implementation handles large-scale crawling workloads without performance bottlenecks. [3]

Academic validation

Unlike many extraction tools, Trafilatura has peer-reviewed academic backing. It was published at ACL 2021 (Association for Computational Linguistics), providing transparency into its methodology and benchmarks. [5]

Limitations

  • Results vary on galleries, catalogs, and link-heavy pages where main content is ambiguous [3]

Web App

Try the interactive web app at contextractor.com to configure extraction settings, preview commands, and explore all options before running at scale.

Features

  • Multiple output formats - Markdown, plain text, JSON, XML, or XML-TEI (scholarly)
  • JavaScript rendering - Handles dynamic sites with Playwright (Chromium/Firefox)
  • Link crawling - Follow links across a site with glob/pseudo-URL filtering
  • Metadata extraction - Title, author, date, description, site name, and language
  • Configurable precision - Balance between extracting more content vs. less noise

Use cases

  • Build training datasets for LLMs
  • Research and academic text extraction
  • Feed content into RAG pipelines
  • Monitor content changes

Input

Crawl Settings

ParameterTypeDefaultDescription
startUrls (required)arrayURLs to extract content from
maxPagesPerCrawlint0Max pages to crawl (0 = unlimited)
maxCrawlingDepthint0Max link depth from start URLs
maxConcurrencyint50Max parallel browser pages
maxRequestRetriesint3Max retries for failed requests
maxResultsPerCrawlint0Max results (0 = unlimited)

Proxy Configuration

ParameterTypeDefaultDescription
proxyConfigurationobjectProxy settings (use the Apify proxy editor)
proxyRotationstring"RECOMMENDED"RECOMMENDED, PER_REQUEST, UNTIL_FAILURE

Browser Settings

ParameterTypeDefaultDescription
launcherstring"CHROMIUM"Browser engine: CHROMIUM, FIREFOX
headlessbooltrueRun browser in headless mode
waitUntilstring"LOAD"Page load event: LOAD, NETWORKIDLE, DOMCONTENTLOADED
pageLoadTimeoutSecsint60Page load timeout in seconds
ignoreCorsAndCspboolfalseDisable CORS/CSP restrictions
closeCookieModalsboolfalseAuto-dismiss cookie consent banners
maxScrollHeightPixelsint5000Max scroll height in pixels (0 = disable)
ignoreSslErrorsboolfalseSkip SSL certificate verification

Crawl Filtering

ParameterTypeDefaultDescription
globsarray[]Glob patterns for URLs to include
excludesarray[]Glob patterns for URLs to exclude
pseudoUrlsarray[]Pseudo-URLs to match (alternative to globs)
linkSelectorstring""CSS selector for links to follow
keepUrlFragmentsboolfalseTreat URLs with different fragments as different pages
respectRobotsTxtFileboolfalseHonor robots.txt

Cookies & Headers

ParameterTypeDefaultDescription
initialCookiesarray[]Initial cookies (JSON array of {name, value, domain, path})
customHttpHeadersobject{}Custom HTTP headers ({"Authorization": "Bearer token"})

Output Settings

ParameterTypeDefaultDescription
saveExtractedMarkdownToKeyValueStorebooltrueSave Markdown to key-value store
saveRawHtmlToKeyValueStoreboolfalseSave raw HTML
saveExtractedTextToKeyValueStoreboolfalseSave plain text
saveExtractedJsonToKeyValueStoreboolfalseSave JSON
saveExtractedXmlToKeyValueStoreboolfalseSave XML
saveExtractedXmlTeiToKeyValueStoreboolfalseSave XML-TEI
datasetNamestringCustom dataset name
keyValueStoreNamestringCustom key-value store name
requestQueueNamestringCustom request queue name

Content Extraction

ParameterTypeDefaultDescription
trafilaturaConfigobject{}Extraction options (see below)

Trafilatura config keys: favorPrecision, favorRecall, includeComments, includeTables, includeImages, includeFormatting, includeLinks, deduplicate, withMetadata, targetLanguage, fast.

Output

Each crawled page produces a dataset item:

{
"loadedUrl": "https://example.com/article",
"httpStatus": 200,
"loadedAt": "2025-01-31T12:00:00.000Z",
"metadata": {
"title": "Article Title",
"author": "John Doe",
"publishedAt": "2025-01-15",
"description": "Article description",
"siteName": "Example Blog",
"lang": "en"
},
"rawHtml": {
"hash": "f8e6bd335e04d03e1be6798c2c72349c",
"length": 45000
},
"extractedMarkdown": {
"key": "a1b2c3d4e5f67890.md",
"url": "https://api.apify.com/v2/key-value-stores/.../records/a1b2c3d4e5f67890.md",
"hash": "43f204bfbee5dbe6862cb38620f257b5",
"length": 5000
}
}

Extracted content is saved to the key-value store. The extractedMarkdown (and similar fields for other formats) contains a url you can use to download the content directly.

Example

Extract all blog posts from a site:

{
"startUrls": [{ "url": "https://example.com/blog" }],
"globs": [{ "glob": "https://example.com/blog/**" }],
"linkSelector": "a",
"maxPagesPerCrawl": 100,
"trafilaturaConfig": {},
"saveExtractedMarkdownToKeyValueStore": true
}

References

1. Article Extraction Benchmark

ScrapingHub. GitHub

2. Evaluation

Barbaresi, Adrien. Trafilatura Documentation v2.0.0

3. Trafilatura: A Python package & command-line tool to gather text on the Web

Barbaresi, Adrien. GitHub

4. An introduction to preparing your own dataset for LLM training

AWS Machine Learning Blog. Amazon Web Services

5. Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction

Barbaresi, Adrien (2021). ACL Anthology

Docs version

2026-03-26T20:50:54Z