contextractor - Trafilatura based avatar
contextractor - Trafilatura based

Pricing

Pay per usage

Go to Apify Store
contextractor - Trafilatura based

contextractor - Trafilatura based

Extract clean, readable content . Uses Trafilatura, the top rated library, to strip away navigation, ads, and boilerplate—leaving just the text you need.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Glueo

Glueo

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

18 hours ago

Last modified

Share

Contextractor — Trafilatura Powered Web Content Extractor

Extract clean, readable content from any website. Uses Trafilatura to strip away navigation, ads, and boilerplate—leaving just the text you need.

Why Trafilatura?

Trafilatura is a Python library designed for web content extraction, created by Adrien Barbaresi at the Berlin-Brandenburg Academy of Sciences. The library achieves the highest F1 score (0.958) among open-source content extraction tools in independent benchmarks, outperforming newspaper4k (0.949), Mozilla Readability (0.947), and goose3 (0.896). [1][2]

With over 4,900 GitHub stars and production deployments at HuggingFace, IBM, and Microsoft Research, Trafilatura has become the de facto standard for text extraction in data pipelines and LLM applications. [3]

Understanding the F1 Score

The F1 score is a standard metric for evaluating extraction quality, combining two complementary measures:

  • Precision: How much of the extracted content is actually relevant (avoiding noise like ads, navigation, footers)
  • Recall: How much of the relevant content was successfully extracted (avoiding missed paragraphs or sections)

The F1 score is the harmonic mean of precision and recall, ranging from 0 to 1. A score of 0.958 means Trafilatura correctly extracts 95.8% of the main content while excluding nearly all boilerplate — the best balance among tested tools. [2]

For comparison, a tool with high precision but low recall might extract clean content but miss important paragraphs. Conversely, high recall with low precision captures everything but includes unwanted elements like sidebars and advertisements.

Benchmark Comparison

The following results are from the ScrapingHub Article Extraction Benchmark, which tests extraction quality across 181 diverse web pages: [1]

ToolF1 ScorePrecisionRecallBest For
Trafilatura0.9580.9380.978General web content, LLM pipelines
newspaper4k0.9490.9640.934News sites with rich metadata
@mozilla/readability0.9470.9140.982Browser-based extraction
readability-lxml0.9220.9130.931Simple HTML preservation
goose30.8960.9400.856High-precision requirements
jusText0.8040.8580.756Academic corpus building

Trafilatura's 0.978 recall is particularly notable — it captures nearly all relevant content while maintaining excellent precision. This balance is achieved through a hybrid extraction approach that combines multiple algorithms. [2]

Key Advantages

LLM-optimized output formats

Trafilatura natively supports markdown output, which reduces token count by approximately 67% compared to raw HTML. [4] This makes it ideal for RAG pipelines, LLM fine-tuning datasets, and any application where token efficiency matters. The library supports seven output formats: plain text, Markdown, HTML, XML, XML-TEI (for academic research), JSON, and CSV.

Comprehensive metadata extraction

Beyond main content, Trafilatura automatically extracts structured metadata including title, author, publication date, language (via py3langid), site name, categories, tags, and content license. This metadata is invaluable for content organization, filtering, and downstream processing. [3]

Hybrid extraction with intelligent fallbacks

Trafilatura achieves its superior accuracy through a multi-stage approach: it first applies its own heuristic algorithms, then falls back to jusText and readability-lxml when needed. This redundancy ensures robust extraction across diverse page layouts and edge cases. [2]

Production-proven at scale

The library is trusted by major organizations including HuggingFace (for dataset curation), IBM, and Microsoft Research. Its efficient implementation handles large-scale crawling workloads without performance bottlenecks. [3]

Academic validation

Unlike many extraction tools, Trafilatura has peer-reviewed academic backing. It was published at ACL 2021 (Association for Computational Linguistics), providing transparency into its methodology and benchmarks. [5]

Limitations

  • Results vary on galleries, catalogs, and link-heavy pages where main content is ambiguous [3]

Features

  • Multiple output formats - Markdown, plain text, JSON, XML, or XML-TEI (scholarly)
  • JavaScript rendering - Handles dynamic sites with Playwright (Chromium/Firefox)
  • Link crawling - Follow links across a site with glob/pseudo-URL filtering
  • Metadata extraction - Title, author, date, description, site name, and language
  • Configurable precision - Balance between extracting more content vs. less noise

Use cases

  • Build training datasets for LLMs
  • Research and academic text extraction
  • Feed content into RAG pipelines
  • Monitor content changes

Input

ParameterDescriptionDefault
startUrls (required)URLs to extract content from
globsGlob patterns for URLs to include in crawling[]
excludesGlob patterns for URLs to exclude[]
extractionModeFAVOR_PRECISION, BALANCED, or FAVOR_RECALLBALANCED
maxPagesPerCrawlLimit total pages crawled (0 = unlimited)0
maxCrawlingDepthLimit link depth from start URLs0
saveExtractedMarkdownToKeyValueStoreSave Markdown to key-value storetrue

See the full input schema for browser settings, proxy configuration, cookies, and custom headers.

Output

Each crawled page produces a dataset item:

{
"loadedUrl": "https://example.com/article",
"httpStatus": 200,
"loadedAt": "2025-01-31T12:00:00.000Z",
"metadata": {
"title": "Article Title",
"author": "John Doe",
"publishedAt": "2025-01-15",
"description": "Article description",
"siteName": "Example Blog",
"lang": "en"
},
"rawHtml": {
"hash": "f8e6bd335e04d03e1be6798c2c72349c",
"length": 45000
},
"extractedMarkdown": {
"key": "a1b2c3d4e5f67890.md",
"url": "https://api.apify.com/v2/key-value-stores/.../records/a1b2c3d4e5f67890.md",
"hash": "43f204bfbee5dbe6862cb38620f257b5",
"length": 5000
}
}

Extracted content is saved to the key-value store. The extractedMarkdown (and similar fields for other formats) contains a url you can use to download the content directly.

Example

Extract all blog posts from a site:

{
"startUrls": [{ "url": "https://example.com/blog" }],
"globs": [{ "glob": "https://example.com/blog/**" }],
"linkSelector": "a",
"maxPagesPerCrawl": 100,
"extractionMode": "BALANCED",
"saveExtractedMarkdownToKeyValueStore": true
}

References

1. Article Extraction Benchmark

ScrapingHub. GitHub

2. Evaluation

Barbaresi, Adrien. Trafilatura Documentation v2.0.0

3. Trafilatura: A Python package & command-line tool to gather text on the Web

Barbaresi, Adrien. GitHub

4. An introduction to preparing your own dataset for LLM training

AWS Machine Learning Blog. Amazon Web Services

5. Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction

Barbaresi, Adrien (2021). ACL Anthology

Docs version

2026-01-31T18:42:11Z