Article Extractor avatar

Article Extractor

Pricing

Pay per usage

Go to Apify Store
Article Extractor

Article Extractor

Extract clean article content from any URL. Get title, author, date, text, images, and metadata as Markdown or plain text. Removes ads and boilerplate. Perfect for LLM training data, RAG pipelines, and AI agents.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Tugelbay Konabayev

Tugelbay Konabayev

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

9 hours ago

Last modified

Share

Article Extractor — Clean Content from Any URL for LLMs

Extract clean, readable article content from any web page. Removes ads, navigation, sidebars, and boilerplate — returns just the article text with metadata. Output as Markdown, plain text, or clean HTML. Built for AI/LLM workflows, content analysis, and data pipelines.

Perfect for building RAG pipelines, AI training datasets, knowledge bases, and content monitoring systems.

What does Article Extractor do?

This actor takes a list of URLs and extracts the main article content from each page using Mozilla's Readability algorithm (the same technology behind Firefox Reader View). It returns structured data including:

  • Article text in Markdown, plain text, or clean HTML
  • Metadata: title, author, published date, description, language
  • Structured data: JSON-LD and Open Graph metadata parsing
  • Media: images, Open Graph image, links found in the article
  • Stats: word count, HTTP status code, extraction timestamp

You provide URLs — the actor does the rest. No custom selectors, no configuration per site, no CSS parsing. It just works.

Why use this instead of a generic web scraper?

FeatureGeneric ScraperWebsite Content CrawlerArticle Extractor
Content extractionRaw HTML / CSS selectorsFull page contentSmart article detection
Output qualityIncludes ads, nav, footersIncludes boilerplateClean article text only
Setup timeWrite custom selectors per siteMinimal configZero config — just add URLs
LLM-ready outputRequires post-processingSome formattingMarkdown ready for RAG
MetadataManual extractionBasicAuto-detected (author, date, JSON-LD, OG)
PricingVariesFree (5,743 users)PPE (pay per article)
SpeedDepends on implementationSlower (full crawl)Fast (parallel HTTP)
AI/MCP compatibleNoNo (free)Yes (PPE)

vs. Website Content Crawler

Apify's Website Content Crawler (5,743 users, free) crawls entire websites and extracts all page content. Article Extractor is different:

  • Focused extraction: Only extracts the main article content, not the entire page
  • Cleaner output: Strips navigation, ads, sidebars, related articles — just the article
  • Richer metadata: Automatically extracts author, publish date, JSON-LD, Open Graph
  • Faster: Uses HTTP requests (no browser), processes pages in parallel
  • PPE pricing: Pay only for successfully extracted articles (AI/MCP compatible)

When to use which:

  • Use Article Extractor when you need clean article text from known URLs (news, blogs, docs)
  • Use Website Content Crawler when you need to crawl an entire website following links

Features

  • Smart article extraction using Mozilla Readability algorithm
  • Markdown output optimized for LLM consumption and RAG pipelines
  • Automatic metadata extraction (author, date, description, language)
  • JSON-LD and Open Graph metadata parsing
  • Image and link extraction from article body
  • Concurrent processing (up to 50 pages in parallel)
  • Proxy support for geo-restricted content
  • Handles news sites, blogs, documentation, and any content page
  • 5MB page size limit to prevent memory issues
  • PPE pricing — pay only for successfully extracted articles
  • First 100 extractions free

Input examples

Extract articles as Markdown (default)

{
"urls": [
{ "url": "https://blog.apify.com/what-is-web-scraping/" },
{ "url": "https://en.wikipedia.org/wiki/Web_scraping" }
],
"outputFormat": "markdown",
"maxItems": 100
}

Extract as plain text for NLP analysis

{
"urls": [{ "url": "https://techcrunch.com/2026/01/15/latest-ai-news/" }],
"outputFormat": "text",
"extractImages": false
}

Bulk extraction with proxy (100+ articles)

{
"urls": [
{ "url": "https://example.com/article-1" },
{ "url": "https://example.com/article-2" },
{ "url": "https://example.com/article-3" }
],
"outputFormat": "markdown",
"maxConcurrency": 20,
"proxyConfiguration": {
"useApifyProxy": true
}
}
{
"urls": [{ "url": "https://news.ycombinator.com/item?id=12345" }],
"outputFormat": "markdown",
"extractImages": true,
"extractLinks": true
}

Input parameters

ParameterTypeDefaultRequiredDescription
urlsArrayYesList of article/page URLs to extract content from
outputFormatString"markdown"NoOutput format: "markdown", "text", or "html"
maxItemsInteger100NoMaximum number of articles to extract (1–10,000)
extractImagesBooleantrueNoInclude image URLs found in the article
extractLinksBooleanfalseNoInclude links found in the article
timeoutInteger30NoMaximum seconds to wait for each page to load (5–120)
maxConcurrencyInteger10NoNumber of pages to process simultaneously (1–50)
proxyConfigurationObjectNoneNoProxy settings for accessing geo-restricted content

Output format

Each item in the dataset contains:

FieldTypeDescription
urlStringFinal page URL (after redirects)
canonicalUrlStringCanonical URL if specified by the page
titleStringArticle title
authorStringArticle author (from meta tags, JSON-LD, or byline)
publishedDateStringPublication date (ISO 8601)
descriptionStringMeta description or article summary
contentStringExtracted article in requested format (Markdown/text/HTML)
wordCountIntegerNumber of words in the article
languageStringDetected content language code
siteNameStringWebsite name (from Open Graph)
imagesArrayImage URLs from the article (if extractImages: true)
linksArrayLinks from the article (if extractLinks: true)
ogImageStringOpen Graph image URL
statusCodeIntegerHTTP response status code
errorStringError message if extraction failed (null on success)
extractedAtStringExtraction timestamp (ISO 8601)

Example output

{
"url": "https://blog.apify.com/what-is-web-scraping/",
"canonicalUrl": "https://blog.apify.com/what-is-web-scraping/",
"title": "What is web scraping? A beginner's guide",
"author": "Apify Team",
"publishedDate": "2024-03-15T10:00:00Z",
"description": "Learn what web scraping is, how it works, and why it matters.",
"content": "# What is web scraping?\n\nWeb scraping is the process of automatically extracting data from websites...\n\n## How does web scraping work?\n\n1. **Send HTTP request** to the target URL\n2. **Parse the HTML** response\n3. **Extract the data** you need\n4. **Store the results** in a structured format",
"wordCount": 2450,
"language": "en",
"siteName": "Apify Blog",
"images": ["https://blog.apify.com/content/images/web-scraping-hero.jpg"],
"links": [],
"ogImage": "https://blog.apify.com/content/images/og-web-scraping.jpg",
"statusCode": 200,
"error": null,
"extractedAt": "2026-03-29T12:00:00+00:00"
}

Integrations

Apify MCP Server (Claude, AI agents)

Use as a tool in Claude Desktop, Claude Code, or any MCP-compatible AI agent framework. The actor is PPE-priced, making it native to AI agent workflows where each task triggers a separate extraction.

Python integration

from apify_client import ApifyClient
client = ApifyClient("your-apify-api-token")
# Extract articles
run = client.actor("tugelbay/article-extractor").call(
run_input={
"urls": [
{"url": "https://blog.apify.com/what-is-web-scraping/"},
{"url": "https://en.wikipedia.org/wiki/Web_scraping"},
],
"outputFormat": "markdown",
}
)
# Read results
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
print(f"Title: {item['title']}")
print(f"Author: {item.get('author', 'Unknown')}")
print(f"Words: {item['wordCount']}")
print(f"Content preview: {item['content'][:200]}...")
print()

JavaScript/TypeScript integration

import { ApifyClient } from "apify-client";
const client = new ApifyClient({ token: "your-apify-api-token" });
const run = await client.actor("tugelbay/article-extractor").call({
urls: [
{ url: "https://blog.apify.com/what-is-web-scraping/" },
{ url: "https://en.wikipedia.org/wiki/Web_scraping" },
],
outputFormat: "markdown",
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
for (const item of items) {
console.log(`${item.title} (${item.wordCount} words)`);
console.log(item.content?.substring(0, 200));
}

LangChain (RAG pipeline)

from langchain_community.utilities import ApifyWrapper
from langchain_core.documents import Document
apify = ApifyWrapper(apify_api_token="your-apify-api-token")
docs = apify.call_actor(
actor_id="tugelbay/article-extractor",
run_input={
"urls": [{"url": "https://example.com/article"}],
"outputFormat": "markdown",
},
dataset_mapping_function=lambda item: Document(
page_content=item.get("content", ""),
metadata={
"url": item.get("url"),
"title": item.get("title"),
"author": item.get("author"),
"publishedDate": item.get("publishedDate"),
},
),
)

Webhooks and integrations

The actor works with Apify's integration ecosystem:

  • Google Sheets — export extracted articles directly to a spreadsheet
  • Zapier / Make — trigger workflows on new results
  • Slack — get notifications when extraction completes
  • Email — receive dataset as email attachment
  • API — call programmatically via Apify REST API

Use cases

  • LLM training data — extract clean text from web pages for fine-tuning datasets
  • RAG pipelines — feed article content into vector databases for retrieval-augmented generation
  • Content analysis — analyze articles at scale for sentiment, topics, and trends
  • News monitoring — extract and archive news articles automatically on a schedule
  • Research — collect and structure academic or industry content for literature reviews
  • SEO analysis — extract competitor content for gap analysis and content strategy
  • Knowledge base — build searchable archives from documentation sites and blogs
  • Content migration — extract content from legacy sites during CMS migrations
  • AI agents — give your AI agent the ability to read and understand any web page
  • Newsletter curation — automatically extract and summarize articles for newsletters
  • Compliance monitoring — track content changes on regulatory or competitor pages

Cost estimation (PPE pricing)

EventDescription
article-extractedEach article successfully extracted

Example costs:

ScenarioArticlesCost
10 blog posts10~$0.05
100 news articles100~$0.50
1,000 documentation pages1,000~$5
Daily news monitoring (50 articles/day)1,500/month~$7.50/month
Large-scale extraction10,000~$50

First 100 extractions are free to help you evaluate the actor.

Tip: Set extractImages: false and extractLinks: false to speed up extraction and reduce output size when you only need the text content.

FAQ

What types of pages work best?

Article Extractor works best on article-style pages: news articles, blog posts, documentation pages, Wikipedia articles, and similar content. The Readability algorithm is designed to identify the "main content" of a page and strip everything else.

Does it work on JavaScript-rendered pages (SPAs)?

No. Article Extractor uses fast HTTP requests (no browser). Pages that require JavaScript to render content (React SPAs, Angular apps) will return empty or minimal content. For those pages, use RAG Web Browser which has automatic browser fallback.

How fast is it?

Very fast. Since it uses HTTP requests (no browser), it can process 100 articles in 2–3 minutes with default concurrency. Increase maxConcurrency to 50 for even faster processing.

Can I extract content behind login walls or paywalls?

No. Article Extractor only works with publicly accessible pages. It cannot bypass login walls, paywalls, or CAPTCHA-protected content.

What's the maximum page size?

5MB per page. Larger pages are truncated to prevent memory issues. This covers 99%+ of normal web articles.

Can I run this on a schedule?

Yes. Set up a Schedule in Apify Console to run the actor at any interval — hourly, daily, or custom cron expressions. Perfect for news monitoring and content tracking.

Why Markdown output?

Markdown is the most LLM-friendly format:

  • Preserves semantic structure (headers, emphasis, lists, code blocks)
  • Compact — fits more content in LLM context windows
  • Renders cleanly in chat interfaces and documentation tools
  • Easy to parse for downstream processing

How does it handle errors?

If a page fails to load (timeout, 404, blocked), the actor returns the URL with an error field explaining what went wrong and a null content field. Other pages in the batch continue processing normally.

Troubleshooting

Empty or very short content extraction

  • Cause: The page is a SPA (Single Page Application) that renders content with JavaScript
  • Fix: Use RAG Web Browser instead, which has browser fallback
  • Alternative: Very short pages (<100 words) may not have enough content for Readability to detect the main article

Missing author or publish date

  • Cause: The page doesn't include author/date in meta tags, JSON-LD, or standard HTML patterns
  • Fix: This is expected — not all pages provide this metadata. The fields will be null.

Timeout errors on some pages

  • Cause: The target page is slow to respond
  • Fix: Increase the timeout parameter (default: 30 seconds, max: 120 seconds)
  • Alternative: Reduce maxConcurrency if you're scraping many pages from the same domain
  • Cause: Some sites block datacenter IPs
  • Fix: Enable Apify proxy with residential proxy groups in proxyConfiguration

Limitations

  • Only works with publicly accessible pages (no login-protected or paywalled content)
  • JavaScript-rendered content (SPAs) will not extract fully — use a browser-based solution for those
  • Very short pages (under 100 words) may not have enough content for Readability to detect
  • Maximum page size: 5MB (larger pages are truncated)
  • Maximum 10,000 articles per run (use multiple runs for larger datasets)
  • Metadata extraction depends on the page having proper meta tags, JSON-LD, or Open Graph markup

Changelog

v1.0 (2026-03-29)

  • Initial release
  • Markdown, plain text, and clean HTML output formats
  • Mozilla Readability-based article extraction
  • Metadata extraction (author, date, description, JSON-LD, Open Graph)
  • Image and link extraction
  • Concurrent processing with configurable concurrency (1–50)
  • Proxy support
  • PPE pricing (first 100 free)