Pricing

from $3.50 / 1,000 articles

Article Extraction API

Extract clean article text and metadata from URLs as Markdown, text, or HTML for RAG, AI agents, monitoring, and research. Guide: https://konabayev.com/tools/article-extractor/?utm_source=apify_info&utm_medium=referral&utm_campaign=article-extractor

Pricing

from $3.50 / 1,000 articles

Rating

0.0

(0)

Developer

Tugelbay Konabayev

Actor stats

Bookmarked

Total users

Monthly active users

7 hours ago

Last modified

Article Extraction API — URL to Markdown for RAG & LLMs

Clean article content from known URLs — extract title, author, date, Markdown/text/HTML, images, links, and metadata from news, blogs, docs, and knowledge-base pages. Built for AI ingestion — compact Markdown output for RAG pipelines, vector databases, LLM prompts, content monitoring, and research datasets. Reliable HTTP extraction — no browser overhead; retries transient 408/429/5xx responses and uses a conservative default concurrency for production runs.

Article Extractor overview: clean article content, Markdown output, metadata, and images

Article Extractor input and output example Article Extractor dataset preview

Convert article URLs into clean, readable content. Article Extractor removes ads, navigation, sidebars, and boilerplate, then returns the main article text with source metadata. Output as Markdown, plain text, or clean HTML for AI/LLM workflows, content analysis, and data pipelines.

Perfect for building RAG pipelines, AI training datasets, knowledge bases, and content monitoring systems.

For implementation recipes, production examples, and SEO/GEO notes, see the Article Extractor guide on Konabayev.com.

Article Extraction API for RAG and AI Agents

Call it from the Apify API, MCP tools, LangChain, Make, Zapier, or scheduled Apify runs. Give the actor known article URLs and get structured records that are ready to index, summarize, classify, or store.

URL to Markdown, Text, or Clean HTML

Paste a URL and get clean article text with title, author, publish date, body content, images, links, Open Graph data, canonical URL, and word count. Markdown is the default because it preserves headings and lists while keeping token usage lower than raw HTML.

Web Content Scraper for RAG Pipelines

Feed clean article text directly into your vector database or LLM prompt. Output as Markdown or plain text — ready for embedding.

Bulk Article Extraction (Up to 10K URLs)

Process hundreds or thousands of URLs in a single run. Perfect for news monitoring, content aggregation, and research datasets.

What does Article Extractor do?

This actor takes a list of URLs and extracts the main article content from each page using Mozilla's Readability algorithm (the same technology behind Firefox Reader View). It returns structured data including:

Article text in Markdown, plain text, or clean HTML
Metadata: title, author, published date, description, language
Structured data: JSON-LD and Open Graph metadata parsing
Media: images, Open Graph image, links found in the article
Stats: word count, HTTP status code, extraction timestamp

You provide article-style public URLs and the actor extracts clean content without custom selectors or per-site CSS parsing. Some protected, app-rendered, or non-article pages may return partial content.

Why use this instead of a generic web scraper?

Need	Generic scraper	Website crawler	Article Extractor
Main content extraction	Usually custom CSS selectors	Whole-page or crawl-oriented extraction	Readability-based article detection
Output cleanup	You remove nav, ads, cookie banners, and footers	Good for broad site ingestion	Focused on clean article body text
Setup time	Write and maintain selectors per site	Configure crawl scope and depth	Add URLs and run
LLM-ready output	Requires post-processing	Good for site-wide RAG	Markdown/text/HTML per known URL
Metadata	Manual extraction	Varies by crawler	Author, date, description, language, OG, JSON-LD
Best fit	Site-specific scraping logic	Crawling whole websites	News, blogs, docs, research, monitoring

vs. Website Content Crawler

Apify's Website Content Crawler crawls entire websites and extracts content across discovered pages. Article Extractor is different:

Focused extraction: Only extracts the main article content, not the entire page
Cleaner output: Strips navigation, ads, sidebars, related articles — just the article
Richer metadata: Automatically extracts author, publish date, JSON-LD, Open Graph
Faster: Uses HTTP requests (no browser), processes pages in parallel
Known-URL workflow: Optimized when your pipeline already has article URLs from RSS, search, sitemaps, feeds, monitoring, or another actor

When to use which:

Use Article Extractor when you need clean article text from known URLs (news, blogs, docs)
Use Website Content Crawler when you need to crawl an entire website following links
Use RAG Web Browser when pages require browser rendering or search-to-content workflows

Best-fit workflows

URL to Markdown API — convert a queue of article URLs into clean Markdown documents.
News and blog monitoring — combine with RSS, search, or scheduled runs to archive new articles.
RAG ingestion — push Markdown content and metadata into vector stores with source traceability.
Content research — collect article titles, authors, dates, descriptions, word counts, and clean bodies.
SEO content analysis — compare competitor articles without scraping menus, ads, and unrelated widgets.

Features

Smart article extraction using Mozilla Readability algorithm
Markdown output optimized for LLM consumption and RAG pipelines
Automatic metadata extraction (author, date, description, language)
JSON-LD and Open Graph metadata parsing
Image and link extraction from article body
Concurrent processing with a reliability-first default of 5 pages at a time and an advanced limit up to 50
Retries transient timeout, rate-limit, and 5xx responses before returning an error item
Proxy support for geo-restricted content
Best for public article-style pages such as news sites, blogs, and documentation
5MB page size limit to prevent memory issues
Pay-per-event pricing for extracted articles

Input examples

Extract articles as Markdown (default)

{
  "urls": [{ "url": "https://blog.apify.com/what-is-web-scraping/" }],
  "outputFormat": "markdown",
  "maxItems": 1
}

Extract as plain text for NLP analysis

{
  "urls": [{ "url": "https://blog.apify.com/how-to-build-a-web-scraper/" }],
  "outputFormat": "text",
  "extractImages": false
}

Bulk extraction with proxy (100+ articles)

{
  "urls": [
    { "url": "https://example.com/article-1" },
    { "url": "https://example.com/article-2" },
    { "url": "https://example.com/article-3" }
  ],
  "outputFormat": "markdown",
  "maxConcurrency": 5,
  "proxyConfiguration": {
    "useApifyProxy": true
  }
}

Extract with all metadata (images + links)

{
  "urls": [{ "url": "https://blog.apify.com/what-is-web-scraping/" }],
  "outputFormat": "markdown",
  "extractImages": true,
  "extractLinks": true
}

Input parameters

Parameter	Type	Default	Required	Description
`urls`	Array	—	Yes	List of article/page URLs to extract content from
`outputFormat`	String	`"markdown"`	No	Output format: `"markdown"`, `"text"`, or `"html"`
`maxItems`	Integer	10	No	Maximum number of articles to extract (1–10,000)
`extractImages`	Boolean	`true`	No	Include image URLs found in the article
`extractLinks`	Boolean	`false`	No	Include links found in the article
`timeout`	Integer	30	No	Maximum seconds to wait for each page to load (5–120)
`maxConcurrency`	Integer	5	No	Number of pages to process simultaneously (1–50). Keep the default for reliability; raise it for fast, stable sites.
`proxyConfiguration`	Object	None	No	Proxy settings for accessing geo-restricted content

Output format

Each item in the dataset contains:

Field	Type	Description
`url`	String	Final page URL (after redirects)
`canonicalUrl`	String	Canonical URL if specified by the page
`title`	String	Article title
`author`	String	Article author (from meta tags, JSON-LD, or byline)
`publishedDate`	String	Publication date (ISO 8601)
`description`	String	Meta description or article summary
`content`	String	Extracted article in requested format (Markdown/text/HTML)
`wordCount`	Integer	Number of words in the article
`language`	String	Detected content language code
`siteName`	String	Website name (from Open Graph)
`images`	Array	Image URLs from the article (if `extractImages: true`)
`links`	Array	Links from the article (if `extractLinks: true`)
`ogImage`	String	Open Graph image URL
`statusCode`	Integer	HTTP response status code
`error`	String	Error message if extraction failed (null on success)
`extractedAt`	String	Extraction timestamp (ISO 8601)

Example output

{
  "url": "https://blog.apify.com/what-is-web-scraping/",
  "canonicalUrl": "https://blog.apify.com/what-is-web-scraping/",
  "title": "What is web scraping? A beginner's guide",
  "author": "Apify Team",
  "publishedDate": "2024-03-15T10:00:00Z",
  "description": "Learn what web scraping is, how it works, and why it matters.",
  "content": "# What is web scraping?\n\nWeb scraping is the process of automatically extracting data from websites...\n\n## How does web scraping work?\n\n1. **Send HTTP request** to the target URL\n2. **Parse the HTML** response\n3. **Extract the data** you need\n4. **Store the results** in a structured format",
  "wordCount": 2450,
  "language": "en",
  "siteName": "Apify Blog",
  "images": ["https://blog.apify.com/content/images/web-scraping-hero.jpg"],
  "links": [],
  "ogImage": "https://blog.apify.com/content/images/og-web-scraping.jpg",
  "statusCode": 200,
  "error": null,
  "extractedAt": "2026-03-29T12:00:00+00:00"
}

Integrations

Apify MCP Server (Claude, AI agents)

Use as a tool in Claude Desktop, Claude Code, or any MCP-compatible AI agent framework. The actor is PPE-priced, making it native to AI agent workflows where each task triggers a separate extraction.

Python integration

from apify_client import ApifyClient

client = ApifyClient("your-apify-api-token")

# Extract articles
run = client.actor("tugelbay/article-extractor").call(
    run_input={
        "urls": [
            {"url": "https://blog.apify.com/what-is-web-scraping/"},
            {"url": "https://en.wikipedia.org/wiki/Web_scraping"},
        ],
        "outputFormat": "markdown",
    }
)

# Read results
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(f"Title: {item['title']}")
    print(f"Author: {item.get('author', 'Unknown')}")
    print(f"Words: {item['wordCount']}")
    print(f"Content preview: {item['content'][:200]}...")
    print()

JavaScript/TypeScript integration

import { ApifyClient } from "apify-client";

const client = new ApifyClient({ token: "your-apify-api-token" });

const run = await client.actor("tugelbay/article-extractor").call({
  urls: [
    { url: "https://blog.apify.com/what-is-web-scraping/" },
    { url: "https://en.wikipedia.org/wiki/Web_scraping" },
  ],
  outputFormat: "markdown",
});

const { items } = await client.dataset(run.defaultDatasetId).listItems();
for (const item of items) {
  console.log(`${item.title} (${item.wordCount} words)`);
  console.log(item.content?.substring(0, 200));
}

LangChain (RAG pipeline)

from langchain_apify import ApifyWrapper
from langchain_core.documents import Document

apify = ApifyWrapper(apify_api_token="your-apify-api-token")

docs = apify.call_actor(
    actor_id="tugelbay/article-extractor",
    run_input={
        "urls": [{"url": "https://example.com/article"}],
        "outputFormat": "markdown",
    },
    dataset_mapping_function=lambda item: Document(
        page_content=item.get("content") or "",
        metadata={
            "url": item.get("url"),
            "title": item.get("title"),
            "author": item.get("author"),
            "publishedDate": item.get("publishedDate"),
        },
    ),
)

n8n

Install the community node @apify/n8n-nodes-apify, then add an Apify node with operation Run Actor, set actor to tugelbay/article-extractor, and pass input:

{ "urls": [{ "url": "{{ $json.url }}" }], "outputFormat": "markdown", "maxItems": 10 }

Pipe output to Google Sheets, Notion, or an LLM node for summarization.

Make (Integromat)

Add the Apify app → module Run an Actor, connect your Apify token, pick actor tugelbay/article-extractor, map an upstream URL into urls, and route structured output to any of Make's 1,500+ apps (Airtable, Google Docs, etc.).

Zapier

Create a Zap with Apify as an action app → Run Actor, connect Apify, choose tugelbay/article-extractor, map the trigger's URL field into urls, and send extracted content/title to Zapier's 6,000+ apps.

Webhooks and integrations

Additional integration options:

Google Sheets — export extracted articles directly to a spreadsheet
Slack — get notifications when extraction completes
Email — receive dataset as email attachment
API — call programmatically via Apify REST API

Use cases

LLM training data — extract clean text from web pages for fine-tuning datasets
RAG pipelines — feed article content into vector databases for retrieval-augmented generation
Content analysis — analyze articles at scale for sentiment, topics, and trends
News monitoring — extract and archive news articles automatically on a schedule
Research — collect and structure academic or industry content for literature reviews
SEO analysis — extract competitor content for gap analysis and content strategy
Knowledge base — build searchable archives from documentation sites and blogs
Content migration — extract content from legacy sites during CMS migrations
AI agents — give your AI agent structured article content from public article-style URLs
Newsletter curation — automatically extract and summarize articles for newsletters
Compliance monitoring — track content changes on regulatory or competitor pages

Cost estimation (PPE pricing)

Event	Description
`article-extracted`	Each article successfully extracted

Apify currently displays pricing from $3.50 / 1,000 articles. The exact price depends on the user's Apify pricing tier; the examples below show the displayed "from" price and exclude the small actor-start event.

Example costs at the displayed "from" price:

Scenario	Articles	From cost
10 blog posts	10	~$0.04
100 news articles	100	~$0.35
1,000 documentation pages	1,000	~$3.50
Daily news monitoring (50 articles/day)	1,500/month	~$5.25/month
Large-scale extraction	10,000	~$35

Tip: Set extractImages: false and extractLinks: false to speed up extraction and reduce output size when you only need the text content.

FAQ

What types of pages work best?

Article Extractor works best on article-style pages: news articles, blog posts, documentation pages, Wikipedia articles, and similar content. The Readability algorithm is designed to identify the "main content" of a page and strip everything else.

Does it work on JavaScript-rendered pages (SPAs)?

No. Article Extractor uses fast HTTP requests (no browser). Pages that require JavaScript to render content (React SPAs, Angular apps) will return empty or minimal content. For those pages, use RAG Web Browser which has automatic browser fallback.

How fast is it?

Very fast. Since it uses HTTP requests (no browser), it can process 100 articles in 2–3 minutes with default concurrency. Increase maxConcurrency to 50 for even faster processing.

No. Article Extractor only works with publicly accessible pages. It cannot bypass login walls, paywalls, or CAPTCHA-protected content.

What's the maximum page size?

5MB per page. Larger pages are truncated to prevent memory issues. This covers 99%+ of normal web articles.

Can I run this on a schedule?

Yes. Set up a Schedule in Apify Console to run the actor at any interval — hourly, daily, or custom cron expressions. Perfect for news monitoring and content tracking.

Why Markdown output?

Markdown is the most LLM-friendly format:

Preserves semantic structure (headers, emphasis, lists, code blocks)
Compact — fits more content in LLM context windows
Renders cleanly in chat interfaces and documentation tools
Easy to parse for downstream processing

How does it handle errors?

If a page fails to load (timeout, 404, blocked), the actor returns the URL with an error field explaining what went wrong and a null content field. Other pages in the batch continue processing normally.

Troubleshooting

Empty or very short content extraction

Cause: The page is a SPA (Single Page Application) that renders content with JavaScript
Fix: Use RAG Web Browser instead, which has browser fallback
Alternative: Very short pages (<100 words) may not have enough content for Readability to detect the main article

Missing author or publish date

Cause: The page doesn't include author/date in meta tags, JSON-LD, or standard HTML patterns
Fix: This is expected — not all pages provide this metadata. The fields will be null.

Timeout errors on some pages

Cause: The target page is slow to respond
Fix: Increase the timeout parameter (default: 30 seconds, max: 120 seconds)
Alternative: Reduce maxConcurrency if you're scraping many pages from the same domain

Cause: Some sites block datacenter IPs
Fix: Enable Apify proxy with residential proxy groups in proxyConfiguration

Limitations

Only works with publicly accessible pages (no login-protected or paywalled content)
JavaScript-rendered content (SPAs) will not extract fully — use a browser-based solution for those
Very short pages (under 100 words) may not have enough content for Readability to detect
Maximum page size: 5MB (larger pages are truncated)
Maximum 10,000 articles per run (use multiple runs for larger datasets)
Metadata extraction depends on the page having proper meta tags, JSON-LD, or Open Graph markup

Changelog

1.0.39 (2026-07-02) — Dataset-push failures are now logged with their real cause (disk pressure, storage errors) before the fallback error item is pushed, instead of being silently swallowed.

v1.0 (2026-03-29)

Initial release
Markdown, plain text, and clean HTML output formats
Mozilla Readability-based article extraction
Metadata extraction (author, date, description, JSON-LD, Open Graph)
Image and link extraction
Concurrent processing with configurable concurrency (1–50)
Proxy support
Pay-per-event pricing

Support & Feedback

If this actor saves you time, please leave a review on its Store page — it takes under a minute and is the main trust signal that helps other users find a reliable tool.

Hit a bug or missing a feature? Open a ticket on the Actor's Issues tab — issues are typically answered within 24-48 hours and fixes ship fast.

Website Content Crawler — Crawl websites and extract Markdown for RAG/LLMs
RAG Web Browser — Search Google + extract as Markdown for AI agents
YouTube Transcript Extractor — Bulk extract video transcripts as SRT/VTT/Markdown
Website Tech Stack Detector — Identify 80+ technologies on any website
Google Maps Lead Extractor — Extract business leads with emails from Google Maps

See all actors: apify.com/tugelbay

RAG Web Browser API - Search & Extract

tugelbay/rag-web-browser

Google search + public URLs to Markdown/text/HTML for RAG and AI agents. Guide: https://konabayev.com/tools/rag-web-browser/?utm_source=apify_info&utm_medium=referral&utm_campaign=rag-web-browser

Tugelbay Konabayev

Website Content Crawler API - Markdown for RAG

tugelbay/website-content-crawler

Crawl public websites and extract clean Markdown, text, or HTML for RAG pipelines, AI agents, documentation indexing, and content monitoring. Guide: https://konabayev.com/tools/website-content-crawler/?utm_source=apify_info&utm_medium=referral&utm_campaign=website-content-crawler

Tugelbay Konabayev

Public Article Intelligence & Citation Extractor

jacksu/public-article-intelligence-agent

Extract clean article text, metadata, summaries, citations, diagnostics, and change signals from public article URLs.

jack su

YouTube Transcript API & Bulk Subtitle Downloader

tugelbay/youtube-transcript

Bulk YouTube transcript API for SRT/VTT, Markdown, JSON, and text exports with metadata for AI/RAG, research, subtitles, and content workflows. Guide: https://konabayev.com/tools/youtube-transcript-scraper/?utm_source=apify_info&utm_medium=referral&utm_campaign=youtube-transcript

Tugelbay Konabayev

Wikipedia Article Extractor

rambunctious_fingerprint/wikipedia-extractor

Casey Marsh

Wikipedia Article Extractor (AI-ready)

changeable_acacia/wikipedia-article-extractor-ai-ready

Extracts clean JSON from any Wikipedia article for AI/RAG use.

SABYASACHI TRIPATHY

🧠 Smart Article Extractor

scraper-engine/smart-article-extractor

Scraper Engine

🧠 Smart Article Extractor

scrapio/smart-article-extractor

Scrapio

🧠 Smart Article Extractor

scrapier/smart-article-extractor

Scrapier

🧠 Smart Article Extractor

simpleapi/smart-article-extractor

SimpleAPI

Article Extraction API

Article Extraction API — URL to Markdown for RAG & LLMs

Article Extraction API for RAG and AI Agents

URL to Markdown, Text, or Clean HTML

Web Content Scraper for RAG Pipelines

Bulk Article Extraction (Up to 10K URLs)

What does Article Extractor do?

Why use this instead of a generic web scraper?

vs. Website Content Crawler

Best-fit workflows

Features

Input examples

Extract articles as Markdown (default)

Extract as plain text for NLP analysis

Bulk extraction with proxy (100+ articles)

Extract with all metadata (images + links)

Input parameters

Output format

Example output

Integrations

Apify MCP Server (Claude, AI agents)

Python integration

JavaScript/TypeScript integration

LangChain (RAG pipeline)

n8n

Make (Integromat)

Zapier

Webhooks and integrations

Use cases

Cost estimation (PPE pricing)

FAQ

What types of pages work best?

Does it work on JavaScript-rendered pages (SPAs)?

How fast is it?

Can I extract content behind login walls or paywalls?

What's the maximum page size?

Can I run this on a schedule?

Why Markdown output?

How does it handle errors?

Troubleshooting

Empty or very short content extraction

Missing author or publish date

Timeout errors on some pages

Proxy-related errors

Limitations

Changelog

v1.0 (2026-03-29)

Support & Feedback

Related Actors

You might also like

RAG Web Browser API - Search & Extract

Website Content Crawler API - Markdown for RAG

Public Article Intelligence & Citation Extractor

YouTube Transcript API & Bulk Subtitle Downloader

Wikipedia Article Extractor

Wikipedia Article Extractor (AI-ready)

🧠 Smart Article Extractor

🧠 Smart Article Extractor

🧠 Smart Article Extractor

🧠 Smart Article Extractor