Pricing

$15.00/month + usage

Smart Article Scraper - Text, Data & Insights

𝗔𝗿𝘁𝗶𝗰𝗹𝗲 𝗦𝗰𝗿𝗮𝗽𝗲𝗿 & 𝗖𝗼𝗻𝘁𝗲𝗻𝘁 𝗘𝘅𝘁𝗿𝗮𝗰𝘁𝗼𝗿 - Extract clean text, metadata, keywords & summaries from any web article or blog post. Perfect for 𝗿𝗲𝘀𝗲𝗮𝗿𝗰𝗵, 𝗰𝗼𝗺𝗽𝗲𝘁𝗶𝘁𝗶𝘃𝗲 𝗮𝗻𝗮𝗹𝘆𝘀𝗶𝘀 & 𝗰𝗼𝗻𝘁𝗲𝗻𝘁 𝗺𝗮𝗿𝗸𝗲𝘁𝗶𝗻𝗴.

Pricing

$15.00/month + usage

Rating

1.0

(1)

Developer

Xtech

Actor stats

Bookmarked

Total users

Monthly active users

79 days

Issues response

11 days ago

Last modified

Article Extractor

Extract clean, structured data from news articles and blog posts. This Apify Actor returns article text, title, authors, publication date, summary, image URLs, keywords, meta tags, language, and extraction status for each URL.

It uses trafilatura as the primary extractor and falls back to newspaper3k when needed, making it useful for news monitoring, content research, SEO workflows, data enrichment, and text analysis pipelines.

Features

Extract full article text from article and blog post pages
Return titles, authors, publication dates, source URLs, and language
Collect summaries, keywords, meta descriptions, and meta keywords when available
Capture the main image URL, discovered image URLs, and embedded video URLs when supported by the extractor
Process multiple article URLs in one Actor run
Configure timeout, language hint, User-Agent, and proxy settings
Push one structured dataset item per input URL, including failed attempts with error details

Input

Provide one or more article URLs in startUrls.

{
  "startUrls": [
    { "url": "https://www.example.com/news/article1" },
    { "url": "https://www.example.com/blog/post2" }
  ],
  "language": "en",
  "requestTimeout": 30,
  "maxConcurrency": 3,
  "maxRetries": 2,
  "fetchImages": true,
  "proxyConfiguration": {
    "useApifyProxy": false
  }
}

Input fields

Field	Type	Description
`startUrls`	array	Direct article or blog post URLs to extract.
`language`	string	Optional two-letter language hint such as `en`, `es`, `de`, or `fr`. Defaults to `en`; leave empty to let extraction libraries infer the language where possible.
`requestTimeout`	integer	Maximum time in seconds to wait for each article page. Defaults to `30`.
`maxConcurrency`	integer	Number of article URLs to process in parallel. Defaults to `3`.
`maxRetries`	integer	Retries for transient request failures such as timeouts, rate limits, and temporary server errors. Defaults to `2`.
`fetchImages`	boolean	Include article image URLs when the newspaper3k fallback is used. The Actor stores image URLs, not image files.
`browserUserAgent`	string	Optional custom User-Agent header.
`proxyConfiguration`	object	Optional Apify Proxy or custom proxy configuration for sites that block direct requests.

Output

The Actor returns a JSON dataset with the following fields for each article:

Field	Description
`articleURL`	Input or resolved article URL.
`sourceURL`	Source website URL when detected.
`articleLanguage`	Detected article language, for example `en` or `es`.
`articleTitle`	Article title.
`articleAuthors`	Article authors as an array.
`articlePublishDate`	Publication date when detected.
`articleText`	Clean extracted article text.
`articleTopImage`	Main article image URL when detected.
`articleAllImages`	Image URLs as an array when available.
`articleVideos`	Embedded video URLs as an array when available.
`articleKeywords`	Keywords or categories as an array when available.
`articleSummary`	Extracted or generated article summary.
`articleMetaDescription`	Page meta description when available.
`articleMetaKeywords`	Page meta keywords as an array when available.
`wordCount`	Number of words in the extracted article text.
`characterCount`	Number of characters in the extracted article text.
`scrapeMethod`	Extractor used for the successful result: `trafilatura` or `newspaper3k`.
`scrapeSuccess`	`true` when extraction succeeded, otherwise `false`.
`scrapeErrorMessage`	Error details for failed extractions.
`scrapedAt`	UTC timestamp when the URL was processed.

Example Output

[
  {
    "articleURL": "https://www.example.com/news/article1",
    "sourceURL": "https://www.example.com",
    "articleLanguage": "en",
    "articleTitle": "Example News Article",
    "articleAuthors": ["John Doe", "Jane Smith"],
    "articlePublishDate": "2024-07-27T10:00:00Z",
    "articleText": "This is the full text of the example news article...",
    "articleTopImage": "https://www.example.com/images/article1.jpg",
    "articleAllImages": ["https://www.example.com/images/article1.jpg", "https://www.example.com/images/article2.png"],
    "articleVideos": [],
    "articleKeywords": ["news", "example", "article"],
    "articleSummary": "A brief summary of the example news article.",
    "articleMetaDescription": "An example article for demonstration.",
    "articleMetaKeywords": ["example", "article", "news", "demo"],
    "wordCount": 825,
    "characterCount": 4920,
    "scrapeMethod": "newspaper3k",
    "scrapeSuccess": true,
    "scrapedAt": "2024-07-27T12:34:56Z"
  }
]

For failed URLs, the Actor still pushes a dataset item with articleURL, scrapeSuccess: false, scrapeErrorMessage, and scrapedAt so you can audit which pages need retrying or proxy changes. The run finishes with status SUCCEEDED as long as input was valid, even when every URL fails extraction (check scrapeSuccess per row).

On Apify cloud, Residential proxy is enabled automatically unless you provide custom proxyUrls.

Use cases

Monitor news articles, press mentions, and competitor content
Build article datasets for research, analysis, or machine learning
Enrich URLs with titles, authors, dates, summaries, and clean text
Collect SEO metadata from article and blog pages
Feed extracted article text into downstream AI, analytics, or database workflows

Pair this Actor with others from the same author for full content pipelines:

YouTube Transcript Scraper Pro — video transcripts for the same research workflow
News Source Crawler — discover article URLs from entire news sites, then extract with this Actor
RSS Feed Scraper — monitor feeds and pass new article URLs here

Tips

Use direct article URLs instead of homepages, category pages, or search result pages.
If all URLs fail because of access restrictions, enable Apify Proxy and retry.
Increase requestTimeout for slow publishers or long-form pages.
Some JavaScript-heavy or paywalled pages may return partial text or fail if the article content is not present in the initial HTML.

Smart Article & Blog Extractor

lightkong/universal-blog-scraper

Extract clean text, author, title, and reading time from any news, blog, or article webpage. Perfect for AI/LLM training and RAG systems.

Lightkong

Article to Text Extractor (for TTS/LLMs)

andok/tts-reader

Extract the core readable text of any article or blog post, stripping out boilerplate. Perfect for Text-to-Speech or AI summaries.

Andok

Google Trends Scraper Pro

xtech/google-keyword-scraper-pro

𝗚𝗼𝗼𝗴𝗹𝗲 𝗧𝗿𝗲𝗻𝗱𝘀 𝗦𝗰𝗿𝗮𝗽𝗲𝗿 & 𝗗𝗮𝘁𝗮 𝗘𝘅𝘁𝗿𝗮𝗰𝘁𝗼𝗿 - Extract trending data for 𝗸𝗲𝘆𝘄𝗼𝗿𝗱𝘀, 𝘁𝗼𝗽𝗶𝗰𝘀 & 𝗿𝗲𝗴𝗶𝗼𝗻𝘀 from Google Trends. Perfect for 𝗦𝗘𝗢, 𝗺𝗮𝗿𝗸𝗲𝘁 𝗿𝗲𝘀𝗲𝗮𝗿𝗰𝗵 & 𝗰𝗼𝗻𝘁𝗲𝗻𝘁 𝘀𝘁𝗿𝗮𝘁𝗲𝗴𝘆.

Xtech

131

1.0

Article Extraction API

tugelbay/article-extractor

Extract clean article text and metadata from URLs as Markdown, text, or HTML for RAG, AI agents, monitoring, and research. Guide: https://konabayev.com/tools/article-extractor/?utm_source=apify_info&utm_medium=referral&utm_campaign=article-extractor

Tugelbay Konabayev

🤖 Any Website URL to Article Summarizer

easyapi/any-website-url-to-article-summarizer

Transform any article, blog post, or web content into concise, AI-powered summaries. Get key insights and main points instantly with smart text analysis and markdown formatting. Perfect for researchers, content creators, and busy professionals who need quick, accurate content digests.

EasyApi

Web Article Extractor — Clean Reader Mode Text & Metadata

maged120/reader-mode

Extract clean, readable article content from any web page. Strips ads, navigation, and clutter — returns title, author, full body text, and publish date in structured JSON.

Maged

Google News Article Scraper

webscrap18/google-news-article-scraper

Scrape Google News, Extract full content with Title, Article Text, Images and Structured data.

WebScrap

LinkedIn Reactions Scraper - Extract Post & Article Engagement

benjarapi/linkedin-post-reactions

Scrape all reactions from any LinkedIn post or article in bulk

Benjar Scraping API

Smart Article Extractor

datapilot/smart-article-extractor

News Article Extractor Actor fetches article URLs and extracts structured content using Requests, , and Newspaper3k. It collects title, author, publish date, text, summary, keywords, images, and word count. Supports proxy use and outputs clean JSON results.

Data Pilot

Smart Article Extractor

parseforge/article-extractor

Extract clean article content from any news, blog, or publisher site! Pull full body text, author, publish date, word count, language, reading time, images, and metadata at scale. Ideal for content research, media monitoring, SEO audits, and AI training. Start extracting articles in minutes!