Smart Article Scraper - Text, Data & Insights avatar

Smart Article Scraper - Text, Data & Insights

Pricing

$15.00/month + usage

Go to Apify Store
Smart Article Scraper - Text, Data & Insights

Smart Article Scraper - Text, Data & Insights

๐—”๐—ฟ๐˜๐—ถ๐—ฐ๐—น๐—ฒ ๐—ฆ๐—ฐ๐—ฟ๐—ฎ๐—ฝ๐—ฒ๐—ฟ & ๐—–๐—ผ๐—ป๐˜๐—ฒ๐—ป๐˜ ๐—˜๐˜…๐˜๐—ฟ๐—ฎ๐—ฐ๐˜๐—ผ๐—ฟ - Extract clean text, metadata, keywords & summaries from any web article or blog post. Perfect for ๐—ฟ๐—ฒ๐˜€๐—ฒ๐—ฎ๐—ฟ๐—ฐ๐—ต, ๐—ฐ๐—ผ๐—บ๐—ฝ๐—ฒ๐˜๐—ถ๐˜๐—ถ๐˜ƒ๐—ฒ ๐—ฎ๐—ป๐—ฎ๐—น๐˜†๐˜€๐—ถ๐˜€ & ๐—ฐ๐—ผ๐—ป๐˜๐—ฒ๐—ป๐˜ ๐—บ๐—ฎ๐—ฟ๐—ธ๐—ฒ๐˜๐—ถ๐—ป๐—ด.

Pricing

$15.00/month + usage

Rating

1.0

(1)

Developer

Xtech

Xtech

Maintained by Community

Actor stats

4

Bookmarked

85

Total users

2

Monthly active users

79 days

Issues response

11 days ago

Last modified

Share

Article Extractor

Extract clean, structured data from news articles and blog posts. This Apify Actor returns article text, title, authors, publication date, summary, image URLs, keywords, meta tags, language, and extraction status for each URL.

It uses trafilatura as the primary extractor and falls back to newspaper3k when needed, making it useful for news monitoring, content research, SEO workflows, data enrichment, and text analysis pipelines.

Features

  • Extract full article text from article and blog post pages
  • Return titles, authors, publication dates, source URLs, and language
  • Collect summaries, keywords, meta descriptions, and meta keywords when available
  • Capture the main image URL, discovered image URLs, and embedded video URLs when supported by the extractor
  • Process multiple article URLs in one Actor run
  • Configure timeout, language hint, User-Agent, and proxy settings
  • Push one structured dataset item per input URL, including failed attempts with error details

Input

Provide one or more article URLs in startUrls.

{
"startUrls": [
{ "url": "https://www.example.com/news/article1" },
{ "url": "https://www.example.com/blog/post2" }
],
"language": "en",
"requestTimeout": 30,
"maxConcurrency": 3,
"maxRetries": 2,
"fetchImages": true,
"proxyConfiguration": {
"useApifyProxy": false
}
}

Input fields

FieldTypeDescription
startUrlsarrayDirect article or blog post URLs to extract.
languagestringOptional two-letter language hint such as en, es, de, or fr. Defaults to en; leave empty to let extraction libraries infer the language where possible.
requestTimeoutintegerMaximum time in seconds to wait for each article page. Defaults to 30.
maxConcurrencyintegerNumber of article URLs to process in parallel. Defaults to 3.
maxRetriesintegerRetries for transient request failures such as timeouts, rate limits, and temporary server errors. Defaults to 2.
fetchImagesbooleanInclude article image URLs when the newspaper3k fallback is used. The Actor stores image URLs, not image files.
browserUserAgentstringOptional custom User-Agent header.
proxyConfigurationobjectOptional Apify Proxy or custom proxy configuration for sites that block direct requests.

Output

The Actor returns a JSON dataset with the following fields for each article:

FieldDescription
articleURLInput or resolved article URL.
sourceURLSource website URL when detected.
articleLanguageDetected article language, for example en or es.
articleTitleArticle title.
articleAuthorsArticle authors as an array.
articlePublishDatePublication date when detected.
articleTextClean extracted article text.
articleTopImageMain article image URL when detected.
articleAllImagesImage URLs as an array when available.
articleVideosEmbedded video URLs as an array when available.
articleKeywordsKeywords or categories as an array when available.
articleSummaryExtracted or generated article summary.
articleMetaDescriptionPage meta description when available.
articleMetaKeywordsPage meta keywords as an array when available.
wordCountNumber of words in the extracted article text.
characterCountNumber of characters in the extracted article text.
scrapeMethodExtractor used for the successful result: trafilatura or newspaper3k.
scrapeSuccesstrue when extraction succeeded, otherwise false.
scrapeErrorMessageError details for failed extractions.
scrapedAtUTC timestamp when the URL was processed.

Example Output

[
{
"articleURL": "https://www.example.com/news/article1",
"sourceURL": "https://www.example.com",
"articleLanguage": "en",
"articleTitle": "Example News Article",
"articleAuthors": ["John Doe", "Jane Smith"],
"articlePublishDate": "2024-07-27T10:00:00Z",
"articleText": "This is the full text of the example news article...",
"articleTopImage": "https://www.example.com/images/article1.jpg",
"articleAllImages": ["https://www.example.com/images/article1.jpg", "https://www.example.com/images/article2.png"],
"articleVideos": [],
"articleKeywords": ["news", "example", "article"],
"articleSummary": "A brief summary of the example news article.",
"articleMetaDescription": "An example article for demonstration.",
"articleMetaKeywords": ["example", "article", "news", "demo"],
"wordCount": 825,
"characterCount": 4920,
"scrapeMethod": "newspaper3k",
"scrapeSuccess": true,
"scrapedAt": "2024-07-27T12:34:56Z"
}
]

For failed URLs, the Actor still pushes a dataset item with articleURL, scrapeSuccess: false, scrapeErrorMessage, and scrapedAt so you can audit which pages need retrying or proxy changes. The run finishes with status SUCCEEDED as long as input was valid, even when every URL fails extraction (check scrapeSuccess per row).

On Apify cloud, Residential proxy is enabled automatically unless you provide custom proxyUrls.

Use cases

  • Monitor news articles, press mentions, and competitor content
  • Build article datasets for research, analysis, or machine learning
  • Enrich URLs with titles, authors, dates, summaries, and clean text
  • Collect SEO metadata from article and blog pages
  • Feed extracted article text into downstream AI, analytics, or database workflows

Pair this Actor with others from the same author for full content pipelines:

Tips

  • Use direct article URLs instead of homepages, category pages, or search result pages.
  • If all URLs fail because of access restrictions, enable Apify Proxy and retry.
  • Increase requestTimeout for slow publishers or long-form pages.
  • Some JavaScript-heavy or paywalled pages may return partial text or fail if the article content is not present in the initial HTML.