Smart Article Scraper - Text, Data & Insights
Pricing
$15.00/month + usage
Smart Article Scraper - Text, Data & Insights
๐๐ฟ๐๐ถ๐ฐ๐น๐ฒ ๐ฆ๐ฐ๐ฟ๐ฎ๐ฝ๐ฒ๐ฟ & ๐๐ผ๐ป๐๐ฒ๐ป๐ ๐๐ ๐๐ฟ๐ฎ๐ฐ๐๐ผ๐ฟ - Extract clean text, metadata, keywords & summaries from any web article or blog post. Perfect for ๐ฟ๐ฒ๐๐ฒ๐ฎ๐ฟ๐ฐ๐ต, ๐ฐ๐ผ๐บ๐ฝ๐ฒ๐๐ถ๐๐ถ๐๐ฒ ๐ฎ๐ป๐ฎ๐น๐๐๐ถ๐ & ๐ฐ๐ผ๐ป๐๐ฒ๐ป๐ ๐บ๐ฎ๐ฟ๐ธ๐ฒ๐๐ถ๐ป๐ด.
Article Extractor
Extract clean, structured data from news articles and blog posts. This Apify Actor returns article text, title, authors, publication date, summary, image URLs, keywords, meta tags, language, and extraction status for each URL.
It uses trafilatura as the primary extractor and falls back to newspaper3k when needed, making it useful for news monitoring, content research, SEO workflows, data enrichment, and text analysis pipelines.
Features
- Extract full article text from article and blog post pages
- Return titles, authors, publication dates, source URLs, and language
- Collect summaries, keywords, meta descriptions, and meta keywords when available
- Capture the main image URL, discovered image URLs, and embedded video URLs when supported by the extractor
- Process multiple article URLs in one Actor run
- Configure timeout, language hint, User-Agent, and proxy settings
- Push one structured dataset item per input URL, including failed attempts with error details
Input
Provide one or more article URLs in startUrls.
{"startUrls": [{ "url": "https://www.example.com/news/article1" },{ "url": "https://www.example.com/blog/post2" }],"language": "en","requestTimeout": 30,"maxConcurrency": 3,"maxRetries": 2,"fetchImages": true,"proxyConfiguration": {"useApifyProxy": false}}
Input fields
| Field | Type | Description |
|---|---|---|
startUrls | array | Direct article or blog post URLs to extract. |
language | string | Optional two-letter language hint such as en, es, de, or fr. Defaults to en; leave empty to let extraction libraries infer the language where possible. |
requestTimeout | integer | Maximum time in seconds to wait for each article page. Defaults to 30. |
maxConcurrency | integer | Number of article URLs to process in parallel. Defaults to 3. |
maxRetries | integer | Retries for transient request failures such as timeouts, rate limits, and temporary server errors. Defaults to 2. |
fetchImages | boolean | Include article image URLs when the newspaper3k fallback is used. The Actor stores image URLs, not image files. |
browserUserAgent | string | Optional custom User-Agent header. |
proxyConfiguration | object | Optional Apify Proxy or custom proxy configuration for sites that block direct requests. |
Output
The Actor returns a JSON dataset with the following fields for each article:
| Field | Description |
|---|---|
articleURL | Input or resolved article URL. |
sourceURL | Source website URL when detected. |
articleLanguage | Detected article language, for example en or es. |
articleTitle | Article title. |
articleAuthors | Article authors as an array. |
articlePublishDate | Publication date when detected. |
articleText | Clean extracted article text. |
articleTopImage | Main article image URL when detected. |
articleAllImages | Image URLs as an array when available. |
articleVideos | Embedded video URLs as an array when available. |
articleKeywords | Keywords or categories as an array when available. |
articleSummary | Extracted or generated article summary. |
articleMetaDescription | Page meta description when available. |
articleMetaKeywords | Page meta keywords as an array when available. |
wordCount | Number of words in the extracted article text. |
characterCount | Number of characters in the extracted article text. |
scrapeMethod | Extractor used for the successful result: trafilatura or newspaper3k. |
scrapeSuccess | true when extraction succeeded, otherwise false. |
scrapeErrorMessage | Error details for failed extractions. |
scrapedAt | UTC timestamp when the URL was processed. |
Example Output
[{"articleURL": "https://www.example.com/news/article1","sourceURL": "https://www.example.com","articleLanguage": "en","articleTitle": "Example News Article","articleAuthors": ["John Doe", "Jane Smith"],"articlePublishDate": "2024-07-27T10:00:00Z","articleText": "This is the full text of the example news article...","articleTopImage": "https://www.example.com/images/article1.jpg","articleAllImages": ["https://www.example.com/images/article1.jpg", "https://www.example.com/images/article2.png"],"articleVideos": [],"articleKeywords": ["news", "example", "article"],"articleSummary": "A brief summary of the example news article.","articleMetaDescription": "An example article for demonstration.","articleMetaKeywords": ["example", "article", "news", "demo"],"wordCount": 825,"characterCount": 4920,"scrapeMethod": "newspaper3k","scrapeSuccess": true,"scrapedAt": "2024-07-27T12:34:56Z"}]
For failed URLs, the Actor still pushes a dataset item with articleURL, scrapeSuccess: false, scrapeErrorMessage, and scrapedAt so you can audit which pages need retrying or proxy changes. The run finishes with status SUCCEEDED as long as input was valid, even when every URL fails extraction (check scrapeSuccess per row).
On Apify cloud, Residential proxy is enabled automatically unless you provide custom proxyUrls.
Use cases
- Monitor news articles, press mentions, and competitor content
- Build article datasets for research, analysis, or machine learning
- Enrich URLs with titles, authors, dates, summaries, and clean text
- Collect SEO metadata from article and blog pages
- Feed extracted article text into downstream AI, analytics, or database workflows
Related Actors on Apify Store
Pair this Actor with others from the same author for full content pipelines:
- YouTube Transcript Scraper Pro โ video transcripts for the same research workflow
- News Source Crawler โ discover article URLs from entire news sites, then extract with this Actor
- RSS Feed Scraper โ monitor feeds and pass new article URLs here
Tips
- Use direct article URLs instead of homepages, category pages, or search result pages.
- If all URLs fail because of access restrictions, enable Apify Proxy and retry.
- Increase
requestTimeoutfor slow publishers or long-form pages. - Some JavaScript-heavy or paywalled pages may return partial text or fail if the article content is not present in the initial HTML.