Pricing

from $5.00 / 1,000 article scrapeds

News Article Scraper — Newsroom & Press Release Extractor

Scrape full article content from any newsroom, press release page, or blog. Get title, author, publish date, summary, SEO keywords, word count, and full body text. Auto-discovers article links. Checkpoint resume. $5 per 1,000 articles

Pricing

from $5.00 / 1,000 article scrapeds

Rating

0.0

(0)

Developer

Scrape Pilot

Actor stats

Bookmarked

Total users

Monthly active users

2 months ago

Last modified

📰 News Article Scraper — Newsroom & Press Release Extractor

The most reliable News Article Scraper on Apify. Extract full article content, metadata, and SEO data from any company newsroom, press release page, news portal, or blog — title, author, publish date, summary, full body text, word count, keywords, and publisher name. Auto-discovers article links. Real-time push. Pay only for results.

🔍 What Is This Actor?

News Article Scraper is a production-ready Apify actor that extracts complete article content and metadata from any newsroom, press release page, news portal, or company blog — automatically discovering article links from the start URL and scraping each one in full.

Provide one or many start URLs — a company's press room, an industry news portal, a product blog, or any page containing article links — and receive back a clean, structured record for every article found: publisher name, article title, author, publish date, summary, SEO keywords, word count, full body text, and direct article URL.

This newsroom scraper auto-detects article links from the start page (filtering for news, press release, blog, and story URL patterns), scrapes each article independently, and pushes results to the dataset in real time — so no data is lost even if the run is interrupted. A checkpoint system ensures that restarted runs skip already-processed URLs automatically.

🚀 Why Use This News Article Scraper?

Feature	This Actor	Manual Research	RSS Scrapers	Other Scrapers
Full article body text	✅	✅ Slow	❌ Excerpt only	⚠️
Auto-discovers article links	✅	❌	❌	⚠️
Author & publish date	✅	✅	⚠️	⚠️
SEO keywords extracted	✅	❌	❌	❌
Word count per article	✅	❌	❌	❌
Publisher name auto-detected	✅	❌	✅	⚠️
Login/CAPTCHA page filtering	✅ Auto	❌	N/A	❌
Checkpoint resume on abort	✅	N/A	❌	❌
Real-time push per article	✅	N/A	N/A	❌
Pay only for results	✅	N/A	N/A	❌

Bottom line: This news article scraper is the only actor that combines automatic article link discovery, full body text extraction, SEO keyword output, word count, and checkpoint resume — with pay-per-result pricing so you only pay for successfully scraped articles.

🌐 Supported Website Types

This newsroom scraper works on any publicly accessible page containing article or press release links:

📢 Corporate & Brand Newsrooms

Company press release pages (/press, /news, /newsroom)
Investor relations news sections
Product launch and announcement blogs

📰 News Portals & Media Sites

Industry-specific news portals
Trade publication article pages
Technology, finance, health, and general news sites

✍️ Blogs & Content Sites

Company and personal blogs (/blog, /articles, /posts)
Thought leadership and opinion content
Tutorial and resource article pages

📣 Press Release Aggregators

Any page containing links to press releases or news articles
Partner and affiliate news feeds
Government and regulatory announcement pages

Any page where URLs contain news, press, release, blog, article, story, or post in the path is automatically detected as an article source.

🎯 Use Cases

📊 Competitive Intelligence & Market Monitoring

Scrape competitor newsrooms to track product launches, partnerships, and strategic announcements
Monitor industry news portals for coverage of key topics, companies, or technologies
Build automated press release feeds from multiple corporate newsrooms in one run

🤖 AI & NLP Training Datasets

Extract full article body text from news portals to build large text corpora for language models
Collect article metadata — author, date, keywords, word count — for news classification research
Build structured datasets of press releases for information extraction and NLP training

📣 PR & Communications Research

Scrape your own press coverage to audit how your announcements are being reported
Monitor brand mentions in news articles across multiple publication newsrooms
Track press release distribution and pickup by scraping multiple outlet pages

🛠️ Content & SEO Analytics

Extract SEO keywords from article meta tags across competitor content for keyword research
Analyze word count distributions and content length patterns across industry publications
Build content audit datasets from any website's blog or article archive

📰 Journalism & Research

Archive newsroom content for investigative journalism or historical analysis
Collect structured article datasets for media studies, framing research, or sentiment analysis
Monitor regulatory or government newsrooms for policy announcements and notices

🏢 Enterprise Content Pipelines

Feed scraped press release content into internal knowledge bases or RAG systems
Automate competitive news briefing generation using structured article data
Build content aggregation pipelines that monitor multiple industry newsrooms on a schedule

⚙️ Input Parameters

{
  "startUrls": [
    { "url": "https://newsroom.example.com/press-releases" },
    { "url": "https://techcrunch.com/tag/artificial-intelligence/" },
    { "url": "https://www.company.com/blog" }
  ],
  "maxItemsPerUrl": 10,
  "proxyConfiguration": {
    "useApifyProxy":    true,
    "apifyProxyGroups": ["RESIDENTIAL"]
  }
}

Parameter	Type	Default	Description
`startUrls`	array	`[]`	Start pages to scrape — each as `{ "url": "https://..." }`. The actor discovers article links automatically from each page
`maxItemsPerUrl`	integer	`10`	Maximum articles to scrape per start URL
`proxyConfiguration`	object	Residential	Apify proxy config — residential proxy recommended for news and corporate sites

Tip: Provide the newsroom index page or article listing page as startUrl — not individual article URLs. The actor discovers individual article links automatically from the page.

📋 Output Fields

Every successfully scraped article from this press release scraper includes:

Field	Type	Description	Example
`publisher`	string	Publisher name from OG metadata or domain	`"TechCrunch"`
`articleTitle`	string	Full article or press release title	`"Acme Corp Launches New AI Platform"`
`author`	string	Article author from metadata	`"Sarah Johnson"`
`publishDate`	string	Publication date (ISO format when available)	`"2024-03-15T09:00:00Z"`
`summary`	string	Article summary or meta description (max 300 chars)	`"Acme Corp today announced the launch of..."`
`seoKeywords`	string	SEO keywords from page meta tags	`"AI, machine learning, enterprise software"`
`wordCount`	integer	Total word count of the article body	`847`
`articleUrl`	string	Direct URL of the scraped article	`"https://newsroom.acme.com/press/ai-launch"`
`fullText`	string	Complete article body text	`"Acme Corp today announced..."`
`scrapedAt`	string	Extraction timestamp (ISO 8601 UTC)	`"2024-03-15T10:30:00Z"`

Quality filtering: Articles shorter than 150 words are automatically skipped. Login walls, CAPTCHA pages, and access-denied pages are detected and skipped by title keyword matching — only real article content reaches your dataset.

📦 Example Input & Output

Input:

{
  "startUrls": [{ "url": "https://newsroom.example.com/press-releases" }],
  "maxItemsPerUrl": 3
}

Output (one record):

{
  "publisher":    "Example Corp Newsroom",
  "articleTitle": "Example Corp Launches New AI-Powered Analytics Platform",
  "author":       "Sarah Johnson",
  "publishDate":  "2024-03-15T09:00:00Z",
  "summary":      "Example Corp today announced the general availability of its new AI-powered analytics platform, designed to help enterprises...",
  "seoKeywords":  "AI analytics, enterprise software, data platform, machine learning",
  "wordCount":    847,
  "articleUrl":   "https://newsroom.example.com/press-releases/ai-analytics-launch",
  "fullText":     "Example Corp today announced the general availability of its new AI-powered analytics platform. The platform, which has been in private beta for six months...",
  "scrapedAt":    "2024-03-15T10:30:00Z"
}

💰 Pricing

This actor uses pay-per-event pricing — you only pay for articles successfully scraped and pushed to the dataset.

Event	Price
Actor start fee	$0.02 per run
Per article successfully scraped	$0.005 per result ($5.00 per 1,000 articles)

How billing works:

✅ The $0.02 start fee applies once per run regardless of results
✅ Each article record pushed to the dataset is charged at $0.005
✅ Articles filtered out (too short, login wall, CAPTCHA) are not charged
✅ The actor stops automatically when your Apify account charge limit is reached
✅ No subscription — pay only for what you successfully extract

Example: Scrape 200 articles = $0.02 (start) + $1.00 (200 × $0.005) = $1.02 total

2-hour free trial available — click Try for free at the top of this page.

⚡ Performance & Limits

Start URLs	Articles Per URL	Estimated Time
1 URL	10	~1–3 minutes
5 URLs	10 each	~5–12 minutes
10 URLs	10 each	~10–25 minutes
1 URL	50	~5–12 minutes

Results pushed to the Apify dataset immediately after each article is scraped — no data loss on abort
Checkpoint saved after every completed URL — restart resumes from the last processed URL
0.5-second delay between article requests — polite scraping that avoids rate limiting
Login wall and CAPTCHA pages are automatically detected and skipped

❓ FAQ

Q: How does the actor discover article links? A: The actor loads the start URL, parses all <a href> links on the page, and filters for URLs containing news, press, release, blog, article, story, or post in the path. Only links from the same or related domain are followed. Pagination and tag page URLs are excluded.

Q: What happens if an article is behind a login wall or paywall? A: Pages with titles containing "login", "sign in", "access denied", "robot", or "captcha" are automatically detected and skipped. No charge is applied for skipped pages.

Q: What is the minimum article length for scraping? A: Articles with fewer than 150 words of body text are automatically filtered out. This prevents stub pages, redirect pages, and content-free landing pages from appearing in your dataset.

Q: What does the checkpoint feature do? A: After each start URL is fully processed, the actor saves a checkpoint. If the run is aborted or hits a spending limit, restarting it with the same input will skip already-completed URLs — no duplicate results and no wasted credits.

Q: Can I scrape individual article URLs directly? A: This actor is designed to start from a listing or index page and discover article links automatically. For scraping known individual article URLs, provide the listing page that contains those links as your start URL.

Q: Is fullText always populated? A: The actor first looks for an <article> HTML element. If not found, it collects all <p> paragraphs with more than 30 characters. If the combined text is still under 150 words, the article is skipped and not charged.

Q: Can I export results to Excel or CSV? A: Yes. All results are pushed to the Apify dataset, which can be exported to JSON, CSV, Excel, and more directly from the Apify Console after each run.

📜 Changelog

v1.0.0 (Current)

✅ Auto-discovery of article links from any newsroom or blog index page
✅ Full article body text extraction — <article> tag first, paragraph fallback
✅ Article metadata: title, author, publish date, summary, SEO keywords, word count
✅ Publisher name auto-detection from OG metadata or domain
✅ Login wall and CAPTCHA page filtering — only real content reaches dataset
✅ Minimum 150-word quality filter — no stub pages in output
✅ Checkpoint/resume — restarted runs skip already-processed URLs
✅ Pay-per-event billing — charged per successfully scraped article
✅ Real-time dataset push as each article is scraped

🏷️ Tags

news article scraper newsroom scraper press release scraper article extractor news content scraper blog scraper full text article scraper press release extractor news data scraper corporate newsroom scraper article metadata scraper content scraper

⚖️ Legal & Terms of Use

This actor accesses publicly visible article and press release content from websites in the same way a regular user browses those pages.

Please note:

Use extracted article data only for lawful purposes — research, competitive intelligence, NLP datasets, content monitoring, and journalism are common legitimate uses
Article content is copyright of the original publisher — do not republish scraped full-text content without appropriate authorization
Respect individual publication Terms of Service when scraping at high volume
The actor developer is not responsible for how extracted article content is used

🤝 Support & Feedback

Bug report? Contact us via the Apify actor page
Feature request? Post in the Apify Community forum
Loving it? Please leave a ⭐ review — it helps other users find this actor!

Built with ❤️ on Apify
The most reliable News Article Scraper — full text, metadata, SEO data, pay per result

💰 $0.02 per run + $5.00 per 1,000 articles · Pay only for results

Article Content Extractor & Reader Scraper

taroyamada/article-content-extractor

Extract article bodies, bylines, publish dates, excerpts, and hero images from public news, blog, newsroom, and press URLs.

naoki anzai

Public Newsroom Press Signal Agent

jacksu/public-newsroom-press-signal-agent

Analyze one public newsroom, press, media, company news, announcements, or events page for public news and market-motion signals with evidence links, risks, and change status.

jack su

Smart Article Extractor

parseforge/article-extractor

Extract clean article content from any news, blog, or publisher site! Pull full body text, author, publish date, word count, language, reading time, images, and metadata at scale. Ideal for content research, media monitoring, SEO audits, and AI training. Start extracting articles in minutes!

ParseForge

Smart Article Extractor

datapilot/smart-article-extractor

News Article Extractor Actor fetches article URLs and extracts structured content using Requests, , and Newspaper3k. It collects title, author, publish date, text, summary, keywords, images, and word count. Supports proxy use and outputs clean JSON results.

Data Pilot

Article Content Extractor

codingfrontend/article-content-extractor

Extract clean article content, metadata and structured information from any web page. Returns title, description, author, publish date, plain content, word count, images, and more.

Coding Frontned

Businesswire Scraper

lexis-solutions/businesswire-scraper

Business Wire scraper for Apify that extracts structured press release data from newsroom pages and article details. Track earnings, M&A, product launches, company tags, links, and metadata for market intelligence, monitoring, APIs, and automation.

Lexis Solutions

Google News Article Scraper

webscrap18/google-news-article-scraper

Scrape Google News, Extract full content with Title, Article Text, Images and Structured data.

WebScrap

News Article Scraper for Feeding LLM

proscraper/newsarticlescraper

Scrape news articles metadata to feed into LLM models. Returns article body, published date, article title, author etc.

Owais Nazir

184

Smart Article & Blog Extractor

lightkong/universal-blog-scraper

Extract clean text, author, title, and reading time from any news, blog, or article webpage. Perfect for AI/LLM training and RAG systems.

Lightkong

Article Content Extractor 📄

easyapi/article-content-extractor

Extract clean article content, metadata and structured information from any web page. Supports multiple URLs and returns well-formatted JSON with title, description, content, author, publish date and more. 🔍📄

EasyApi

140

5.0

News Article Scraper — Newsroom & Press Release Extractor

📰 News Article Scraper — Newsroom & Press Release Extractor

📌 Table of Contents

🔍 What Is This Actor?

🚀 Why Use This News Article Scraper?

🌐 Supported Website Types

📢 Corporate & Brand Newsrooms

📰 News Portals & Media Sites

✍️ Blogs & Content Sites

📣 Press Release Aggregators

🎯 Use Cases

📊 Competitive Intelligence & Market Monitoring

🤖 AI & NLP Training Datasets

📣 PR & Communications Research

🛠️ Content & SEO Analytics

📰 Journalism & Research

🏢 Enterprise Content Pipelines

⚙️ Input Parameters

📋 Output Fields

📦 Example Input & Output

💰 Pricing

How billing works:

⚡ Performance & Limits

❓ FAQ

📜 Changelog

v1.0.0 (Current)

🏷️ Tags

⚖️ Legal & Terms of Use

🤝 Support & Feedback

You might also like

Article Content Extractor & Reader Scraper

Public Newsroom Press Signal Agent

Smart Article Extractor

Smart Article Extractor

Article Content Extractor

Businesswire Scraper

Google News Article Scraper

News Article Scraper for Feeding LLM

Smart Article & Blog Extractor

Article Content Extractor 📄