News Article Scraper — Newsroom & Press Release Extractor
Pricing
from $5.00 / 1,000 article scrapeds
News Article Scraper — Newsroom & Press Release Extractor
Scrape full article content from any newsroom, press release page, or blog. Get title, author, publish date, summary, SEO keywords, word count, and full body text. Auto-discovers article links. Checkpoint resume. $5 per 1,000 articles
Pricing
from $5.00 / 1,000 article scrapeds
Rating
0.0
(0)
Developer
Scrape Pilot
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
a day ago
Last modified
Categories
Share
📰 News Article Scraper — Newsroom & Press Release Extractor
The most reliable News Article Scraper on Apify. Extract full article content, metadata, and SEO data from any company newsroom, press release page, news portal, or blog — title, author, publish date, summary, full body text, word count, keywords, and publisher name. Auto-discovers article links. Real-time push. Pay only for results.
📌 Table of Contents
- What Is This Actor?
- Why Use This News Article Scraper?
- Supported Website Types
- Use Cases
- Input Parameters
- Output Fields
- Example Input & Output
- Pricing
- Performance & Limits
- FAQ
- Changelog
- Legal & Terms of Use
🔍 What Is This Actor?
News Article Scraper is a production-ready Apify actor that extracts complete article content and metadata from any newsroom, press release page, news portal, or company blog — automatically discovering article links from the start URL and scraping each one in full.
Provide one or many start URLs — a company's press room, an industry news portal, a product blog, or any page containing article links — and receive back a clean, structured record for every article found: publisher name, article title, author, publish date, summary, SEO keywords, word count, full body text, and direct article URL.
This newsroom scraper auto-detects article links from the start page (filtering for news, press release, blog, and story URL patterns), scrapes each article independently, and pushes results to the dataset in real time — so no data is lost even if the run is interrupted. A checkpoint system ensures that restarted runs skip already-processed URLs automatically.
🚀 Why Use This News Article Scraper?
| Feature | This Actor | Manual Research | RSS Scrapers | Other Scrapers |
|---|---|---|---|---|
| Full article body text | ✅ | ✅ Slow | ❌ Excerpt only | ⚠️ |
| Auto-discovers article links | ✅ | ❌ | ❌ | ⚠️ |
| Author & publish date | ✅ | ✅ | ⚠️ | ⚠️ |
| SEO keywords extracted | ✅ | ❌ | ❌ | ❌ |
| Word count per article | ✅ | ❌ | ❌ | ❌ |
| Publisher name auto-detected | ✅ | ❌ | ✅ | ⚠️ |
| Login/CAPTCHA page filtering | ✅ Auto | ❌ | N/A | ❌ |
| Checkpoint resume on abort | ✅ | N/A | ❌ | ❌ |
| Real-time push per article | ✅ | N/A | N/A | ❌ |
| Pay only for results | ✅ | N/A | N/A | ❌ |
Bottom line: This news article scraper is the only actor that combines automatic article link discovery, full body text extraction, SEO keyword output, word count, and checkpoint resume — with pay-per-result pricing so you only pay for successfully scraped articles.
🌐 Supported Website Types
This newsroom scraper works on any publicly accessible page containing article or press release links:
📢 Corporate & Brand Newsrooms
- Company press release pages (
/press,/news,/newsroom) - Investor relations news sections
- Product launch and announcement blogs
📰 News Portals & Media Sites
- Industry-specific news portals
- Trade publication article pages
- Technology, finance, health, and general news sites
✍️ Blogs & Content Sites
- Company and personal blogs (
/blog,/articles,/posts) - Thought leadership and opinion content
- Tutorial and resource article pages
📣 Press Release Aggregators
- Any page containing links to press releases or news articles
- Partner and affiliate news feeds
- Government and regulatory announcement pages
Any page where URLs contain
news,press,release,blog,article,story, orpostin the path is automatically detected as an article source.
🎯 Use Cases
📊 Competitive Intelligence & Market Monitoring
- Scrape competitor newsrooms to track product launches, partnerships, and strategic announcements
- Monitor industry news portals for coverage of key topics, companies, or technologies
- Build automated press release feeds from multiple corporate newsrooms in one run
🤖 AI & NLP Training Datasets
- Extract full article body text from news portals to build large text corpora for language models
- Collect article metadata — author, date, keywords, word count — for news classification research
- Build structured datasets of press releases for information extraction and NLP training
📣 PR & Communications Research
- Scrape your own press coverage to audit how your announcements are being reported
- Monitor brand mentions in news articles across multiple publication newsrooms
- Track press release distribution and pickup by scraping multiple outlet pages
🛠️ Content & SEO Analytics
- Extract SEO keywords from article meta tags across competitor content for keyword research
- Analyze word count distributions and content length patterns across industry publications
- Build content audit datasets from any website's blog or article archive
📰 Journalism & Research
- Archive newsroom content for investigative journalism or historical analysis
- Collect structured article datasets for media studies, framing research, or sentiment analysis
- Monitor regulatory or government newsrooms for policy announcements and notices
🏢 Enterprise Content Pipelines
- Feed scraped press release content into internal knowledge bases or RAG systems
- Automate competitive news briefing generation using structured article data
- Build content aggregation pipelines that monitor multiple industry newsrooms on a schedule
⚙️ Input Parameters
{"startUrls": [{ "url": "https://newsroom.example.com/press-releases" },{ "url": "https://techcrunch.com/tag/artificial-intelligence/" },{ "url": "https://www.company.com/blog" }],"maxItemsPerUrl": 10,"proxyConfiguration": {"useApifyProxy": true,"apifyProxyGroups": ["RESIDENTIAL"]}}
| Parameter | Type | Default | Description |
|---|---|---|---|
startUrls | array | [] | Start pages to scrape — each as { "url": "https://..." }. The actor discovers article links automatically from each page |
maxItemsPerUrl | integer | 10 | Maximum articles to scrape per start URL |
proxyConfiguration | object | Residential | Apify proxy config — residential proxy recommended for news and corporate sites |
Tip: Provide the newsroom index page or article listing page as
startUrl— not individual article URLs. The actor discovers individual article links automatically from the page.
📋 Output Fields
Every successfully scraped article from this press release scraper includes:
| Field | Type | Description | Example |
|---|---|---|---|
publisher | string | Publisher name from OG metadata or domain | "TechCrunch" |
articleTitle | string | Full article or press release title | "Acme Corp Launches New AI Platform" |
author | string | Article author from metadata | "Sarah Johnson" |
publishDate | string | Publication date (ISO format when available) | "2024-03-15T09:00:00Z" |
summary | string | Article summary or meta description (max 300 chars) | "Acme Corp today announced the launch of..." |
seoKeywords | string | SEO keywords from page meta tags | "AI, machine learning, enterprise software" |
wordCount | integer | Total word count of the article body | 847 |
articleUrl | string | Direct URL of the scraped article | "https://newsroom.acme.com/press/ai-launch" |
fullText | string | Complete article body text | "Acme Corp today announced..." |
scrapedAt | string | Extraction timestamp (ISO 8601 UTC) | "2024-03-15T10:30:00Z" |
Quality filtering: Articles shorter than 150 words are automatically skipped. Login walls, CAPTCHA pages, and access-denied pages are detected and skipped by title keyword matching — only real article content reaches your dataset.
📦 Example Input & Output
Input:
{"startUrls": [{ "url": "https://newsroom.example.com/press-releases" }],"maxItemsPerUrl": 3}
Output (one record):
{"publisher": "Example Corp Newsroom","articleTitle": "Example Corp Launches New AI-Powered Analytics Platform","author": "Sarah Johnson","publishDate": "2024-03-15T09:00:00Z","summary": "Example Corp today announced the general availability of its new AI-powered analytics platform, designed to help enterprises...","seoKeywords": "AI analytics, enterprise software, data platform, machine learning","wordCount": 847,"articleUrl": "https://newsroom.example.com/press-releases/ai-analytics-launch","fullText": "Example Corp today announced the general availability of its new AI-powered analytics platform. The platform, which has been in private beta for six months...","scrapedAt": "2024-03-15T10:30:00Z"}
💰 Pricing
This actor uses pay-per-event pricing — you only pay for articles successfully scraped and pushed to the dataset.
| Event | Price |
|---|---|
| Actor start fee | $0.02 per run |
| Per article successfully scraped | $0.005 per result ($5.00 per 1,000 articles) |
How billing works:
- ✅ The $0.02 start fee applies once per run regardless of results
- ✅ Each article record pushed to the dataset is charged at $0.005
- ✅ Articles filtered out (too short, login wall, CAPTCHA) are not charged
- ✅ The actor stops automatically when your Apify account charge limit is reached
- ✅ No subscription — pay only for what you successfully extract
Example: Scrape 200 articles = $0.02 (start) + $1.00 (200 × $0.005) = $1.02 total
2-hour free trial available — click Try for free at the top of this page.
⚡ Performance & Limits
| Start URLs | Articles Per URL | Estimated Time |
|---|---|---|
| 1 URL | 10 | ~1–3 minutes |
| 5 URLs | 10 each | ~5–12 minutes |
| 10 URLs | 10 each | ~10–25 minutes |
| 1 URL | 50 | ~5–12 minutes |
- Results pushed to the Apify dataset immediately after each article is scraped — no data loss on abort
- Checkpoint saved after every completed URL — restart resumes from the last processed URL
- 0.5-second delay between article requests — polite scraping that avoids rate limiting
- Login wall and CAPTCHA pages are automatically detected and skipped
❓ FAQ
Q: How does the actor discover article links?
A: The actor loads the start URL, parses all <a href> links on the page, and filters for URLs containing news, press, release, blog, article, story, or post in the path. Only links from the same or related domain are followed. Pagination and tag page URLs are excluded.
Q: What happens if an article is behind a login wall or paywall? A: Pages with titles containing "login", "sign in", "access denied", "robot", or "captcha" are automatically detected and skipped. No charge is applied for skipped pages.
Q: What is the minimum article length for scraping? A: Articles with fewer than 150 words of body text are automatically filtered out. This prevents stub pages, redirect pages, and content-free landing pages from appearing in your dataset.
Q: What does the checkpoint feature do? A: After each start URL is fully processed, the actor saves a checkpoint. If the run is aborted or hits a spending limit, restarting it with the same input will skip already-completed URLs — no duplicate results and no wasted credits.
Q: Can I scrape individual article URLs directly? A: This actor is designed to start from a listing or index page and discover article links automatically. For scraping known individual article URLs, provide the listing page that contains those links as your start URL.
Q: Is fullText always populated?
A: The actor first looks for an <article> HTML element. If not found, it collects all <p> paragraphs with more than 30 characters. If the combined text is still under 150 words, the article is skipped and not charged.
Q: Can I export results to Excel or CSV? A: Yes. All results are pushed to the Apify dataset, which can be exported to JSON, CSV, Excel, and more directly from the Apify Console after each run.
📜 Changelog
v1.0.0 (Current)
- ✅ Auto-discovery of article links from any newsroom or blog index page
- ✅ Full article body text extraction —
<article>tag first, paragraph fallback - ✅ Article metadata: title, author, publish date, summary, SEO keywords, word count
- ✅ Publisher name auto-detection from OG metadata or domain
- ✅ Login wall and CAPTCHA page filtering — only real content reaches dataset
- ✅ Minimum 150-word quality filter — no stub pages in output
- ✅ Checkpoint/resume — restarted runs skip already-processed URLs
- ✅ Pay-per-event billing — charged per successfully scraped article
- ✅ Real-time dataset push as each article is scraped
🏷️ Tags
news article scraper newsroom scraper press release scraper article extractor news content scraper blog scraper full text article scraper press release extractor news data scraper corporate newsroom scraper article metadata scraper content scraper
⚖️ Legal & Terms of Use
This actor accesses publicly visible article and press release content from websites in the same way a regular user browses those pages.
Please note:
- Use extracted article data only for lawful purposes — research, competitive intelligence, NLP datasets, content monitoring, and journalism are common legitimate uses
- Article content is copyright of the original publisher — do not republish scraped full-text content without appropriate authorization
- Respect individual publication Terms of Service when scraping at high volume
- The actor developer is not responsible for how extracted article content is used
🤝 Support & Feedback
- Bug report? Contact us via the Apify actor page
- Feature request? Post in the Apify Community forum
- Loving it? Please leave a ⭐ review — it helps other users find this actor!
Built with ❤️ on Apify
The most reliable News Article Scraper — full text, metadata, SEO data, pay per result
💰 $0.02 per run + $5.00 per 1,000 articles · Pay only for results