News Website Crawler & Article Extractor avatar

News Website Crawler & Article Extractor

Under maintenance
Try for free

2 hours trial then $20.00/month - No credit card required now

Go to Store
This Actor is under maintenance.

This Actor may be unreliable while under maintenance. Would you like to try a similar Actor instead?

See alternative Actors
News Website Crawler & Article Extractor

News Website Crawler & Article Extractor

xtech/news-source-crawler
Try for free

2 hours trial then $20.00/month - No credit card required now

Scrape all articles from any news website. Extract full text, metadata, keywords, and summaries. Ideal for content analysis, research, and news aggregation.

News Source Crawler 📰🚀 (Apify Actor)

Crawl an entire news website and extract clean, structured data from all its articles. Get article text, metadata, keywords, summaries, and more – perfect for content analysis, market research, news aggregation, and SEO monitoring. No coding required!

Pricing 💰

  • $35/month for unlimited usage
  • Includes all features and Apify platform benefits
  • No additional costs or hidden fees

Features ✨

  • Full Website Crawl: 🌐 Scrapes articles from a specified news source URL
  • Comprehensive Article Extraction: 📰 Get full article text, publication date, author(s), and source URL
  • SEO & Content Analysis: 🔍 Extract keywords, meta descriptions, and automatically generated summaries
  • Multimedia Extraction: 🖼️ Get links to the main image, all images, and embedded videos
  • Language Support: 🌐 Specify the article language
  • Limit Articles: 🔢 Set a maximum number of articles to scrape (optional)
  • Proxy Support: ⚙️ Integrates with Apify Proxy for reliable scraping or use your custom proxy
  • Analysis-Ready Data (JSON): 💾 Structured data output, perfect for analysis and integration
  • Error Handling: ✅ Robust error handling

Why Use This News Source Crawler? 🤔

This Actor is designed to efficiently extract data from entire news websites. It crawls all linked articles from a starting URL, making it ideal for:

  • Large-Scale Data Collection: Quickly gather data from an entire news source
  • Comprehensive Analysis: Analyze the content, trends, and SEO strategies of a website
  • Automated News Feeds: Build custom news feeds with structured data
  • Time Savings: Automate the process of collecting articles from a specific source

Data Output 📦

The Actor pushes data to the dataset as it scrapes, providing results in real-time. Each item represents a single article (or an error) and contains the following fields:

  • articleURL: The URL of the scraped article
  • sourceURL: The base URL of the news source
  • articleLanguage: The language of the article (e.g., "en", "es")
  • articleTitle: The title of the article
  • articleAuthors: A comma-separated list of the article's authors
  • articlePublishDate: The publication date (ISO 8601 format), if available
  • articleText: The full text content of the article
  • articleTopImage: The URL of the main image
  • articleAllImages: A comma-separated list of URLs for all images
  • articleVideos: A comma-separated list of URLs for embedded videos
  • articleKeywords: A comma-separated list of extracted keywords
  • articleSummary: A concise summary of the article
  • scrapedAt: The timestamp of when the article was scraped (ISO 8601)
  • scrapeSuccess: true if scraped successfully, false otherwise
  • articleMetaDescription: The meta description of the article
  • articleMetaKeywords: A comma-separated list of the meta keywords
  • scrapeErrorMessage: An error message if scrapeSuccess is false

Example Output

1[
2  {
3    "articleURL": "https://www.example.com/news/article1",
4    "sourceURL": "https://www.example.com",
5    "articleLanguage": "en",
6    "articleTitle": "Example News Article",
7    "articleAuthors": "John Doe, Jane Smith",
8    "articlePublishDate": "2024-07-27T10:00:00Z",
9    "articleText": "This is the full text of the example article...",
10    "articleTopImage": "https://www.example.com/images/article1.jpg",
11    "articleAllImages": "https://www.example.com/images/article1.jpg,https://www.example.com/images/article2.png",
12    "articleVideos": "",
13    "articleKeywords": "news, example, article",
14    "articleSummary": "A brief summary of the example article.",
15    "scrapedAt": "2024-07-27T12:34:56Z",
16    "scrapeSuccess": true,
17    "articleMetaDescription": "Meta description of the example news article.",
18    "articleMetaKeywords": "example, article, news"
19  }
20]

Use Cases 💡

Content Marketing & SEO 📢

  • Competitor Analysis: Track all content published by competitors
  • Content Audits: Analyze an entire website's content strategy
  • Keyword Research: Identify trending topics across a whole site
  • Backlink Monitoring: Find sites linking to a news source
  • Brand Monitoring: Monitor your brand

Market Research & Business Intelligence 📊

  • News Aggregation: Build comprehensive news feeds from specific sources
  • Trend Analysis: Identify emerging trends within a news domain
  • Sentiment Analysis: Analyze the tone and sentiment of articles from a source

Academic Research 🎓

  • Data Collection: Gather large datasets of articles for research
  • Text Analysis: Analyze the content of entire news websites
  • Gather Specific Information: Gather articles of a specific niche

Other Applications 🌐

  • Machine Learning: Train models with large sets of scraped articles
  • Content Curation: Easily find and collect relevant articles

Getting Started 🚀

  1. Find the "News Source Crawler" in the Apify Store

  2. Configure the input:

    • url: (Required) The URL of the news website to crawl
    • language: (Optional) The expected language (default: "en")
    • maxArticles: (Optional) The maximum number of articles to scrape
    • proxyConfiguration: (Optional) Select an Apify Proxy configuration or provide custom proxies
  3. Run the Actor

  4. Access results in JSON, CSV, Excel, or other formats, directly from the dataset as the Actor runs

  5. Optional: Schedule the Actor, set up webhooks, or integrate with other Actors

Key Benefits 🏆

Data Quality

  • ✅ Reliable & Accurate: Provides high-quality extracted data
  • ✅ Clean Data: Extracts only the relevant information
  • ✅ Structured Format: Easy to use and integrate

Platform Advantages (Apify)

  • Scalable & Serverless: Handles large crawls without infrastructure management
  • Cost-Effective: Pay only for what you use
  • Full Apify Integration: Connects seamlessly with other Apify tools
  • User-Friendly: No coding required – simple input form
  • Real-time Results: Data is pushed to the dataset as it's scraped
  • Automated Updates: The Actor is maintained and updated
  • Isolated Runs: Each run is in a fresh, isolated container

Start crawling news sources today! ➡️

Developer
Maintained by Community

Actor Metrics

  • 2 monthly users

  • 0 No stars yet

  • >99% runs succeeded

  • Created in Feb 2025

  • Modified 3 days ago