In Depth News Scraper avatar

In Depth News Scraper

Try for free

3 days trial then $5.00/month - No credit card required now

Go to Store
In Depth News Scraper

In Depth News Scraper

sync-network/in-depth-news-scraper
Try for free

3 days trial then $5.00/month - No credit card required now

Extract full length articles from top news sources, streamlining the collection of the latest updates on any subject. Its key feature is retrieving complete content—not just headlines. Customise your output from concise summaries to complete articles, transforming your news gathering process.

In-Depth News Scraper

The In-Depth News Scraper is an Apify actor designed to revolutionise how you gather and process news data. It stands apart from conventional scrapers by delivering complete article content rather than just headlines, enabling comprehensive analysis across diverse news categories.

Key Advantages

• Thorough content extraction, not just headlines • Support for major news categories and outlets • Flexible search and filtering capabilities • Structured, analysis-ready output

Features

• Category-Based Filtering: Focus your news gathering by targeting specific categories such as World, Business, or Technology. • Complete Article Extraction: Access full article content directly, surpassing the limitations of basic news aggregators. • Customisable Content Length: Control output size by specifying word count or retrieving complete articles. • Intelligent Filtering: Exclude irrelevant content using customisable keyword filters. • Time-Range Selection: Gather current news or research historical content with flexible time frame options. • Structured Data Output: Receive consistently formatted data including titles, URLs, dates, and sources. • Optional Image Support: Choose whether to include article images based on your requirements.

Input Parameters

The actor accepts the following configuration options:

ParameterTypeDescription
newsCategoryStringRequired: Category filter (e.g., "World", "Technology")
additionalKeywordsStringOptional: Refine search within selected category
numberOfItemsNumberNumber of articles to retrieve (default: 10, max: 100)
filterBadKeywordsArrayOptional: Keywords to exclude from results
contentLengthStringContent extraction mode: "Full" or "Summary" (default: Full)
timeRangeStringTime period for article selection
retrieveImageBooleanInclude image URLs in output (default: false)

Example configuration:

1{
2    "newsCategory": "Technology",
3    "additionalKeywords": "artificial intelligence",
4    "numberOfItems": 20,
5    "filterBadKeywords": ["sponsored", "advertisement"],
6    "contentLength": "Full",
7    "timeRange": "Past week",
8    "retrieveImage": false
9}

Supported Categories

The actor provides coverage across these primary news categories:

  • World
  • Business
  • Technology
  • Entertainment
  • Health
  • Science
  • Sports
  • Politics

Output Structure

Each article in the dataset contains the following fields:

1{
2    "title": "Article headline",
3    "link": "Article URL",
4    "pubDate": "2025-02-05T10:00:00.000Z",
5    "source": "Publishing outlet name",
6    "summary": "Brief article overview",
7    "content": "Full article text (length based on contentLength parameter)",
8    "imageUrl": "Main image URL (if retrieveImage is true)"
9}

Implementation Guide

  1. Choose your target news category
  2. Add any specific keywords to refine results
  3. Set additional parameters as needed
  4. Execute the actor
  5. Access your structured dataset

Performance Considerations

Performance varies based on several factors:

  • Processing Duration: Typically 5-10 seconds per article for full extraction
  • Volume Handling: Efficiently processes up to 100 articles per run
  • Request Management: Sequential processing with appropriate intervals

For optimal results:

  • Limit requests to 50 items for faster completion
  • Use precise keywords to target relevant content
  • Consider using word limits unless full text is required
  • Disable image retrieval when not essential

Note: Network conditions and source website responsiveness may affect performance.

Error Handling and Troubleshooting

The actor implements comprehensive error handling:

  • Connection Issues: Automatic retry (up to 3 attempts) for failed connections
  • Rate Management: Dynamic delays between requests to prevent rate limiting
  • Content Fallback: Defaults to article summary if full content extraction fails
  • Input Validation: Clear error messages for invalid configurations

Troubleshooting Common Issues

  • Timeout Errors: Consider reducing batch size or increasing time between requests
  • Missing Content: Check if the source website requires authentication
  • Rate Limiting: The actor will automatically pause and retry; no action needed
  • Error Logs: Available in the actor's run details for debugging

For detailed error information, consult the actor's run log in the Apify Console.

Technical Support

For implementation assistance or to report issues:

  1. Check the actor's run log for specific error messages
  2. Review the troubleshooting section above
  3. Contact support with the actor run ID for detailed investigation

The actor continuously logs its progress and any errors encountered, facilitating quick problem resolution.

Developer
Maintained by Community

Actor Metrics

  • 2 monthly users

  • 2 bookmarks

  • >99% runs succeeded

  • Created in Feb 2025

  • Modified 16 days ago