Sports News Scraper avatar
Sports News Scraper
Under maintenance

Pricing

Pay per usage

Go to Apify Store
Sports News Scraper

Sports News Scraper

Under maintenance

It provides you the latest news for the sports category you chose and you love

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Rohan Dani

Rohan Dani

Maintained by Community

Actor stats

0

Bookmarked

1

Total users

0

Monthly active users

10 days ago

Last modified

Share

An Apify Actor that scrapes the latest sports news from multiple websites based on user-selected sport categories. Perfect for sports enthusiasts, journalists, and data analysts who need aggregated sports news from multiple sources.

Overview

This actor collects sports news articles from various sports websites, processes them to remove duplicates, classifies transfer news, and outputs structured data ready for analysis or integration into your applications. It uses Cheerio for efficient HTML parsing and includes robust error handling to ensure reliable data collection even when individual sources fail.

Features

  • Multi-Category Support: Scrape news for Cricket, Football, Kabaddi, Ice Hockey, Basketball, and Baseball
  • Multiple Sources: Aggregates news from various reputable sports websites
  • Transfer News Classification: Automatically identifies and classifies transfer news as rumors or confirmed
  • Custom Sources: Add your own websites with custom CSS selectors
  • Smart Deduplication: Removes duplicate articles across sources based on title similarity
  • Error Resilience: Continues scraping even if individual sources fail
  • Retry Logic: Automatic retries with exponential backoff for network errors
  • Structured Output: Clean, consistent JSON output saved to Apify dataset
  • Comprehensive Logging: Detailed logs for monitoring and debugging

Input Configuration

Required Parameters

  • categories (array): One or more sport categories to scrape
    • Options: cricket, football, kabaddi, ice-hockey, basketball, baseball
    • Example: ["cricket", "football"]

Optional Parameters

  • customWebsites (array): Add custom websites to scrape

    • Each website requires: name, url, category, and selectors
    • Example:
      {
      "name": "Custom Sports Site",
      "url": "https://example.com/sports",
      "category": "cricket",
      "selectors": {
      "article": ".article-item",
      "title": ".title",
      "link": "a",
      "date": ".date",
      "description": ".summary"
      }
      }
  • useOnlyCustomWebsites (boolean): If true, only scrape custom websites (default: false)

  • maxArticlesPerSource (integer): Maximum articles per source (default: 20, range: 1-100)

Output Format

Each scraped article includes:

{
"title": "Article title",
"url": "https://example.com/article",
"date": "2025-11-12T10:30:00Z",
"description": "Article summary or description",
"source": "Source website name",
"category": "cricket",
"tags": ["transfer", "news"],
"transferInfo": {
"isTransfer": true,
"status": "confirmed",
"confidence": 0.9
},
"scrapedAt": "2025-11-12T12:00:00Z"
}

Usage Examples

Basic Usage - Single Category

{
"categories": ["cricket"]
}

Multiple Categories

{
"categories": ["cricket", "football", "kabaddi", "ice-hockey", "basketball", "baseball"]
}

With Custom Website

{
"categories": ["cricket"],
"customWebsites": [
{
"name": "My Cricket Site",
"url": "https://mycricketsite.com/news",
"category": "cricket",
"selectors": {
"article": ".news-item",
"title": "h2",
"link": "a",
"date": ".publish-date"
}
}
]
}

Custom Websites Only

{
"categories": ["football"],
"useOnlyCustomWebsites": true,
"customWebsites": [
{
"name": "My Football Source",
"url": "https://myfootball.com/news",
"category": "football",
"selectors": {
"article": ".article",
"title": ".headline",
"link": "a.read-more"
}
}
]
}

Transfer News Classification

The actor automatically detects and classifies transfer-related news:

  • Rumor: Articles containing keywords like "rumor", "speculation", "reported", "linked"
  • Confirmed: Articles with keywords like "confirmed", "official", "announced", "signed"
  • Unknown: Transfer news without clear classification

Error Handling

  • Network errors trigger automatic retries (up to 3 attempts with exponential backoff)
  • Failed sources are logged but don't stop the entire scraping process
  • Partial results are saved even if some sources fail

Important Notes

Anti-Scraping Protection

Many sports news websites implement anti-scraping measures that may block direct requests. When running on the Apify platform, the actor automatically benefits from Apify's infrastructure which helps with successful scraping. For best results:

  • Use Custom Websites: Add your own trusted sources with the customWebsites parameter
  • Apify Platform: The actor works best when deployed on Apify platform (better success rates than local testing)
  • Alternative Sources: Some default sources may be blocked - use custom sources for reliable scraping

For production use, we recommend:

  1. Deploy the actor to Apify platform
  2. Test with your specific custom sources
  3. Monitor success rates and adjust sources as needed
  4. Use websites that are more scraper-friendly (smaller news sites, RSS feeds, etc.)

Troubleshooting

No Results Returned

Problem: Actor completes but returns no articles.

Solutions:

  • Anti-scraping blocks: Many major sports sites block automated requests. Try using custom websites with less restrictive policies
  • Use Apify platform: The actor has better success rates on Apify platform than local testing
  • Verify that the selected categories have configured sources
  • Check that custom website URLs are accessible and not blocked
  • Review actor logs for specific error messages or network failures (403, 429 errors indicate blocking)
  • Ensure useOnlyCustomWebsites is not set to true without providing custom websites
  • Test custom selectors on the target website to ensure they match elements

Missing Data Fields

Problem: Some articles are missing date, description, or other fields.

Explanation: Not all sources provide all fields. The actor handles missing fields gracefully by setting them to null.

Solutions:

  • This is expected behavior - filter results in your application if needed
  • For custom websites, verify selectors are correctly targeting the desired elements
  • Check if the source website actually provides the missing information

Parsing Errors

Problem: Actor logs show parsing errors for specific sources.

Solutions:

  • Website structure changes may break selectors - this is common with web scraping
  • For default sources, check if there's an updated version of the actor
  • For custom sources, inspect the website HTML and update your selectors
  • Use browser DevTools to test CSS selectors before adding them to configuration

Network Timeouts

Problem: Actor fails with timeout errors.

Solutions:

  • Some websites may be slow or temporarily unavailable
  • The actor automatically retries failed requests up to 3 times
  • Consider increasing the actor's timeout setting in Apify Console
  • Check if the website is blocking automated requests

Memory Limit Exceeded

Problem: Actor fails with out-of-memory error.

Solutions:

  • Reduce maxArticlesPerSource to limit memory usage
  • Scrape fewer categories in a single run
  • Increase memory allocation in Apify Console (recommended: 512MB or higher)

Rate Limiting / Blocked Requests

Problem: Websites return 429 or 403 errors.

Solutions:

  • Some websites may block automated requests
  • The actor includes retry logic with exponential backoff
  • Consider using Apify's proxy services for better success rates
  • Reduce the number of concurrent requests if scraping many sources

Performance Tips

  • Optimize Article Limits: Set maxArticlesPerSource to a reasonable value (10-30) for faster runs
  • Select Specific Categories: Only scrape categories you need to reduce runtime
  • Use Scheduling: Schedule regular runs to keep data fresh without manual intervention
  • Monitor Success Rates: Check logs to identify consistently failing sources

Development

Local Testing

# Install dependencies
npm install
# Run locally with Apify CLI
apify run
# Or with Node.js directly
npm start
# Run tests
npm test

Project Structure

apify-sports-news-scraper/
├── .actor/
│ └── actor.json # Actor metadata and configuration
├── src/
│ ├── main.js # Entry point
│ ├── config.js # Source configuration
│ ├── scraper.js # Scraping logic
│ ├── classifier.js # Transfer classification
│ ├── processor.js # Data processing
│ └── utils.js # Utility functions
├── test/ # Test files
├── INPUT_SCHEMA.json # Input validation schema
├── Dockerfile # Docker configuration
├── package.json # Dependencies
└── README.md # This file

Technical Details

Dependencies

  • apify: Apify SDK for platform integration
  • cheerio: Fast HTML parsing
  • axios: HTTP client with retry support

Retry Strategy

Network requests are automatically retried up to 3 times with exponential backoff:

  • 1st retry: 1 second delay
  • 2nd retry: 2 seconds delay
  • 3rd retry: 4 seconds delay

Deduplication Algorithm

Articles are deduplicated based on title similarity. Articles with very similar titles (from different sources) are merged, keeping the first occurrence and combining source information.

Limitations

  • The actor scrapes publicly available news only
  • Some websites may block automated scraping
  • Selector-based scraping may break if websites change their HTML structure
  • Transfer classification is keyword-based and may not be 100% accurate

Support

For issues, feature requests, or questions:

  • Check the troubleshooting section above
  • Review actor logs in Apify Console for detailed error messages
  • Submit feedback through the Apify platform
  • Contact the actor maintainer for custom source configurations

License

ISC