Under maintenance

Pricing

Pay per usage

Try for free

Go to Apify Store

Sports News Scraper

Under maintenance

Try for free

It provides you the latest news for the sports category you chose and you love

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Rohan Dani

Actor stats

Bookmarked

Total users

Monthly active users

3 months ago

Last modified

Overview

This actor collects sports news articles from various sports websites, processes them to remove duplicates, classifies transfer news, and outputs structured data ready for analysis or integration into your applications. It uses Cheerio for efficient HTML parsing and includes robust error handling to ensure reliable data collection even when individual sources fail.

Features

Multi-Category Support: Scrape news for Cricket, Football, Kabaddi, Ice Hockey, Basketball, and Baseball
Multiple Sources: Aggregates news from various reputable sports websites
Transfer News Classification: Automatically identifies and classifies transfer news as rumors or confirmed
Custom Sources: Add your own websites with custom CSS selectors
Smart Deduplication: Removes duplicate articles across sources based on title similarity
Error Resilience: Continues scraping even if individual sources fail
Retry Logic: Automatic retries with exponential backoff for network errors
Structured Output: Clean, consistent JSON output saved to Apify dataset
Comprehensive Logging: Detailed logs for monitoring and debugging

Input Configuration

Required Parameters

categories (array): One or more sport categories to scrape
- Options: cricket, football, kabaddi, ice-hockey, basketball, baseball
- Example: ["cricket", "football"]

Optional Parameters

customWebsites (array): Add custom websites to scrape

Each website requires: name, url, category, and selectors

Example:

{
  "name": "Custom Sports Site",
  "url": "https://example.com/sports",
  "category": "cricket",
  "selectors": {
    "article": ".article-item",
    "title": ".title",
    "link": "a",
    "date": ".date",
    "description": ".summary"
  }
}

useOnlyCustomWebsites (boolean): If true, only scrape custom websites (default: false)
maxArticlesPerSource (integer): Maximum articles per source (default: 20, range: 1-100)

Output Format

Each scraped article includes:

{
  "title": "Article title",
  "url": "https://example.com/article",
  "date": "2025-11-12T10:30:00Z",
  "description": "Article summary or description",
  "source": "Source website name",
  "category": "cricket",
  "tags": ["transfer", "news"],
  "transferInfo": {
    "isTransfer": true,
    "status": "confirmed",
    "confidence": 0.9
  },
  "scrapedAt": "2025-11-12T12:00:00Z"
}

Usage Examples

Basic Usage - Single Category

{
  "categories": ["cricket"]
}

Multiple Categories

{
  "categories": ["cricket", "football", "kabaddi", "ice-hockey", "basketball", "baseball"]
}

With Custom Website

{
  "categories": ["cricket"],
  "customWebsites": [
    {
      "name": "My Cricket Site",
      "url": "https://mycricketsite.com/news",
      "category": "cricket",
      "selectors": {
        "article": ".news-item",
        "title": "h2",
        "link": "a",
        "date": ".publish-date"
      }
    }
  ]
}

Custom Websites Only

{
  "categories": ["football"],
  "useOnlyCustomWebsites": true,
  "customWebsites": [
    {
      "name": "My Football Source",
      "url": "https://myfootball.com/news",
      "category": "football",
      "selectors": {
        "article": ".article",
        "title": ".headline",
        "link": "a.read-more"
      }
    }
  ]
}

Transfer News Classification

The actor automatically detects and classifies transfer-related news:

Rumor: Articles containing keywords like "rumor", "speculation", "reported", "linked"
Confirmed: Articles with keywords like "confirmed", "official", "announced", "signed"
Unknown: Transfer news without clear classification

Error Handling

Network errors trigger automatic retries (up to 3 attempts with exponential backoff)
Failed sources are logged but don't stop the entire scraping process
Partial results are saved even if some sources fail

Important Notes

Anti-Scraping Protection

Many sports news websites implement anti-scraping measures that may block direct requests. When running on the Apify platform, the actor automatically benefits from Apify's infrastructure which helps with successful scraping. For best results:

Use Custom Websites: Add your own trusted sources with the customWebsites parameter
Apify Platform: The actor works best when deployed on Apify platform (better success rates than local testing)
Alternative Sources: Some default sources may be blocked - use custom sources for reliable scraping

Recommended Approach

For production use, we recommend:

Deploy the actor to Apify platform
Test with your specific custom sources
Monitor success rates and adjust sources as needed
Use websites that are more scraper-friendly (smaller news sites, RSS feeds, etc.)

Troubleshooting

No Results Returned

Problem: Actor completes but returns no articles.

Solutions:

Anti-scraping blocks: Many major sports sites block automated requests. Try using custom websites with less restrictive policies
Use Apify platform: The actor has better success rates on Apify platform than local testing
Verify that the selected categories have configured sources
Check that custom website URLs are accessible and not blocked
Review actor logs for specific error messages or network failures (403, 429 errors indicate blocking)
Ensure useOnlyCustomWebsites is not set to true without providing custom websites
Test custom selectors on the target website to ensure they match elements

Missing Data Fields

Problem: Some articles are missing date, description, or other fields.

Explanation: Not all sources provide all fields. The actor handles missing fields gracefully by setting them to null.

Solutions:

This is expected behavior - filter results in your application if needed
For custom websites, verify selectors are correctly targeting the desired elements
Check if the source website actually provides the missing information

Parsing Errors

Problem: Actor logs show parsing errors for specific sources.

Solutions:

Website structure changes may break selectors - this is common with web scraping
For default sources, check if there's an updated version of the actor
For custom sources, inspect the website HTML and update your selectors
Use browser DevTools to test CSS selectors before adding them to configuration

Network Timeouts

Problem: Actor fails with timeout errors.

Solutions:

Some websites may be slow or temporarily unavailable
The actor automatically retries failed requests up to 3 times
Consider increasing the actor's timeout setting in Apify Console
Check if the website is blocking automated requests

Memory Limit Exceeded

Problem: Actor fails with out-of-memory error.

Solutions:

Reduce maxArticlesPerSource to limit memory usage
Scrape fewer categories in a single run
Increase memory allocation in Apify Console (recommended: 512MB or higher)

Rate Limiting / Blocked Requests

Problem: Websites return 429 or 403 errors.

Solutions:

Some websites may block automated requests
The actor includes retry logic with exponential backoff
Consider using Apify's proxy services for better success rates
Reduce the number of concurrent requests if scraping many sources

Performance Tips

Optimize Article Limits: Set maxArticlesPerSource to a reasonable value (10-30) for faster runs
Select Specific Categories: Only scrape categories you need to reduce runtime
Use Scheduling: Schedule regular runs to keep data fresh without manual intervention
Monitor Success Rates: Check logs to identify consistently failing sources

Development

Local Testing

# Install dependencies
npm install

# Run locally with Apify CLI
apify run

# Or with Node.js directly
npm start

# Run tests
npm test

Project Structure

apify-sports-news-scraper/
├── .actor/
│   └── actor.json           # Actor metadata and configuration
├── src/
│   ├── main.js              # Entry point
│   ├── config.js            # Source configuration
│   ├── scraper.js           # Scraping logic
│   ├── classifier.js        # Transfer classification
│   ├── processor.js         # Data processing
│   └── utils.js             # Utility functions
├── test/                    # Test files
├── INPUT_SCHEMA.json        # Input validation schema
├── Dockerfile               # Docker configuration
├── package.json             # Dependencies
└── README.md                # This file

Technical Details

Dependencies

apify: Apify SDK for platform integration
cheerio: Fast HTML parsing
axios: HTTP client with retry support

Retry Strategy

Network requests are automatically retried up to 3 times with exponential backoff:

1st retry: 1 second delay
2nd retry: 2 seconds delay
3rd retry: 4 seconds delay

Deduplication Algorithm

Articles are deduplicated based on title similarity. Articles with very similar titles (from different sources) are merged, keeping the first occurrence and combining source information.

Limitations

The actor scrapes publicly available news only
Some websites may block automated scraping
Selector-based scraping may break if websites change their HTML structure
Transfer classification is keyword-based and may not be 100% accurate

Support

For issues, feature requests, or questions:

Check the troubleshooting section above
Review actor logs in Apify Console for detailed error messages
Submit feedback through the Apify platform
Contact the actor maintainer for custom source configurations

License

ISC

Sports Data API

vivid_astronaut/sports-data

Fabio Suizu

Google Latest News Scraper

mansurmqlfelvin/google-latest-news-scraper

Google Latest News Scraper

Jerry Chao

Google News

canadesk/google-news

Find the latest news with direct source links from Google News. It's fast and costs little!

Canadesk Support

Ultimate News API

glitch_404/Ultimate-News-Scraper

Scrape up to 10000 news articles from over 4500 news sources in less than 20 minutes, news from over 20 categories, e.g., Crypto news, World News, Latest News, Celebrities, and a lot more. You can find news on websites such as Fox News, BBC News, CNN, and Cryptocurrency-Related News Sources.

Yousif Wael

232

1.6

Espn Sports Scraper

fortuitous_pirate/espn-sports-scraper

Fortuitous Pirate

Google News Scraper

epctex/google-news-scraper

Unlock timely news insights with our Google News data retrieval tool. Get the latest news on any news at any time, and more. Effortless and powerful. 📰🔍 #NewsData

epctex

566

5.0

SportIntel MCP

epicmotionsd/sportintel-mcp

is the first AI-powered sports analytics Actor built on the Model Context Protocol (MCP). It provides explainable player projections, lineup optimization, and real-time betting odds for Daily Fantasy Sports (DFS) and sports betting.