Sports News Scraper
Pricing
Pay per usage
Sports News Scraper
It provides you the latest news for the sports category you chose and you love
Pricing
Pay per usage
Rating
0.0
(0)
Developer

Rohan Dani
Actor stats
0
Bookmarked
1
Total users
0
Monthly active users
10 days ago
Last modified
Categories
Share
An Apify Actor that scrapes the latest sports news from multiple websites based on user-selected sport categories. Perfect for sports enthusiasts, journalists, and data analysts who need aggregated sports news from multiple sources.
Overview
This actor collects sports news articles from various sports websites, processes them to remove duplicates, classifies transfer news, and outputs structured data ready for analysis or integration into your applications. It uses Cheerio for efficient HTML parsing and includes robust error handling to ensure reliable data collection even when individual sources fail.
Features
- Multi-Category Support: Scrape news for Cricket, Football, Kabaddi, Ice Hockey, Basketball, and Baseball
- Multiple Sources: Aggregates news from various reputable sports websites
- Transfer News Classification: Automatically identifies and classifies transfer news as rumors or confirmed
- Custom Sources: Add your own websites with custom CSS selectors
- Smart Deduplication: Removes duplicate articles across sources based on title similarity
- Error Resilience: Continues scraping even if individual sources fail
- Retry Logic: Automatic retries with exponential backoff for network errors
- Structured Output: Clean, consistent JSON output saved to Apify dataset
- Comprehensive Logging: Detailed logs for monitoring and debugging
Input Configuration
Required Parameters
- categories (array): One or more sport categories to scrape
- Options:
cricket,football,kabaddi,ice-hockey,basketball,baseball - Example:
["cricket", "football"]
- Options:
Optional Parameters
-
customWebsites (array): Add custom websites to scrape
- Each website requires:
name,url,category, andselectors - Example:
{"name": "Custom Sports Site","url": "https://example.com/sports","category": "cricket","selectors": {"article": ".article-item","title": ".title","link": "a","date": ".date","description": ".summary"}}
- Each website requires:
-
useOnlyCustomWebsites (boolean): If true, only scrape custom websites (default:
false) -
maxArticlesPerSource (integer): Maximum articles per source (default:
20, range: 1-100)
Output Format
Each scraped article includes:
{"title": "Article title","url": "https://example.com/article","date": "2025-11-12T10:30:00Z","description": "Article summary or description","source": "Source website name","category": "cricket","tags": ["transfer", "news"],"transferInfo": {"isTransfer": true,"status": "confirmed","confidence": 0.9},"scrapedAt": "2025-11-12T12:00:00Z"}
Usage Examples
Basic Usage - Single Category
{"categories": ["cricket"]}
Multiple Categories
{"categories": ["cricket", "football", "kabaddi", "ice-hockey", "basketball", "baseball"]}
With Custom Website
{"categories": ["cricket"],"customWebsites": [{"name": "My Cricket Site","url": "https://mycricketsite.com/news","category": "cricket","selectors": {"article": ".news-item","title": "h2","link": "a","date": ".publish-date"}}]}
Custom Websites Only
{"categories": ["football"],"useOnlyCustomWebsites": true,"customWebsites": [{"name": "My Football Source","url": "https://myfootball.com/news","category": "football","selectors": {"article": ".article","title": ".headline","link": "a.read-more"}}]}
Transfer News Classification
The actor automatically detects and classifies transfer-related news:
- Rumor: Articles containing keywords like "rumor", "speculation", "reported", "linked"
- Confirmed: Articles with keywords like "confirmed", "official", "announced", "signed"
- Unknown: Transfer news without clear classification
Error Handling
- Network errors trigger automatic retries (up to 3 attempts with exponential backoff)
- Failed sources are logged but don't stop the entire scraping process
- Partial results are saved even if some sources fail
Important Notes
Anti-Scraping Protection
Many sports news websites implement anti-scraping measures that may block direct requests. When running on the Apify platform, the actor automatically benefits from Apify's infrastructure which helps with successful scraping. For best results:
- Use Custom Websites: Add your own trusted sources with the
customWebsitesparameter - Apify Platform: The actor works best when deployed on Apify platform (better success rates than local testing)
- Alternative Sources: Some default sources may be blocked - use custom sources for reliable scraping
Recommended Approach
For production use, we recommend:
- Deploy the actor to Apify platform
- Test with your specific custom sources
- Monitor success rates and adjust sources as needed
- Use websites that are more scraper-friendly (smaller news sites, RSS feeds, etc.)
Troubleshooting
No Results Returned
Problem: Actor completes but returns no articles.
Solutions:
- Anti-scraping blocks: Many major sports sites block automated requests. Try using custom websites with less restrictive policies
- Use Apify platform: The actor has better success rates on Apify platform than local testing
- Verify that the selected categories have configured sources
- Check that custom website URLs are accessible and not blocked
- Review actor logs for specific error messages or network failures (403, 429 errors indicate blocking)
- Ensure
useOnlyCustomWebsitesis not set totruewithout providing custom websites - Test custom selectors on the target website to ensure they match elements
Missing Data Fields
Problem: Some articles are missing date, description, or other fields.
Explanation: Not all sources provide all fields. The actor handles missing fields gracefully by setting them to null.
Solutions:
- This is expected behavior - filter results in your application if needed
- For custom websites, verify selectors are correctly targeting the desired elements
- Check if the source website actually provides the missing information
Parsing Errors
Problem: Actor logs show parsing errors for specific sources.
Solutions:
- Website structure changes may break selectors - this is common with web scraping
- For default sources, check if there's an updated version of the actor
- For custom sources, inspect the website HTML and update your selectors
- Use browser DevTools to test CSS selectors before adding them to configuration
Network Timeouts
Problem: Actor fails with timeout errors.
Solutions:
- Some websites may be slow or temporarily unavailable
- The actor automatically retries failed requests up to 3 times
- Consider increasing the actor's timeout setting in Apify Console
- Check if the website is blocking automated requests
Memory Limit Exceeded
Problem: Actor fails with out-of-memory error.
Solutions:
- Reduce
maxArticlesPerSourceto limit memory usage - Scrape fewer categories in a single run
- Increase memory allocation in Apify Console (recommended: 512MB or higher)
Rate Limiting / Blocked Requests
Problem: Websites return 429 or 403 errors.
Solutions:
- Some websites may block automated requests
- The actor includes retry logic with exponential backoff
- Consider using Apify's proxy services for better success rates
- Reduce the number of concurrent requests if scraping many sources
Performance Tips
- Optimize Article Limits: Set
maxArticlesPerSourceto a reasonable value (10-30) for faster runs - Select Specific Categories: Only scrape categories you need to reduce runtime
- Use Scheduling: Schedule regular runs to keep data fresh without manual intervention
- Monitor Success Rates: Check logs to identify consistently failing sources
Development
Local Testing
# Install dependenciesnpm install# Run locally with Apify CLIapify run# Or with Node.js directlynpm start# Run testsnpm test
Project Structure
apify-sports-news-scraper/├── .actor/│ └── actor.json # Actor metadata and configuration├── src/│ ├── main.js # Entry point│ ├── config.js # Source configuration│ ├── scraper.js # Scraping logic│ ├── classifier.js # Transfer classification│ ├── processor.js # Data processing│ └── utils.js # Utility functions├── test/ # Test files├── INPUT_SCHEMA.json # Input validation schema├── Dockerfile # Docker configuration├── package.json # Dependencies└── README.md # This file
Technical Details
Dependencies
- apify: Apify SDK for platform integration
- cheerio: Fast HTML parsing
- axios: HTTP client with retry support
Retry Strategy
Network requests are automatically retried up to 3 times with exponential backoff:
- 1st retry: 1 second delay
- 2nd retry: 2 seconds delay
- 3rd retry: 4 seconds delay
Deduplication Algorithm
Articles are deduplicated based on title similarity. Articles with very similar titles (from different sources) are merged, keeping the first occurrence and combining source information.
Limitations
- The actor scrapes publicly available news only
- Some websites may block automated scraping
- Selector-based scraping may break if websites change their HTML structure
- Transfer classification is keyword-based and may not be 100% accurate
Support
For issues, feature requests, or questions:
- Check the troubleshooting section above
- Review actor logs in Apify Console for detailed error messages
- Submit feedback through the Apify platform
- Contact the actor maintainer for custom source configurations
License
ISC


