news scraper to scrape up to 10K news articles from over 4500 news sources in less than 20 minutes news from over 20 categories .e.g. Crypto news, World News, Latest News, Celebrities News, and a lot more.
you can get news from websites like Fox News, BBC News, CNN News, Crypto and Cryptocurrencies.
this scraper is still in Beta mode, so you might find some bugs or issues please don't hesitate to report them though opening an issue here is myGitHubif you need to contact me
new features added
allow specific agencies or even 1 agency for the scraper to scrape from them (if found)
new Date Range feature it's way better now you can view it is documents downside
better code and better speed improvements, you can scrape way more historical data now
API Mode is faster, you can use it to extract html on you own instead of the scraper ( it will lower your costs a lot )
better logging system for bugs and errors detection
better and more accurate data collection
more scrapers added to collect a lot more data over the web
how to use v2.0 Beta new features
Date Range
you can enter a date range or enter today to get today's news - all to get any date (can scrape months old articles)
today = today's date articles
yesterday = yesterday's date articles
to scrape a specific date range use / to separate the 2 dates (only 2 dates are counted)
dates are always in the format YYYY-MM-DD
example: 2021-12-31/2022-01-01
to scrape a specific date enter it like this 2021-12-31 and all articles will be in this format
this version also support these formats (1s, 2m, 3h, 1d, 1w, 5M, 2y)
so, you can use them to scrape
1s for 1 second old articles
1m for 1 minute old articles
1h for 1 hour old articles
1d for 1 day old articles
1w for 1 week old articles
1M for 1 month old articles
1y for 1 year for old articles
Maximum articles amount
you set the amount to maximize the scraper output, but it scrapes around that number not the exact number so the scraper might find less than the value you set
API mode
you can lower the scraper costs by scraping the news in HTML instead of letting it extract the text, and then you can extract the text manually
Exclude specific agencies
set this to exclude any agency from scraping it separate agencies like this bbc, fox
this feature works like this if agency in exclude_agencies.split(','): skip article so please be careful with this setting
Allow specific agencies
same as exclude but works in reverse you can allow 1 or 2 or way more specific news agencies to be scraped for example
bbc news, fox news this will allow only bbc and Fox News to be scraped
Proxies
with this option you can set your own proxies from whatever provided, or you can use Apify proxies
it is very optional to use proxies this scraper doesn't need any proxies but if you face an issue with no articles being scraped then allow this actor to use some proxies as a test otherwise contact me here is my GitHub
more info
this project took me over 55 hours of pure coding to finish please if you find any bugs or errors report to me and I will fix them all
if you have ANY suggestions or questions please contact me here is my GitHub or open an issue on the scraper I will be very happy to maintain this scraper as long as I can