Fast News Scraper avatar

Fast News Scraper

Try for free

3 days trial then $29.00/month - No credit card required now

Go to Store
Fast News Scraper

Fast News Scraper

timgreen/fast-news-scraper
Try for free

3 days trial then $29.00/month - No credit card required now

Extract full article text and metadata from popular news sites like The New York Times, Bloomberg, Reuters, BBC, CNBC, and Wired. Scrape thousands of articles in just a few minutes. Scape a single site or provide a list of article URLs to scrape.

Fast News Scraper extracts full article text from select news and content websites with a focus on speed. It uses private APIs where available and only makes plain HTTP requests. This won't work for every website, but with a little ingenuity, it can work in a surprising number of cases. Thousands of full articles can be pulled in just minutes.

In addition to the full article text, Fast News Scraper also retrieves various pieces of metadata for each article. The full output is detailed below.

What news websites are supported?

Fast News Scraper currently supports scraping articles from the following websites:

  • The New York Times (nytimes.com)
  • The Washington Post (washingtonpost.com)
  • Bloomberg (bloomberg.com)
  • CNN (cnn.com)
  • BBC (bbc.com)
  • Reuters (reuters.com)
  • Seeking Alpha (seekingalpha.com/market-news) - market news only
  • Wired (wired.com)
  • CNBC (cnbc.com)

Additional websites will be added over time. If there's a website you'd like to see supported, go to the Issues tab and create a new issue.

Why scrape full news articles?

There are a variety of reasons why scraping full news articles is useful:

  1. Media monitoring: Scrape news articles to track mentions of your company, competitors, or industry-related keywords, allowing you to stay on top of your online reputation and market trends.
  2. Research and analysis: Collect and analyze news articles to identify patterns, trends, and insights on various topics, such as politics, economics, or social issues.
  3. Sentiment analysis: Analyze news articles to determine the sentiment around a particular topic, company, or individual, helping you understand public opinion and make informed decisions.
  4. Event detection: Scrape news articles to detect and track events, such as natural disasters, protests, or product launches, allowing you to respond quickly and effectively.
  5. Topic modeling: Use scraped news articles to identify underlying topics and themes, enabling you to understand the broader context and relationships between different news stories.
  6. Entity extraction: Extract specific entities, such as people, organizations, and locations, from news articles to build databases, create profiles, or track relationships.
  7. News recommendation: Scrape news articles to build personalized news recommendation systems, suggesting relevant content to users based on their interests and preferences.
  8. Fake news detection: Analyze news articles to identify potential fake news stories, helping to combat misinformation and promote fact-based journalism.
  9. Historical research: Scrape news articles to create archives of historical events, allowing researchers and scholars to study and analyze past events and trends.
  10. Business intelligence: Collect and analyze news articles to gather competitive intelligence, track market trends, and identify business opportunities.
  11. Content generation: Use scraped news articles as inspiration or input for generating new content, such as summaries, abstracts, or even entire articles.
  12. Academic research: Collect and analyze news articles to support academic research in fields like journalism, communication, sociology, and political science.
  13. Data journalism: Scrape news articles to create interactive visualizations, dashboards, and stories that help journalists and researchers.
  14. AI training: AI models require large quantities of training data. News articles can provide a rich source of such data.

Input configuration

There are two ways to run Fast News Scraper:

  1. Search a specific website with a query (some websites support a blank query, in which case all articles will be returned).
  2. Scrape a list of article URLs from supported websites.

Here are all the supported input fields. For more details, see the Input tab.

FieldTypeDescriptionDefault value
sitestringThe site to scrape. Must be one of the supported sites.reuters.com
querystringThe query term used to search to selected site. Not all sites support queries, and only some sites allow an empty query.artificial intelligence
sortstringThe order in which articles are returned. Must be either date or relevance. Not all website support both.date
maxItemsnumberThe approximate maximum number of items that will be returned by a run. The actual number returned may be slightly higher or lower.500
articleURLsarrayA list of article URLs to scrape. If this field is not empty, only the provided URLs will be scraped, and the site, query, and sort fields will be ignored. This is useful if you acquire a list of URLs elsewhere.[]
datasetNamestringIf this field is present, a named dataset will be used. This is useful for appending results from multiple runs.null
requestQueueNamestringIf this field is present, a named request queue will be used. This allows you to avoid scraping the same content across multiple runs.null
beginDatestringONLY SUPPORTED FOR SOME WEBSITES. Extract articles on or after this date.null
endDatestringONLY SUPPORTED FOR SOME WEBSITES. Extract articles on or before this datenull
proxyobjectThe proxy configuration to use. This field is required.{ "useApifyProxy": true }

Output example

The scraped articles will be shown as a dataset which you can find in the Output tab. Note that the output will first be organized as a table for viewing convenience.

You can preview all the fields and choose in which format to download the data you’ve extracted: JSON, CSV, Excel, HTML table, or XML. Here below is the same dataset in JSON:

1{
2	"site": "bbc.com",
3	"query": "China",
4	"url": "https://www.bbc.com/news/world-asia-china-68894782",
5	"title": "Blinken arrives in China as relations crackle with tension",
6	"tags": [
7		"article",
8		"news",
9		"China"
10	],
11	"summary": "Blinken's trip is a sign of improved US-China ties, but the two still view each other with suspicion.",
12	"image": "https://ichef.bbci.co.uk/news/480/cpsprodpb/10718/production/_133225376_gettyimages-2149484148-1.jpg",
13	"author": "By Laura Bicker, Tom Bateman and Tessa Wong",
14	"published": "2024-04-25T04:36:15.000Z",
15	"updated": "2024-04-25T04:36:15.000Z",
16	"label": "bbc.com.article",
17	"content": "\"Three, two, one - hut!\" shouts quarterback Mu Yang, as he throws the ball across the field.\n\nHis Beijing Cyclones teammate Henry Mu sprints to the corner for the catch, his studs thudding off the AstroTurf as he jumps for the ball.\n\n\"I was so surprised to find American football here,\" says Henry as he catches his breath. \"It's very tough, physically and mentally, you must defeat your fear.\"\n\nHere, men and women play together in a team sport that you'd associate more with Baltimore than Beijing.\n\nFor many Americans, this is more than just a game - it is an expression of their national identity. For this Chinese team, it is something new - there are only a few thousand players in China, but millions of fans.\n\nThis is exactly the kind of \"people to people\" exchanges and cultural connection that Beijing wants with the US, as the two rival superpowers try to calm their tumultuous relationship.\n\nSince President Xi Jinping visited San Francisco last November...(truncated)"
18}

Note: Some fields will be blank, empty, or null depending on the website and article. If you've provided a list of article URLs rather than a site and query term, the query item field will be missing.

How long does it take to scrape news articles?

The article extraction rate for each supported website differs. Using the default settings, here's a rough idea of how quickly you can scrape full articles using Fast News Scraper based on some test runs:

Note: All runs listed below used Datacenter proxies unless otherwise noted.

SiteArticlesTimeRateNotes
reuters.com1,9994m 17s467 articles/minute
cnn.com1,5862m 03s774 articles/minute
bbc.com1,9023m 13s591 articles/minute
bloomberg.com1,9876m 06s326 articles/minuteResidential proxies
wired.com1,8824m 25s426 articles/minute
seekingalpha.com/market-news10,0081m 37s6,191 articles/minute!!!!!
nytimes.com9244m 47s193 articles/minuteResidential proxies
washingtonpost.com2904m 54s59 articles/minute
cnbc.com6453m 19s195 articles/minute

How to scrape articles from Bloomberg (bloomberg.com)

Bloomberg is a leading international news agency that provides 24/7 coverage of global business, finance, economics, and politics. Founded in 1990 by Michael Bloomberg, the company is headquartered in New York City and has bureaus in over 120 countries worldwide. Bloomberg's news site, Bloomberg.com, offers in-depth analysis, breaking news, and commentary on markets, industries, and governments, as well as video and audio content.

Bloomberg locks articles behind a paywall, only allowing free users to access a limited number of articles per month. Fast News Scraper gets around this, providing access to the full text of Bloomberg articles.

To extract Bloomberg articles:

  • Set site to bloomberg.com.
  • Set query to a non-empty string. Bloomberg does not allow empty queries.
  • Set sort to either date or relevance. If sort is omitted, articles will be returned by date.

Bloomberg does not support beginDate or endDate.

Note: Using Residential proxies is recommended for scraping Bloomberg articles to avoid getting blocked and to ensure that articles are not dropped.

Note: Each Bloomberg query can only return a maximum of 10,000 articles.

Note: Some types of content, including video, audio, and graphics, are skipped.

How to scrape articles from The New York Times (nytimes.com)

The New York Times is a daily newspaper based in New York City that is widely regarded as one of the most respected and authoritative sources of news and information in the world. Founded in 1851, The Times has a long history of journalistic excellence, having won 127 Pulitzer Prizes, more than any other newspaper. Known for its in-depth reporting and thoughtful analysis, The Times covers a wide range of topics, including national and international news, politics, business, culture, and more.

The New York Times locks articles behind a paywall, only allowing free users to access a limited number of articles per month. Fast News Scraper gets around this, providing access to the full text of New York Times articles.

To extract New York Times articles:

  • Set site to nytimes.com.
  • If query is omitted or left empty, all New York Times content will be returned. A non-empty query will use the website's search functionality.
  • Set sort to either date or relevance. If sort is omitted, articles will be returned by date.

The New York Times does support beginDate or endDate.

Each New York Times search will only return a maximum of ~1,000 articles. However, by running multiple searches with different date ranges, you can pull any number of articles. You can use a named dataset and request queue to ensure that articles aren't extracted twice.

Note: Using Residential proxies is recommended for scraping New York Times articles to avoid getting blocked and to ensure that articles are not dropped.

Note: Some types of content, including live news content and any articles that live on a subdomain, are skipped.

How to scrape articles from The Washington Post (washingtonpost.com)

The Washington Post is a major American daily newspaper published in Washington, D.C. Founded in 1877, it is one of the oldest and most respected newspapers in the United States. Known for its in-depth coverage of national politics, The Post has won numerous Pulitzer Prizes for its investigative reporting, including its coverage of the Watergate scandal in the 1970s. Today, The Washington Post is a leading source of news and opinion on politics, business, sports, and culture, with a print and online circulation of millions.

The Washington Post locks articles behind a paywall, only allowing free users to access a limited number of articles per month. Fast News Scraper gets around this, providing access to the full text of Washington Post articles.

To extract Washington Post articles:

  • Set site to washingtonpost.com.
  • Set query to a non-empty string. The Washington Post does not allow empty queries.
  • Articles will always be returned sorted by relevance. The sort field will be ignored.

The Washington Post does not support beginDate or endDate.

The default Datacenter proxies work just fine with The Washington Post.

Note: Each query generally only returns a few hundred articles.

How to scrape articles from Reuters (reuters.com)

Reuters is a leading international news agency that provides comprehensive and unbiased coverage of global news, including politics, business, finance, technology, and more. Founded in 1851, Reuters is one of the oldest and most respected news agencies in the world, with a reputation for accuracy, speed, and independence. Reuters.com offers real-time news coverage, in-depth analysis, and commentary on global events, as well as video and photography from around the world.

Reuters requires registration to view unlimited articles, only allowing unregistered users to access a limited number of articles per month. Fast News Scraper gets around this, providing access to the full text of Reuters articles.

To extract Reuters articles:

  • Set site to reuters.com.
  • Set query to a non-empty string. Reuters does not allow empty queries.
  • Set sort to either date or relevance. If sort is omitted, articles will be returned by date.

Reuters does not support beginDate or endDate.

The default Datacenter proxies work just fine with Reuters.

How to scrape articles from BBC (bbc.com)

BBC News is a British public service broadcaster that provides impartial and comprehensive coverage of global news, including politics, business, entertainment, and more. The BBC is one of the most trusted and respected news sources in the world, with a reputation for accuracy, fairness, and in-depth reporting. BBC.com offers a wide range of news content, including video, audio, and written articles, as well as live streaming of BBC TV and radio programs.

BBC doesn't require registration to view news articles, so scraping the website is relatively straightforward.

To extract BBC articles:

  • Set site to bbc.com.
  • Set query to a non-empty string. BBC does not allow empty queries.
  • BBC will always return articles based on relevance, so the sort field will be ignored.

BBC does not support beginDate or endDate.

The default Datacenter proxies work just fine with BBC.

Note: Some types of content, including audio, video, live content, and newsround content, will be skipped.

How to scrape articles from CNN (cnn.com)

CNN (Cable News Network) is a 24-hour cable news channel that provides continuous coverage of global news, politics, business, entertainment, and more. Founded in 1980, CNN is one of the most recognized and respected news brands in the world, known for its breaking news coverage, in-depth reporting, and live coverage of major events. CNN.com offers a wide range of news content, including video, articles, and blogs, as well as live streaming of CNN TV programming.

CNN doesn't require registration to view news articles, so scraping the website is relatively straightforward.

To extract CNN articles:

  • Set site to cnn.com.
  • If query is omitted or left empty, all CNN content will be returned. A non-empty query will use the website's search functionality.
  • Set sort to either date or relevance. If sort is omitted, articles will be returned by date.

CNN does not support beginDate or endDate.

The default Datacenter proxies work just fine with CNN.

Note: Some types of content, including video, live news, CNN Underscored, and gallery content, will be skipped. Any articles that are in a special format (interactive, etc.) will likely fail to be extracted.

How to scrape articles from Wired (wired.com)

Wired is a technology-focused news site that provides in-depth coverage of the latest developments in tech, science, and innovation. Founded in 1993, Wired is known for its cutting-edge reporting on emerging trends, gadgets, and ideas that are shaping the future of business, culture, and society. Wired.com features news, analysis, and commentary on topics such as artificial intelligence, cybersecurity, robotics, and more, as well as profiles of innovators and entrepreneurs who are changing the world.

Wired locks articles behind a paywall, only allowing free users to access a limited number of articles per month. Fast News Scraper gets around this, providing access to the full text of Wired articles.

To extract Wired articles:

  • Set site to wired.com.
  • If query is omitted or left empty, all Wired content will be returned. A non-empty query will use the website's search functionality.
  • Set sort to either date or relevance. If sort is omitted, articles will be returned by date.

Wired does not support beginDate or endDate.

The default Datacenter proxies work just fine with Wired.

Note: Sponsored content is skipped.

How to scrape articles from CNBC (cnbc.com)

CNBC, or Consumer News and Business Channel, is a 24-hour cable television network that provides business news and financial information to a global audience. Founded in 1989, CNBC is a leading source of business and market news, offering live coverage of stock markets, economic indicators, and corporate news. The network's programming includes popular shows such as "Squawk Box," "Fast Money," and "Mad Money with Jim Cramer," featuring expert analysis and commentary from experienced journalists and financial experts. CNBC also provides online content, including articles, videos, and podcasts, making it a one-stop shop for investors, business leaders, and anyone interested in staying informed about the world of finance.

CNBC locks its PRO articles behind a paywall. Fast News Scraper gets around this, providing access to the full text of CNBC articles, regardless of whether they're standard articles or PRO articles.

To extract CNBC articles:

  • Set site to cnbc.com.
  • Set query to a non-empty string. CNBC does not allow empty queries.
  • Set sort to either date or relevance. If sort is omitted, articles will be returned by date.

CNBC does not support beginDate or endDate.

The default Datacenter proxies work just fine with CNBC.

Note: Video content is skipped, and some "live update" content will fail to be scraped.

How to scrape market news from Seeking Alpha (seekingalpha.com/market-news)

Seeking Alpha is a financial news and analysis website that provides stock market insights, news, and commentary to individual and professional investors. Founded in 2004, Seeking Alpha is known for its crowdsourced approach, featuring articles and analysis from a network of thousands of contributors, including experienced investors, analysts, and industry experts. The site offers a wide range of content, including news, earnings calls, dividend analysis, and stock ratings, as well as tools and resources for investors to make informed investment decisions.

Seeking Alpha locks its market news content behind a paywall, requiring a Premium subscription for unfettered access. Fast News Scraper gets around this, providing access to the full text of Seeking Alpha's market news.

To extract Seeking Alpha market news content:

  • Set site to seekingalpha.com/market-news.
  • query is not supported. All market news content will be returned.
  • sort is not supported. Market news content is always sorted by date.

Seeking Alpha does not support beginDate or endDate.

The default Datacenter proxies work just fine with Seeking Alpha.

Note: Only market news is supported. Seeking Alpha's longer-form articles and analysis will not be returned.

Note: Content is scraped rapidly at a rate of thousands of market news items per minute.

How to pull only new articles

Let's say you want to schedule Fast News Scraper to run once a week and pull any new articles from wired.com that you haven't already extracted. The key is to use a named dataset and request queue, which can be done using the datasetName and requestQueueName input fields. Each time you run Fast News Scraper, only articles that have not yet been scraped will be processed, and the scraper will automatically stop once it's reached a point where the only articles it's finding are articles you've already scraped. This way you avoid wasting time and money repeatedly scraping the same content.

Learn more about scheduling on the Apify platform.

Note: If you use a named dataset, data will be pushed to the named dataset and an unnamed dataset linked to the run. This is a limitation of the Apify platform. You can view the full dataset and request queue by navigating to the Storage page in the Apify console.

Extracting articles is legal, as you are scraping publicly available content. Please be aware that most articles are protected by copyright laws. Before you publish extracted articles anywhere, check the terms of use of the scraped website. In other words: Don't be a jerk.

News icons created by Freepik - Flaticon

Developer
Maintained by Community

Actor Metrics

  • 37 monthly users

  • 12 stars

  • >99% runs succeeded

  • 44 days response time

  • Created in May 2024

  • Modified 15 days ago

Categories