Cyclingnews Races & News Scraper avatar

Cyclingnews Races & News Scraper

Pricing

Pay per event

Go to Apify Store
Cyclingnews Races & News Scraper

Cyclingnews Races & News Scraper

Scrapes pro-cycling news articles and race reports from Cyclingnews.com. Extracts headline, author, dates, body text, summary, and LATAM-cycling relevance flags (riders and races). For sports-analytics, LLM training, and cycling intelligence dashboards.

Pricing

Pay per event

Rating

0.0

(0)

Developer

BowTiedRaccoon

BowTiedRaccoon

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

11 days ago

Last modified

Categories

Share

Scrapes pro-cycling news articles and race reports from Cyclingnews.com — the largest English-language cycling news outlet, owned by Future plc. Returns structured article data including headline, author, publish date, full body text, and a curated LATAM-cycling relevance layer.

The site is server-rendered with rich JSON-LD structured data on every article. No browser required. The scraper pulls from the Google News sitemap and the live /news/ listing page, so each run returns the freshest content without you managing pagination or archives.

What It Returns

Every record is one article. The dataset includes:

FieldTypeDescription
article_idStringURL-slug identifier derived from the canonical URL
article_urlStringCanonical URL of the article
article_titleStringHeadline (HTML entities decoded)
article_authorStringPrimary author name
article_published_atStringISO-8601 publish timestamp
article_modified_atStringISO-8601 last-modified timestamp
article_body_textStringPlain-text article body, up to 50,000 characters
article_summaryStringSub-headline or deck
article_sectionStringSection label (e.g. Racing, Women's Cycling, Teams & Riders)
article_tagsArrayOpen Graph article:tag values
latam_relevantBooleanTrue if the article mentions a curated LATAM rider or race
latam_ridersArrayLATAM riders mentioned (Quintana, Bernal, Carapaz, Higuita, etc.)
latam_racesArrayLATAM races mentioned (Tour Colombia, Vuelta San Juan, etc.)
source_urlStringAlways https://www.cyclingnews.com
scraped_atStringISO-8601 scrape timestamp

LATAM Enrichment

The latam_relevant flag and companion arrays are the value-add. The scraper checks every article against a curated list of ~30 Colombian, Ecuadorian, and other Latin American riders — Nairo Quintana, Egan Bernal, Richard Carapaz, Sergio Higuita, Santiago Buitrago, and others — plus ~25 LATAM races including Tour Colombia, Vuelta a Colombia, Vuelta San Juan, and Ruta de los Conquistadores. Downstream models and dashboards can filter on latam_relevant: true without re-reading the body text.

Input Parameters

ParameterTypeDefaultDescription
maxItemsInteger10Maximum articles to scrape. The Google News sitemap refreshes every few hours with ~27 recent articles.

How It Works

Each run:

  1. Fetches sitemap-news.xml (Google News sitemap — always publicly accessible) and collects article URLs for the past 48–72 hours.
  2. Also scrapes the live /news/ listing page for any articles not yet indexed in the sitemap.
  3. Deduplicates and caps to maxItems, then fetches each article.
  4. Parses JSON-LD NewsArticle schema for structured metadata, #article-body for body text.

The scraper uses impit — a Chrome TLS fingerprint HTTP client — which passes Fastly CDN edge checks without a browser. No proxy required.

Use Cases

  • Sports-analytics pipelines: feed article bodies into NLP models to extract race results, rider performance signals, and team news.
  • LLM training corpora: Cyclingnews is the canonical English-language source for pro-cycling narrative. The body text is editorial-quality, structured, and tagged.
  • LATAM cycling intelligence dashboards: the latam_riders and latam_races arrays make it simple to track Colombian Grand Tour coverage, contract news, and race reports without keyword scanning.
  • Journalism aggregators: combine with a scheduling trigger to catch every article within hours of publication.

Coverage

Cyclingnews publishes 50–80 articles per week across racing, women's cycling, teams & riders, tech/gear, and features. The Google News sitemap covers the rolling 48-hour window — run on a daily or twice-daily schedule to maintain a complete archive. A single run with maxItems: 0 captures all available articles (~27 from the news sitemap plus the listing page).

Limitations

The Google News sitemap covers recent articles only (~48–72 hours). Historical article archives are not accessible without pagination, which Future plc gates with 403 on non-recent listing pages. For historical ingestion, supply a list of known article URLs via a custom pipeline.


Data sourced from Cyclingnews.com (Future plc). Use in accordance with applicable terms of service.