Cyclingnews Races & News Scraper
Pricing
Pay per event
Cyclingnews Races & News Scraper
Scrapes pro-cycling news articles and race reports from Cyclingnews.com. Extracts headline, author, dates, body text, summary, and LATAM-cycling relevance flags (riders and races). For sports-analytics, LLM training, and cycling intelligence dashboards.
Pricing
Pay per event
Rating
0.0
(0)
Developer
BowTiedRaccoon
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
11 days ago
Last modified
Share
Scrapes pro-cycling news articles and race reports from Cyclingnews.com — the largest English-language cycling news outlet, owned by Future plc. Returns structured article data including headline, author, publish date, full body text, and a curated LATAM-cycling relevance layer.
The site is server-rendered with rich JSON-LD structured data on every article. No browser required. The scraper pulls from the Google News sitemap and the live /news/ listing page, so each run returns the freshest content without you managing pagination or archives.
What It Returns
Every record is one article. The dataset includes:
| Field | Type | Description |
|---|---|---|
article_id | String | URL-slug identifier derived from the canonical URL |
article_url | String | Canonical URL of the article |
article_title | String | Headline (HTML entities decoded) |
article_author | String | Primary author name |
article_published_at | String | ISO-8601 publish timestamp |
article_modified_at | String | ISO-8601 last-modified timestamp |
article_body_text | String | Plain-text article body, up to 50,000 characters |
article_summary | String | Sub-headline or deck |
article_section | String | Section label (e.g. Racing, Women's Cycling, Teams & Riders) |
article_tags | Array | Open Graph article:tag values |
latam_relevant | Boolean | True if the article mentions a curated LATAM rider or race |
latam_riders | Array | LATAM riders mentioned (Quintana, Bernal, Carapaz, Higuita, etc.) |
latam_races | Array | LATAM races mentioned (Tour Colombia, Vuelta San Juan, etc.) |
source_url | String | Always https://www.cyclingnews.com |
scraped_at | String | ISO-8601 scrape timestamp |
LATAM Enrichment
The latam_relevant flag and companion arrays are the value-add. The scraper checks every article against a curated list of ~30 Colombian, Ecuadorian, and other Latin American riders — Nairo Quintana, Egan Bernal, Richard Carapaz, Sergio Higuita, Santiago Buitrago, and others — plus ~25 LATAM races including Tour Colombia, Vuelta a Colombia, Vuelta San Juan, and Ruta de los Conquistadores. Downstream models and dashboards can filter on latam_relevant: true without re-reading the body text.
Input Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
maxItems | Integer | 10 | Maximum articles to scrape. The Google News sitemap refreshes every few hours with ~27 recent articles. |
How It Works
Each run:
- Fetches
sitemap-news.xml(Google News sitemap — always publicly accessible) and collects article URLs for the past 48–72 hours. - Also scrapes the live
/news/listing page for any articles not yet indexed in the sitemap. - Deduplicates and caps to
maxItems, then fetches each article. - Parses JSON-LD NewsArticle schema for structured metadata,
#article-bodyfor body text.
The scraper uses impit — a Chrome TLS fingerprint HTTP client — which passes Fastly CDN edge checks without a browser. No proxy required.
Use Cases
- Sports-analytics pipelines: feed article bodies into NLP models to extract race results, rider performance signals, and team news.
- LLM training corpora: Cyclingnews is the canonical English-language source for pro-cycling narrative. The body text is editorial-quality, structured, and tagged.
- LATAM cycling intelligence dashboards: the
latam_ridersandlatam_racesarrays make it simple to track Colombian Grand Tour coverage, contract news, and race reports without keyword scanning. - Journalism aggregators: combine with a scheduling trigger to catch every article within hours of publication.
Coverage
Cyclingnews publishes 50–80 articles per week across racing, women's cycling, teams & riders, tech/gear, and features. The Google News sitemap covers the rolling 48-hour window — run on a daily or twice-daily schedule to maintain a complete archive. A single run with maxItems: 0 captures all available articles (~27 from the news sitemap plus the listing page).
Limitations
The Google News sitemap covers recent articles only (~48–72 hours). Historical article archives are not accessible without pagination, which Future plc gates with 403 on non-recent listing pages. For historical ingestion, supply a list of known article URLs via a custom pipeline.
Data sourced from Cyclingnews.com (Future plc). Use in accordance with applicable terms of service.