News Blindspot
Pricing
Pay per usage
News Blindspot
Detects open-data media coverage blindspots — topics heavily covered by one political side but undercovered by others. Uses GDELT, AllSides, and MBFC data only.
Pricing
Pay per usage
Rating
0.0
(0)
Developer
spiros douk
Maintained by CommunityActor stats
0
Bookmarked
3
Total users
0
Monthly active users
6 months ago
Last modified
Categories
Share
Open-data Blindspot Detector
Detects media coverage blindspots by analyzing political bias distribution across news sources using only public, legal data sources.
Live demo
- Run: View on Apify
- Dataset: View JSON output
What it does
Finds blindspots — topics covered by one political side but under-covered by others — using public sources only. The actor:
- Fetches news articles from GDELT DOC API for specified queries
- Labels sources using AllSides (brand→bias), MBFC (domain→bias), and optional local overrides
- Clusters articles by story (title + day) to deduplicate rewrites
- Calculates known-only shares (Unknown excluded from denominator) for each political side
- Raises gap-aware flags when a side is both below a threshold and meaningfully lower than the next side
- Emits representative URLs per side and an Unknown% confidence proxy
Stack & data sources
Stack:
- Node.js 20+
- Apify Actor framework
- axios, p-limit, tldts, csv-parse
- Optional: @xenova/transformers (for LLM weak labeler)
Data sources:
- GDELT DOC API — article metadata (title, URL, source, date)
- AllSides — brand→bias mapping (GitHub CSV)
- MBFC — domain→bias mapping (GitHub Gist)
- Local overrides — optional JSON file for manual corrections
No private or paid data; no background scraping.
Key design choices
- English-only filter — GDELT
sourcelang:englishto reduce non-English noise - Cluster by story — title+day clustering → compute shares on clusters, not raw articles
- Known-only math —
*_pct_knownexcludes Unknown from denominator;excludeUnknownFromBlindspotuses known-only for flagging - Gap-aware flags — requires both
side_pct < blindspotThresholdPctANDgap_vs_next_pct >= gapMinPct - Label precedence: Overrides > AllSides > MBFC > learned aliases > fuzzy; else unknown
- LLM weak labeler — optional, OFF by default; cached; conservative thresholds (confidence ≥0.8, margin ≥0.12)
- Provenance — label method captured for auditability (
domain,bias,method) - Confidence —
unknown_pctreported;confidence = 1 - unknown_pct/100
Install & run
Prerequisites:
- Node.js 20+
- Apify CLI (
npm install -g apify-cli@latest) or Apify platform account- Note: This project uses the new
.actor/actor.jsonformat with a minimalapify.jsonfor CLI compatibility
- Note: This project uses the new
Local:
npm installnpx apify run
Apify platform:
- Upload actor to Apify platform
- Click "Run actor" with input JSON (see below)
Typical input
{"queries": ["climate change", "immigration"],"sinceHours": 24,"blindspotThresholdPct": 20,"gapMinPct": 12,"maxRepUrlsPerSide": 3,"restrictToEnglish": true,"excludeUnknownFromBlindspot": true,"overridesPath": "./bias-overrides.json","enableLearning": true,"learningMinCount": 3,"learningMinConsistency": 0.8,"suggestionsMax": 15,"suggestionsMinCount": 2,"enableLlmWeakLabeler": false,"llmMaxDomains": 15,"llmMinCount": 3,"llmConfidenceThreshold": 0.8,"llmMarginThreshold": 0.12,"llmSampleTitlesPerDomain": 8,"forceRefreshCache": false}
Important defaults:
restrictToEnglish: true— filters to English sourcesexcludeUnknownFromBlindspot: true— uses known-only shares for blindspot detectionblindspotThresholdPct: 20— side must be <20% to flaggapMinPct: 12— gap to next side must be ≥12% to flagenableLlmWeakLabeler: false— recommended default (keeps runs lean)
Input fields
| Field | Type | Default | Description |
|---|---|---|---|
queries | string[] | ["climate change", "immigration"] | Search queries for GDELT |
sinceHours | integer | 24 | Lookback window (1–168 hours) |
blindspotThresholdPct | number | 20 | Side must be below this % to flag (0–100) |
gapMinPct | number | 12 | Minimum gap vs next side to flag (0–100) |
maxRepUrlsPerSide | integer | 3 | Representative URLs per side (1–20) |
restrictToEnglish | boolean | true | Flag: Filter to English sources via GDELT |
excludeUnknownFromBlindspot | boolean | true | Flag: Use known-only shares for blindspot math |
overridesPath | string | "./bias-overrides.json" | Path to overrides JSON: { "example.com": "left|center|right" } |
enableLearning | boolean | true | Enable conservative alias learning |
learningMinCount | integer | 3 | Min samples to learn alias (2–100) |
learningMinConsistency | number | 0.8 | Min consistency to learn (0.7–1) |
suggestionsMax | integer | 15 | Max suggested overrides per query (0–100) |
suggestionsMinCount | integer | 2 | Min articles to suggest override (1–50) |
enableLlmWeakLabeler | boolean | false | Flag: Enable LLM weak labeler (OFF by default) |
llmMaxDomains | integer | 15 | Top unknown eTLD+1 to try (1–50) |
llmMinCount | integer | 3 | Min articles per domain (2–50) |
llmConfidenceThreshold | number | 0.8 | Min confidence (0.5–1) |
llmMarginThreshold | number | 0.12 | Min margin winner–runnerUp (0–1) |
llmSampleTitlesPerDomain | integer | 8 | Titles per domain to sample (1–20) |
forceRefreshCache | boolean | false | Force refresh bias cache (ignores 7-day TTL) |
Output schema
Each query produces one result object:
{"schema_version": "1.1.0","generated_at_utc": "2025-01-01T00:00:00.000Z","query": "climate change","total_clusters": 41,"total_articles_raw": 96,"left_pct": 31.7,"center_pct": 0.0,"right_pct": 4.9,"unknown_pct": 63.4,"left_pct_known": 86.7,"center_pct_known": 0.0,"right_pct_known": 13.3,"known_articles": 15,"blindspot_flags": [{ "side": "center", "pct": 0.0, "gap_vs_next_pct": 13.3 }],"representative_urls": {"left": ["https://..."],"center": [],"right": ["https://..."],"unknown": ["https://..."]},"confidence": 0.37,"confidence_note": "Unknown 63.4% — results based on 15 known of 41 total clusters","unknown_summary": {"top_unknown_hosts": ["allafrica.com (3)", "scoop.co.nz (3)"],"suggested_overrides": [{"eTLD1": "allafrica.com","support": 3,"mode_side": null,"consistency": null}],"suggested_overrides_snippet": "{\n \"allafrica.com\": \"/* decide bias */\"\n}"},"provenance": {"labels_used": [{"domain": "bostonglobe.com","bias": "left","method": "domain:authoritative:host"}]}}
Field notes:
*_pct— numbers (percent values to one decimal)blindspot_flags— zero or more flags; appears whenside_pct < blindspotThresholdPctandgap_vs_next_pct >= gapMinPctconfidence—1 - unknown_pct/100(0.0–1.0)representative_urls— earliest URL per cluster, deduped by host, up tomaxRepUrlsPerSideper sidesuggested_overrides— candidates for manual review (sorted by support count)
Interpreting results
Blindspot flags:
- A flag means a side is both below the threshold (
blindspotThresholdPct) and has a meaningful gap vs the next side (gapMinPct). - Example:
{ "side": "center", "pct": 0.0, "gap_vs_next_pct": 13.3 }means center has 0% coverage and is 13.3% below the next side.
Unknown% and confidence:
- High
unknown_pct→ lowconfidence→ results based on fewer known sources. - If
unknown_pct > 50%, consider wideningsinceHoursor adding overrides.
Clusters vs articles:
total_clusters≤total_articles_rawdue to title+day deduplication (rewrites of same story).
Representativeness caveats:
- GDELT is broad but not exhaustive.
- Bias maps (AllSides/MBFC) focus on US outlets; non-US outlets are often
unknownunless overridden. - Representative URLs are earliest per cluster, not necessarily most authoritative.
Overrides & human-in-the-loop
Maintain a small bias-overrides.json (≤100 entries) for frequent eTLD+1s:
{"example.com": "center","paper.co.uk": "right"}
Workflow:
- Review
unknown_summary.suggested_overrides(sorted by support count). - Research each domain (editorial stance, ownership, fact-checking records).
- Add to
bias-overrides.jsonwith chosen side (left,center, orright). - Auditable process: PRs should state evidence and chosen side.
Precedence: Overrides > AllSides > MBFC > learned aliases > fuzzy; else unknown.
LLM weak labeler (optional)
Default: OFF. When enabled:
- Uses Transformers.js zero-shot classification (tries multilingual XNLI, falls back to English NLI).
- Strict thresholds:
confidence ≥ 0.8,margin ≥ 0.12. - Caches results in
bias-cache.json(targeted_cache.llm). - Negative cache: skips domains for 14 days if low-confidence or non-English (with English-only model).
- Not a replacement for overrides; use sparingly for top unknown domains.
When to enable:
- High Unknown% and manual overrides are impractical.
- Offline model availability (Transformers.js downloads models on first use).
Operational notes
User-Agent:
- Update
USER_AGENTinmain.jswith a real contact email:"BlindspotDetectorBot/2.1 (contact: you@example.com)"
Caching:
bias-cache.jsonrefreshed ≤7 days (unlessforceRefreshCache: true).- Includes
targeted_cache.llm(LLM weak labels) andllm_neg(14-day skip list). - Cache persists across runs.
Logs to expect:
- Fetch counts:
🔍 Fetching news from GDELT... - Clusters:
📦 Query "X": N articles → M clusters - Unknown%:
confidence_notein output - Flags:
blindspot_flagsarray
Throughput tips:
- Limit
queriesarray size (parallel fetches, but GDELT rate limits apply). - Widen
sinceHoursfor sparse topics (more articles → better coverage).
Limitations
- Bias maps are imperfect: AllSides/MBFC focus on US outlets; some domains unresolved.
- Non-English and hyper-local outlets: Often remain
unknownunless overridden. - GDELT dedup isn't perfect: Title+day clustering helps but won't catch every variation.
- Not a fact-checker: Measures coverage, not truth or accuracy.
- English filter:
restrictToEnglish: trueexcludes non-English sources (may miss relevant coverage).
Security & ethics
- Public sources only: GDELT DOC API, AllSides CSV, MBFC Gist (all public GitHub).
- No scraping: No paywalls, login, or background scraping.
- Transparent heuristics: Provenance captured; human review for overrides.
- Respectful rate limits: Retries with backoff; User-Agent identifies bot.
Roadmap
- Title-relevance gate (behind a flag) to filter off-topic articles.
- Fill
mode_side/consistencyin suggestions by tallies from known sources. - Provenance source granularity (AllSides vs MBFC vs Overrides in method field).
- Multilingual support when infrastructure allows (currently English-only filter).
Troubleshooting
High Unknown%:
- Widen
sinceHours(more articles → more known sources). - Add overrides for frequent unknown domains (see
unknown_summary.suggested_overrides).
Empty reps for a side:
- Story imbalance or sparse data; increase
sinceHoursor check query specificity.
LLM model load errors:
- Ensure offline models available (Transformers.js downloads on first use).
- Keep
enableLlmWeakLabeler: falseif models unavailable.
Cache issues:
- Set
forceRefreshCache: trueto rebuild bias map. - Check
bias-cache.jsonexists and is valid JSON.
Development
Repo structure:
blindspot-detector/├── main.js # Main actor logic├── apify.json # Minimal config for CLI compatibility├── .actor/│ └── actor.json # Detailed actor configuration & input schema├── package.json # Dependencies├── bias-cache.json # Cached bias maps (auto-generated)├── bias-overrides.json # Manual overrides (optional)└── input.json # Example input
Configuration files:
apify.json- Minimal configuration file for Apify CLI compatibility (legacy format support).actor/actor.json- Detailed actor configuration including input schema, metadata, and build settings
Scripts:
npm install # Install dependenciesnpx apify run # Run locally
Code style:
- Node.js 20+, ES modules
- Async/await, p-limit for concurrency
- Conservative learning thresholds
License
MIT
