News Blindspot
Pricing
Pay per usage
News Blindspot
Detects open-data media coverage blindspots — topics heavily covered by one political side but undercovered by others. Uses GDELT, AllSides, and MBFC data only.
Pricing
Pay per usage
Rating
0.0
(0)
Developer

spiros douk
Actor stats
0
Bookmarked
3
Total users
2
Monthly active users
6 days ago
Last modified
Categories
Share
Open-data Blindspot Detector
Detects media coverage blindspots by analyzing political bias distribution across news sources using only public, legal data sources.
Live demo
- Run: View on Apify
- Dataset: View JSON output
What it does
Finds blindspots — topics covered by one political side but under-covered by others — using public sources only. The actor:
- Fetches news articles from GDELT DOC API for specified queries
- Labels sources using AllSides (brand→bias), MBFC (domain→bias), and optional local overrides
- Clusters articles by story (title + day) to deduplicate rewrites
- Calculates known-only shares (Unknown excluded from denominator) for each political side
- Raises gap-aware flags when a side is both below a threshold and meaningfully lower than the next side
- Emits representative URLs per side and an Unknown% confidence proxy
Stack & data sources
Stack:
- Node.js 20+
- Apify Actor framework
- axios, p-limit, tldts, csv-parse
- Optional: @xenova/transformers (for LLM weak labeler)
Data sources:
- GDELT DOC API — article metadata (title, URL, source, date)
- AllSides — brand→bias mapping (GitHub CSV)
- MBFC — domain→bias mapping (GitHub Gist)
- Local overrides — optional JSON file for manual corrections
No private or paid data; no background scraping.
Key design choices
- English-only filter — GDELT
sourcelang:englishto reduce non-English noise - Cluster by story — title+day clustering → compute shares on clusters, not raw articles
- Known-only math —
*_pct_knownexcludes Unknown from denominator;excludeUnknownFromBlindspotuses known-only for flagging - Gap-aware flags — requires both
side_pct < blindspotThresholdPctANDgap_vs_next_pct >= gapMinPct - Label precedence: Overrides > AllSides > MBFC > learned aliases > fuzzy; else unknown
- LLM weak labeler — optional, OFF by default; cached; conservative thresholds (confidence ≥0.8, margin ≥0.12)
- Provenance — label method captured for auditability (
domain,bias,method) - Confidence —
unknown_pctreported;confidence = 1 - unknown_pct/100
Install & run
Prerequisites:
- Node.js 20+
- Apify CLI (
npm install -g apify-cli@latest) or Apify platform account- Note: This project uses the new
.actor/actor.jsonformat with a minimalapify.jsonfor CLI compatibility
- Note: This project uses the new
Local:
npm installnpx apify run
Apify platform:
- Upload actor to Apify platform
- Click "Run actor" with input JSON (see below)
Typical input
{"queries": ["climate change", "immigration"],"sinceHours": 24,"blindspotThresholdPct": 20,"gapMinPct": 12,"maxRepUrlsPerSide": 3,"restrictToEnglish": true,"excludeUnknownFromBlindspot": true,"overridesPath": "./bias-overrides.json","enableLearning": true,"learningMinCount": 3,"learningMinConsistency": 0.8,"suggestionsMax": 15,"suggestionsMinCount": 2,"enableLlmWeakLabeler": false,"llmMaxDomains": 15,"llmMinCount": 3,"llmConfidenceThreshold": 0.8,"llmMarginThreshold": 0.12,"llmSampleTitlesPerDomain": 8,"forceRefreshCache": false}
Important defaults:
restrictToEnglish: true— filters to English sourcesexcludeUnknownFromBlindspot: true— uses known-only shares for blindspot detectionblindspotThresholdPct: 20— side must be <20% to flaggapMinPct: 12— gap to next side must be ≥12% to flagenableLlmWeakLabeler: false— recommended default (keeps runs lean)
Input fields
| Field | Type | Default | Description |
|---|---|---|---|
queries | string[] | ["climate change", "immigration"] | Search queries for GDELT |
sinceHours | integer | 24 | Lookback window (1–168 hours) |
blindspotThresholdPct | number | 20 | Side must be below this % to flag (0–100) |
gapMinPct | number | 12 | Minimum gap vs next side to flag (0–100) |
maxRepUrlsPerSide | integer | 3 | Representative URLs per side (1–20) |
restrictToEnglish | boolean | true | Flag: Filter to English sources via GDELT |
excludeUnknownFromBlindspot | boolean | true | Flag: Use known-only shares for blindspot math |
overridesPath | string | "./bias-overrides.json" | Path to overrides JSON: { "example.com": "left|center|right" } |
enableLearning | boolean | true | Enable conservative alias learning |
learningMinCount | integer | 3 | Min samples to learn alias (2–100) |
learningMinConsistency | number | 0.8 | Min consistency to learn (0.7–1) |
suggestionsMax | integer | 15 | Max suggested overrides per query (0–100) |
suggestionsMinCount | integer | 2 | Min articles to suggest override (1–50) |
enableLlmWeakLabeler | boolean | false | Flag: Enable LLM weak labeler (OFF by default) |
llmMaxDomains | integer | 15 | Top unknown eTLD+1 to try (1–50) |
llmMinCount | integer | 3 | Min articles per domain (2–50) |
llmConfidenceThreshold | number | 0.8 | Min confidence (0.5–1) |
llmMarginThreshold | number | 0.12 | Min margin winner–runnerUp (0–1) |
llmSampleTitlesPerDomain | integer | 8 | Titles per domain to sample (1–20) |
forceRefreshCache | boolean | false | Force refresh bias cache (ignores 7-day TTL) |
Output schema
Each query produces one result object:
{"schema_version": "1.1.0","generated_at_utc": "2025-01-01T00:00:00.000Z","query": "climate change","total_clusters": 41,"total_articles_raw": 96,"left_pct": 31.7,"center_pct": 0.0,"right_pct": 4.9,"unknown_pct": 63.4,"left_pct_known": 86.7,"center_pct_known": 0.0,"right_pct_known": 13.3,"known_articles": 15,"blindspot_flags": [{ "side": "center", "pct": 0.0, "gap_vs_next_pct": 13.3 }],"representative_urls": {"left": ["https://..."],"center": [],"right": ["https://..."],"unknown": ["https://..."]},"confidence": 0.37,"confidence_note": "Unknown 63.4% — results based on 15 known of 41 total clusters","unknown_summary": {"top_unknown_hosts": ["allafrica.com (3)", "scoop.co.nz (3)"],"suggested_overrides": [{"eTLD1": "allafrica.com","support": 3,"mode_side": null,"consistency": null}],"suggested_overrides_snippet": "{\n \"allafrica.com\": \"/* decide bias */\"\n}"},"provenance": {"labels_used": [{"domain": "bostonglobe.com","bias": "left","method": "domain:authoritative:host"}]}}
Field notes:
*_pct— numbers (percent values to one decimal)blindspot_flags— zero or more flags; appears whenside_pct < blindspotThresholdPctandgap_vs_next_pct >= gapMinPctconfidence—1 - unknown_pct/100(0.0–1.0)representative_urls— earliest URL per cluster, deduped by host, up tomaxRepUrlsPerSideper sidesuggested_overrides— candidates for manual review (sorted by support count)
Interpreting results
Blindspot flags:
- A flag means a side is both below the threshold (
blindspotThresholdPct) and has a meaningful gap vs the next side (gapMinPct). - Example:
{ "side": "center", "pct": 0.0, "gap_vs_next_pct": 13.3 }means center has 0% coverage and is 13.3% below the next side.
Unknown% and confidence:
- High
unknown_pct→ lowconfidence→ results based on fewer known sources. - If
unknown_pct > 50%, consider wideningsinceHoursor adding overrides.
Clusters vs articles:
total_clusters≤total_articles_rawdue to title+day deduplication (rewrites of same story).
Representativeness caveats:
- GDELT is broad but not exhaustive.
- Bias maps (AllSides/MBFC) focus on US outlets; non-US outlets are often
unknownunless overridden. - Representative URLs are earliest per cluster, not necessarily most authoritative.
Overrides & human-in-the-loop
Maintain a small bias-overrides.json (≤100 entries) for frequent eTLD+1s:
{"example.com": "center","paper.co.uk": "right"}
Workflow:
- Review
unknown_summary.suggested_overrides(sorted by support count). - Research each domain (editorial stance, ownership, fact-checking records).
- Add to
bias-overrides.jsonwith chosen side (left,center, orright). - Auditable process: PRs should state evidence and chosen side.
Precedence: Overrides > AllSides > MBFC > learned aliases > fuzzy; else unknown.
LLM weak labeler (optional)
Default: OFF. When enabled:
- Uses Transformers.js zero-shot classification (tries multilingual XNLI, falls back to English NLI).
- Strict thresholds:
confidence ≥ 0.8,margin ≥ 0.12. - Caches results in
bias-cache.json(targeted_cache.llm). - Negative cache: skips domains for 14 days if low-confidence or non-English (with English-only model).
- Not a replacement for overrides; use sparingly for top unknown domains.
When to enable:
- High Unknown% and manual overrides are impractical.
- Offline model availability (Transformers.js downloads models on first use).
Operational notes
User-Agent:
- Update
USER_AGENTinmain.jswith a real contact email:"BlindspotDetectorBot/2.1 (contact: you@example.com)"
Caching:
bias-cache.jsonrefreshed ≤7 days (unlessforceRefreshCache: true).- Includes
targeted_cache.llm(LLM weak labels) andllm_neg(14-day skip list). - Cache persists across runs.
Logs to expect:
- Fetch counts:
🔍 Fetching news from GDELT... - Clusters:
📦 Query "X": N articles → M clusters - Unknown%:
confidence_notein output - Flags:
blindspot_flagsarray
Throughput tips:
- Limit
queriesarray size (parallel fetches, but GDELT rate limits apply). - Widen
sinceHoursfor sparse topics (more articles → better coverage).
Limitations
- Bias maps are imperfect: AllSides/MBFC focus on US outlets; some domains unresolved.
- Non-English and hyper-local outlets: Often remain
unknownunless overridden. - GDELT dedup isn't perfect: Title+day clustering helps but won't catch every variation.
- Not a fact-checker: Measures coverage, not truth or accuracy.
- English filter:
restrictToEnglish: trueexcludes non-English sources (may miss relevant coverage).
Security & ethics
- Public sources only: GDELT DOC API, AllSides CSV, MBFC Gist (all public GitHub).
- No scraping: No paywalls, login, or background scraping.
- Transparent heuristics: Provenance captured; human review for overrides.
- Respectful rate limits: Retries with backoff; User-Agent identifies bot.
Roadmap
- Title-relevance gate (behind a flag) to filter off-topic articles.
- Fill
mode_side/consistencyin suggestions by tallies from known sources. - Provenance source granularity (AllSides vs MBFC vs Overrides in method field).
- Multilingual support when infrastructure allows (currently English-only filter).
Troubleshooting
High Unknown%:
- Widen
sinceHours(more articles → more known sources). - Add overrides for frequent unknown domains (see
unknown_summary.suggested_overrides).
Empty reps for a side:
- Story imbalance or sparse data; increase
sinceHoursor check query specificity.
LLM model load errors:
- Ensure offline models available (Transformers.js downloads on first use).
- Keep
enableLlmWeakLabeler: falseif models unavailable.
Cache issues:
- Set
forceRefreshCache: trueto rebuild bias map. - Check
bias-cache.jsonexists and is valid JSON.
Development
Repo structure:
blindspot-detector/├── main.js # Main actor logic├── apify.json # Minimal config for CLI compatibility├── .actor/│ └── actor.json # Detailed actor configuration & input schema├── package.json # Dependencies├── bias-cache.json # Cached bias maps (auto-generated)├── bias-overrides.json # Manual overrides (optional)└── input.json # Example input
Configuration files:
apify.json- Minimal configuration file for Apify CLI compatibility (legacy format support).actor/actor.json- Detailed actor configuration including input schema, metadata, and build settings
Scripts:
npm install # Install dependenciesnpx apify run # Run locally
Code style:
- Node.js 20+, ES modules
- Async/await, p-limit for concurrency
- Conservative learning thresholds
License
MIT
