News Blindspot avatar
News Blindspot

Pricing

Pay per usage

Go to Apify Store
News Blindspot

News Blindspot

Detects open-data media coverage blindspots — topics heavily covered by one political side but undercovered by others. Uses GDELT, AllSides, and MBFC data only.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

spiros douk

spiros douk

Maintained by Community

Actor stats

0

Bookmarked

3

Total users

2

Monthly active users

6 days ago

Last modified

Categories

Share

Open-data Blindspot Detector

Node.js View on Apify LICENSE CI

Detects media coverage blindspots by analyzing political bias distribution across news sources using only public, legal data sources.

Live demo

What it does

Finds blindspots — topics covered by one political side but under-covered by others — using public sources only. The actor:

  • Fetches news articles from GDELT DOC API for specified queries
  • Labels sources using AllSides (brand→bias), MBFC (domain→bias), and optional local overrides
  • Clusters articles by story (title + day) to deduplicate rewrites
  • Calculates known-only shares (Unknown excluded from denominator) for each political side
  • Raises gap-aware flags when a side is both below a threshold and meaningfully lower than the next side
  • Emits representative URLs per side and an Unknown% confidence proxy

Stack & data sources

Stack:

  • Node.js 20+
  • Apify Actor framework
  • axios, p-limit, tldts, csv-parse
  • Optional: @xenova/transformers (for LLM weak labeler)

Data sources:

  • GDELT DOC API — article metadata (title, URL, source, date)
  • AllSides — brand→bias mapping (GitHub CSV)
  • MBFC — domain→bias mapping (GitHub Gist)
  • Local overrides — optional JSON file for manual corrections

No private or paid data; no background scraping.

Key design choices

  • English-only filter — GDELT sourcelang:english to reduce non-English noise
  • Cluster by story — title+day clustering → compute shares on clusters, not raw articles
  • Known-only math*_pct_known excludes Unknown from denominator; excludeUnknownFromBlindspot uses known-only for flagging
  • Gap-aware flags — requires both side_pct < blindspotThresholdPct AND gap_vs_next_pct >= gapMinPct
  • Label precedence: Overrides > AllSides > MBFC > learned aliases > fuzzy; else unknown
  • LLM weak labeler — optional, OFF by default; cached; conservative thresholds (confidence ≥0.8, margin ≥0.12)
  • Provenance — label method captured for auditability (domain, bias, method)
  • Confidenceunknown_pct reported; confidence = 1 - unknown_pct/100

Install & run

Prerequisites:

  • Node.js 20+
  • Apify CLI (npm install -g apify-cli@latest) or Apify platform account
    • Note: This project uses the new .actor/actor.json format with a minimal apify.json for CLI compatibility

Local:

npm install
npx apify run

Apify platform:

  1. Upload actor to Apify platform
  2. Click "Run actor" with input JSON (see below)

Typical input

{
"queries": ["climate change", "immigration"],
"sinceHours": 24,
"blindspotThresholdPct": 20,
"gapMinPct": 12,
"maxRepUrlsPerSide": 3,
"restrictToEnglish": true,
"excludeUnknownFromBlindspot": true,
"overridesPath": "./bias-overrides.json",
"enableLearning": true,
"learningMinCount": 3,
"learningMinConsistency": 0.8,
"suggestionsMax": 15,
"suggestionsMinCount": 2,
"enableLlmWeakLabeler": false,
"llmMaxDomains": 15,
"llmMinCount": 3,
"llmConfidenceThreshold": 0.8,
"llmMarginThreshold": 0.12,
"llmSampleTitlesPerDomain": 8,
"forceRefreshCache": false
}

Important defaults:

  • restrictToEnglish: true — filters to English sources
  • excludeUnknownFromBlindspot: true — uses known-only shares for blindspot detection
  • blindspotThresholdPct: 20 — side must be <20% to flag
  • gapMinPct: 12 — gap to next side must be ≥12% to flag
  • enableLlmWeakLabeler: falserecommended default (keeps runs lean)

Input fields

FieldTypeDefaultDescription
queriesstring[]["climate change", "immigration"]Search queries for GDELT
sinceHoursinteger24Lookback window (1–168 hours)
blindspotThresholdPctnumber20Side must be below this % to flag (0–100)
gapMinPctnumber12Minimum gap vs next side to flag (0–100)
maxRepUrlsPerSideinteger3Representative URLs per side (1–20)
restrictToEnglishbooleantrueFlag: Filter to English sources via GDELT
excludeUnknownFromBlindspotbooleantrueFlag: Use known-only shares for blindspot math
overridesPathstring"./bias-overrides.json"Path to overrides JSON: { "example.com": "left|center|right" }
enableLearningbooleantrueEnable conservative alias learning
learningMinCountinteger3Min samples to learn alias (2–100)
learningMinConsistencynumber0.8Min consistency to learn (0.7–1)
suggestionsMaxinteger15Max suggested overrides per query (0–100)
suggestionsMinCountinteger2Min articles to suggest override (1–50)
enableLlmWeakLabelerbooleanfalseFlag: Enable LLM weak labeler (OFF by default)
llmMaxDomainsinteger15Top unknown eTLD+1 to try (1–50)
llmMinCountinteger3Min articles per domain (2–50)
llmConfidenceThresholdnumber0.8Min confidence (0.5–1)
llmMarginThresholdnumber0.12Min margin winner–runnerUp (0–1)
llmSampleTitlesPerDomaininteger8Titles per domain to sample (1–20)
forceRefreshCachebooleanfalseForce refresh bias cache (ignores 7-day TTL)

Output schema

Each query produces one result object:

{
"schema_version": "1.1.0",
"generated_at_utc": "2025-01-01T00:00:00.000Z",
"query": "climate change",
"total_clusters": 41,
"total_articles_raw": 96,
"left_pct": 31.7,
"center_pct": 0.0,
"right_pct": 4.9,
"unknown_pct": 63.4,
"left_pct_known": 86.7,
"center_pct_known": 0.0,
"right_pct_known": 13.3,
"known_articles": 15,
"blindspot_flags": [
{ "side": "center", "pct": 0.0, "gap_vs_next_pct": 13.3 }
],
"representative_urls": {
"left": ["https://..."],
"center": [],
"right": ["https://..."],
"unknown": ["https://..."]
},
"confidence": 0.37,
"confidence_note": "Unknown 63.4% — results based on 15 known of 41 total clusters",
"unknown_summary": {
"top_unknown_hosts": ["allafrica.com (3)", "scoop.co.nz (3)"],
"suggested_overrides": [
{
"eTLD1": "allafrica.com",
"support": 3,
"mode_side": null,
"consistency": null
}
],
"suggested_overrides_snippet": "{\n \"allafrica.com\": \"/* decide bias */\"\n}"
},
"provenance": {
"labels_used": [
{
"domain": "bostonglobe.com",
"bias": "left",
"method": "domain:authoritative:host"
}
]
}
}

Field notes:

  • *_pctnumbers (percent values to one decimal)
  • blindspot_flags — zero or more flags; appears when side_pct < blindspotThresholdPct and gap_vs_next_pct >= gapMinPct
  • confidence1 - unknown_pct/100 (0.0–1.0)
  • representative_urls — earliest URL per cluster, deduped by host, up to maxRepUrlsPerSide per side
  • suggested_overrides — candidates for manual review (sorted by support count)

Interpreting results

Blindspot flags:

  • A flag means a side is both below the threshold (blindspotThresholdPct) and has a meaningful gap vs the next side (gapMinPct).
  • Example: { "side": "center", "pct": 0.0, "gap_vs_next_pct": 13.3 } means center has 0% coverage and is 13.3% below the next side.

Unknown% and confidence:

  • High unknown_pct → low confidence → results based on fewer known sources.
  • If unknown_pct > 50%, consider widening sinceHours or adding overrides.

Clusters vs articles:

  • total_clusterstotal_articles_raw due to title+day deduplication (rewrites of same story).

Representativeness caveats:

  • GDELT is broad but not exhaustive.
  • Bias maps (AllSides/MBFC) focus on US outlets; non-US outlets are often unknown unless overridden.
  • Representative URLs are earliest per cluster, not necessarily most authoritative.

Overrides & human-in-the-loop

Maintain a small bias-overrides.json (≤100 entries) for frequent eTLD+1s:

{
"example.com": "center",
"paper.co.uk": "right"
}

Workflow:

  1. Review unknown_summary.suggested_overrides (sorted by support count).
  2. Research each domain (editorial stance, ownership, fact-checking records).
  3. Add to bias-overrides.json with chosen side (left, center, or right).
  4. Auditable process: PRs should state evidence and chosen side.

Precedence: Overrides > AllSides > MBFC > learned aliases > fuzzy; else unknown.

LLM weak labeler (optional)

Default: OFF. When enabled:

  • Uses Transformers.js zero-shot classification (tries multilingual XNLI, falls back to English NLI).
  • Strict thresholds: confidence ≥ 0.8, margin ≥ 0.12.
  • Caches results in bias-cache.json (targeted_cache.llm).
  • Negative cache: skips domains for 14 days if low-confidence or non-English (with English-only model).
  • Not a replacement for overrides; use sparingly for top unknown domains.

When to enable:

  • High Unknown% and manual overrides are impractical.
  • Offline model availability (Transformers.js downloads models on first use).

Operational notes

User-Agent:

  • Update USER_AGENT in main.js with a real contact email: "BlindspotDetectorBot/2.1 (contact: you@example.com)"

Caching:

  • bias-cache.json refreshed ≤7 days (unless forceRefreshCache: true).
  • Includes targeted_cache.llm (LLM weak labels) and llm_neg (14-day skip list).
  • Cache persists across runs.

Logs to expect:

  • Fetch counts: 🔍 Fetching news from GDELT...
  • Clusters: 📦 Query "X": N articles → M clusters
  • Unknown%: confidence_note in output
  • Flags: blindspot_flags array

Throughput tips:

  • Limit queries array size (parallel fetches, but GDELT rate limits apply).
  • Widen sinceHours for sparse topics (more articles → better coverage).

Limitations

  • Bias maps are imperfect: AllSides/MBFC focus on US outlets; some domains unresolved.
  • Non-English and hyper-local outlets: Often remain unknown unless overridden.
  • GDELT dedup isn't perfect: Title+day clustering helps but won't catch every variation.
  • Not a fact-checker: Measures coverage, not truth or accuracy.
  • English filter: restrictToEnglish: true excludes non-English sources (may miss relevant coverage).

Security & ethics

  • Public sources only: GDELT DOC API, AllSides CSV, MBFC Gist (all public GitHub).
  • No scraping: No paywalls, login, or background scraping.
  • Transparent heuristics: Provenance captured; human review for overrides.
  • Respectful rate limits: Retries with backoff; User-Agent identifies bot.

Roadmap

  • Title-relevance gate (behind a flag) to filter off-topic articles.
  • Fill mode_side/consistency in suggestions by tallies from known sources.
  • Provenance source granularity (AllSides vs MBFC vs Overrides in method field).
  • Multilingual support when infrastructure allows (currently English-only filter).

Troubleshooting

High Unknown%:

  • Widen sinceHours (more articles → more known sources).
  • Add overrides for frequent unknown domains (see unknown_summary.suggested_overrides).

Empty reps for a side:

  • Story imbalance or sparse data; increase sinceHours or check query specificity.

LLM model load errors:

  • Ensure offline models available (Transformers.js downloads on first use).
  • Keep enableLlmWeakLabeler: false if models unavailable.

Cache issues:

  • Set forceRefreshCache: true to rebuild bias map.
  • Check bias-cache.json exists and is valid JSON.

Development

Repo structure:

blindspot-detector/
├── main.js # Main actor logic
├── apify.json # Minimal config for CLI compatibility
├── .actor/
│ └── actor.json # Detailed actor configuration & input schema
├── package.json # Dependencies
├── bias-cache.json # Cached bias maps (auto-generated)
├── bias-overrides.json # Manual overrides (optional)
└── input.json # Example input

Configuration files:

  • apify.json - Minimal configuration file for Apify CLI compatibility (legacy format support)
  • .actor/actor.json - Detailed actor configuration including input schema, metadata, and build settings

Scripts:

npm install # Install dependencies
npx apify run # Run locally

Code style:

  • Node.js 20+, ES modules
  • Async/await, p-limit for concurrency
  • Conservative learning thresholds

License

MIT