News Blindspot

Pricing

Pay per usage

News Blindspot

Detects open-data media coverage blindspots — topics heavily covered by one political side but undercovered by others. Uses GDELT, AllSides, and MBFC data only.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

spiros douk

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

2 months ago

Last modified

Open-data Blindspot Detector

LICENSE

Detects media coverage blindspots by analyzing political bias distribution across news sources using only public, legal data sources.

Live demo

Run: View on Apify
Dataset: View JSON output

What it does

Finds blindspots — topics covered by one political side but under-covered by others — using public sources only. The actor:

Fetches news articles from GDELT DOC API for specified queries
Labels sources using AllSides (brand→bias), MBFC (domain→bias), and optional local overrides
Clusters articles by story (title + day) to deduplicate rewrites
Calculates known-only shares (Unknown excluded from denominator) for each political side
Raises gap-aware flags when a side is both below a threshold and meaningfully lower than the next side
Emits representative URLs per side and an Unknown% confidence proxy

Stack & data sources

Stack:

Node.js 20+
Apify Actor framework
axios, p-limit, tldts, csv-parse
Optional: @xenova/transformers (for LLM weak labeler)

Data sources:

GDELT DOC API — article metadata (title, URL, source, date)
AllSides — brand→bias mapping (GitHub CSV)
MBFC — domain→bias mapping (GitHub Gist)
Local overrides — optional JSON file for manual corrections

No private or paid data; no background scraping.

Key design choices

English-only filter — GDELT sourcelang:english to reduce non-English noise
Cluster by story — title+day clustering → compute shares on clusters, not raw articles
Known-only math — *_pct_known excludes Unknown from denominator; excludeUnknownFromBlindspot uses known-only for flagging
Gap-aware flags — requires both side_pct < blindspotThresholdPct AND gap_vs_next_pct >= gapMinPct
Label precedence: Overrides > AllSides > MBFC > learned aliases > fuzzy; else unknown
LLM weak labeler — optional, OFF by default; cached; conservative thresholds (confidence ≥0.8, margin ≥0.12)
Provenance — label method captured for auditability (domain, bias, method)
Confidence — unknown_pct reported; confidence = 1 - unknown_pct/100

Install & run

Prerequisites:

Node.js 20+
Apify CLI (npm install -g apify-cli@latest) or Apify platform account
- Note: This project uses the new .actor/actor.json format with a minimal apify.json for CLI compatibility

Local:

npm install
npx apify run

Apify platform:

Upload actor to Apify platform
Click "Run actor" with input JSON (see below)

Typical input

{
  "queries": ["climate change", "immigration"],
  "sinceHours": 24,
  "blindspotThresholdPct": 20,
  "gapMinPct": 12,
  "maxRepUrlsPerSide": 3,
  "restrictToEnglish": true,
  "excludeUnknownFromBlindspot": true,
  "overridesPath": "./bias-overrides.json",
  "enableLearning": true,
  "learningMinCount": 3,
  "learningMinConsistency": 0.8,
  "suggestionsMax": 15,
  "suggestionsMinCount": 2,
  "enableLlmWeakLabeler": false,
  "llmMaxDomains": 15,
  "llmMinCount": 3,
  "llmConfidenceThreshold": 0.8,
  "llmMarginThreshold": 0.12,
  "llmSampleTitlesPerDomain": 8,
  "forceRefreshCache": false
}

Important defaults:

restrictToEnglish: true — filters to English sources
excludeUnknownFromBlindspot: true — uses known-only shares for blindspot detection
blindspotThresholdPct: 20 — side must be <20% to flag
gapMinPct: 12 — gap to next side must be ≥12% to flag
enableLlmWeakLabeler: false — recommended default (keeps runs lean)

Input fields

Field	Type	Default	Description
`queries`	`string[]`	`["climate change", "immigration"]`	Search queries for GDELT
`sinceHours`	`integer`	`24`	Lookback window (1–168 hours)
`blindspotThresholdPct`	`number`	`20`	Side must be below this % to flag (0–100)
`gapMinPct`	`number`	`12`	Minimum gap vs next side to flag (0–100)
`maxRepUrlsPerSide`	`integer`	`3`	Representative URLs per side (1–20)
`restrictToEnglish`	`boolean`	`true`	Flag: Filter to English sources via GDELT
`excludeUnknownFromBlindspot`	`boolean`	`true`	Flag: Use known-only shares for blindspot math
`overridesPath`	`string`	`"./bias-overrides.json"`	Path to overrides JSON: `{ "example.com": "left\|center\|right" }`
`enableLearning`	`boolean`	`true`	Enable conservative alias learning
`learningMinCount`	`integer`	`3`	Min samples to learn alias (2–100)
`learningMinConsistency`	`number`	`0.8`	Min consistency to learn (0.7–1)
`suggestionsMax`	`integer`	`15`	Max suggested overrides per query (0–100)
`suggestionsMinCount`	`integer`	`2`	Min articles to suggest override (1–50)
`enableLlmWeakLabeler`	`boolean`	`false`	Flag: Enable LLM weak labeler (OFF by default)
`llmMaxDomains`	`integer`	`15`	Top unknown eTLD+1 to try (1–50)
`llmMinCount`	`integer`	`3`	Min articles per domain (2–50)
`llmConfidenceThreshold`	`number`	`0.8`	Min confidence (0.5–1)
`llmMarginThreshold`	`number`	`0.12`	Min margin winner–runnerUp (0–1)
`llmSampleTitlesPerDomain`	`integer`	`8`	Titles per domain to sample (1–20)
`forceRefreshCache`	`boolean`	`false`	Force refresh bias cache (ignores 7-day TTL)

Output schema

Each query produces one result object:

{
  "schema_version": "1.1.0",
  "generated_at_utc": "2025-01-01T00:00:00.000Z",
  "query": "climate change",
  "total_clusters": 41,
  "total_articles_raw": 96,
  "left_pct": 31.7,
  "center_pct": 0.0,
  "right_pct": 4.9,
  "unknown_pct": 63.4,
  "left_pct_known": 86.7,
  "center_pct_known": 0.0,
  "right_pct_known": 13.3,
  "known_articles": 15,
  "blindspot_flags": [
    { "side": "center", "pct": 0.0, "gap_vs_next_pct": 13.3 }
  ],
  "representative_urls": {
    "left": ["https://..."],
    "center": [],
    "right": ["https://..."],
    "unknown": ["https://..."]
  },
  "confidence": 0.37,
  "confidence_note": "Unknown 63.4% — results based on 15 known of 41 total clusters",
  "unknown_summary": {
    "top_unknown_hosts": ["allafrica.com (3)", "scoop.co.nz (3)"],
    "suggested_overrides": [
      {
        "eTLD1": "allafrica.com",
        "support": 3,
        "mode_side": null,
        "consistency": null
      }
    ],
    "suggested_overrides_snippet": "{\n  \"allafrica.com\": \"/* decide bias */\"\n}"
  },
  "provenance": {
    "labels_used": [
      {
        "domain": "bostonglobe.com",
        "bias": "left",
        "method": "domain:authoritative:host"
      }
    ]
  }
}

Field notes:

*_pct — numbers (percent values to one decimal)
blindspot_flags — zero or more flags; appears when side_pct < blindspotThresholdPct and gap_vs_next_pct >= gapMinPct
confidence — 1 - unknown_pct/100 (0.0–1.0)
representative_urls — earliest URL per cluster, deduped by host, up to maxRepUrlsPerSide per side
suggested_overrides — candidates for manual review (sorted by support count)

Interpreting results

Blindspot flags:

A flag means a side is both below the threshold (blindspotThresholdPct) and has a meaningful gap vs the next side (gapMinPct).
Example: { "side": "center", "pct": 0.0, "gap_vs_next_pct": 13.3 } means center has 0% coverage and is 13.3% below the next side.

Unknown% and confidence:

High unknown_pct → low confidence → results based on fewer known sources.
If unknown_pct > 50%, consider widening sinceHours or adding overrides.

Clusters vs articles:

total_clusters ≤ total_articles_raw due to title+day deduplication (rewrites of same story).

Representativeness caveats:

GDELT is broad but not exhaustive.
Bias maps (AllSides/MBFC) focus on US outlets; non-US outlets are often unknown unless overridden.
Representative URLs are earliest per cluster, not necessarily most authoritative.

Overrides & human-in-the-loop

Maintain a small bias-overrides.json (≤100 entries) for frequent eTLD+1s:

{
  "example.com": "center",
  "paper.co.uk": "right"
}

Workflow:

Review unknown_summary.suggested_overrides (sorted by support count).
Research each domain (editorial stance, ownership, fact-checking records).
Add to bias-overrides.json with chosen side (left, center, or right).
Auditable process: PRs should state evidence and chosen side.

Precedence: Overrides > AllSides > MBFC > learned aliases > fuzzy; else unknown.

LLM weak labeler (optional)

Default: OFF. When enabled:

Uses Transformers.js zero-shot classification (tries multilingual XNLI, falls back to English NLI).
Strict thresholds: confidence ≥ 0.8, margin ≥ 0.12.
Caches results in bias-cache.json (targeted_cache.llm).
Negative cache: skips domains for 14 days if low-confidence or non-English (with English-only model).
Not a replacement for overrides; use sparingly for top unknown domains.

When to enable:

High Unknown% and manual overrides are impractical.
Offline model availability (Transformers.js downloads models on first use).

Operational notes

User-Agent:

Update USER_AGENT in main.js with a real contact email: "BlindspotDetectorBot/2.1 (contact: you@example.com)"

Caching:

bias-cache.json refreshed ≤7 days (unless forceRefreshCache: true).
Includes targeted_cache.llm (LLM weak labels) and llm_neg (14-day skip list).
Cache persists across runs.

Logs to expect:

Fetch counts: 🔍 Fetching news from GDELT...
Clusters: 📦 Query "X": N articles → M clusters
Unknown%: confidence_note in output
Flags: blindspot_flags array

Throughput tips:

Limit queries array size (parallel fetches, but GDELT rate limits apply).
Widen sinceHours for sparse topics (more articles → better coverage).

Limitations

Bias maps are imperfect: AllSides/MBFC focus on US outlets; some domains unresolved.
Non-English and hyper-local outlets: Often remain unknown unless overridden.
GDELT dedup isn't perfect: Title+day clustering helps but won't catch every variation.
Not a fact-checker: Measures coverage, not truth or accuracy.
English filter: restrictToEnglish: true excludes non-English sources (may miss relevant coverage).

Security & ethics

Public sources only: GDELT DOC API, AllSides CSV, MBFC Gist (all public GitHub).
No scraping: No paywalls, login, or background scraping.
Transparent heuristics: Provenance captured; human review for overrides.
Respectful rate limits: Retries with backoff; User-Agent identifies bot.

Roadmap

Title-relevance gate (behind a flag) to filter off-topic articles.
Fill mode_side/consistency in suggestions by tallies from known sources.
Provenance source granularity (AllSides vs MBFC vs Overrides in method field).
Multilingual support when infrastructure allows (currently English-only filter).

Troubleshooting

High Unknown%:

Widen sinceHours (more articles → more known sources).
Add overrides for frequent unknown domains (see unknown_summary.suggested_overrides).

Empty reps for a side:

Story imbalance or sparse data; increase sinceHours or check query specificity.

LLM model load errors:

Ensure offline models available (Transformers.js downloads on first use).
Keep enableLlmWeakLabeler: false if models unavailable.

Cache issues:

Set forceRefreshCache: true to rebuild bias map.
Check bias-cache.json exists and is valid JSON.

Development

Repo structure:

blindspot-detector/
├── main.js              # Main actor logic
├── apify.json           # Minimal config for CLI compatibility
├── .actor/
│   └── actor.json       # Detailed actor configuration & input schema
├── package.json         # Dependencies
├── bias-cache.json      # Cached bias maps (auto-generated)
├── bias-overrides.json  # Manual overrides (optional)
└── input.json           # Example input

Configuration files:

apify.json - Minimal configuration file for Apify CLI compatibility (legacy format support)
.actor/actor.json - Detailed actor configuration including input schema, metadata, and build settings

Scripts:

npm install              # Install dependencies
npx apify run           # Run locally

Code style:

Node.js 20+, ES modules
Async/await, p-limit for concurrency
Conservative learning thresholds

License

MIT

AI Model Comparison

onescales/ai-model-comparison

Compare responses from multiple AI models side by side and let AI analyze them to deliver the single best answer.

One Scales

5.0

Reuters Scraper (Per Event)

dadhalfdev/reuters-scraper-per-event

Stay ahead of global news with comprehensive coverage from Reuters, one of the world's most trusted news sources! This actor uses a pay-per-event pricing model, so you only pay for what you use.

Dad Half Dev

Google News Scraper

scrapier/google-news-scraper

Pull fresh news coverage from Google News with reliable scraping. Extract article metadata, summaries, sources, and URLs for trend analysis or reporting workflows. Designed for content teams, researchers, and automation pipelines.

Scrapier

Open Profile Status

bestscrapers/open-profile-status

Accurately detects whether a LinkedIn profile is open to receiving InMail messages (i.e., Open Profile status: true or false).

Linkedin Scrapers

News AU Scraper

maria.f/news-au-scraper

Scrape news data from news.com.au with this unofficial API. Extract articles, monitor their popularity and performance and automate the fight against fake news. Filter the results by authors, topics, categories, or publication dates. Preview or download the results in your preferred format.

Mariachiara Faraon

GDELT News Data Enrichment Pipeline

visita/gdelt-news

This actor is the central intelligence hub for a multi-pipeline news aggregation system. Its primary role is to fetch, unify, cleanse, and analyze raw news data from multiple Apify news pipeline actors, preparing a structured dataset of topical trends for downstream AI services.

Visita AI & Automation

Linkedin Open Profile Status

freshdata/linkedin-open-profile-status

Accurately detects whether a LinkedIn profile is open to receiving InMail messages (i.e., Open Profile status: true or false).

FreshData

Linkedin Open To Work Status

freshdata/linkedin-open-to-work-status

Accurately detects whether a LinkedIn profile is marked as Open to Work (i.e., open-to-work status: true or false).

FreshData

Ultimate News API

glitch_404/Ultimate-News-Scraper

Scrape up to 10000 news articles from over 4500 news sources in less than 20 minutes, news from over 20 categories, e.g., Crypto news, World News, Latest News, Celebrities, and a lot more. You can find news on websites such as Fox News, BBC News, CNN, and Cryptocurrency-Related News Sources.

Yousif Wael

188

US News Scraper

hanatsai/us-news-scraper

Scrape news data from usnews.com with this unofficial API. Extract articles, monitor their popularity and performance and automate the fight against fake news. Filter the results by authors, topics, categories, or publication dates. Preview or download the results in your preferred format.