Wiki Grabber avatar

Wiki Grabber

Pricing

from $0.35 / 1,000 results

Go to Apify Store
Wiki Grabber

Wiki Grabber

Find Wikipedia pages with citation-needed tags, dead links, broken link signals, and cleanup issues using keyword search. Great for SEO, link building, outreach, and research workflows.

Pricing

from $0.35 / 1,000 results

Rating

0.0

(0)

Developer

Shahab Uddin

Shahab Uddin

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

a month ago

Last modified

Share

WikiGrabber

WikiGrabber is an Apify Actor and lightweight web app for finding Wikipedia pages with citation-needed tags, dead-link templates, broken-link signals, and other source-cleanup hints.

What it does

  • Searches English Wikipedia by keyword
  • Parses page wikitext and rendered HTML
  • Detects citation-needed, dead-link, and cleanup-style signals
  • Extracts exact citation and dead-link locations from article sections
  • Adds direct article, section, and section-edit links for faster action
  • Scores results so higher-opportunity pages rise to the top
  • Stores filtered results in an Apify dataset
  • Lets you browse results in a browser
  • Exports saved results as CSV

Endpoints

  • GET / serves the browser UI
  • GET /api/health returns a simple health check
  • GET /api/search?keyword=SEO&limit=30&page=1 runs a keyword search and creates a request-safe dataset
  • GET /api/dataset?dataset=<datasetName-from-search>&page=2&limit=20 pages through saved dataset results
  • GET /api/export.csv?dataset=<datasetName-from-search> exports a dataset as CSV

Advanced result workflow

  • Filter result pages by Show all, Missing Citations, or Dead Links
  • See exact issue rows with section title, line reference, and excerpt
  • Open the exact Wikipedia section directly from the result card
  • Jump straight into action=edit&section=<n> links to add a citation or replace a dead link
  • Review mixed pages that contain both citation and dead-link opportunities

Local development

npm install
npm start

By default the app starts on http://localhost:4321.

For a local one-off QA run that follows the same standard-run code path as Apify's automated test, put an INPUT.json file under your chosen CRAWLEE_STORAGE_DIR, then start the actor with WIKI_GRABBER_FORCE_STANDARD_MODE=1.

Deploy on Apify

npx apify login
npx apify push

Important note about Apify run modes

This project supports both Apify run modes, but they behave differently:

  • Standard Actor run The Actor does not keep the HTTP server alive on Apify. Instead, it treats the run as a one-off batch job. If you provide input like {"keyword":"seo tool","limit":10}, it will build the dataset, save output, and finish with SUCCEEDED. If a standard run starts without a keyword, the actor now falls back to the built-in QA keyword seo tool so automated tests and manual one-off runs still produce a non-empty default dataset.
  • Standby mode The Actor behaves like a web server behind a stable URL, and Apify keeps standby runs available according to the standby configuration.

If you want a persistent app-like experience, use Standby mode instead of manually starting a normal Actor run from the Console.

The input schema now uses both prefill and default on the search keyword for maximum compatibility with Apify's QA flow, while operational settings such as limit keep a real default value for API, task, and scheduler runs.

Apify QA checklist

  • In Apify Console, use Source > Input > Restore example input and confirm it fills keyword: "seo tool" with limit: 10
  • Start the Actor from that restored example input and verify the run finishes within Apify's 5-minute automated-test window
  • Confirm the default dataset is non-empty and that fallback rows, when emitted, are clearly marked with resultType: "fallback"
  • If Wikipedia is temporarily unavailable during the test window, expect a successful run with a diagnostic fallback row instead of an empty default dataset

Standby behavior

  • Repeated identical searches can be served from an in-memory cache while a Standby run stays warm
  • Concurrent identical requests share the same in-flight search work instead of duplicating Wikipedia fetches
  • Each generated dataset name is request-safe, so one user search does not drop or overwrite another user's dataset
  • Add refresh=true to /api/search if you want to bypass the cache and force a new dataset build
  • Wikipedia API calls automatically retry on transient timeout and 429/5xx responses, and large revision batches fall back to smaller groups when needed

Example use cases

  • Wikipedia citation research
  • Dead-link replacement prospecting
  • Link-building opportunity discovery
  • SEO outreach research
  • Topic-based cleanup analysis
  • CSV export for campaign workflows

Output fields

Each result can include:

  • resultType
  • keyword
  • title
  • note
  • pageid
  • url
  • snippet
  • wordcount
  • timestamp
  • citationNeededTemplates
  • deadLinkTemplates
  • brokenLinkSignals
  • cleanupTemplates
  • bareUrlCount
  • refCount
  • score
  • issueCounts
  • locations[]
  • actionLinks