Pricing

Pay per usage

ArXiv Paper Scraper — Search by Category, Bulk JSON, DOI

arXiv corpus as JSON — arxivId, title, authors, abstract, categories, dates, DOI, PDF URL. By search OR category. Built for ML/AI training data + lit reviews. 19 runs. Backed by 951-run Trustpilot flagship + 31-actor portfolio. spinov001@gmail.com · blog.spinov.online · t.me/scraping_ai

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Alex

Actor stats

Bookmarked

Total users

Monthly active users

a month ago

Last modified

arXiv Paper Scraper — Bulk Research-Paper Metadata via arXiv API

Pull research-paper metadata from the official arXiv API by category or free-text search query — titles, abstracts, authors, primary + secondary categories, submission/update dates, DOI, journal reference, PDF URL — as a flat JSON dataset. No API key, no auth.

Verified against src/main.js — every output field below is what the actor actually pushes.

What you get per paper (14 fields)

Field	Type	Example	Source in `main.js`
`arxivId`	string	`2403.14812v1`	parsed from `<id>` URL tail
`title`	string	`Attention Is All You Need`	`<title>` text, whitespace-collapsed
`authors`	string[]	`["Ashish Vaswani", "Noam Shazeer"]`	each `<author><name>` text — no affiliations, no ORCIDs
`abstract`	string	full abstract, whitespace-collapsed	`<summary>` text
`categories`	string[]	`["cs.CL", "cs.LG"]`	every `<category term="...">`
`primaryCategory`	string \| null	`cs.CL`	first `<category>`
`published`	ISO string	`2017-06-12T17:57:34Z`	`<published>`
`updated`	ISO string	`2023-08-02T00:41:18Z`	`<updated>`
`doi`	string \| null	`10.5555/3294996.3295349`	`<arxiv:doi>` or null
`journalRef`	string \| null	`NeurIPS 2017`	`<arxiv:journal_ref>` or null
`pdfUrl`	string	`http://arxiv.org/pdf/2403.14812v1`	`<id>` with `/abs/` → `/pdf/`
`abstractUrl`	string	`http://arxiv.org/abs/2403.14812v1`	`<id>` text
`source`	string	`category:cs.AI` or `search:RAG`	tagged at scrape time
`scrapedAt`	ISO string	`2026-04-29T10:30:00.000Z`	actor capture time

The actor parses arXiv's Atom-XML response with Cheerio (xmlMode: true). It does not extract author affiliations, institutions, page counts, figure counts, reference counts, citations, or comment-field metadata — those would require parsing PDFs or hitting Semantic Scholar / OpenAlex (separate actors).

Inputs (from `.actor/input_schema.json`)

Parameter	Type	Default	Description
`searchQueries`	string[]	`[]`	free-text queries — sent to arXiv as `all:<query>` (matches title + abstract + author + comment)
`categories`	string[]	`[]`	arXiv category codes (e.g. `cs.AI`, `cs.LG`, `stat.ML`)
`maxPapersPerSource`	integer	`50`	cap per query / per category. Caveat: searchQueries paginate up to this cap (max 500 = 5 batches of 100). Categories only fetch ONE batch (≤100) — values >100 for categories are silently capped at 100. To pull >100 papers per category, use a searchQuery `cat:cs.AI` instead, or request a paginated-categories custom build.
`sortBy`	string	`submittedDate`	`submittedDate` \| `lastUpdatedDate` \| `relevance`
`sortOrder`	string	`descending`	`ascending` \| `descending`

If both searchQueries and categories are empty, the actor fetches the latest 25 cs.AI papers as a default sample.

Use cases

Weekly digest — pull new cs.AI / cs.LG / cs.CL submissions every Monday for a research-team Slack feed
RAG corpus ingestion — feed arxivId + pdfUrl into a downstream PDF-extraction pipeline (use pypdf / PyMuPDF)
Citation graph stub — collect doi / journalRef for joining against OpenAlex or Semantic Scholar later
Topic monitoring — search "retrieval augmented generation" weekly, diff against last week

Quick start

from apify_client import ApifyClient

client = ApifyClient("YOUR_APIFY_TOKEN")
run = client.actor("knotless_cadence/arxiv-paper-scraper").call(
    run_input={
        "categories": ["cs.AI", "cs.LG"],
        "searchQueries": ["retrieval augmented generation"],
        "maxPapersPerSource": 200,
        "sortBy": "submittedDate",
        "sortOrder": "descending",
    }
)
for p in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(p["arxivId"], p["primaryCategory"], "|", p["title"])
    print("  Authors:", ", ".join(p["authors"][:3]))
    print("  Published:", p["published"])
    print("  PDF:", p["pdfUrl"])

Bash one-liner

curl -s "https://api.apify.com/v2/acts/knotless_cadence~arxiv-paper-scraper/run-sync-get-dataset-items?token=$APIFY_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"categories":["cs.AI","cs.LG"],"maxPapersPerSource":50,"sortBy":"submittedDate","sortOrder":"descending"}' \
  | jq -r '.[] | "\(.arxivId)\t\(.primaryCategory)\t\(.title)"'

How it works

For each searchQueries entry → arXiv Atom API: ?search_query=all:<query>&sortBy=...&max_results=...&start=... (paginated up to maxPapersPerSource, max 500)
For each categories entry → arXiv Atom API: ?search_query=cat:<cat>&sortBy=...&max_results=<min(maxPapersPerSource,100)> (single-batch only — no pagination; max 100)
Cheerio parses XML (xmlMode: true), walks each <entry>, builds the 14-field record above
3-second delay between PAGINATED batches WITHIN a single query — delay(3000) after each batch. No delay between separate queries or between separate categories — multi-input runs may issue back-to-back requests faster than arXiv's published 3s/request etiquette. Throttle yourself by spacing queries across runs, or request a custom build that adds inter-source delay.
Actor.pushData() per paper, capped at maxPapersPerSource

Honest limitations (read before bulk runs)

HTTP not HTTPS. Code calls http://export.arxiv.org/api/query — paper metadata transits cleartext, MITM-tamperable on hostile networks. arXiv supports HTTPS at the same hostname; this would be a one-line code change in a custom build.
Categories don't paginate. A category fetch issues exactly one request with max_results=min(maxPapersPerSource,100). Asking for 500 papers in cs.AI returns 100, NOT 500. Workaround: use a searchQuery formatted as cat:cs.AI — that path DOES paginate up to 500.
3s delay is intra-query only. A run with 5 queries + 5 categories issues up to 10 back-to-back requests with no inter-source delay — over arXiv's etiquette threshold for batched callers. Heavy users should expect occasional rate-limiting. Custom build can add delay(3000) between sources.
Outer try/catch wraps the entire run. A 5xx on query #3 of 10 throws, the catch fires, and queries #4-#10 + ALL categories are SKIPPED. Workaround: split inputs across runs.
Single-attempt fetch — no retry on transient errors.
searchQueries use all: operator — matches title + abstract + author + comment. For more precise control, write the operator yourself in the query (e.g. ti:transformer AND au:vaswani).
sortBy=relevance is best-effort against arXiv's relevance ranker; results can vary.
No author affiliations / ORCIDs / institution / page-count / figure-count / reference-count / citation-count. arXiv API returns paper-level metadata only; deeper enrichment requires Semantic Scholar / OpenAlex / direct PDF parsing.
No version chain — arxivId includes the version suffix (v1, v2...) but the actor does NOT walk the replacement history.
doi and journalRef are often null — many arXiv papers never get a journal-of-record DOI assigned.
Cheerio XML namespace selector quirk — arxiv\\:doi, doi matches both namespaced and bare doi elements; if arXiv changes their feed structure, the bare-name fallback covers it.

OpenLibrary Book Scraper — apify.com/knotless_cadence/openlibrary-book-scraper
Hacker News Scraper — discussion threads + scoring → apify.com/knotless_cadence/hacker-news-scraper
Reddit Discussion Scraper — subreddit threads → apify.com/knotless_cadence/reddit-discussion-scraper
GitHub Trending Scraper — open-source signal → apify.com/knotless_cadence/github-trending-scraper

Need a custom build?

Apify-as-a-Service tiers:

Pilot — $97: 1 actor, basic config, 7-day support
Standard — $297: custom actor + Slack/email alerts, 30-day support
Premium — $797: custom actor + dashboard + 90-day support + 1 modification round

Common arXiv-related custom builds: institution-affiliation extraction (PDF-parse), citation graph join with OpenAlex, weekly delta alerts to Slack, RAG-ready chunking + embedding pipeline.

Email: spinov001@gmail.com Blog (case studies): https://blog.spinov.online Telegram channel (scraping & data engineering): https://t.me/scraping_ai

Honest disclosure

Public arXiv API only (http://export.arxiv.org/api/query) — no key, no auth, no scraping behind paywalls. HTTP not HTTPS in current code — custom build can switch to HTTPS.
arXiv's published etiquette is one request every 3 seconds — actor enforces this between paginated batches within a single query, NOT between separate queries or categories. Multi-input runs can exceed etiquette.
maxPapersPerSource for categories is silently capped at 100 (single-batch fetch, no pagination). For pagination use a searchQuery formatted as cat:cs.AI.
The actor returns arXiv's native metadata only. It does not extract: author affiliations, ORCIDs, page counts, figure counts, reference counts, citation counts, paper comments, or replacement history. For those, use Semantic Scholar / OpenAlex / direct PDF parsing.
Independent project — not affiliated with arXiv or Cornell University. arXiv data is free for research and educational use under arXiv's license terms.

Google Maps Scraper — Reviews, Contacts & Leads [No API Key]

knotless_cadence/google-maps-scraper-pro

18 runs. Google Maps: name, address, phone, site, category, rating, reviews, hours, GPS, place-ID. CSV/JSON, no key. Local-biz prospecting + competitor scout + territory mapping. Backed by 951-run Trustpilot flagship + 31-actor portfolio. spinov001@gmail.com · blog.spinov.online · t.me/scraping_ai

Alex

IMDb Scraper — Ratings, Cast, Genres, JSON/CSV, No Key

knotless_cadence/imdb-movie-scraper

16 runs. Backed by 951-run Trustpilot flagship + 31-actor portfolio. IMDb titles in JSON/CSV — title, imdbId, type, genres, actors, directors, rating. Bulk by ID or search. No API key. For streaming intel + licensing + recommender training. spinov001@gmail.com · blog.spinov.online · t.me/scraping_ai

Alex

Website Screenshot — Full Pages, Any Resolution, PNG, No Limits

knotless_cadence/website-screenshot-scraper

20 runs. Website screenshots as PNG/JPG/PDF in 2 min — full-page, desktop + mobile, custom viewport, bulk URL input. Backed by 951-run Trustpilot flagship + 31-actor portfolio. For competitor visual tracking + UX research. spinov001@gmail.com · blog.spinov.online · t.me/scraping_ai

Alex

Tech Stack Detector — Frameworks, CMS, Analytics, JSON Out

knotless_cadence/website-tech-stack-detector

Competitor tech stack as CSV/JSON in 2 min — frameworks, CMS, analytics, CDN, servers, trackers. No Wappalyzer seat fee, no BuiltWith cap. 19 runs. Backed by 951-run Trustpilot flagship + 31-actor portfolio. spinov001@gmail.com · blog.spinov.online · t.me/scraping_ai

Alex

arXiv Search & Paper Scraper

scrapeworks/arxiv-search

Search arXiv and get clean structured JSON for each paper: title, authors, abstract, categories, DOI, PDF link, and dates. Built for research, datasets, and AI pipelines.

Nicolas van Arkens

GitHub Profile — Repos, Stars, Activity, CSV, No Token, Bulk

knotless_cadence/github-profile-scraper

21 runs. GitHub user intel in CSV/JSON — repos, stars, followers, contribs, languages, bio, email. No API token, no rate blocks. Backed by 951-run Trustpilot flagship + 31-actor portfolio. For recruiter outreach + talent mapping. spinov001@gmail.com · blog.spinov.online · t.me/scraping_ai

Alex

GitHub Trending — CSV Stars, Topics by Period, No Token

knotless_cadence/github-trending-scraper

20 runs. GitHub Trending repos in CSV/JSON — owner, name, url, language, stars, topics. Daily/weekly/monthly + lang filter, no token. Backed by 951-run Trustpilot flagship + 31-actor portfolio. For OSS scouting + VC dealflow. spinov001@gmail.com · blog.spinov.online · t.me/scraping_ai

Alex

Yelp Scraper — Reviews, Ratings, Contacts, CSV, No API Key

knotless_cadence/yelp-business-scraper

Yelp business leads CSV/JSON — name, address, phone, website, rating, reviews, categories by keyword+city. No paid API, no copy-paste. 17 runs. Backed by 951-run Trustpilot flagship + 31-actor portfolio. For local-biz prospecting + SMB lead-gen. spinov001@gmail.com · blog.spinov.online

Alex

arXiv Research Paper Scraper

crawlerbros/arxiv-research-paper-scraper

Scrape research papers from arXiv.org - search by query, category, or author; lookup by arXiv ID. Returns title, authors, abstract, PDF URL, DOI, categories, and more. Uses the public arXiv Atom API. No login or proxy required.

Crawler Bros

IP Geolocation — Country, City, ISP, CSV, No API Key, Bulk

knotless_cadence/ip-geolocation-lookup

20 runs. IP intel as CSV/JSON — country, region, city, ISP, ASN, timezone, lat/lon, isMobile/isProxy flags. Accepts IPs + domains. Backed by 951-run Trustpilot flagship + 31-actor portfolio. For fraud + ad-targeting + GDPR audits. spinov001@gmail.com · blog.spinov.online · t.me/scraping_ai

Alex