📄 arXiv Papers Monitor
Pricing
from $3.50 / 1,000 arxiv paper saveds
📄 arXiv Papers Monitor
Pull new AI / ML / CS / physics / math papers from arXiv as they land via the official arXiv API. Title, abstract, authors, PDF link, DOI, and LLM-ready summary card per paper. For ML researchers, AI agents, and journalists. Export, run via API, schedule, or integrate with other tools.
Pricing
from $3.50 / 1,000 arxiv paper saveds
Rating
0.0
(0)
Developer
Skootle
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
an hour ago
Last modified
Categories
Share

TL;DR
AI engineers and ML researchers waste 30+ minutes a day refreshing arxiv.org/list/cs.AI/recent and copy-pasting abstracts into spreadsheets. This delivers a clean daily diff of new arXiv papers in your tracked categories (cs.AI, cs.LG, cs.CL, cs.CV, stat.ML, math.OC, q-bio, more), deduplicated by arxivId, with full abstract, every author, PDF URL, DOI, and an LLM-ready markdown card per record. Watchlist mode emits only papers new since the last run, so a daily schedule feeds a RAG pipeline, vector DB, weekly research email, or Slack digest with zero duplicates and ISO 8601 timestamps your downstream sort logic can trust.
Try it on a small dataset (the 10-paper default fits the free $5 trial credit), then let us know what you think in a review.
What does arXiv Papers Monitor do?
It calls the public arXiv API on your behalf and turns the raw Atom feed into clean JSON your code can use immediately. Each paper record includes:
arxivId(e.g.2304.12345) and the version-awarearxivIdVersion(2304.12345v2)- Full
titleand fullabstract - Every
authors[].nameand (when arXiv provides it)authors[].affiliation primaryCategoryplus the fullcategories[]list, with anisCrossListedflagsubmittedDateandupdatedDatein ISO 8601- Direct
pdfUrlandabstractPageUrl doi,journalRef, and the author's owncomment(often "NeurIPS 2026 spotlight" or page count)agentMarkdown: a 5-line markdown card formatted for Claude / Codex / Slack / a CRM ticket
One API call replaces the manual workflow of opening arxiv.org, choosing a category, paging through 50 abstracts at a time, copy-pasting fields into a spreadsheet, and chasing PDF links. We collapse that to a JSON dataset you can pipe into a vector DB, an LLM agent, an alerting system, or a research dashboard.
Why scrape arXiv?
arXiv is where every AI, ML, vision, and NLP paper lands first, often weeks or months before peer review. If your job is "what was published this week in X," refreshing arxiv.org/list/cs.AI/recent and copy-pasting abstracts into a spreadsheet eats 30+ minutes a day.
Feed a RAG pipeline, drive a weekly research newsletter, watch a specific lab or topic, or build training corpora, all from one daily diff. The buyers here are AI engineers wiring research retrieval, ML researchers tracking sub-fields, and editors of weekly AI newsletters who need a clean "what's new since yesterday" feed.
Who needs this?
- AI agent builders wiring research-paper retrieval into RAG pipelines and need clean text plus PDF URLs without writing an Atom parser
- ML researchers tracking three or four sub-fields and wanting a daily digest of new submissions in their categories
- AI journalists chasing weekly stories who need to spot trending architectures, models, and lab outputs as they appear
- M&A and corp-dev analysts profiling AI startups by tracking which authors and labs are publishing what
- Recruiters sourcing ML talent by pulling first-author lists from hot subfields (RLHF, MoE, agents, vision-language)
- Data scientists at LLM labs building reproduction pipelines who need full abstracts and DOIs, not titles
- Conference reviewers and editors who want a structured, per-category submission feed for trend analysis
If your job involves "what was published on arXiv this week in X," you are the buyer.
How to use arXiv Papers Monitor
- Open the actor in Apify Console.
- Pick your
categories(e.g.["cs.AI","cs.CL"]) or type aquery("retrieval augmented generation"). - Optionally set
submittedAfterto limit to recent papers, or flipwatchlistModeon for a daily-new feed. - Click Start. The default (
maxItems: 10) returns about 30 seconds of work. - Download the dataset as JSON, CSV, or Excel, or pull it via the API at
https://api.apify.com/v2/acts/skootle~arxiv-papers/runs/last/dataset/items.
How much will scraping arXiv cost?
Pay-per-result pricing. You only pay for papers actually saved, plus a one-time start fee per run.
| Plan | Per paper | Run start |
|---|---|---|
| FREE | $0.005 | $0.005 |
| BRONZE | $0.0045 | $0.005 |
| SILVER | $0.004 | $0.005 |
| GOLD | $0.0035 | $0.005 |
| PLATINUM | $0.003 | $0.005 |
| DIAMOND | $0.003 | $0.005 |
Typical daily watchlist run for one researcher (50 new papers across cs.AI + cs.CL): about $0.26 on FREE, $0.16 on PLATINUM. A weekly bulk pull of 1000 papers is about $5 on FREE, $3 on PLATINUM. The $5 free Apify credit covers roughly 1000 records on the FREE tier.
Is it legal to scrape arXiv?
arXiv runs an official, public, unauthenticated query API explicitly intended for programmatic access. We honor their published rate limit (1 request per 3 seconds) and identify ourselves with a descriptive User-Agent header. arXiv's Terms of Use cover non-commercial use directly; for commercial redistribution of paper content, follow up with arXiv directly and consult your own counsel.
This actor pulls only the metadata + abstract that arXiv exposes through the public API. It does not download PDFs, does not bypass any auth, and does not touch withdrawn papers.
Examples
1. Daily new cs.AI papers
{"categories": ["cs.AI"],"sortBy": "submittedDate","sortOrder": "descending","maxItems": 50,"watchlistMode": true}
Schedule daily, point the dataset webhook at Slack or a vector DB.
2. RAG-themed papers from the last 30 days
{"query": "retrieval augmented generation","submittedAfter": "2026-04-09","submittedBefore": "2026-05-09","maxItems": 200}
3. NLP + ML cross-listed papers
{"categories": ["cs.CL", "cs.LG"],"sortBy": "submittedDate","maxItems": 100}
4. Specific lab tracking via author keyword in title
{"query": "DeepMind OR Anthropic","categories": ["cs.AI", "cs.LG"],"maxItems": 100}
5. Diffusion-model survey
{"query": "diffusion model","sortBy": "relevance","maxItems": 100}
6. Math optimization for ML
{"categories": ["math.OC", "stat.ML"],"submittedAfter": "2026-01-01","maxItems": 200}
7. Computational neuroscience
{"categories": ["q-bio.NC", "cs.NE"],"maxItems": 50}
8. Title-only feed for fast indexing
{"categories": ["cs.CV"],"includeAbstract": false,"maxItems": 1000}
Input parameters
| Field | Type | Description |
|---|---|---|
query | string | Free-text search across title + abstract |
categories | string[] | arXiv category codes (cs.AI, cs.LG, cs.CL, cs.CV, stat.ML, math.OC, physics.*, q-bio.*, more) |
submittedAfter | string (ISO date) | Earliest submission date |
submittedBefore | string (ISO date) | Latest submission date |
sortBy | enum | submittedDate, lastUpdatedDate, or relevance |
sortOrder | enum | descending or ascending |
maxItems | int | Max papers per run (default 10, max 2000) |
includeAbstract | bool | Toggle full abstract vs title-only (default true) |
watchlistMode | bool | Emit only new papers since the last run |
proxyConfiguration | object | Optional residential proxy for very large bulk runs |
arXiv output format
arxiv_paper record
| Field | Type | Notes |
|---|---|---|
recordType | string | Always "arxiv_paper" |
outputSchemaVersion | string | "2026-05-10". Bumps on schema change. |
arxivId | string | "2304.12345" (no version) |
arxivIdVersion | string | "2304.12345v2" |
doi | string | null | DOI when assigned |
title | string | Full title |
abstract | string | Full abstract, whitespace-normalized |
authors | object[] | { name, affiliation } per author |
authorCount | int | Length of authors |
primaryCategory | string | e.g. "cs.AI" |
categories | string[] | All assigned categories |
submittedDate | string | ISO 8601 |
updatedDate | string | ISO 8601 |
pdfUrl | string | Direct PDF URL |
abstractPageUrl | string | arxiv.org abs page |
journalRef | string | null | "Nature 612, 2026" style reference if accepted |
comment | string | null | Author note ("NeurIPS 2026 spotlight", page count, etc.) |
estimatedReadMinutes | int | Abstract word count / 200 |
isCrossListed | bool | True when categories.length > 1 |
agentMarkdown | string | LLM-ready 5-line card |
fieldCompletenessScore | int | 0-100, 10 fields evaluated |
scrapedAt | string | ISO 8601 |
Sample record
{"recordType": "arxiv_paper","outputSchemaVersion": "2026-05-10","arxivId": "2605.06667","arxivIdVersion": "2605.06667v1","doi": null,"title": "ActCam: Zero-Shot Joint Camera and 3D Motion Control for Video Generation","abstract": "For artistic applications, video generation requires fine-grained control...","authors": [{ "name": "Omar El Khalifi", "affiliation": null },{ "name": "Thomas Rossi", "affiliation": null }],"authorCount": 9,"primaryCategory": "cs.CV","categories": ["cs.CV", "cs.AI", "cs.LG"],"submittedDate": "2026-05-07T17:59:58Z","updatedDate": "2026-05-07T17:59:58Z","pdfUrl": "https://arxiv.org/pdf/2605.06667v1","abstractPageUrl": "https://arxiv.org/abs/2605.06667v1","journalRef": null,"comment": "SIGGRAPH 2026","estimatedReadMinutes": 2,"isCrossListed": true,"agentMarkdown": "📄 ActCam: Zero-Shot Joint Camera and 3D Motion Control for Video Generation (2605.06667)\n👥 Omar El Khalifi + 8 more\n📅 Submitted 2026-05-07 · Category cs.CV\n📊 2 min read · Cross-listed\n🔗 https://arxiv.org/pdf/2605.06667v1","fieldCompletenessScore": 80,"scrapedAt": "2026-05-09T20:50:00Z"}
During the actor run
No authentication required. The actor honors arXiv's published 1-request-per-3-seconds rate limit and identifies itself with a descriptive User-Agent, so the source stays available for everyone. A 1000-paper pull typically completes in about 30 seconds.
A run summary lands at the OUTPUT key, a markdown digest of the top 5 papers at AGENT_BRIEFING, and (with watchlistMode: true) the rolling 50,000-id dedupe window at WATCHLIST_STATE.
FAQ
How is this different from arXiv's free API?
The free API returns raw Atom XML with namespaced tags. You write an XML parser, you write a paginator that respects the 3-second rate limit, you write a normalizer for affiliation / journal_ref / comment fields, and you write the watchlist diff yourself. Then you maintain it. We give you typed JSON, idempotent IDs, watchlist mode, an agent-ready markdown card per record, and a versioned schema so your downstream pipeline does not silently break.
What about HuggingFace papers or PapersWithCode?
Different sources, different scope. HuggingFace Papers is curated and lags arXiv. PapersWithCode focuses on code-attached papers. Use this actor for the firehose, then enrich with HF / PWC if you need code-availability signals. We will likely ship companion actors for both in v0.2.
Can I track only papers from specific universities or labs?
Indirectly. arXiv's API does not expose a clean "affiliation" filter, but you can query by lab keywords ("Anthropic", "DeepMind", "Stanford NLP") and the term will match titles and abstracts. For author-list filtering, post-process the dataset (authors[].name) downstream.
How does watchlist mode work?
Flip watchlistMode: true. The actor reads WATCHLIST_STATE from the key-value store, runs the search, and emits only papers whose arxivId it has not delivered before. After each run it appends the newly seen IDs back to state (rolling window of 50,000). Pair with a daily Apify schedule for a clean "what's new" feed.
Can I use this with Python?
Yes. pip install apify-client, call client.actor("skootle/arxiv-papers").call(run_input=...), then iterate client.dataset(run["defaultDatasetId"]).iterate_items().
Can I integrate with Make / Zapier / n8n / Slack?
Yes. Apify exposes webhook triggers on dataset items and run completion. n8n and Make have native Apify connectors; Zapier works through the standard webhook bridge.
Why does this cost more than free arXiv scrapers?
If you are wiring this into a customer-facing product or a daily AI-agent pipeline, the per-record cost ($0.003 at GOLD) buys you reliability free actors do not provide: versioned schema, idempotent IDs, watchlist diff, daily Apify auto-test reliability, and a maintenance commitment. Free actors break monthly when the source changes a tag name, you do not get notified, and your pipeline silently goes empty.
What rate limits should I worry about?
arXiv asks for at most 1 request per 3 seconds. We honor that automatically. With 100 papers per page, a 1000-paper pull takes roughly 30 seconds plus arXiv processing time.
Does this download the full PDF?
No, only metadata and abstract. The pdfUrl field gives you the direct PDF link if your downstream needs the full text.
Why choose arXiv Papers Monitor
- Monitor mode emits only what's new since last run. A rolling 50,000-id window means your RAG pipeline ingests each paper exactly once.
- Reliability free actors can't deliver. Free arXiv scrapers break monthly when source tags change. You don't get notified, your pipeline silently goes empty. The per-record cost ($0.003 at GOLD) buys daily auto-test reliability and 24-48 hour fix turnaround.
- Sub-minute runtime, no rate-limit babysitting. Pure HTTP against the official arXiv API, no HTML parsing, no headless browser, 1000 papers in about 30 seconds.
- Drop-in for LLM agents.
agentMarkdowncard baked into every record, plus a per-runAGENT_BRIEFING.mddigest of the top 5 papers ready for Slack or a daily LLM context window. - Schema doesn't break your pipeline, versioned and bumped on every breaking change.
- Re-runs are safe to dedupe by ID,
arxivId-keyed records upsert cleanly across runs. - AI agents can self-filter sparse rows via
fieldCompletenessScore(0-100, 10 fields evaluated).
Your feedback
Hit a bug or want a feature? Open an issue on the Issues tab rather than the reviews page, and we will fix it fast (typically within 48 hours).
Other Skootle actors you might want to check
- skootle/hackernews-watchlist, watchlist new HN stories matching keywords or domains
- skootle/github-trending, daily trending repos by language with stargazer + commit signals
- skootle/reddit-subreddit-monitor, new posts in any subreddit with watchlist diff
- skootle/sec-edgar-filings, public SEC filings normalized for AI agents
Support and contact
Found a bug or need a new field? Open an issue. For commercial use questions, email jamie.kester@gmail.com.