📄 arXiv Papers Monitor avatar

📄 arXiv Papers Monitor

Pricing

from $3.50 / 1,000 arxiv paper saveds

Go to Apify Store
📄 arXiv Papers Monitor

📄 arXiv Papers Monitor

Pull new AI / ML / CS / physics / math papers from arXiv as they land via the official arXiv API. Title, abstract, authors, PDF link, DOI, and LLM-ready summary card per paper. For ML researchers, AI agents, and journalists. Export, run via API, schedule, or integrate with other tools.

Pricing

from $3.50 / 1,000 arxiv paper saveds

Rating

0.0

(0)

Developer

Skootle

Skootle

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

an hour ago

Last modified

Share

arXiv Papers Monitor hero

TL;DR

AI engineers and ML researchers waste 30+ minutes a day refreshing arxiv.org/list/cs.AI/recent and copy-pasting abstracts into spreadsheets. This delivers a clean daily diff of new arXiv papers in your tracked categories (cs.AI, cs.LG, cs.CL, cs.CV, stat.ML, math.OC, q-bio, more), deduplicated by arxivId, with full abstract, every author, PDF URL, DOI, and an LLM-ready markdown card per record. Watchlist mode emits only papers new since the last run, so a daily schedule feeds a RAG pipeline, vector DB, weekly research email, or Slack digest with zero duplicates and ISO 8601 timestamps your downstream sort logic can trust.

Try it on a small dataset (the 10-paper default fits the free $5 trial credit), then let us know what you think in a review.


What does arXiv Papers Monitor do?

It calls the public arXiv API on your behalf and turns the raw Atom feed into clean JSON your code can use immediately. Each paper record includes:

  • arxivId (e.g. 2304.12345) and the version-aware arxivIdVersion (2304.12345v2)
  • Full title and full abstract
  • Every authors[].name and (when arXiv provides it) authors[].affiliation
  • primaryCategory plus the full categories[] list, with an isCrossListed flag
  • submittedDate and updatedDate in ISO 8601
  • Direct pdfUrl and abstractPageUrl
  • doi, journalRef, and the author's own comment (often "NeurIPS 2026 spotlight" or page count)
  • agentMarkdown: a 5-line markdown card formatted for Claude / Codex / Slack / a CRM ticket

One API call replaces the manual workflow of opening arxiv.org, choosing a category, paging through 50 abstracts at a time, copy-pasting fields into a spreadsheet, and chasing PDF links. We collapse that to a JSON dataset you can pipe into a vector DB, an LLM agent, an alerting system, or a research dashboard.

Why scrape arXiv?

arXiv is where every AI, ML, vision, and NLP paper lands first, often weeks or months before peer review. If your job is "what was published this week in X," refreshing arxiv.org/list/cs.AI/recent and copy-pasting abstracts into a spreadsheet eats 30+ minutes a day.

Feed a RAG pipeline, drive a weekly research newsletter, watch a specific lab or topic, or build training corpora, all from one daily diff. The buyers here are AI engineers wiring research retrieval, ML researchers tracking sub-fields, and editors of weekly AI newsletters who need a clean "what's new since yesterday" feed.

Who needs this?

  • AI agent builders wiring research-paper retrieval into RAG pipelines and need clean text plus PDF URLs without writing an Atom parser
  • ML researchers tracking three or four sub-fields and wanting a daily digest of new submissions in their categories
  • AI journalists chasing weekly stories who need to spot trending architectures, models, and lab outputs as they appear
  • M&A and corp-dev analysts profiling AI startups by tracking which authors and labs are publishing what
  • Recruiters sourcing ML talent by pulling first-author lists from hot subfields (RLHF, MoE, agents, vision-language)
  • Data scientists at LLM labs building reproduction pipelines who need full abstracts and DOIs, not titles
  • Conference reviewers and editors who want a structured, per-category submission feed for trend analysis

If your job involves "what was published on arXiv this week in X," you are the buyer.

How to use arXiv Papers Monitor

  1. Open the actor in Apify Console.
  2. Pick your categories (e.g. ["cs.AI","cs.CL"]) or type a query ("retrieval augmented generation").
  3. Optionally set submittedAfter to limit to recent papers, or flip watchlistMode on for a daily-new feed.
  4. Click Start. The default (maxItems: 10) returns about 30 seconds of work.
  5. Download the dataset as JSON, CSV, or Excel, or pull it via the API at https://api.apify.com/v2/acts/skootle~arxiv-papers/runs/last/dataset/items.

How much will scraping arXiv cost?

Pay-per-result pricing. You only pay for papers actually saved, plus a one-time start fee per run.

PlanPer paperRun start
FREE$0.005$0.005
BRONZE$0.0045$0.005
SILVER$0.004$0.005
GOLD$0.0035$0.005
PLATINUM$0.003$0.005
DIAMOND$0.003$0.005

Typical daily watchlist run for one researcher (50 new papers across cs.AI + cs.CL): about $0.26 on FREE, $0.16 on PLATINUM. A weekly bulk pull of 1000 papers is about $5 on FREE, $3 on PLATINUM. The $5 free Apify credit covers roughly 1000 records on the FREE tier.

arXiv runs an official, public, unauthenticated query API explicitly intended for programmatic access. We honor their published rate limit (1 request per 3 seconds) and identify ourselves with a descriptive User-Agent header. arXiv's Terms of Use cover non-commercial use directly; for commercial redistribution of paper content, follow up with arXiv directly and consult your own counsel.

This actor pulls only the metadata + abstract that arXiv exposes through the public API. It does not download PDFs, does not bypass any auth, and does not touch withdrawn papers.

Examples

1. Daily new cs.AI papers

{
"categories": ["cs.AI"],
"sortBy": "submittedDate",
"sortOrder": "descending",
"maxItems": 50,
"watchlistMode": true
}

Schedule daily, point the dataset webhook at Slack or a vector DB.

2. RAG-themed papers from the last 30 days

{
"query": "retrieval augmented generation",
"submittedAfter": "2026-04-09",
"submittedBefore": "2026-05-09",
"maxItems": 200
}

3. NLP + ML cross-listed papers

{
"categories": ["cs.CL", "cs.LG"],
"sortBy": "submittedDate",
"maxItems": 100
}

4. Specific lab tracking via author keyword in title

{
"query": "DeepMind OR Anthropic",
"categories": ["cs.AI", "cs.LG"],
"maxItems": 100
}

5. Diffusion-model survey

{
"query": "diffusion model",
"sortBy": "relevance",
"maxItems": 100
}

6. Math optimization for ML

{
"categories": ["math.OC", "stat.ML"],
"submittedAfter": "2026-01-01",
"maxItems": 200
}

7. Computational neuroscience

{
"categories": ["q-bio.NC", "cs.NE"],
"maxItems": 50
}

8. Title-only feed for fast indexing

{
"categories": ["cs.CV"],
"includeAbstract": false,
"maxItems": 1000
}

Input parameters

FieldTypeDescription
querystringFree-text search across title + abstract
categoriesstring[]arXiv category codes (cs.AI, cs.LG, cs.CL, cs.CV, stat.ML, math.OC, physics.*, q-bio.*, more)
submittedAfterstring (ISO date)Earliest submission date
submittedBeforestring (ISO date)Latest submission date
sortByenumsubmittedDate, lastUpdatedDate, or relevance
sortOrderenumdescending or ascending
maxItemsintMax papers per run (default 10, max 2000)
includeAbstractboolToggle full abstract vs title-only (default true)
watchlistModeboolEmit only new papers since the last run
proxyConfigurationobjectOptional residential proxy for very large bulk runs

arXiv output format

arxiv_paper record

FieldTypeNotes
recordTypestringAlways "arxiv_paper"
outputSchemaVersionstring"2026-05-10". Bumps on schema change.
arxivIdstring"2304.12345" (no version)
arxivIdVersionstring"2304.12345v2"
doistring | nullDOI when assigned
titlestringFull title
abstractstringFull abstract, whitespace-normalized
authorsobject[]{ name, affiliation } per author
authorCountintLength of authors
primaryCategorystringe.g. "cs.AI"
categoriesstring[]All assigned categories
submittedDatestringISO 8601
updatedDatestringISO 8601
pdfUrlstringDirect PDF URL
abstractPageUrlstringarxiv.org abs page
journalRefstring | null"Nature 612, 2026" style reference if accepted
commentstring | nullAuthor note ("NeurIPS 2026 spotlight", page count, etc.)
estimatedReadMinutesintAbstract word count / 200
isCrossListedboolTrue when categories.length > 1
agentMarkdownstringLLM-ready 5-line card
fieldCompletenessScoreint0-100, 10 fields evaluated
scrapedAtstringISO 8601

Sample record

{
"recordType": "arxiv_paper",
"outputSchemaVersion": "2026-05-10",
"arxivId": "2605.06667",
"arxivIdVersion": "2605.06667v1",
"doi": null,
"title": "ActCam: Zero-Shot Joint Camera and 3D Motion Control for Video Generation",
"abstract": "For artistic applications, video generation requires fine-grained control...",
"authors": [
{ "name": "Omar El Khalifi", "affiliation": null },
{ "name": "Thomas Rossi", "affiliation": null }
],
"authorCount": 9,
"primaryCategory": "cs.CV",
"categories": ["cs.CV", "cs.AI", "cs.LG"],
"submittedDate": "2026-05-07T17:59:58Z",
"updatedDate": "2026-05-07T17:59:58Z",
"pdfUrl": "https://arxiv.org/pdf/2605.06667v1",
"abstractPageUrl": "https://arxiv.org/abs/2605.06667v1",
"journalRef": null,
"comment": "SIGGRAPH 2026",
"estimatedReadMinutes": 2,
"isCrossListed": true,
"agentMarkdown": "📄 ActCam: Zero-Shot Joint Camera and 3D Motion Control for Video Generation (2605.06667)\n👥 Omar El Khalifi + 8 more\n📅 Submitted 2026-05-07 · Category cs.CV\n📊 2 min read · Cross-listed\n🔗 https://arxiv.org/pdf/2605.06667v1",
"fieldCompletenessScore": 80,
"scrapedAt": "2026-05-09T20:50:00Z"
}

During the actor run

No authentication required. The actor honors arXiv's published 1-request-per-3-seconds rate limit and identifies itself with a descriptive User-Agent, so the source stays available for everyone. A 1000-paper pull typically completes in about 30 seconds.

A run summary lands at the OUTPUT key, a markdown digest of the top 5 papers at AGENT_BRIEFING, and (with watchlistMode: true) the rolling 50,000-id dedupe window at WATCHLIST_STATE.

FAQ

How is this different from arXiv's free API?

The free API returns raw Atom XML with namespaced tags. You write an XML parser, you write a paginator that respects the 3-second rate limit, you write a normalizer for affiliation / journal_ref / comment fields, and you write the watchlist diff yourself. Then you maintain it. We give you typed JSON, idempotent IDs, watchlist mode, an agent-ready markdown card per record, and a versioned schema so your downstream pipeline does not silently break.

What about HuggingFace papers or PapersWithCode?

Different sources, different scope. HuggingFace Papers is curated and lags arXiv. PapersWithCode focuses on code-attached papers. Use this actor for the firehose, then enrich with HF / PWC if you need code-availability signals. We will likely ship companion actors for both in v0.2.

Can I track only papers from specific universities or labs?

Indirectly. arXiv's API does not expose a clean "affiliation" filter, but you can query by lab keywords ("Anthropic", "DeepMind", "Stanford NLP") and the term will match titles and abstracts. For author-list filtering, post-process the dataset (authors[].name) downstream.

How does watchlist mode work?

Flip watchlistMode: true. The actor reads WATCHLIST_STATE from the key-value store, runs the search, and emits only papers whose arxivId it has not delivered before. After each run it appends the newly seen IDs back to state (rolling window of 50,000). Pair with a daily Apify schedule for a clean "what's new" feed.

Can I use this with Python?

Yes. pip install apify-client, call client.actor("skootle/arxiv-papers").call(run_input=...), then iterate client.dataset(run["defaultDatasetId"]).iterate_items().

Can I integrate with Make / Zapier / n8n / Slack?

Yes. Apify exposes webhook triggers on dataset items and run completion. n8n and Make have native Apify connectors; Zapier works through the standard webhook bridge.

Why does this cost more than free arXiv scrapers?

If you are wiring this into a customer-facing product or a daily AI-agent pipeline, the per-record cost ($0.003 at GOLD) buys you reliability free actors do not provide: versioned schema, idempotent IDs, watchlist diff, daily Apify auto-test reliability, and a maintenance commitment. Free actors break monthly when the source changes a tag name, you do not get notified, and your pipeline silently goes empty.

What rate limits should I worry about?

arXiv asks for at most 1 request per 3 seconds. We honor that automatically. With 100 papers per page, a 1000-paper pull takes roughly 30 seconds plus arXiv processing time.

Does this download the full PDF?

No, only metadata and abstract. The pdfUrl field gives you the direct PDF link if your downstream needs the full text.

Why choose arXiv Papers Monitor

  • Monitor mode emits only what's new since last run. A rolling 50,000-id window means your RAG pipeline ingests each paper exactly once.
  • Reliability free actors can't deliver. Free arXiv scrapers break monthly when source tags change. You don't get notified, your pipeline silently goes empty. The per-record cost ($0.003 at GOLD) buys daily auto-test reliability and 24-48 hour fix turnaround.
  • Sub-minute runtime, no rate-limit babysitting. Pure HTTP against the official arXiv API, no HTML parsing, no headless browser, 1000 papers in about 30 seconds.
  • Drop-in for LLM agents. agentMarkdown card baked into every record, plus a per-run AGENT_BRIEFING.md digest of the top 5 papers ready for Slack or a daily LLM context window.
  • Schema doesn't break your pipeline, versioned and bumped on every breaking change.
  • Re-runs are safe to dedupe by ID, arxivId-keyed records upsert cleanly across runs.
  • AI agents can self-filter sparse rows via fieldCompletenessScore (0-100, 10 fields evaluated).

Your feedback

Hit a bug or want a feature? Open an issue on the Issues tab rather than the reviews page, and we will fix it fast (typically within 48 hours).

Other Skootle actors you might want to check

Support and contact

Found a bug or need a new field? Open an issue. For commercial use questions, email jamie.kester@gmail.com.