arXiv Daily Digest Scraper avatar

arXiv Daily Digest Scraper

Pricing

Pay per usage

Go to Apify Store
arXiv Daily Digest Scraper

arXiv Daily Digest Scraper

Scrape arXiv papers by search query or category. Extract titles, authors, abstracts, and PDF links from recent submissions.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Donny Nguyen

Donny Nguyen

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

6 hours ago

Last modified

Categories

Share

arXiv Daily Digest

Monitor arXiv for new papers by topic, author, or keyword. This actor extracts comprehensive paper metadata from arXiv.org including titles, authors, abstracts, categories, and PDF links.

Features

  • Search by Keywords: Query arXiv using any search terms
  • Monitor Categories: Track specific arXiv categories (cs.AI, cs.LG, physics, etc.)
  • Custom URLs: Provide your own arXiv search or list page URLs
  • Date Filtering: Only include papers from the last N days
  • Comprehensive Metadata: Extract title, authors, abstract, categories, arXiv ID, PDF URL, and publication date

Input Parameters

  • Search Queries (stringList): Keywords or phrases to search on arXiv (e.g., "machine learning", "quantum computing")
  • arXiv Categories (stringList): Category codes to monitor (e.g., "cs.AI", "cs.LG", "physics.gen-ph")
  • Start URLs (requestListSources): Custom arXiv URLs to scrape
  • Max Results (integer, default: 50): Maximum number of papers to extract per URL
  • Days Back (integer, default: 7): Only include papers from the last N days (0 = no filter)
  • Use Residential Proxy (boolean, default: false): Use residential proxies for better reliability

Output

Each paper includes:

{
"title": "Paper Title",
"authors": ["Author 1", "Author 2"],
"abstract": "Paper abstract text...",
"categories": ["cs.AI", "cs.LG"],
"arxivId": "2401.12345",
"pdfUrl": "https://arxiv.org/pdf/2401.12345.pdf",
"publishedDate": "1 Jan 2024",
"url": "https://arxiv.org/abs/2401.12345",
"scrapedAt": "2024-01-15T10:30:00.000Z"
}

Use Cases

  • Stay updated on research in your field
  • Track specific research topics or authors
  • Build research paper databases
  • Monitor new publications in specific categories
  • Automated literature review workflows

Example Configuration

{
"searchQueries": ["machine learning", "deep learning"],
"categories": ["cs.AI", "cs.LG"],
"maxResults": 50,
"daysBack": 7
}

Notes

  • arXiv is an open-access repository, so this actor respects their terms of use
  • Date filtering is based on the submission/announcement date
  • The actor handles both search result pages and category list pages
  • Results are limited per URL to avoid excessive scraping

Performance

  • Speed: Fast (Cheerio-based, no browser overhead)
  • Cost: Low (datacenter proxies sufficient for arXiv)
  • Memory: 256-512 MB recommended

Built with Apify SDK and Crawlee using CheerioCrawler for efficient scraping.