Pricing

from $0.01 / 1,000 results

HuggingFaceTP

Scrapes trending research papers from HuggingFace, capturing each paper’s title, description, and URL. The scraper collects data from the listing page and visits individual paper pages for full abstracts, providing a structured dataset of the latest AI research.

Pricing

from $0.01 / 1,000 results

Rating

0.0

(0)

Developer

amazing

Actor stats

Bookmarked

Total users

Monthly active users

3 months ago

Last modified

🚀 Features

✅ Scrapes trending AI/ML research papers from HuggingFace
✅ Extracts paper titles, authors, abstracts, and publication dates
✅ Collects paper URLs and direct links to research papers
✅ Fast and efficient scraping with Playwright
✅ Easy to use via Apify Console
✅ Exports data in JSON, CSV, or Excel format
✅ Configurable number of papers to scrape

📊 Data Extracted

The scraper collects the following information for each paper:

Field	Description
Paper Title	Full title of the research paper
Authors	List of paper authors
Abstract	Paper abstract/summary
Publication Date	When the paper was published
Paper URL	Link to the HuggingFace paper page
ArXiv URL	Direct link to the paper on ArXiv (if available)
Upvotes	Number of upvotes on HuggingFace
Comments	Number of comments/discussions
Scraped At	Timestamp when data was collected

🛠️ How to Use

Option 1: Using Apify Console (No Coding Required)

Create an Apify Account
- Go to apify.com and sign up for free
Import This Actor
- Click on Actors → Create new
- Choose this actor from the store or import via GitHub
Configure Input
- Set Max Papers (default: 50)
- Optionally adjust other settings
Run the Actor
- Click the Start button
- Wait for the scraper to complete (usually 1-3 minutes)
Download Results
- Go to Dataset tab
- Click Export and choose your format (CSV, JSON, Excel)

Option 2: Using Apify API

const ApifyClient = require('apify-client');

const client = new ApifyClient({
    token: 'YOUR_APIFY_TOKEN',
});

const input = {
    maxPapers: 30,
};

const run = await client.actor('YOUR_ACTOR_ID').call(input);
const { items } = await client.dataset(run.defaultDatasetId).listItems();

console.log(items);

Option 3: Scheduled Runs

Set up automatic daily/weekly scraping:

Go to Schedules in Apify Console
Click Create new
Select this actor
Choose frequency (daily, weekly, etc.)
Save and activate

⚙️ Configuration Options

Input Parameters

{
    "maxPapers": 50,
    "startUrls": [
        {
            "url": "https://huggingface.co/papers"
        }
    ],
    "proxyConfiguration": {
        "useApifyProxy": true
    }
}

Parameter	Type	Default	Description
`maxPapers`	Number	50	Maximum number of papers to scrape
`startUrls`	Array	HuggingFace Papers	URLs to start scraping from
`proxyConfiguration`	Object	Apify Proxy	Proxy settings to avoid blocking

📦 Output Format

JSON Example

[
    {
        "Paper Title": "Attention Is All You Need",
        "Authors": "Vaswani et al.",
        "Abstract": "The dominant sequence transduction models...",
        "Publication Date": "2023-12-01",
        "Paper URL": "https://huggingface.co/papers/1706.03762",
        "ArXiv URL": "https://arxiv.org/abs/1706.03762",
        "Upvotes": 1250,
        "Comments": 45,
        "Scraped At": "2025-12-06T09:45:00.000Z"
    }
]

CSV Example

Paper Title,Authors,Abstract,Publication Date,Paper URL,ArXiv URL,Upvotes,Comments,Scraped At
"Attention Is All You Need","Vaswani et al.","The dominant sequence...","2023-12-01","https://huggingface.co/papers/1706.03762","https://arxiv.org/abs/1706.03762",1250,45,"2025-12-06T09:45:00.000Z"

🔧 Technical Details

Built With

Apify SDK - Actor framework
Crawlee - Web crawling and scraping library
Playwright - Headless browser automation
Cheerio - HTML parsing

Requirements

Node.js 18+
Apify account (free tier available)

📈 Use Cases

Research Tracking: Stay updated with trending AI research
Content Curation: Aggregate papers for newsletters or blogs
Academic Monitoring: Track specific research areas
Data Analysis: Analyze trends in AI/ML research
Literature Review: Collect papers for research projects

🚨 Rate Limiting & Best Practices

The scraper uses Apify proxy by default to avoid blocking
Respects HuggingFace's robots.txt
Implements reasonable delays between requests
Recommended: Run no more than once per hour

🐛 Troubleshooting

No Data Scraped

Check if HuggingFace changed their page structure
Verify proxy settings are enabled
Increase wait time in settings

Partial Data

Some papers may not have all fields available
The scraper handles missing data gracefully

Actor Fails

Check the logs in the Run tab
Ensure you have sufficient Apify credits
Try reducing maxPapers value

📝 Example Use Case: Daily AI Research Digest

Schedule the actor to run daily at 9 AM
Connect to Zapier/Make to send results to:
- Notion database
- Google Sheets
- Slack channel
- Email digest
Filter papers by keywords in your own processing pipeline

🤝 Contributing

Found a bug or want to suggest improvements?

Open an issue in the repository
Submit a pull request
Contact support via Apify Console

📄 License

This actor is provided as-is under the MIT License.

🔗 Links

💡 Tips

Combine with other scrapers: Use alongside arXiv or Google Scholar scrapers for comprehensive coverage
Set up alerts: Use Apify webhooks to get notified when new papers are found
Custom filtering: Process the output with your own scripts to filter by topics/authors
Data enrichment: Combine with citation APIs to get paper impact metrics

Note: This scraper is for educational and research purposes. Always respect website terms of service and rate limits. Use responsibly! 🎓

Last Updated: December 2025

Semantic Scholar Scraper - Cheap 📚🔎🤖

scrapestorm/semantic-scholar-scraper---cheap

🔎 Easily collect research papers from Semantic Scholar Provide one or multiple search keywords, paper URLs or author profiles and extract structured academic data such as 📄 Paper Title👨‍🔬 Authors 📅 Publication Year 🔗 Paper URL & more Perfect for academic research & AI research monitoring 📚

Storm_Scraper

5.0

arXiv Search Scraper 📚

easyapi/arxiv-search-scraper

Extract comprehensive research paper data from arXiv search results. Get detailed metadata including titles, authors, abstracts, categories and more. Perfect for academic research monitoring, trend analysis and building paper databases. 🎓📚

EasyApi

5.0

Arxiv Paper Scraper

technicaldost/arxiv-paper-scraper

Technical Dost Solutions

ArXiv Paper Scraper

nexgendata/arxiv-scraper

Extract research papers, abstracts, authors, and citations from arXiv.org. Perfect for academic research monitoring, literature reviews, and scientific trend analysis.

Stephan Corbeil

Arxiv Paper Intelligence

viralanalyzer/arxiv-paper-intelligence

Search and extract ArXiv papers, abstracts, authors, and citations. Track research trends across any scientific field. AI-powered analysis.

viralanalyzer

5.0

Semantic Scholar Paper Scraper

agenscrape/semantic-scholar-paper-scraper

Scrape academic papers from Semantic Scholar. Search by keyword and extract paper titles, abstracts, authors, citation counts, publication dates, DOIs, open access PDFs... Perfect for literature reviews, citation analysis, and research databases. Real time data output with pagination support.

Agenscrape

Ai-ML-scraper

labrat011/ai-ml-scraper

Search AI/ML models, research papers, and trending papers from HuggingFace Hub and arXiv. No API key required.

Mick

arXiv Paper Scraper

cloud9_ai/arxiv-paper-scraper

Scrape academic papers from arXiv.org. Search by keyword, browse categories, or get latest papers. Extract titles, abstracts, authors, PDF links, and citation data via arXiv API.

cloud9

ArXiv Academic Paper Scraper

fortuitous_pirate/arxiv-scraper

Scrape academic papers from ArXiv. Extract titles, authors, abstracts, categories, and PDF links. Essential for research and literature reviews.

Fortuitous Pirate

arXiv Scraper

artificially/arxiv-scraper

Search and extract academic papers from arXiv.org. Get paper titles, authors, abstracts, categories, and PDF links for AI/ML, physics, math, and more.