Arxiv Keyword Spider avatar

Arxiv Keyword Spider

Pricing

from $9.00 / 1,000 results

Go to Apify Store
Arxiv Keyword Spider

Arxiv Keyword Spider

Arxiv Keyword Spider efficiently scrapes arXiv.org for research papers using keywords, delivering comprehensive metadata like titles, authors, abstracts, and categories. Perfect for academic research, market analysis, and trend monitoring....

Pricing

from $9.00 / 1,000 results

Rating

0.0

(0)

Developer

GetDataForMe

GetDataForMe

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

11 hours ago

Last modified

Share

Description

Arxiv Keyword Spider efficiently scrapes arXiv.org for research papers using keywords, delivering comprehensive metadata like titles, authors, abstracts, and categories. Perfect for academic research, market analysis, and trend monitoring....


Arxiv Keyword Spider

Introduction

The Arxiv Keyword Spider is a powerful Apify Actor designed to scrape and extract research papers from arXiv.org based on user-defined keywords. It provides comprehensive metadata for each paper, including titles, authors, abstracts, and categorization, enabling efficient data collection for academic research, market analysis, and trend monitoring. This tool streamlines the process of gathering insights from the vast arXiv repository, saving time and effort for users seeking targeted scientific information.

Features

  • Keyword-Based Search: Perform precise queries on arXiv to retrieve relevant papers matching specific terms or topics.
  • Comprehensive Metadata Extraction: Captures essential details such as paper IDs, URLs, titles, authors, abstracts, and categories for thorough analysis.
  • High Reliability: Built on robust scraping technology to ensure accurate and consistent data retrieval from arXiv's dynamic content.
  • Scalable Performance: Handles large volumes of results efficiently, with options for pagination and filtering to manage output size.
  • Structured Output: Delivers data in clean JSON format, ready for integration into databases, analytics tools, or downstream processing.
  • Error Handling: Includes built-in mechanisms to manage rate limits, network issues, and incomplete data gracefully.
  • No Coding Required: User-friendly interface on Apify Store for easy configuration and execution without technical expertise.

Input Parameters

ParameterTypeRequiredDescriptionExample
QuerystringNoThe keyword or phrase to search for in arXiv papers. Defaults to a basic term if not specified."machine learning"

Example Usage

To run the Arxiv Keyword Spider, configure the input parameters in the Apify console or via API. Here's an example input JSON:

{
"Query": "artificial intelligence"
}

This will search for papers related to "artificial intelligence". The output will be a JSON array of objects, each representing a paper. Example output:

[
{
"arxiv_id": "2604.06123",
"abstract_url": "https://arxiv.org/abs/2604.06123",
"pdf_url": "https://arxiv.org/pdf/2604.06123",
"title": "A Large-Scale Empirical Comparison of Meta-Learners and Causal Forests for Heterogeneous Treatment Effect Estimation in Marketing Uplift Modeling",
"authors": [
"Aman Singh"
],
"abstract": "Estimating Conditional Average Treatment Effects (CATE) at the individual level is central to precision marketing, yet systematic benchmarking of uplift modeling methods at industrial scale remains limited. We present UpliftBench, an empirical evaluation of four CATE estimators: S-Learner, T-Learner, X-Learner (all with LightGBM base learners), and Causal Forest (EconML), applied to the Criteo Uplift v2.1 dataset comprising 13.98 million customer records. The near-random treatment assignment (propensity AUC = 0.509) provides strong internal validity for causal estimation. Evaluated via Qini coefficient and cumulative gain curves, the S-Learner achieves the highest Qini score of 0.376, with the top 20% of customers ranked by predicted CATE capturing 77.7% of all incremental conversions, a 3.9x improvement over random targeting. SHAP analysis identifies f8 as the dominant heterogeneous treatment effect (HTE) driver among the 12 anonymized covariates. Causal Forest uncertainty quantification reveals that 1.9% of customers are confident persuadables (lower 95% CI > 0) and 0.1% are confident sleeping",
"primary_category": "stat.CO",
"all_categories": [
"stat.CO",
"cs.LG",
"econ.EM",
"stat.ME"
],
"actor_id": "nzczPpnpwdctoDoPa",
"run_id": "Vuycmdfs1pUgS6tsg"
},
{
"arxiv_id": "2604.01708",
"abstract_url": "https://arxiv.org/abs/2604.01708",
"pdf_url": "https://arxiv.org/pdf/2604.01708",
"title": "OpenGo: An OpenClaw-Based Robotic",
"authors": [
"Hanbing Li",
"Xuewei Cao",
"Zhiwen Zeng",
"Yuhan Wu",
"Yanyong Zhang",
"Yan Xia"
],
"abstract": "Adaptation to complex tasks and multiple scenarios remains a significant challenge for a single robot agent. The ability to acquire organize, and switch between a wide range of skills in real time, particularly in dynamic environments, has become a fundamental requirement for embodied intelligence. We introduce OpenGo, an OpenClaw-powered embodied robotic",
"primary_category": "cs.RO",
"all_categories": [
"cs.RO",
"cs.AI"
],
"actor_id": "nzczPpnpwdctoDoPa",
"run_id": "Vuycmdfs1pUgS6tsg"
},
{
"arxiv_id": "2603.29271",
"abstract_url": "https://arxiv.org/abs/2603.29271",
"pdf_url": "https://arxiv.org/pdf/2603.29271",
"title": "ConInfer: Context-Aware Inference for Training-Free Open-Vocabulary Remote Sensing Segmentation",
"authors": [
"Wenyang Chen",
"Zhanxuan Hu",
"Yaping Zhang",
"Hailong Ning",
"Yonghang Tai"
],
"abstract": "Training-free open-vocabulary remote sensing segmentation (OVRSS), empowered by vision-language models, has emerged as a promising paradigm for achieving category-agnostic semantic understanding in remote sensing imagery. Existing approaches mainly focus on enhancing feature representations or mitigating modality discrepancies to improve patch-level prediction accuracy. However, such independent prediction schemes are fundamentally misaligned with the intrinsic characteristics of remote sensing data. In real-world applications, remote sensing scenes are typically large-scale and exhibit strong spatial as well as semantic correlations, making isolated patch-wise predictions insufficient for accurate segmentation. To address this limitation, we propose ConInfer, a context-aware inference framework for OVRSS that performs joint prediction across multiple spatial units while explicitly modeling their inter-unit semantic dependencies. By incorporating global contextual cues, our method significantly enhances segmentation consistency, robustness, and generalization in complex remote sensing environments. Extensive experiments on multiple benchmark datasets demonstrate that our approach consistently surpasses state-of-the-art per-pixel VLM-based baselines such as SegEarth-OV, achieving average improvements of 2.80% and 6.13% on open-vocabulary semantic segmentation and object extraction tasks, respectively. The implementation code is available at: https://github.com/",
"primary_category": "cs.CV",
"all_categories": [
"cs.CV"
],
"actor_id": "nzczPpnpwdctoDoPa",
"run_id": "Vuycmdfs1pUgS6tsg"
}
]

Use Cases

  • Academic Research: Quickly gather papers on emerging topics like AI or quantum computing for literature reviews.
  • Market Analysis: Monitor trends in fields such as machine learning or data science to inform business strategies.
  • Competitive Intelligence: Track publications from specific authors or institutions for industry insights.
  • Content Aggregation: Build datasets of abstracts and metadata for blogs, newsletters, or educational platforms.
  • Trend Monitoring: Identify popular categories and keywords in scientific discourse for forecasting.
  • Business Automation: Automate data collection for reports on technological advancements in sectors like robotics or statistics.

Installation and Usage

  1. Search for "Arxiv Keyword Spider" in the Apify Store.
  2. Click "Try for free" or "Run".
  3. Configure input parameters (e.g., set your query keyword).
  4. Click "Start" to begin extraction.
  5. Monitor progress in the log.
  6. Export results in your preferred format (JSON, CSV, Excel).

Output Format

The output is a JSON array of objects, each containing metadata for a single arXiv paper. Key fields include:

  • arxiv_id: Unique identifier for the paper.
  • abstract_url and pdf_url: Direct links to the abstract and PDF.
  • title: Full title of the paper.
  • authors: Array of author names.
  • abstract: Summary text of the paper.
  • primary_category and all_categories: ArXiv classification codes.
  • actor_id and run_id: Identifiers for the Apify run.

This structured format ensures easy parsing and integration.

Error Handling

The Actor includes robust error handling for common issues like network timeouts, invalid queries, or arXiv site changes. If errors occur, check the run logs for details. For persistent problems, retry with adjusted parameters or contact support.

Rate Limiting and Best Practices

ArXiv may impose rate limits; the Actor respects these to avoid bans. Best practices include using specific queries to limit results, running during off-peak hours, and exporting data incrementally. Avoid overloading with broad keywords.

Limitations and Considerations

Results are based on arXiv's search capabilities and may not include all papers if queries are too vague. Abstracts can be truncated in output. Ensure compliance with arXiv's terms of use. For large datasets, consider pagination to manage memory.

Support

For custom/simplified outputs or bug reports, please contact:

We're here to help you get the most out of this Actor!