# PyPI Scraper (`crawlerbros/pypi-scraper`) Actor

Scrape Python package metadata from PyPI: exact-name lookup, newly-added packages, and recently-updated packages. Pulls version, license, classifiers, dependencies, project URLs, and maintainer info.

- **URL**: https://apify.com/crawlerbros/pypi-scraper.md
- **Developed by:** [Crawler Bros](https://apify.com/crawlerbros) (community)
- **Categories:** Developer tools, Automation, Other
- **Stats:** 2 total users, 1 monthly users, 100.0% runs succeeded, 10 bookmarks
- **User rating**: 5.00 out of 5 stars

## Pricing

from $1.00 / 1,000 results

This Actor is paid per event and usage. You are charged both the fixed price for specific events and for Apify platform usage.
Since this Actor supports Apify Store discounts, the price gets lower the higher subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## PyPI Scraper

Scrape Python package metadata from the PyPI Python Package Index — exact-name lookup, newly-added packages, and recently-updated packages. Pulls version, license, classifiers, dependencies, project URLs, author/maintainer info, and latest artifact details. HTTP-only via PyPI's public JSON + RSS endpoints. No auth, no proxy.

### What this actor does

- **Three modes:** `lookup` (exact package names), `newest` (RSS feed of newly-added packages), `updates` (RSS feed of recently-updated package versions)
- **Rich metadata:** name, latest version, summary, full description (markdown), license, classifiers, keywords, requires_python, project_urls (Documentation / Source / Issues / etc.)
- **Artifacts of latest release:** filename, package type (wheel / sdist), python version, URL, size, upload time
- **Filters:** classifier any-of, license substring, minimum supported Python version
- **Optional:** `includeReleases` (full version history), `includeUrls` (project_urls map)
- **Empty fields are omitted** — no nulls / blank strings reach the dataset

### Output per package

- `name`, `latestVersion`, `summary`, `description`, `descriptionContentType`
- `license` — prefers `license_expression` (SPDX) then falls back to `license`
- `homePage`, `downloadUrl`, `docsUrl`, `bugTrackUrl`
- `requiresPython` (e.g. `>=3.8,<4`)
- `keywords[]` — auto-detects comma vs space separator
- `classifiers[]` — full list of PyPI classifiers
- `author` — `{name, email}`
- `maintainer` — `{name, email}` (when present)
- `projectUrls` — `{Documentation, Source, Issues, ...}` map (when `includeUrls=true`)
- `requiresDist[]` — runtime dependency specifiers
- `latestArtifacts[]` — `[{filename, packageType, pythonVersion, url, size, uploadTime}, ...]`
- `versions[]` — sorted (reverse) list of release versions when `includeReleases=true`
- `vulnerabilityCount` — number of known vulnerabilities (PyPI's reported list)
- `pypiUrl`, `pypiJsonUrl`
- `recordType: "package"`, `scrapedAt`

### Input

| Field | Type | Default | Description |
|---|---|---|---|
| `mode` | string | `newest` | `lookup` / `newest` / `updates` |
| `packageNames` | array | – | Required for `mode=lookup` (e.g. `["requests", "numpy"]`) |
| `classifierAnyOf` | array | `[]` | Only emit packages with at least one of these classifiers |
| `licenseContains` | string | – | Only emit packages whose license contains this substring (case-insensitive) |
| `minPythonVersion` | string | – | e.g. `3.10` — only packages whose `requires_python` allows this version |
| `includeReleases` | bool | `false` | Emit full version history |
| `includeUrls` | bool | `true` | Emit `project_urls` map |
| `maxItems` | int | `50` | Hard cap (1–1000) |

#### Example: lookup specific packages

```json
{
  "mode": "lookup",
  "packageNames": ["requests", "numpy", "fastapi"]
}
````

#### Example: newest packages on PyPI

```json
{
  "mode": "newest",
  "maxItems": 30
}
```

#### Example: recently updated packages, MIT-licensed only

```json
{
  "mode": "updates",
  "licenseContains": "MIT",
  "minPythonVersion": "3.10",
  "maxItems": 50
}
```

#### Example: tracking SQL ORM packages

```json
{
  "mode": "lookup",
  "packageNames": ["sqlalchemy", "tortoise-orm", "peewee", "pony"],
  "classifierAnyOf": ["Topic :: Database"],
  "includeReleases": true
}
```

### Use cases

- **Open-source intelligence** — track adoption / version cadence of Python packages
- **Security teams** — track maintainer churn, monitor `vulnerabilityCount`, audit licenses
- **DevRel & growth** — find similar / competing packages, monitor share of voice
- **Compliance** — bulk-fetch SPDX license expressions across an entire dependency tree
- **Package discovery** — find newly-published packages in your domain
- **Release monitoring** — wire up the `updates` feed to alert on new releases of watched packages

### FAQ

**Why no search mode?** PyPI removed the JSON search API in 2024. There's no longer a programmatic search endpoint that returns structured JSON. Use `lookup` for known names, `newest` / `updates` for new-package discovery, or filter the RSS feeds via classifier / license / Python-version filters.

**Why are RSS modes so much faster than search?** Each RSS feed call returns up to 40 items in one request. The actor then fetches each package's `pypi.org/pypi/<name>/json` to enrich. So `newest` mode = 1 RSS call + N package calls.

**What's `license` vs `license_expression`?** `license` is free-form (often `MIT License`, `Apache 2.0`). `license_expression` is the SPDX identifier (e.g. `Apache-2.0`). The actor prefers `license_expression` if present.

**How does `minPythonVersion` work?** It parses `requires_python` (e.g. `>=3.8,<4`), extracts the lowest required version, and checks if your threshold is `>=` that. So `minPythonVersion: "3.10"` keeps packages that support Python 3.10 (i.e. `requires_python` lower bound ≤ 3.10).

**What does `vulnerabilityCount` track?** PyPI exposes a `vulnerabilities` array on each package's JSON payload (sourced from OSV.dev). We count entries — a non-zero count is a signal to dig deeper.

**Are dependencies fully resolved?** No — `requiresDist` returns the raw specifiers (e.g. `"urllib3<3,>=1.21.1"`). For full resolution, feed the package into `pip-compile` or `uv lock` downstream.

**How fresh is the data?** Real-time. PyPI's JSON is served from the same backend that serves the website; RSS feeds update every few minutes.

# Actor input Schema

## `mode` (type: `string`):

`lookup` fetches metadata for exact package names. `newest` lists the most recently added PyPI packages. `updates` lists the most recently updated package versions.

## `packageNames` (type: `array`):

Exact PyPI package names to fetch.

## `packageVersions` (type: `array`):

List of `name==version` strings (e.g. `requests==2.31.0`, `numpy==1.26.0`). Returns metadata frozen at that release.

## `classifierAnyOf` (type: `array`):

Only emit packages whose `classifiers` include at least one of these (e.g. `Programming Language :: Python :: 3.12`, `License :: OSI Approved :: MIT License`).

## `licenseContains` (type: `string`):

Only emit packages whose `license` field contains this substring (case-insensitive). Example: `MIT`, `Apache`.

## `minPythonVersion` (type: `string`):

Only emit packages whose `requires_python` allows this Python version (e.g. `3.10`).

## `includeReleases` (type: `boolean`):

Emit the full sorted list of released versions.

## `includeUrls` (type: `boolean`):

Emit the `project_urls` map (Documentation, Source, Bug Tracker, etc.).

## `maxItems` (type: `integer`):

Hard cap on emitted records.

## Actor input object example

```json
{
  "mode": "newest",
  "packageNames": [
    "requests",
    "numpy",
    "pandas"
  ],
  "packageVersions": [
    "requests==2.31.0"
  ],
  "classifierAnyOf": [],
  "includeReleases": false,
  "includeUrls": true,
  "maxItems": 50
}
```

# Actor output Schema

## `packages` (type: `string`):

Dataset containing all scraped PyPI packages.

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "mode": "newest",
    "packageNames": [
        "requests",
        "numpy",
        "pandas"
    ],
    "packageVersions": [
        "requests==2.31.0"
    ],
    "classifierAnyOf": [],
    "includeReleases": false,
    "includeUrls": true,
    "maxItems": 50
};

// Run the Actor and wait for it to finish
const run = await client.actor("crawlerbros/pypi-scraper").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {
    "mode": "newest",
    "packageNames": [
        "requests",
        "numpy",
        "pandas",
    ],
    "packageVersions": ["requests==2.31.0"],
    "classifierAnyOf": [],
    "includeReleases": False,
    "includeUrls": True,
    "maxItems": 50,
}

# Run the Actor and wait for it to finish
run = client.actor("crawlerbros/pypi-scraper").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "mode": "newest",
  "packageNames": [
    "requests",
    "numpy",
    "pandas"
  ],
  "packageVersions": [
    "requests==2.31.0"
  ],
  "classifierAnyOf": [],
  "includeReleases": false,
  "includeUrls": true,
  "maxItems": 50
}' |
apify call crawlerbros/pypi-scraper --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=crawlerbros/pypi-scraper",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "PyPI Scraper",
        "description": "Scrape Python package metadata from PyPI: exact-name lookup, newly-added packages, and recently-updated packages. Pulls version, license, classifiers, dependencies, project URLs, and maintainer info.",
        "version": "1.0",
        "x-build-id": "aDXJZ0tDeelhI0HRA"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/crawlerbros~pypi-scraper/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-crawlerbros-pypi-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/crawlerbros~pypi-scraper/runs": {
            "post": {
                "operationId": "runs-sync-crawlerbros-pypi-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/crawlerbros~pypi-scraper/run-sync": {
            "post": {
                "operationId": "run-sync-crawlerbros-pypi-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "mode"
                ],
                "properties": {
                    "mode": {
                        "title": "Mode",
                        "enum": [
                            "lookup",
                            "newest",
                            "updates",
                            "byVersion"
                        ],
                        "type": "string",
                        "description": "`lookup` fetches metadata for exact package names. `newest` lists the most recently added PyPI packages. `updates` lists the most recently updated package versions.",
                        "default": "newest"
                    },
                    "packageNames": {
                        "title": "Package names (mode=lookup)",
                        "type": "array",
                        "description": "Exact PyPI package names to fetch.",
                        "default": [],
                        "items": {
                            "type": "string"
                        }
                    },
                    "packageVersions": {
                        "title": "Package + version (mode=byVersion)",
                        "type": "array",
                        "description": "List of `name==version` strings (e.g. `requests==2.31.0`, `numpy==1.26.0`). Returns metadata frozen at that release.",
                        "default": [],
                        "items": {
                            "type": "string"
                        }
                    },
                    "classifierAnyOf": {
                        "title": "Classifier filter",
                        "type": "array",
                        "description": "Only emit packages whose `classifiers` include at least one of these (e.g. `Programming Language :: Python :: 3.12`, `License :: OSI Approved :: MIT License`).",
                        "default": [],
                        "items": {
                            "type": "string"
                        }
                    },
                    "licenseContains": {
                        "title": "License contains",
                        "type": "string",
                        "description": "Only emit packages whose `license` field contains this substring (case-insensitive). Example: `MIT`, `Apache`."
                    },
                    "minPythonVersion": {
                        "title": "Minimum Python version",
                        "type": "string",
                        "description": "Only emit packages whose `requires_python` allows this Python version (e.g. `3.10`)."
                    },
                    "includeReleases": {
                        "title": "Include version history",
                        "type": "boolean",
                        "description": "Emit the full sorted list of released versions.",
                        "default": false
                    },
                    "includeUrls": {
                        "title": "Include project_urls map",
                        "type": "boolean",
                        "description": "Emit the `project_urls` map (Documentation, Source, Bug Tracker, etc.).",
                        "default": true
                    },
                    "maxItems": {
                        "title": "Max items",
                        "minimum": 1,
                        "maximum": 1000,
                        "type": "integer",
                        "description": "Hard cap on emitted records.",
                        "default": 50
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
