PyPI Scraper avatar

PyPI Scraper

Pricing

from $1.00 / 1,000 results

Go to Apify Store
PyPI Scraper

PyPI Scraper

Scrape Python package metadata from PyPI: exact-name lookup, newly-added packages, and recently-updated packages. Pulls version, license, classifiers, dependencies, project URLs, and maintainer info.

Pricing

from $1.00 / 1,000 results

Rating

5.0

(10)

Developer

Crawler Bros

Crawler Bros

Maintained by Community

Actor stats

10

Bookmarked

2

Total users

1

Monthly active users

a day ago

Last modified

Share

Scrape Python package metadata from the PyPI Python Package Index — exact-name lookup, newly-added packages, and recently-updated packages. Pulls version, license, classifiers, dependencies, project URLs, author/maintainer info, and latest artifact details. HTTP-only via PyPI's public JSON + RSS endpoints. No auth, no proxy.

What this actor does

  • Three modes: lookup (exact package names), newest (RSS feed of newly-added packages), updates (RSS feed of recently-updated package versions)
  • Rich metadata: name, latest version, summary, full description (markdown), license, classifiers, keywords, requires_python, project_urls (Documentation / Source / Issues / etc.)
  • Artifacts of latest release: filename, package type (wheel / sdist), python version, URL, size, upload time
  • Filters: classifier any-of, license substring, minimum supported Python version
  • Optional: includeReleases (full version history), includeUrls (project_urls map)
  • Empty fields are omitted — no nulls / blank strings reach the dataset

Output per package

  • name, latestVersion, summary, description, descriptionContentType
  • license — prefers license_expression (SPDX) then falls back to license
  • homePage, downloadUrl, docsUrl, bugTrackUrl
  • requiresPython (e.g. >=3.8,<4)
  • keywords[] — auto-detects comma vs space separator
  • classifiers[] — full list of PyPI classifiers
  • author{name, email}
  • maintainer{name, email} (when present)
  • projectUrls{Documentation, Source, Issues, ...} map (when includeUrls=true)
  • requiresDist[] — runtime dependency specifiers
  • latestArtifacts[][{filename, packageType, pythonVersion, url, size, uploadTime}, ...]
  • versions[] — sorted (reverse) list of release versions when includeReleases=true
  • vulnerabilityCount — number of known vulnerabilities (PyPI's reported list)
  • pypiUrl, pypiJsonUrl
  • recordType: "package", scrapedAt

Input

FieldTypeDefaultDescription
modestringnewestlookup / newest / updates
packageNamesarrayRequired for mode=lookup (e.g. ["requests", "numpy"])
classifierAnyOfarray[]Only emit packages with at least one of these classifiers
licenseContainsstringOnly emit packages whose license contains this substring (case-insensitive)
minPythonVersionstringe.g. 3.10 — only packages whose requires_python allows this version
includeReleasesboolfalseEmit full version history
includeUrlsbooltrueEmit project_urls map
maxItemsint50Hard cap (1–1000)

Example: lookup specific packages

{
"mode": "lookup",
"packageNames": ["requests", "numpy", "fastapi"]
}

Example: newest packages on PyPI

{
"mode": "newest",
"maxItems": 30
}

Example: recently updated packages, MIT-licensed only

{
"mode": "updates",
"licenseContains": "MIT",
"minPythonVersion": "3.10",
"maxItems": 50
}

Example: tracking SQL ORM packages

{
"mode": "lookup",
"packageNames": ["sqlalchemy", "tortoise-orm", "peewee", "pony"],
"classifierAnyOf": ["Topic :: Database"],
"includeReleases": true
}

Use cases

  • Open-source intelligence — track adoption / version cadence of Python packages
  • Security teams — track maintainer churn, monitor vulnerabilityCount, audit licenses
  • DevRel & growth — find similar / competing packages, monitor share of voice
  • Compliance — bulk-fetch SPDX license expressions across an entire dependency tree
  • Package discovery — find newly-published packages in your domain
  • Release monitoring — wire up the updates feed to alert on new releases of watched packages

FAQ

Why no search mode? PyPI removed the JSON search API in 2024. There's no longer a programmatic search endpoint that returns structured JSON. Use lookup for known names, newest / updates for new-package discovery, or filter the RSS feeds via classifier / license / Python-version filters.

Why are RSS modes so much faster than search? Each RSS feed call returns up to 40 items in one request. The actor then fetches each package's pypi.org/pypi/<name>/json to enrich. So newest mode = 1 RSS call + N package calls.

What's license vs license_expression? license is free-form (often MIT License, Apache 2.0). license_expression is the SPDX identifier (e.g. Apache-2.0). The actor prefers license_expression if present.

How does minPythonVersion work? It parses requires_python (e.g. >=3.8,<4), extracts the lowest required version, and checks if your threshold is >= that. So minPythonVersion: "3.10" keeps packages that support Python 3.10 (i.e. requires_python lower bound ≤ 3.10).

What does vulnerabilityCount track? PyPI exposes a vulnerabilities array on each package's JSON payload (sourced from OSV.dev). We count entries — a non-zero count is a signal to dig deeper.

Are dependencies fully resolved? No — requiresDist returns the raw specifiers (e.g. "urllib3<3,>=1.21.1"). For full resolution, feed the package into pip-compile or uv lock downstream.

How fresh is the data? Real-time. PyPI's JSON is served from the same backend that serves the website; RSS feeds update every few minutes.