PyPI Metadata API Scraper avatar

PyPI Metadata API Scraper

Pricing

Pay per event

Go to Apify Store
PyPI Metadata API Scraper

PyPI Metadata API Scraper

Pull rich metadata for any PyPI package via the PyPI JSON API — current version, dependencies, classifiers, author, license, home page, download URLs, release history — export to JSON or CSV. Free PyPI API, no key required, with download-stat support.

Pricing

Pay per event

Rating

0.0

(0)

Developer

DevilScrapes

DevilScrapes

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

19 hours ago

Last modified

Categories

Share


🎯 What this scrapes

PyPI's metadata API (pypi.org/pypi/<package>/json) returns a full metadata record for every published package. This Actor accepts a list of package names, fans the requests out in parallel, and writes one typed dataset row per package — version, requires_dist, classifiers (Python versions, OS, license, framework), author, project URLs, and the latest 10 release timestamps. Bulk PyPI data export in a single run.

🔥 Features

  • 🛡️ Browser fingerprint rotationcurl-cffi impersonates real Chrome, Firefox, and Safari TLS handshakes so the endpoint sees a browser, not a Python script. Profiles rotate across requests.
  • 🌐 Residential proxy rotation via Apify Proxy — fresh session and exit IP on every block or rate-limit response.
  • 🔁 Retries with exponential backoff on 408 / 429 / 5xx — up to 5 attempts per package, Retry-After respected.
  • 🧱 Rate-limit-aware pacing — when the API pushes back we throttle automatically; you never get silently blocked or handed an empty dataset.
  • 🧊 Clean, typed rows — Pydantic-validated, ISO-8601 timestamps, stable field names. Export to JSON, CSV, or Excel straight from the Apify Console.
  • 💰 Pay-Per-Event pricing — you pay only for rows that hit your dataset. No data, no charge.

💡 Use cases

  • Dependency intel — feed your dependency tree to surface outdated or yanked packages across a full monorepo.
  • Compliance audit — pull license classifier and Python-version bounds for every direct dependency in one bulk PyPI data export.
  • Maintainer mapping — correlate your stack to authors and maintainers for supply-chain analysis.
  • Release monitoring — schedule a daily run on a watch list; alert your team when a new version lands.
  • RAG corpus — pull PyPI package READMEs in bulk for AI retrieval pipelines or language-model fine-tuning datasets.

⚙️ How to use it

  1. Click Try for free at the top of the page.
  2. Paste your package names into the Package names field — one per line.
  3. Toggle Include release history if you want the latest 10 version timestamps per package.
  4. Click Start. Output streams into the run's dataset as each package resolves.
  5. Export from Storage → Dataset as JSON, CSV, or Excel — or pull results via the Apify API.

📥 Input

FieldTypeRequiredDefaultNotes
packagesarrayyes['requests', 'httpx', 'selectolax']PyPI package names. Case-insensitive — Requests and requests resolve the same record.
includeReleasesbooleannotrueWhen true, appends the latest 10 release versions + upload dates per package. No extra API call needed.
concurrencyintegerno8Parallel API requests. Raise for large lists; lower if you see 429s on a shared IP.
proxyConfigurationobjectno{"useApifyProxy": false}Proxy settings. Leave off for small lists; enable residential proxies for bulk runs.

Example input

{
"packages": [
"httpx"
],
"includeReleases": true,
"concurrency": 4,
"proxyConfiguration": {
"useApifyProxy": false
}
}

📤 Output

Every row is one dataset item corresponding to one PyPI package.

FieldTypeNotes
namestringCanonical package name (PyPI returns the canonical casing).
versionstringCurrently-published version.
summarystring | nullOne-line description.
descriptionstring | nullLong description (README or similar). May be large.
description_content_typestring | nullMarkup format of the long description.
authorstring | nullAuthor name.
author_emailstring | nullAuthor email.
maintainerstring | nullMaintainer name.
licensestring | nullLicense classifier or raw license text.
home_pagestring | nullProject home page URL.
project_urlstring | nullPyPI project URL.
project_urlsobject | nullMap of label → URL for additional project links (Source, Bug Tracker, etc.).
requires_pythonstring | nullPEP 440 marker, e.g. >=3.9.
requires_distarrayRuntime install_requires dependency list.
classifiersarrayPyPI classifiers — Programming Language, Topic, License, OS.
keywordsstring | nullComma-separated keyword string from the project metadata.
yankedbooleantrue if the current release was yanked from the index.
release_historyarrayRecent releases — version + upload timestamp — when includeReleases is true.
package_urlstringCanonical pypi.org URL for the package.
scraped_atstringISO-8601 timestamp of when this row was recorded.

Example output

{
"name": "httpx",
"version": "0.27.0",
"summary": "The next generation HTTP client.",
"requires_python": ">=3.9",
"license": "BSD-3-Clause",
"home_page": null,
"project_url": "https://pypi.org/project/httpx/"
}

💰 Pricing

Pay-Per-Event — you pay only when these events fire:

EventUSDWhat it is
actor-start$0.005One-off warm-up charge per run
result$0.0015Per dataset row written

1 000 results at these rates ≈ $1.50. No subscription, no minimum commitment, no card required to start — every new Apify account gets $5 of free credit.

Also see our npm-package-scraper — identical pricing, same bulk-export pattern for the npm ecosystem.

🚧 Limitations

This Actor queries the PyPI metadata API (/pypi/<pkg>/json) endpoint only. Download counts, vulnerability scores, and reverse-dependency graphs live elsewhere — see pypistats.org, OSV, and Snyk for those. Classifier data is only as complete as what individual maintainers have filed with PyPI; we return whatever the API provides and never fabricate missing values.

❓ FAQ

What is the PyPI metadata API?

PyPI exposes a free, undocumented-for-bulk-use JSON endpoint at pypi.org/pypi/<package>/json. It returns the full metadata record for a published package: version info, classifiers, dependencies, author details, and release history. This Actor wraps that endpoint with retry logic, proxy rotation, and bulk throughput so you can pull thousands of records in a single run.

Are download counts included?

The PyPI metadata API does not expose download counts. For historical download statistics use BigQuery's pypi-public-data public dataset or the pypistats.org API — those are separate surfaces and not in scope here.

What happens if a package name doesn't exist?

The Actor logs a 404 and skips that package. The dataset still contains every package that resolved successfully; the run log lists skipped names.

Why is the description field so large?

Many maintainers bundle their full README into the PyPI package metadata. If your pipeline doesn't need the long description, drop the field in post-processing or filter it in your Apify dataset view.

Can I get vulnerability or CVE data?

That is out of scope for this Actor. Use the OSV API (osv.dev) or Snyk's advisory database for vulnerability enrichment; they publish dedicated APIs built for that purpose.

How do I do a bulk PyPI data export for my whole requirements.txt?

Paste your package list into the Package names field. For large lists (1 000+ packages) raise concurrency to 16 and consider enabling Apify residential proxies to distribute the load. At $1.50 / 1 000 results the cost scales linearly.

💬 Your feedback

Spotted a bug, hit an unexpected edge case, or need a field the Actor doesn't currently return? Open an issue on the Actor's Issues tab in the Apify Console — we ship fixes weekly and read every report.