PyPI Metadata API Scraper
Pricing
Pay per event
PyPI Metadata API Scraper
Pull rich metadata for any PyPI package via the PyPI JSON API — current version, dependencies, classifiers, author, license, home page, download URLs, release history — export to JSON or CSV. Free PyPI API, no key required, with download-stat support.
Pricing
Pay per event
Rating
0.0
(0)
Developer
DevilScrapes
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
19 hours ago
Last modified
Categories
Share
🎯 What this scrapes
PyPI's metadata API (pypi.org/pypi/<package>/json) returns a full metadata record for every published package. This Actor accepts a list of package names, fans the requests out in parallel, and writes one typed dataset row per package — version, requires_dist, classifiers (Python versions, OS, license, framework), author, project URLs, and the latest 10 release timestamps. Bulk PyPI data export in a single run.
🔥 Features
- 🛡️ Browser fingerprint rotation —
curl-cffiimpersonates real Chrome, Firefox, and Safari TLS handshakes so the endpoint sees a browser, not a Python script. Profiles rotate across requests. - 🌐 Residential proxy rotation via Apify Proxy — fresh session and exit IP on every block or rate-limit response.
- 🔁 Retries with exponential backoff on
408 / 429 / 5xx— up to 5 attempts per package,Retry-Afterrespected. - 🧱 Rate-limit-aware pacing — when the API pushes back we throttle automatically; you never get silently blocked or handed an empty dataset.
- 🧊 Clean, typed rows — Pydantic-validated, ISO-8601 timestamps, stable field names. Export to JSON, CSV, or Excel straight from the Apify Console.
- 💰 Pay-Per-Event pricing — you pay only for rows that hit your dataset. No data, no charge.
💡 Use cases
- Dependency intel — feed your dependency tree to surface outdated or yanked packages across a full monorepo.
- Compliance audit — pull license classifier and Python-version bounds for every direct dependency in one bulk PyPI data export.
- Maintainer mapping — correlate your stack to authors and maintainers for supply-chain analysis.
- Release monitoring — schedule a daily run on a watch list; alert your team when a new version lands.
- RAG corpus — pull PyPI package READMEs in bulk for AI retrieval pipelines or language-model fine-tuning datasets.
⚙️ How to use it
- Click Try for free at the top of the page.
- Paste your package names into the Package names field — one per line.
- Toggle Include release history if you want the latest 10 version timestamps per package.
- Click Start. Output streams into the run's dataset as each package resolves.
- Export from Storage → Dataset as JSON, CSV, or Excel — or pull results via the Apify API.
📥 Input
| Field | Type | Required | Default | Notes |
|---|---|---|---|---|
packages | array | yes | ['requests', 'httpx', 'selectolax'] | PyPI package names. Case-insensitive — Requests and requests resolve the same record. |
includeReleases | boolean | no | true | When true, appends the latest 10 release versions + upload dates per package. No extra API call needed. |
concurrency | integer | no | 8 | Parallel API requests. Raise for large lists; lower if you see 429s on a shared IP. |
proxyConfiguration | object | no | {"useApifyProxy": false} | Proxy settings. Leave off for small lists; enable residential proxies for bulk runs. |
Example input
{"packages": ["httpx"],"includeReleases": true,"concurrency": 4,"proxyConfiguration": {"useApifyProxy": false}}
📤 Output
Every row is one dataset item corresponding to one PyPI package.
| Field | Type | Notes |
|---|---|---|
name | string | Canonical package name (PyPI returns the canonical casing). |
version | string | Currently-published version. |
summary | string | null | One-line description. |
description | string | null | Long description (README or similar). May be large. |
description_content_type | string | null | Markup format of the long description. |
author | string | null | Author name. |
author_email | string | null | Author email. |
maintainer | string | null | Maintainer name. |
license | string | null | License classifier or raw license text. |
home_page | string | null | Project home page URL. |
project_url | string | null | PyPI project URL. |
project_urls | object | null | Map of label → URL for additional project links (Source, Bug Tracker, etc.). |
requires_python | string | null | PEP 440 marker, e.g. >=3.9. |
requires_dist | array | Runtime install_requires dependency list. |
classifiers | array | PyPI classifiers — Programming Language, Topic, License, OS. |
keywords | string | null | Comma-separated keyword string from the project metadata. |
yanked | boolean | true if the current release was yanked from the index. |
release_history | array | Recent releases — version + upload timestamp — when includeReleases is true. |
package_url | string | Canonical pypi.org URL for the package. |
scraped_at | string | ISO-8601 timestamp of when this row was recorded. |
Example output
{"name": "httpx","version": "0.27.0","summary": "The next generation HTTP client.","requires_python": ">=3.9","license": "BSD-3-Clause","home_page": null,"project_url": "https://pypi.org/project/httpx/"}
💰 Pricing
Pay-Per-Event — you pay only when these events fire:
| Event | USD | What it is |
|---|---|---|
actor-start | $0.005 | One-off warm-up charge per run |
result | $0.0015 | Per dataset row written |
1 000 results at these rates ≈ $1.50. No subscription, no minimum commitment, no card required to start — every new Apify account gets $5 of free credit.
Also see our npm-package-scraper — identical pricing, same bulk-export pattern for the npm ecosystem.
🚧 Limitations
This Actor queries the PyPI metadata API (/pypi/<pkg>/json) endpoint only. Download counts, vulnerability scores, and reverse-dependency graphs live elsewhere — see pypistats.org, OSV, and Snyk for those. Classifier data is only as complete as what individual maintainers have filed with PyPI; we return whatever the API provides and never fabricate missing values.
❓ FAQ
What is the PyPI metadata API?
PyPI exposes a free, undocumented-for-bulk-use JSON endpoint at pypi.org/pypi/<package>/json. It returns the full metadata record for a published package: version info, classifiers, dependencies, author details, and release history. This Actor wraps that endpoint with retry logic, proxy rotation, and bulk throughput so you can pull thousands of records in a single run.
Are download counts included?
The PyPI metadata API does not expose download counts. For historical download statistics use BigQuery's pypi-public-data public dataset or the pypistats.org API — those are separate surfaces and not in scope here.
What happens if a package name doesn't exist?
The Actor logs a 404 and skips that package. The dataset still contains every package that resolved successfully; the run log lists skipped names.
Why is the description field so large?
Many maintainers bundle their full README into the PyPI package metadata. If your pipeline doesn't need the long description, drop the field in post-processing or filter it in your Apify dataset view.
Can I get vulnerability or CVE data?
That is out of scope for this Actor. Use the OSV API (osv.dev) or Snyk's advisory database for vulnerability enrichment; they publish dedicated APIs built for that purpose.
How do I do a bulk PyPI data export for my whole requirements.txt?
Paste your package list into the Package names field. For large lists (1 000+ packages) raise concurrency to 16 and consider enabling Apify residential proxies to distribute the load. At $1.50 / 1 000 results the cost scales linearly.
💬 Your feedback
Spotted a bug, hit an unexpected edge case, or need a field the Actor doesn't currently return? Open an issue on the Actor's Issues tab in the Apify Console — we ship fixes weekly and read every report.