Data.gov Dataset Catalog Crawler
Pricing
Pay per event
Data.gov Dataset Catalog Crawler
Crawl 300K+ US government datasets from Data.gov. Extract titles, organizations, tags, formats, download URLs, API endpoints, temporal and spatial coverage, contacts, and resources. Filter by agency, format, category, and tags.
Pricing
Pay per event
Rating
0.0
(0)
Developer
BowTiedRaccoon
Actor stats
0
Bookmarked
1
Total users
0
Monthly active users
12 hours ago
Last modified
Categories
Share
Extract dataset metadata from the Data.gov federal open data catalog. Covers 400K+ US government datasets — titles, descriptions, organizations, tags, formats, download URLs, API endpoints, temporal and spatial coverage, contact information, and individual resource details.
Data.gov Crawler Features
- Search across 400K+ federal datasets by keyword, organization, format, category, or tag
- Filter by 20+ federal agencies including EPA, NASA, Census Bureau, HHS, DOE, and USDA
- Filter by resource format: CSV, JSON, XML, API, Excel, PDF, Shapefile, GeoJSON, KML, RDF
- Extract download URLs and API endpoints for every resource in a dataset
- Includes temporal coverage (date ranges), spatial coverage (geographic bounds), and update frequency
- Pulls contact names and emails for dataset maintainers
- Toggle resource-level detail on or off — run lean for metadata, or include full file listings
- Sort by relevance, recently modified, recently created, or alphabetically
- Reads the CKAN v3 JSON API directly — no HTML parsing, no proxy required
- Pay-per-event pricing at roughly $0.50 per 1,000 datasets
Who Uses Data.gov Dataset Metadata?
- Data scientists and researchers — discover federal datasets by topic, format, and time period without manually browsing catalog.data.gov
- Government contractors — identify available datasets from specific agencies for proposal research and compliance reporting
- Journalists and civic technologists — find public datasets on health, environment, transportation, or public safety for data-driven stories and apps
- Open data advocates — monitor dataset freshness, publication rates, and coverage gaps across federal agencies
- B2B data teams — build pipelines that pull structured government data feeds into internal analytics platforms
How Data.gov Crawler Works
- You configure search filters: keywords, organization, data format, category, tags, or any combination.
- The crawler queries the Data.gov CKAN API with your filters and paginates through all matching results (1,000 per request, 200ms courtesy delay).
- Each dataset record is transformed into a clean, flat output object with all metadata fields extracted and normalized.
Input
Browse all datasets (default)
{"maxItems": 100}
Search by keyword
{"searchQuery": "air quality monitoring","maxItems": 50}
Filter by agency and format
{"organization": "environmental-protection-agency","dataFormat": "CSV","maxItems": 200}
Filter by category and tags
{"category": "health","tags": ["covid-19"],"maxItems": 100}
Input Parameters
| Field | Type | Default | Description |
|---|---|---|---|
| searchQuery | string | "" | Full-text search across dataset titles, descriptions, and tags. Leave empty to browse all datasets. |
| organization | string | "" | Filter by publishing agency. Options include environmental-protection-agency, department-of-commerce, national-aeronautics-and-space-administration, census-bureau, and 16 others. |
| dataFormat | string | "" | Filter by resource format: CSV, JSON, XML, API, XLS, PDF, SHP, GeoJSON, KML, RDF, HTML. |
| category | string | "" | Filter by Data.gov topic: agriculture, business, climate, consumer, ecosystems, education, energy, finance, health, manufacturing, ocean, public-safety, science, weather. |
| tags | string[] | [] | Filter by dataset tags (e.g., health, environment). Datasets must match all specified tags. |
| includeResources | boolean | true | Include individual resource details (file names, formats, URLs) within each dataset. Set to false for faster, leaner output. |
| sortBy | string | "score desc" | Sort order: score desc (relevance), metadata_modified desc (recently modified), metadata_created desc (recently created), name asc (A-Z). |
| maxItems | integer | 100 | Maximum dataset records to return. Set to 0 for unlimited. |
Data.gov Crawler Output Fields
Example Output
{"dataset_title": "Air Quality System (AQS) Data","dataset_id": "c9310bc1-b224-4527-b3c2-0bd2bb24f455","organization": "Environmental Protection Agency","description": "The Air Quality System contains ambient air quality data collected by EPA, state, local, and tribal air pollution control agencies.","tags": ["air-quality", "environment", "monitoring"],"categories": ["Climate", "Science & Research"],"update_frequency": "Annually","temporal_coverage": "1980-01-01/2024-12-31","spatial_coverage": "-180.0,18.0,-66.0,72.0","formats": ["CSV", "API", "HTML"],"download_url": "https://aqs.epa.gov/aqsweb/airdata/download_files.html","api_endpoint": "https://aqs.epa.gov/data/api","license": "Public Domain","modified_date": "2024-06-20","created_date": "2014-10-09","contact_name": "AQS Helpdesk","contact_email": "aqs@epa.gov","publisher": "US EPA Office of Air Quality","resource_count": 3,"resources": ["Data CSV | CSV | https://aqs.epa.gov/aqsweb/airdata/download_files.html","API Endpoint | API | https://aqs.epa.gov/data/api","Documentation | HTML | https://aqs.epa.gov/aqsweb/documents.html"],"dataset_url": "https://catalog.data.gov/dataset/air-quality-system-aqs-data"}
Output Field Reference
| Field | Type | Description |
|---|---|---|
| dataset_title | string | Title of the dataset |
| dataset_id | string | Unique CKAN dataset identifier |
| organization | string | Publishing organization (federal agency name) |
| description | string | Full description of the dataset |
| tags | string[] | Tags and keywords associated with the dataset |
| categories | string[] | Topic categories (climate, health, energy, etc.) |
| update_frequency | string | How often the dataset is updated (Daily, Monthly, Annually, etc.) |
| temporal_coverage | string | Time period covered (e.g., 2010-01-01/2023-12-31) |
| spatial_coverage | string | Geographic area covered (coordinate bounds or description) |
| formats | string[] | Available resource formats (CSV, JSON, XML, API, etc.) |
| download_url | string | Direct download URL for the primary resource |
| api_endpoint | string | API endpoint URL if the dataset offers an API |
| license | string | License type (Public Domain, Creative Commons, etc.) |
| modified_date | string | Date the metadata was last modified (YYYY-MM-DD) |
| created_date | string | Date the dataset was first published (YYYY-MM-DD) |
| contact_name | string | Dataset contact person or maintainer name |
| contact_email | string | Dataset contact email |
| publisher | string | Publishing sub-agency or office |
| resource_count | number | Number of individual resources (files/APIs) in the dataset |
| resources | string[] | Resource details: `name |
| dataset_url | string | URL to the dataset page on catalog.data.gov |
FAQ
How many datasets does Data.gov Crawler cover? Data.gov Crawler queries the full Data.gov catalog — over 400,000 datasets published by federal agencies. If it is listed on catalog.data.gov, the crawler can find it.
Does this crawler need proxies? No. Data.gov is a public government CKAN API with no authentication, no rate limits, and no bot detection. Proxies are disabled by default because they are genuinely unnecessary.
How fast does it run? Roughly 1,000 datasets per 80 seconds. A 100-dataset run completes in under a minute on 256 MB memory. The full 400K+ catalog takes approximately 11 hours.
What is the difference between includeResources on and off?
With includeResources enabled (default), each dataset record lists every individual file and API resource — name, format, and URL. Disable it for faster runs when you only need dataset-level metadata.
Can I filter by multiple criteria at once? Yes. Combine any of the filters — organization, format, category, tags, and search query. The CKAN API intersects all active filters, narrowing results to datasets that match every condition.
What data formats are available on Data.gov? Common formats include CSV, JSON, XML, API, Excel (XLS/XLSX), PDF, Shapefile (SHP), GeoJSON, KML, RDF, and HTML. Filter by any of these to find datasets in the format you need.
Need More Features?
Need custom fields, additional filters, or a different government data source? File an issue or get in touch.
Why Use Data.gov Crawler?
- Full catalog access — 400K+ federal datasets from 20+ agencies, searchable and filterable without navigating catalog.data.gov manually.
- Structured output — Every record is normalized to the same flat schema with 21 fields, ready for downstream pipelines.
- Built on the official API — Reads CKAN v3 JSON directly, so it does not break when someone redesigns a webpage.