Data.gov Dataset Catalog Crawler avatar

Data.gov Dataset Catalog Crawler

Pricing

Pay per event

Go to Apify Store
Data.gov Dataset Catalog Crawler

Data.gov Dataset Catalog Crawler

Crawl 300K+ US government datasets from Data.gov. Extract titles, organizations, tags, formats, download URLs, API endpoints, temporal and spatial coverage, contacts, and resources. Filter by agency, format, category, and tags.

Pricing

Pay per event

Rating

0.0

(0)

Developer

BowTiedRaccoon

BowTiedRaccoon

Maintained by Community

Actor stats

0

Bookmarked

1

Total users

0

Monthly active users

12 hours ago

Last modified

Share

Extract dataset metadata from the Data.gov federal open data catalog. Covers 400K+ US government datasets — titles, descriptions, organizations, tags, formats, download URLs, API endpoints, temporal and spatial coverage, contact information, and individual resource details.

Data.gov Crawler Features

  • Search across 400K+ federal datasets by keyword, organization, format, category, or tag
  • Filter by 20+ federal agencies including EPA, NASA, Census Bureau, HHS, DOE, and USDA
  • Filter by resource format: CSV, JSON, XML, API, Excel, PDF, Shapefile, GeoJSON, KML, RDF
  • Extract download URLs and API endpoints for every resource in a dataset
  • Includes temporal coverage (date ranges), spatial coverage (geographic bounds), and update frequency
  • Pulls contact names and emails for dataset maintainers
  • Toggle resource-level detail on or off — run lean for metadata, or include full file listings
  • Sort by relevance, recently modified, recently created, or alphabetically
  • Reads the CKAN v3 JSON API directly — no HTML parsing, no proxy required
  • Pay-per-event pricing at roughly $0.50 per 1,000 datasets

Who Uses Data.gov Dataset Metadata?

  • Data scientists and researchers — discover federal datasets by topic, format, and time period without manually browsing catalog.data.gov
  • Government contractors — identify available datasets from specific agencies for proposal research and compliance reporting
  • Journalists and civic technologists — find public datasets on health, environment, transportation, or public safety for data-driven stories and apps
  • Open data advocates — monitor dataset freshness, publication rates, and coverage gaps across federal agencies
  • B2B data teams — build pipelines that pull structured government data feeds into internal analytics platforms

How Data.gov Crawler Works

  1. You configure search filters: keywords, organization, data format, category, tags, or any combination.
  2. The crawler queries the Data.gov CKAN API with your filters and paginates through all matching results (1,000 per request, 200ms courtesy delay).
  3. Each dataset record is transformed into a clean, flat output object with all metadata fields extracted and normalized.

Input

Browse all datasets (default)

{
"maxItems": 100
}

Search by keyword

{
"searchQuery": "air quality monitoring",
"maxItems": 50
}

Filter by agency and format

{
"organization": "environmental-protection-agency",
"dataFormat": "CSV",
"maxItems": 200
}

Filter by category and tags

{
"category": "health",
"tags": ["covid-19"],
"maxItems": 100
}

Input Parameters

FieldTypeDefaultDescription
searchQuerystring""Full-text search across dataset titles, descriptions, and tags. Leave empty to browse all datasets.
organizationstring""Filter by publishing agency. Options include environmental-protection-agency, department-of-commerce, national-aeronautics-and-space-administration, census-bureau, and 16 others.
dataFormatstring""Filter by resource format: CSV, JSON, XML, API, XLS, PDF, SHP, GeoJSON, KML, RDF, HTML.
categorystring""Filter by Data.gov topic: agriculture, business, climate, consumer, ecosystems, education, energy, finance, health, manufacturing, ocean, public-safety, science, weather.
tagsstring[][]Filter by dataset tags (e.g., health, environment). Datasets must match all specified tags.
includeResourcesbooleantrueInclude individual resource details (file names, formats, URLs) within each dataset. Set to false for faster, leaner output.
sortBystring"score desc"Sort order: score desc (relevance), metadata_modified desc (recently modified), metadata_created desc (recently created), name asc (A-Z).
maxItemsinteger100Maximum dataset records to return. Set to 0 for unlimited.

Data.gov Crawler Output Fields

Example Output

{
"dataset_title": "Air Quality System (AQS) Data",
"dataset_id": "c9310bc1-b224-4527-b3c2-0bd2bb24f455",
"organization": "Environmental Protection Agency",
"description": "The Air Quality System contains ambient air quality data collected by EPA, state, local, and tribal air pollution control agencies.",
"tags": ["air-quality", "environment", "monitoring"],
"categories": ["Climate", "Science & Research"],
"update_frequency": "Annually",
"temporal_coverage": "1980-01-01/2024-12-31",
"spatial_coverage": "-180.0,18.0,-66.0,72.0",
"formats": ["CSV", "API", "HTML"],
"download_url": "https://aqs.epa.gov/aqsweb/airdata/download_files.html",
"api_endpoint": "https://aqs.epa.gov/data/api",
"license": "Public Domain",
"modified_date": "2024-06-20",
"created_date": "2014-10-09",
"contact_name": "AQS Helpdesk",
"contact_email": "aqs@epa.gov",
"publisher": "US EPA Office of Air Quality",
"resource_count": 3,
"resources": [
"Data CSV | CSV | https://aqs.epa.gov/aqsweb/airdata/download_files.html",
"API Endpoint | API | https://aqs.epa.gov/data/api",
"Documentation | HTML | https://aqs.epa.gov/aqsweb/documents.html"
],
"dataset_url": "https://catalog.data.gov/dataset/air-quality-system-aqs-data"
}

Output Field Reference

FieldTypeDescription
dataset_titlestringTitle of the dataset
dataset_idstringUnique CKAN dataset identifier
organizationstringPublishing organization (federal agency name)
descriptionstringFull description of the dataset
tagsstring[]Tags and keywords associated with the dataset
categoriesstring[]Topic categories (climate, health, energy, etc.)
update_frequencystringHow often the dataset is updated (Daily, Monthly, Annually, etc.)
temporal_coveragestringTime period covered (e.g., 2010-01-01/2023-12-31)
spatial_coveragestringGeographic area covered (coordinate bounds or description)
formatsstring[]Available resource formats (CSV, JSON, XML, API, etc.)
download_urlstringDirect download URL for the primary resource
api_endpointstringAPI endpoint URL if the dataset offers an API
licensestringLicense type (Public Domain, Creative Commons, etc.)
modified_datestringDate the metadata was last modified (YYYY-MM-DD)
created_datestringDate the dataset was first published (YYYY-MM-DD)
contact_namestringDataset contact person or maintainer name
contact_emailstringDataset contact email
publisherstringPublishing sub-agency or office
resource_countnumberNumber of individual resources (files/APIs) in the dataset
resourcesstring[]Resource details: `name
dataset_urlstringURL to the dataset page on catalog.data.gov

FAQ

How many datasets does Data.gov Crawler cover? Data.gov Crawler queries the full Data.gov catalog — over 400,000 datasets published by federal agencies. If it is listed on catalog.data.gov, the crawler can find it.

Does this crawler need proxies? No. Data.gov is a public government CKAN API with no authentication, no rate limits, and no bot detection. Proxies are disabled by default because they are genuinely unnecessary.

How fast does it run? Roughly 1,000 datasets per 80 seconds. A 100-dataset run completes in under a minute on 256 MB memory. The full 400K+ catalog takes approximately 11 hours.

What is the difference between includeResources on and off? With includeResources enabled (default), each dataset record lists every individual file and API resource — name, format, and URL. Disable it for faster runs when you only need dataset-level metadata.

Can I filter by multiple criteria at once? Yes. Combine any of the filters — organization, format, category, tags, and search query. The CKAN API intersects all active filters, narrowing results to datasets that match every condition.

What data formats are available on Data.gov? Common formats include CSV, JSON, XML, API, Excel (XLS/XLSX), PDF, Shapefile (SHP), GeoJSON, KML, RDF, and HTML. Filter by any of these to find datasets in the format you need.

Need More Features?

Need custom fields, additional filters, or a different government data source? File an issue or get in touch.

Why Use Data.gov Crawler?

  • Full catalog access — 400K+ federal datasets from 20+ agencies, searchable and filterable without navigating catalog.data.gov manually.
  • Structured output — Every record is normalized to the same flat schema with 21 fields, ready for downstream pipelines.
  • Built on the official API — Reads CKAN v3 JSON directly, so it does not break when someone redesigns a webpage.