Pricing

Pay per event

Data.gov Dataset Catalog Crawler

Crawl 300K+ US government datasets from Data.gov. Extract titles, organizations, tags, formats, download URLs, API endpoints, temporal and spatial coverage, contacts, and resources. Filter by agency, format, category, and tags.

Pricing

Pay per event

Rating

0.0

(0)

Developer

BowTiedRaccoon

Actor stats

Bookmarked

Total users

Monthly active users

6 hours ago

Last modified

Data.gov Crawler Features

Search across 400K+ federal datasets by keyword, organization, format, category, or tag
Filter by 20+ federal agencies including EPA, NASA, Census Bureau, HHS, DOE, and USDA
Filter by resource format: CSV, JSON, XML, API, Excel, PDF, Shapefile, GeoJSON, KML, RDF
Extract download URLs and API endpoints for every resource in a dataset
Includes temporal coverage (date ranges), spatial coverage (geographic bounds), and update frequency
Pulls contact names and emails for dataset maintainers
Toggle resource-level detail on or off — run lean for metadata, or include full file listings
Sort by relevance, recently modified, recently created, or alphabetically
Reads the CKAN v3 JSON API directly — no HTML parsing, no proxy required
Pay-per-event pricing at roughly $0.50 per 1,000 datasets

Who Uses Data.gov Dataset Metadata?

Data scientists and researchers — discover federal datasets by topic, format, and time period without manually browsing catalog.data.gov
Government contractors — identify available datasets from specific agencies for proposal research and compliance reporting
Journalists and civic technologists — find public datasets on health, environment, transportation, or public safety for data-driven stories and apps
Open data advocates — monitor dataset freshness, publication rates, and coverage gaps across federal agencies
B2B data teams — build pipelines that pull structured government data feeds into internal analytics platforms

How Data.gov Crawler Works

You configure search filters: keywords, organization, data format, category, tags, or any combination.
The crawler queries the Data.gov CKAN API with your filters and paginates through all matching results (1,000 per request, 200ms courtesy delay).
Each dataset record is transformed into a clean, flat output object with all metadata fields extracted and normalized.

Input

Browse all datasets (default)

{
  "maxItems": 100
}

Search by keyword

{
  "searchQuery": "air quality monitoring",
  "maxItems": 50
}

Filter by agency and format

{
  "organization": "environmental-protection-agency",
  "dataFormat": "CSV",
  "maxItems": 200
}

Filter by category and tags

{
  "category": "health",
  "tags": ["covid-19"],
  "maxItems": 100
}

Input Parameters

Field	Type	Default	Description
searchQuery	string	`""`	Full-text search across dataset titles, descriptions, and tags. Leave empty to browse all datasets.
organization	string	`""`	Filter by publishing agency. Options include `environmental-protection-agency`, `department-of-commerce`, `national-aeronautics-and-space-administration`, `census-bureau`, and 16 others.
dataFormat	string	`""`	Filter by resource format: CSV, JSON, XML, API, XLS, PDF, SHP, GeoJSON, KML, RDF, HTML.
category	string	`""`	Filter by Data.gov topic: agriculture, business, climate, consumer, ecosystems, education, energy, finance, health, manufacturing, ocean, public-safety, science, weather.
tags	string[]	`[]`	Filter by dataset tags (e.g., `health`, `environment`). Datasets must match all specified tags.
includeResources	boolean	`true`	Include individual resource details (file names, formats, URLs) within each dataset. Set to `false` for faster, leaner output.
sortBy	string	`"score desc"`	Sort order: `score desc` (relevance), `metadata_modified desc` (recently modified), `metadata_created desc` (recently created), `name asc` (A-Z).
maxItems	integer	`100`	Maximum dataset records to return. Set to `0` for unlimited.

Data.gov Crawler Output Fields

Example Output

{
  "dataset_title": "Air Quality System (AQS) Data",
  "dataset_id": "c9310bc1-b224-4527-b3c2-0bd2bb24f455",
  "organization": "Environmental Protection Agency",
  "description": "The Air Quality System contains ambient air quality data collected by EPA, state, local, and tribal air pollution control agencies.",
  "tags": ["air-quality", "environment", "monitoring"],
  "categories": ["Climate", "Science & Research"],
  "update_frequency": "Annually",
  "temporal_coverage": "1980-01-01/2024-12-31",
  "spatial_coverage": "-180.0,18.0,-66.0,72.0",
  "formats": ["CSV", "API", "HTML"],
  "download_url": "https://aqs.epa.gov/aqsweb/airdata/download_files.html",
  "api_endpoint": "https://aqs.epa.gov/data/api",
  "license": "Public Domain",
  "modified_date": "2024-06-20",
  "created_date": "2014-10-09",
  "contact_name": "AQS Helpdesk",
  "contact_email": "aqs@epa.gov",
  "publisher": "US EPA Office of Air Quality",
  "resource_count": 3,
  "resources": [
    "Data CSV | CSV | https://aqs.epa.gov/aqsweb/airdata/download_files.html",
    "API Endpoint | API | https://aqs.epa.gov/data/api",
    "Documentation | HTML | https://aqs.epa.gov/aqsweb/documents.html"
  ],
  "dataset_url": "https://catalog.data.gov/dataset/air-quality-system-aqs-data"
}

Output Field Reference

Field	Type	Description
dataset_title	string	Title of the dataset
dataset_id	string	Unique CKAN dataset identifier
organization	string	Publishing organization (federal agency name)
description	string	Full description of the dataset
tags	string[]	Tags and keywords associated with the dataset
categories	string[]	Topic categories (climate, health, energy, etc.)
update_frequency	string	How often the dataset is updated (Daily, Monthly, Annually, etc.)
temporal_coverage	string	Time period covered (e.g., `2010-01-01/2023-12-31`)
spatial_coverage	string	Geographic area covered (coordinate bounds or description)
formats	string[]	Available resource formats (CSV, JSON, XML, API, etc.)
download_url	string	Direct download URL for the primary resource
api_endpoint	string	API endpoint URL if the dataset offers an API
license	string	License type (Public Domain, Creative Commons, etc.)
modified_date	string	Date the metadata was last modified (YYYY-MM-DD)
created_date	string	Date the dataset was first published (YYYY-MM-DD)
contact_name	string	Dataset contact person or maintainer name
contact_email	string	Dataset contact email
publisher	string	Publishing sub-agency or office
resource_count	number	Number of individual resources (files/APIs) in the dataset
resources	string[]	Resource details: `name
dataset_url	string	URL to the dataset page on catalog.data.gov

FAQ

How many datasets does Data.gov Crawler cover? Data.gov Crawler queries the full Data.gov catalog — over 400,000 datasets published by federal agencies. If it is listed on catalog.data.gov, the crawler can find it.

Does this crawler need proxies? No. Data.gov is a public government CKAN API with no authentication, no rate limits, and no bot detection. Proxies are disabled by default because they are genuinely unnecessary.

How fast does it run? Roughly 1,000 datasets per 80 seconds. A 100-dataset run completes in under a minute on 256 MB memory. The full 400K+ catalog takes approximately 11 hours.

What is the difference between includeResources on and off? With includeResources enabled (default), each dataset record lists every individual file and API resource — name, format, and URL. Disable it for faster runs when you only need dataset-level metadata.

Can I filter by multiple criteria at once? Yes. Combine any of the filters — organization, format, category, tags, and search query. The CKAN API intersects all active filters, narrowing results to datasets that match every condition.

What data formats are available on Data.gov? Common formats include CSV, JSON, XML, API, Excel (XLS/XLSX), PDF, Shapefile (SHP), GeoJSON, KML, RDF, and HTML. Filter by any of these to find datasets in the format you need.

Need More Features?

Need custom fields, additional filters, or a different government data source? File an issue or get in touch.

Why Use Data.gov Crawler?

Full catalog access — 400K+ federal datasets from 20+ agencies, searchable and filterable without navigating catalog.data.gov manually.
Structured output — Every record is normalized to the same flat schema with 21 fields, ready for downstream pipelines.
Built on the official API — Reads CKAN v3 JSON directly, so it does not break when someone redesigns a webpage.

Data.gov Catalog Scraper

crawlerbros/data-gov-catalog-scraper

Scrape the Data.gov catalog (catalog.data.gov). Search 300,000+ open government datasets by keyword, organization, and format. Fetch dataset details or list organizations. No API key required.

Crawler Bros

Data.gov Catalog Scraper

crawlergang/data-gov-catalog-scraper

Scrape the Data.gov catalog (catalog.data.gov). Search 300,000+ open government datasets by keyword, organization, and format. Fetch dataset details or list organizations. No API key required.

Crawler Gang

5.0

Data Gov Catalog Scraper

fortuitous_pirate/data-gov-catalog-scraper

Search, filter, and download metadata for 300,000+ federal open datasets from Data.gov. Filter by agency (NASA, EPA, NOAA), format (CSV, JSON, API), tags, and topics. Returns dataset details, resource links, and organization info.

Fortuitous Pirate

Data.gov API - US Open Government Datasets

alizarin_refrigerator-owner/data-gov-api---us-open-government-datasets

Access the Data.gov catalog of 300,000+ US government datasets. Search datasets by topic, agency, format, and keywords. Discover open data from federal, state, and local governments

The Howlers

Data.gov Dataset Search

ryanclinton/datagov-dataset-search

Search and extract metadata from 300,000+ datasets in the official United States government open data catalog at [Data.gov](https://catalog.data.gov/).

Ryan Clinton

Data.gov.uk Scraper - Low-cost💲🔥📚🇬🇧

delectable_incubator/data-gov-uk-scraper-low-cost

Scrape data.gov.uk dataset listings 🔎📊 with a powerful open data scraper. Extract dataset titles, publishers, update dates, descriptions, tags, and dataset URLs from search results. Ideal for government data monitoring, open data research, dataset discovery, and structured data catalog creation 🚀

Prime Scrape

Australia Open Data (data.gov.au) Scraper

parseforge/australia-data-gov-au-scraper

Export Australian government open datasets from data.gov.au. Browse 70k+ datasets across federal, state, and territory agencies. Pull dataset metadata, resources, organization, license, tags, and update frequency. Catalog mode lists all; dataset mode fetches one by ID.

ParseForge

Hong Kong Open Data Scraper

parseforge/data-gov-hk-hong-kong-scraper

Export datasets from data.gov.hk, the Hong Kong government open data portal. Browse the full catalog or fetch specific datasets. Pull titles, organizations, descriptions, tags, update frequency, resource files, formats, licences, and direct download links.

ParseForge

Data.gov.uk Scraper - Cheap 🌐📊🇬🇧

scrapestorm/data-gov-uk-scraper---cheap

🔎 Easily collect dataset listings from data.gov.uk Provide one or multiple search URLs and extract dataset information such as 📄 Dataset Title 🏢 Published By 🕒 Last Updated 📝 Description 🔗 Dataset URL & more Perfect for open data research, government data monitoring & dataset discovery 📊🚀

Storm_Scraper

5.0

Gov Contracts Scraper

labrat011/gov-contracts-scraper

Search U.S. federal contract opportunities, awards, and agencies from SAM.gov. Filter by keyword, NAICS code, set-aside type, state, agency, and more. Returns structured data including contacts, deadlines, award amounts, and direct SAM.gov links. Requires a free SAM.gov API key.