Recipe JSON-LD Bulk Harvester
Pricing
Pay per event
Recipe JSON-LD Bulk Harvester
Harvest structured recipe data from any food blog. URL mode: scrape a provided list. Domain mode: auto-discover the sitemap, filter Recipe pages, and crawl them. Extracts name, author, parsed ingredients, instructions, nutrition, and ratings from schema.org/Recipe JSON-LD and hRecipe microformat.
Pricing
Pay per event
Rating
0.0
(0)
Developer
BowTiedRaccoon
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
4 days ago
Last modified
Categories
Share
Harvest structured recipe data from any food blog or website. Unlike tools that only work on a fixed list of known sites, this actor works on any domain — supply URLs directly or let the actor auto-discover the site's sitemap and find all recipe pages automatically.
What it does
- URL mode: Provide a list of recipe page URLs (or a text file of URLs) and scrape each one.
- Domain mode: Provide one or more domain names and the actor fetches
robots.txt, discovers the site's sitemap(s), filters pages that look like recipes, and crawls them up to yourmaxItemslimit.
Data is extracted from schema.org/Recipe JSON-LD (the near-universal standard used by virtually every food blog for Google rich results) with an hRecipe microformat fallback for legacy sites.
What you get
Each result record contains:
| Field | Description |
|---|---|
name | Recipe title |
author | Author name |
description | Recipe summary |
recipe_category | Category (e.g. Dessert, Main Course) |
recipe_cuisine | Cuisine type (e.g. Italian, Mexican) |
prep_time | Preparation time (ISO 8601, e.g. PT15M) |
cook_time | Cook time (ISO 8601) |
total_time | Total time |
recipe_yield | Servings (e.g. "4 servings") |
recipe_ingredient | Raw ingredient strings from the page |
recipe_ingredient_parsed | Structured ingredients — each parsed to "quantity unit item, prep" |
recipe_instructions | Step-by-step instructions (one per array item) |
nutrition | Nutrition facts as JSON string (calories, fat, protein, carbs, etc.) |
aggregate_rating | Star rating (number) |
rating_count | Number of ratings |
keywords | Recipe tags/keywords |
image_urls | Recipe photo URLs |
video_url | Recipe video URL if present |
date_published | Publication date (ISO 8601) |
source_domain | Domain scraped |
url | Full page URL |
schema_type | Extraction method: recipe-jsonld, hrecipe-microformat, or none |
extraction_warnings | Non-fatal issues (missing fields, parse errors) |
Structured ingredient parser
The recipe_ingredient_parsed field is the headline feature — it breaks each raw ingredient string into structured components:
"2 cups all-purpose flour, sifted" -> "2 cups all-purpose flour, sifted""1/2 tsp kosher salt" -> "0.5 tsp kosher salt""1 large egg, at room temperature" -> "1 egg, at room temperature"
Handles Unicode fractions, mixed fractions ("1 1/2"), and common unit abbreviations.
Input
URL mode
{"urls": ["https://www.allrecipes.com/recipe/10813/best-chocolate-chip-cookies/","https://www.simplyrecipes.com/best-easy-roast-chicken-recipe-5207046"],"maxItems": 100}
You can also use requestsFromUrl to point to a plain-text file with one URL per line.
Domain mode
{"domains": ["www.seriouseats.com","www.kingarthurbaking.com"],"maxItems": 500}
The actor fetches robots.txt from each domain, discovers listed sitemaps (or falls back to /sitemap.xml), traverses sitemap indexes, and filters URLs that look like recipe pages.
Input fields
| Field | Type | Description |
|---|---|---|
urls | array | Recipe page URLs to scrape (URL mode) |
domains | array | Domains to auto-discover and crawl (domain mode) |
maxItems | integer | Maximum results to return (0 = unlimited) |
requestsFromUrl | string | URL of a text file with one recipe URL per line |
Provide either urls (+ optional requestsFromUrl) or domains — not both.
How it works
URL mode — The actor resolves the URL list, crawls each page, and extracts recipe data directly.
Domain mode — For each domain:
- Fetch
robots.txtto discover sitemap URLs - Fall back to
/sitemap.xmlif robots.txt lists none - Walk sitemap indexes to find leaf sitemaps
- Filter URLs by recipe-path heuristics (path contains
/recipe/, slug has 3+ hyphen-separated words, etc.) - Crawl each filtered URL and extract recipe data
Supported sites
Works on any food blog or cooking site that emits schema.org/Recipe JSON-LD — which covers the vast majority of food sites since Google requires it for recipe rich results. This includes:
- Recipe-plugin-powered WordPress sites (Tasty Recipes, WP Recipe Maker, Recipe Card Blocks, etc.)
- Major food media (Allrecipes, Simply Recipes, Serious Eats, Food Network, BBC Good Food, etc.)
- Independent food bloggers
- Any site using hRecipe microformat (legacy support)
Pricing
Billed per recipe record saved. The default pricing profile charges a small fee per record plus a run start fee.
Notes
- Rate limiting: The actor respects per-domain rate limiting — sites that throttle will be retried with backoff automatically.
- Paywalled pages: Pages that return 403 or require login will be skipped with a warning in
extraction_warnings. - Missing schema: Pages where no Recipe schema is found produce a stub record with
schema_type: "none"and a warning.