Recipe JSON-LD Bulk Harvester avatar

Recipe JSON-LD Bulk Harvester

Pricing

Pay per event

Go to Apify Store
Recipe JSON-LD Bulk Harvester

Recipe JSON-LD Bulk Harvester

Harvest structured recipe data from any food blog. URL mode: scrape a provided list. Domain mode: auto-discover the sitemap, filter Recipe pages, and crawl them. Extracts name, author, parsed ingredients, instructions, nutrition, and ratings from schema.org/Recipe JSON-LD and hRecipe microformat.

Pricing

Pay per event

Rating

0.0

(0)

Developer

BowTiedRaccoon

BowTiedRaccoon

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

4 days ago

Last modified

Share

Harvest structured recipe data from any food blog or website. Unlike tools that only work on a fixed list of known sites, this actor works on any domain — supply URLs directly or let the actor auto-discover the site's sitemap and find all recipe pages automatically.

What it does

  • URL mode: Provide a list of recipe page URLs (or a text file of URLs) and scrape each one.
  • Domain mode: Provide one or more domain names and the actor fetches robots.txt, discovers the site's sitemap(s), filters pages that look like recipes, and crawls them up to your maxItems limit.

Data is extracted from schema.org/Recipe JSON-LD (the near-universal standard used by virtually every food blog for Google rich results) with an hRecipe microformat fallback for legacy sites.

What you get

Each result record contains:

FieldDescription
nameRecipe title
authorAuthor name
descriptionRecipe summary
recipe_categoryCategory (e.g. Dessert, Main Course)
recipe_cuisineCuisine type (e.g. Italian, Mexican)
prep_timePreparation time (ISO 8601, e.g. PT15M)
cook_timeCook time (ISO 8601)
total_timeTotal time
recipe_yieldServings (e.g. "4 servings")
recipe_ingredientRaw ingredient strings from the page
recipe_ingredient_parsedStructured ingredients — each parsed to "quantity unit item, prep"
recipe_instructionsStep-by-step instructions (one per array item)
nutritionNutrition facts as JSON string (calories, fat, protein, carbs, etc.)
aggregate_ratingStar rating (number)
rating_countNumber of ratings
keywordsRecipe tags/keywords
image_urlsRecipe photo URLs
video_urlRecipe video URL if present
date_publishedPublication date (ISO 8601)
source_domainDomain scraped
urlFull page URL
schema_typeExtraction method: recipe-jsonld, hrecipe-microformat, or none
extraction_warningsNon-fatal issues (missing fields, parse errors)

Structured ingredient parser

The recipe_ingredient_parsed field is the headline feature — it breaks each raw ingredient string into structured components:

"2 cups all-purpose flour, sifted" -> "2 cups all-purpose flour, sifted"
"1/2 tsp kosher salt" -> "0.5 tsp kosher salt"
"1 large egg, at room temperature" -> "1 egg, at room temperature"

Handles Unicode fractions, mixed fractions ("1 1/2"), and common unit abbreviations.

Input

URL mode

{
"urls": [
"https://www.allrecipes.com/recipe/10813/best-chocolate-chip-cookies/",
"https://www.simplyrecipes.com/best-easy-roast-chicken-recipe-5207046"
],
"maxItems": 100
}

You can also use requestsFromUrl to point to a plain-text file with one URL per line.

Domain mode

{
"domains": [
"www.seriouseats.com",
"www.kingarthurbaking.com"
],
"maxItems": 500
}

The actor fetches robots.txt from each domain, discovers listed sitemaps (or falls back to /sitemap.xml), traverses sitemap indexes, and filters URLs that look like recipe pages.

Input fields

FieldTypeDescription
urlsarrayRecipe page URLs to scrape (URL mode)
domainsarrayDomains to auto-discover and crawl (domain mode)
maxItemsintegerMaximum results to return (0 = unlimited)
requestsFromUrlstringURL of a text file with one recipe URL per line

Provide either urls (+ optional requestsFromUrl) or domains — not both.

How it works

URL mode — The actor resolves the URL list, crawls each page, and extracts recipe data directly.

Domain mode — For each domain:

  1. Fetch robots.txt to discover sitemap URLs
  2. Fall back to /sitemap.xml if robots.txt lists none
  3. Walk sitemap indexes to find leaf sitemaps
  4. Filter URLs by recipe-path heuristics (path contains /recipe/, slug has 3+ hyphen-separated words, etc.)
  5. Crawl each filtered URL and extract recipe data

Supported sites

Works on any food blog or cooking site that emits schema.org/Recipe JSON-LD — which covers the vast majority of food sites since Google requires it for recipe rich results. This includes:

  • Recipe-plugin-powered WordPress sites (Tasty Recipes, WP Recipe Maker, Recipe Card Blocks, etc.)
  • Major food media (Allrecipes, Simply Recipes, Serious Eats, Food Network, BBC Good Food, etc.)
  • Independent food bloggers
  • Any site using hRecipe microformat (legacy support)

Pricing

Billed per recipe record saved. The default pricing profile charges a small fee per record plus a run start fee.

Notes

  • Rate limiting: The actor respects per-domain rate limiting — sites that throttle will be retried with backoff automatically.
  • Paywalled pages: Pages that return 403 or require login will be skipped with a warning in extraction_warnings.
  • Missing schema: Pages where no Recipe schema is found produce a stub record with schema_type: "none" and a warning.