Web Content Extractor API — URL to JSON avatar

Web Content Extractor API — URL to JSON

Pricing

Pay per usage

Go to Apify Store
Web Content Extractor API — URL to JSON

Web Content Extractor API — URL to JSON

Extract structured JSON from any webpage. Articles, products, recipes, jobs. Auto-detects content type. Returns metadata, headings, images, links. For AI agents and RAG.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

George Kioko

George Kioko

Maintained by Community

Actor stats

0

Bookmarked

1

Total users

0

Monthly active users

4 hours ago

Last modified

Share

🔍 Web Content Extractor API — URL to Structured JSON

One API call. Any URL. Clean structured JSON. Extract articles, products, recipes, job postings, and more — automatically detected and organized. Built for AI agents, RAG pipelines, and data workflows.


Architecture Overview

flowchart TB
subgraph Input
URL[/"URL: any webpage"/]
end
subgraph Processing["Extraction Pipeline"]
FETCH["1. Fetch & Parse HTML"]
DETECT["2. Auto-Detect Content Type"]
SCORE["3. Score Content Blocks"]
EXTRACT["4. Extract Structured Data"]
ENRICH["5. Enrich with Metadata"]
end
subgraph Detection["Content Type Detection"]
ART["Article"]
PROD["Product"]
REC["Recipe"]
JOB["Job Posting"]
EVT["Event"]
WEB["Generic Webpage"]
end
subgraph Output["Structured JSON"]
META["Metadata: title, author, date, image"]
CONTENT["Content: text, headings, word count"]
MEDIA["Media: images, links"]
SCHEMA["JSON-LD Structured Data"]
TYPED["Type-Specific: price, ingredients, salary..."]
end
URL --> FETCH --> DETECT --> SCORE --> EXTRACT --> ENRICH
DETECT --> ART & PROD & REC & JOB & EVT & WEB
ENRICH --> META & CONTENT & MEDIA & SCHEMA & TYPED
style Input fill:#1a1a2e,color:#fff
style Processing fill:#16213e,color:#fff
style Detection fill:#0f3460,color:#fff
style Output fill:#533483,color:#fff

What Makes This Different?

FeatureThis ActorTypical Scrapers
Output formatStructured JSONRaw HTML
Content detectionAuto-detects 6 typesManual configuration
Setup timeZero — just pass URLHours of selector writing
AI-readyYes — clean text for LLMsNeeds post-processing
Batch supportUp to 25 URLs per callOne at a time
Response time1-3 seconds5-30 seconds

Request Flow

sequenceDiagram
participant Client as Your App
participant API as Content Extractor
participant Web as Target Website
participant Cache as 30-min Cache
Client->>API: GET /extract?url=example.com
API->>Cache: Check cache
alt Cache Hit
Cache-->>API: Return cached result
API-->>Client: JSON response (instant)
else Cache Miss
API->>Web: Fetch HTML
Web-->>API: HTML content
API->>API: Detect type + Extract + Score
API->>Cache: Store result
API-->>Client: Structured JSON (1-3s)
end
Note over Client,API: PPE charge: $0.003 per extraction

API Endpoints

GET /extract — Extract from URL

GET /extract?url=https://techcrunch.com/2026/03/24/ai-news&format=full
ParameterTypeRequiredDefaultOptions
urlstringYesAny valid URL
formatstringNofullfull, article, metadata

POST /extract — Extract with JSON body

POST /extract
{
"url": "https://techcrunch.com/2026/03/24/ai-news",
"format": "article"
}

POST /batch — Extract multiple URLs

POST /batch
{
"urls": [
"https://news.ycombinator.com",
"https://techcrunch.com",
"https://bbc.com/news"
],
"format": "full"
}

GET / — Health check

Returns API status, version, and endpoint documentation.


Content Type Detection

flowchart LR
HTML["HTML Page"] --> CHECK{"Detect Signals"}
CHECK -->|"og:type=article<br/>or article tag"| ART["**article**<br/>title, author, date,<br/>full text, headings"]
CHECK -->|"Schema: Product<br/>or .product-price"| PROD["**product**<br/>name, price, rating,<br/>images, SKU, brand"]
CHECK -->|"Schema: Recipe<br/>or .recipe"| REC["**recipe**<br/>ingredients, instructions,<br/>prep time, servings"]
CHECK -->|"Schema: JobPosting<br/>or .job-title"| JOB["**job_posting**<br/>title, company, salary,<br/>location, type"]
CHECK -->|"Schema: Event<br/>or .event-date"| EVT["**event**<br/>name, date, location,<br/>description"]
CHECK -->|"No specific<br/>signals found"| WEB["**webpage**<br/>metadata, content,<br/>links, images"]
style ART fill:#10b981,color:#fff
style PROD fill:#f59e0b,color:#fff
style REC fill:#ef4444,color:#fff
style JOB fill:#3b82f6,color:#fff
style EVT fill:#8b5cf6,color:#fff
style WEB fill:#6b7280,color:#fff

Output Examples

Article Extraction

{
"url": "https://techcrunch.com/2026/03/24/ai-agents",
"type": "article",
"metadata": {
"title": "AI Agents Are Reshaping Enterprise Software",
"description": "How autonomous AI agents are changing B2B SaaS",
"author": "Sarah Perez",
"date": "2026-03-24T10:00:00Z",
"image": "https://techcrunch.com/hero.jpg",
"siteName": "TechCrunch",
"locale": "en-US",
"canonical": "https://techcrunch.com/2026/03/24/ai-agents",
"keywords": ["AI", "agents", "enterprise", "SaaS"]
},
"content": {
"text": "The rise of AI agents represents a fundamental shift in how enterprise software operates. Unlike traditional chatbots...",
"headings": [
{ "level": 2, "text": "What Are AI Agents?" },
{ "level": 2, "text": "The Enterprise Impact" },
{ "level": 3, "text": "Case Study: Salesforce" }
],
"wordCount": 2847
},
"media": {
"images": [
{ "src": "https://techcrunch.com/diagram.png", "alt": "AI agent architecture" }
],
"links": [
{ "href": "https://openai.com/agents", "text": "OpenAI's agent framework" }
]
},
"structuredData": [{ "@type": "NewsArticle", "headline": "..." }],
"extractedAt": "2026-03-24T12:34:56.789Z"
}

Product Extraction

{
"url": "https://store.example.com/product/widget-pro",
"type": "product",
"metadata": { "title": "Widget Pro - Best Seller", "siteName": "Example Store" },
"content": { "text": "The Widget Pro is our most popular...", "wordCount": 342 },
"product": {
"name": "Widget Pro",
"price": "$49.99",
"currency": "USD",
"availability": "InStock",
"rating": "4.8",
"reviewCount": "1,247",
"brand": "WidgetCo",
"sku": "WP-2026",
"images": ["https://store.example.com/widget-pro-1.jpg"]
}
}

Use Case Workflows

RAG Pipeline Integration

flowchart LR
URLs["URL List<br/>100+ sources"] --> EXTRACT["Web Content<br/>Extractor API"]
EXTRACT --> TEXT["Clean Text<br/>+ Metadata"]
TEXT --> CHUNK["Text Chunking<br/>(LangChain)"]
CHUNK --> EMBED["Embeddings<br/>(OpenAI)"]
EMBED --> VECTOR["Vector DB<br/>(Pinecone)"]
VECTOR --> RAG["RAG Query<br/>Engine"]
RAG --> ANSWER["AI-Powered<br/>Answers"]
style EXTRACT fill:#10b981,color:#fff
style RAG fill:#3b82f6,color:#fff

Competitive Intelligence Pipeline

flowchart LR
COMP["Competitor<br/>URLs"] --> EXTRACT["Web Content<br/>Extractor API"]
EXTRACT --> PROD["Product Data:<br/>prices, features"]
EXTRACT --> NEWS["News & Blog:<br/>announcements"]
PROD --> DASH["Analytics<br/>Dashboard"]
NEWS --> ALERT["Email<br/>Alerts"]
style EXTRACT fill:#10b981,color:#fff

Pricing

EventPrice per callCost per 1,000
Content extraction$0.003$3.00

Cost Comparison

SolutionCost per 1,000 URLsSetup Time
This Actor$3.000 minutes
Diffbot$299/month flatHours
Custom scraper$50+ developer hoursDays
Manual copy-paste40+ hours laborForever

Integrations

PlatformHow to Connect
LangChainUse as Document Loader via HTTP
LlamaIndexCustom reader pointing to /extract
ZapierWebhook trigger -> GET /extract
Make (Integromat)HTTP module -> POST /extract
n8nHTTP Request node
Apify OrchestratorDirect actor call or Standby URL

FAQ

Q: How fast is extraction? A: 1-3 seconds for a single URL. Batch processes 25 URLs in parallel.

Q: Does it handle paywalled content? A: It extracts whatever is publicly visible in the HTML. Paywalled content behind JavaScript auth won't be extracted.

Q: What about JavaScript-rendered pages (SPAs)? A: Current version uses server-side HTML. For JS-heavy pages, pair with our Screenshot & PDF API.

Q: Is there a rate limit? A: No hard rate limit. Apify Standby handles concurrent requests automatically.

Q: What languages are supported? A: Any language. The extractor works with HTML structure, not language-specific parsing.



Built by George Kioko | 6,196+ data extraction jobs completed | 35+ production APIs