Web Content Extractor API — URL to JSON
Pricing
Pay per usage
Web Content Extractor API — URL to JSON
Extract structured JSON from any webpage. Articles, products, recipes, jobs. Auto-detects content type. Returns metadata, headings, images, links. For AI agents and RAG.
Pricing
Pay per usage
Rating
0.0
(0)
Developer
George Kioko
Actor stats
0
Bookmarked
1
Total users
0
Monthly active users
4 hours ago
Last modified
Categories
Share
🔍 Web Content Extractor API — URL to Structured JSON
One API call. Any URL. Clean structured JSON. Extract articles, products, recipes, job postings, and more — automatically detected and organized. Built for AI agents, RAG pipelines, and data workflows.
Architecture Overview
flowchart TBsubgraph InputURL[/"URL: any webpage"/]endsubgraph Processing["Extraction Pipeline"]FETCH["1. Fetch & Parse HTML"]DETECT["2. Auto-Detect Content Type"]SCORE["3. Score Content Blocks"]EXTRACT["4. Extract Structured Data"]ENRICH["5. Enrich with Metadata"]endsubgraph Detection["Content Type Detection"]ART["Article"]PROD["Product"]REC["Recipe"]JOB["Job Posting"]EVT["Event"]WEB["Generic Webpage"]endsubgraph Output["Structured JSON"]META["Metadata: title, author, date, image"]CONTENT["Content: text, headings, word count"]MEDIA["Media: images, links"]SCHEMA["JSON-LD Structured Data"]TYPED["Type-Specific: price, ingredients, salary..."]endURL --> FETCH --> DETECT --> SCORE --> EXTRACT --> ENRICHDETECT --> ART & PROD & REC & JOB & EVT & WEBENRICH --> META & CONTENT & MEDIA & SCHEMA & TYPEDstyle Input fill:#1a1a2e,color:#fffstyle Processing fill:#16213e,color:#fffstyle Detection fill:#0f3460,color:#fffstyle Output fill:#533483,color:#fff
What Makes This Different?
| Feature | This Actor | Typical Scrapers |
|---|---|---|
| Output format | Structured JSON | Raw HTML |
| Content detection | Auto-detects 6 types | Manual configuration |
| Setup time | Zero — just pass URL | Hours of selector writing |
| AI-ready | Yes — clean text for LLMs | Needs post-processing |
| Batch support | Up to 25 URLs per call | One at a time |
| Response time | 1-3 seconds | 5-30 seconds |
Request Flow
sequenceDiagramparticipant Client as Your Appparticipant API as Content Extractorparticipant Web as Target Websiteparticipant Cache as 30-min CacheClient->>API: GET /extract?url=example.comAPI->>Cache: Check cachealt Cache HitCache-->>API: Return cached resultAPI-->>Client: JSON response (instant)else Cache MissAPI->>Web: Fetch HTMLWeb-->>API: HTML contentAPI->>API: Detect type + Extract + ScoreAPI->>Cache: Store resultAPI-->>Client: Structured JSON (1-3s)endNote over Client,API: PPE charge: $0.003 per extraction
API Endpoints
GET /extract — Extract from URL
GET /extract?url=https://techcrunch.com/2026/03/24/ai-news&format=full
| Parameter | Type | Required | Default | Options |
|---|---|---|---|---|
url | string | Yes | — | Any valid URL |
format | string | No | full | full, article, metadata |
POST /extract — Extract with JSON body
POST /extract{"url": "https://techcrunch.com/2026/03/24/ai-news","format": "article"}
POST /batch — Extract multiple URLs
POST /batch{"urls": ["https://news.ycombinator.com","https://techcrunch.com","https://bbc.com/news"],"format": "full"}
GET / — Health check
Returns API status, version, and endpoint documentation.
Content Type Detection
flowchart LRHTML["HTML Page"] --> CHECK{"Detect Signals"}CHECK -->|"og:type=article<br/>or article tag"| ART["**article**<br/>title, author, date,<br/>full text, headings"]CHECK -->|"Schema: Product<br/>or .product-price"| PROD["**product**<br/>name, price, rating,<br/>images, SKU, brand"]CHECK -->|"Schema: Recipe<br/>or .recipe"| REC["**recipe**<br/>ingredients, instructions,<br/>prep time, servings"]CHECK -->|"Schema: JobPosting<br/>or .job-title"| JOB["**job_posting**<br/>title, company, salary,<br/>location, type"]CHECK -->|"Schema: Event<br/>or .event-date"| EVT["**event**<br/>name, date, location,<br/>description"]CHECK -->|"No specific<br/>signals found"| WEB["**webpage**<br/>metadata, content,<br/>links, images"]style ART fill:#10b981,color:#fffstyle PROD fill:#f59e0b,color:#fffstyle REC fill:#ef4444,color:#fffstyle JOB fill:#3b82f6,color:#fffstyle EVT fill:#8b5cf6,color:#fffstyle WEB fill:#6b7280,color:#fff
Output Examples
Article Extraction
{"url": "https://techcrunch.com/2026/03/24/ai-agents","type": "article","metadata": {"title": "AI Agents Are Reshaping Enterprise Software","description": "How autonomous AI agents are changing B2B SaaS","author": "Sarah Perez","date": "2026-03-24T10:00:00Z","image": "https://techcrunch.com/hero.jpg","siteName": "TechCrunch","locale": "en-US","canonical": "https://techcrunch.com/2026/03/24/ai-agents","keywords": ["AI", "agents", "enterprise", "SaaS"]},"content": {"text": "The rise of AI agents represents a fundamental shift in how enterprise software operates. Unlike traditional chatbots...","headings": [{ "level": 2, "text": "What Are AI Agents?" },{ "level": 2, "text": "The Enterprise Impact" },{ "level": 3, "text": "Case Study: Salesforce" }],"wordCount": 2847},"media": {"images": [{ "src": "https://techcrunch.com/diagram.png", "alt": "AI agent architecture" }],"links": [{ "href": "https://openai.com/agents", "text": "OpenAI's agent framework" }]},"structuredData": [{ "@type": "NewsArticle", "headline": "..." }],"extractedAt": "2026-03-24T12:34:56.789Z"}
Product Extraction
{"url": "https://store.example.com/product/widget-pro","type": "product","metadata": { "title": "Widget Pro - Best Seller", "siteName": "Example Store" },"content": { "text": "The Widget Pro is our most popular...", "wordCount": 342 },"product": {"name": "Widget Pro","price": "$49.99","currency": "USD","availability": "InStock","rating": "4.8","reviewCount": "1,247","brand": "WidgetCo","sku": "WP-2026","images": ["https://store.example.com/widget-pro-1.jpg"]}}
Use Case Workflows
RAG Pipeline Integration
flowchart LRURLs["URL List<br/>100+ sources"] --> EXTRACT["Web Content<br/>Extractor API"]EXTRACT --> TEXT["Clean Text<br/>+ Metadata"]TEXT --> CHUNK["Text Chunking<br/>(LangChain)"]CHUNK --> EMBED["Embeddings<br/>(OpenAI)"]EMBED --> VECTOR["Vector DB<br/>(Pinecone)"]VECTOR --> RAG["RAG Query<br/>Engine"]RAG --> ANSWER["AI-Powered<br/>Answers"]style EXTRACT fill:#10b981,color:#fffstyle RAG fill:#3b82f6,color:#fff
Competitive Intelligence Pipeline
flowchart LRCOMP["Competitor<br/>URLs"] --> EXTRACT["Web Content<br/>Extractor API"]EXTRACT --> PROD["Product Data:<br/>prices, features"]EXTRACT --> NEWS["News & Blog:<br/>announcements"]PROD --> DASH["Analytics<br/>Dashboard"]NEWS --> ALERT["Email<br/>Alerts"]style EXTRACT fill:#10b981,color:#fff
Pricing
| Event | Price per call | Cost per 1,000 |
|---|---|---|
| Content extraction | $0.003 | $3.00 |
Cost Comparison
| Solution | Cost per 1,000 URLs | Setup Time |
|---|---|---|
| This Actor | $3.00 | 0 minutes |
| Diffbot | $299/month flat | Hours |
| Custom scraper | $50+ developer hours | Days |
| Manual copy-paste | 40+ hours labor | Forever |
Integrations
| Platform | How to Connect |
|---|---|
| LangChain | Use as Document Loader via HTTP |
| LlamaIndex | Custom reader pointing to /extract |
| Zapier | Webhook trigger -> GET /extract |
| Make (Integromat) | HTTP module -> POST /extract |
| n8n | HTTP Request node |
| Apify Orchestrator | Direct actor call or Standby URL |
FAQ
Q: How fast is extraction? A: 1-3 seconds for a single URL. Batch processes 25 URLs in parallel.
Q: Does it handle paywalled content? A: It extracts whatever is publicly visible in the HTML. Paywalled content behind JavaScript auth won't be extracted.
Q: What about JavaScript-rendered pages (SPAs)? A: Current version uses server-side HTML. For JS-heavy pages, pair with our Screenshot & PDF API.
Q: Is there a rate limit? A: No hard rate limit. Apify Standby handles concurrent requests automatically.
Q: What languages are supported? A: Any language. The extractor works with HTML structure, not language-specific parsing.
Related Actors
- WebSight API — Technical website analysis (SEO, tech stack, AI score)
- Screenshot & PDF API — Pixel-perfect webpage captures
- Website Contact Scraper — Extract emails, phones, social links
Built by George Kioko | 6,196+ data extraction jobs completed | 35+ production APIs