Under maintenance

Pricing

from $3.00 / 1,000 content extractions

Try for free

Go to Apify Store

Web Content Extractor API — URL to JSON

Under maintenance

Try for free

Extract structured JSON from any webpage. Articles, products, recipes, jobs. Auto-detects content type. Returns metadata, headings, images, links. For AI agents and RAG.

Pricing

from $3.00 / 1,000 content extractions

Rating

0.0

(0)

Developer

George Kioko

Actor stats

Bookmarked

Total users

Monthly active users

a month ago

Last modified

🔍 Web Content Extractor API — URL to Structured JSON

One API call. Any URL. Clean structured JSON. Extract articles, products, recipes, job postings, and more — automatically detected and organized. Built for AI agents, RAG pipelines, and data workflows.

Architecture Overview

flowchart TB
    subgraph Input
        URL[/"URL: any webpage"/]
    end

    subgraph Processing["Extraction Pipeline"]
        FETCH["1. Fetch & Parse HTML"]
        DETECT["2. Auto-Detect Content Type"]
        SCORE["3. Score Content Blocks"]
        EXTRACT["4. Extract Structured Data"]
        ENRICH["5. Enrich with Metadata"]
    end

    subgraph Detection["Content Type Detection"]
        ART["Article"]
        PROD["Product"]
        REC["Recipe"]
        JOB["Job Posting"]
        EVT["Event"]
        WEB["Generic Webpage"]
    end

    subgraph Output["Structured JSON"]
        META["Metadata: title, author, date, image"]
        CONTENT["Content: text, headings, word count"]
        MEDIA["Media: images, links"]
        SCHEMA["JSON-LD Structured Data"]
        TYPED["Type-Specific: price, ingredients, salary..."]
    end

    URL --> FETCH --> DETECT --> SCORE --> EXTRACT --> ENRICH
    DETECT --> ART & PROD & REC & JOB & EVT & WEB
    ENRICH --> META & CONTENT & MEDIA & SCHEMA & TYPED

    style Input fill:#1a1a2e,color:#fff
    style Processing fill:#16213e,color:#fff
    style Detection fill:#0f3460,color:#fff
    style Output fill:#533483,color:#fff

What Makes This Different?

Feature	This Actor	Typical Scrapers
Output format	Structured JSON	Raw HTML
Content detection	Auto-detects 6 types	Manual configuration
Setup time	Zero — just pass URL	Hours of selector writing
AI-ready	Yes — clean text for LLMs	Needs post-processing
Batch support	Up to 25 URLs per call	One at a time
Response time	1-3 seconds	5-30 seconds

Request Flow

sequenceDiagram
    participant Client as Your App
    participant API as Content Extractor
    participant Web as Target Website
    participant Cache as 30-min Cache

    Client->>API: GET /extract?url=example.com
    API->>Cache: Check cache

    alt Cache Hit
        Cache-->>API: Return cached result
        API-->>Client: JSON response (instant)
    else Cache Miss
        API->>Web: Fetch HTML
        Web-->>API: HTML content
        API->>API: Detect type + Extract + Score
        API->>Cache: Store result
        API-->>Client: Structured JSON (1-3s)
    end

    Note over Client,API: PPE charge: $0.003 per extraction

API Endpoints

`GET /extract` — Extract from URL

GET /extract?url=https://techcrunch.com/2026/03/24/ai-news&format=full

Parameter	Type	Required	Default	Options
`url`	string	Yes	—	Any valid URL
`format`	string	No	`full`	`full`, `article`, `metadata`

`POST /extract` — Extract with JSON body

POST /extract
{
  "url": "https://techcrunch.com/2026/03/24/ai-news",
  "format": "article"
}

`POST /batch` — Extract multiple URLs

POST /batch
{
  "urls": [
    "https://news.ycombinator.com",
    "https://techcrunch.com",
    "https://bbc.com/news"
  ],
  "format": "full"
}

`GET /` — Health check

Returns API status, version, and endpoint documentation.

Content Type Detection

flowchart LR
    HTML["HTML Page"] --> CHECK{"Detect Signals"}

    CHECK -->|"og:type=article<br/>or article tag"| ART["**article**<br/>title, author, date,<br/>full text, headings"]
    CHECK -->|"Schema: Product<br/>or .product-price"| PROD["**product**<br/>name, price, rating,<br/>images, SKU, brand"]
    CHECK -->|"Schema: Recipe<br/>or .recipe"| REC["**recipe**<br/>ingredients, instructions,<br/>prep time, servings"]
    CHECK -->|"Schema: JobPosting<br/>or .job-title"| JOB["**job_posting**<br/>title, company, salary,<br/>location, type"]
    CHECK -->|"Schema: Event<br/>or .event-date"| EVT["**event**<br/>name, date, location,<br/>description"]
    CHECK -->|"No specific<br/>signals found"| WEB["**webpage**<br/>metadata, content,<br/>links, images"]

    style ART fill:#10b981,color:#fff
    style PROD fill:#f59e0b,color:#fff
    style REC fill:#ef4444,color:#fff
    style JOB fill:#3b82f6,color:#fff
    style EVT fill:#8b5cf6,color:#fff
    style WEB fill:#6b7280,color:#fff

Output Examples

Article Extraction

{
  "url": "https://techcrunch.com/2026/03/24/ai-agents",
  "type": "article",
  "metadata": {
    "title": "AI Agents Are Reshaping Enterprise Software",
    "description": "How autonomous AI agents are changing B2B SaaS",
    "author": "Sarah Perez",
    "date": "2026-03-24T10:00:00Z",
    "image": "https://techcrunch.com/hero.jpg",
    "siteName": "TechCrunch",
    "locale": "en-US",
    "canonical": "https://techcrunch.com/2026/03/24/ai-agents",
    "keywords": ["AI", "agents", "enterprise", "SaaS"]
  },
  "content": {
    "text": "The rise of AI agents represents a fundamental shift in how enterprise software operates. Unlike traditional chatbots...",
    "headings": [
      { "level": 2, "text": "What Are AI Agents?" },
      { "level": 2, "text": "The Enterprise Impact" },
      { "level": 3, "text": "Case Study: Salesforce" }
    ],
    "wordCount": 2847
  },
  "media": {
    "images": [
      { "src": "https://techcrunch.com/diagram.png", "alt": "AI agent architecture" }
    ],
    "links": [
      { "href": "https://openai.com/agents", "text": "OpenAI's agent framework" }
    ]
  },
  "structuredData": [{ "@type": "NewsArticle", "headline": "..." }],
  "extractedAt": "2026-03-24T12:34:56.789Z"
}

Product Extraction

{
  "url": "https://store.example.com/product/widget-pro",
  "type": "product",
  "metadata": { "title": "Widget Pro - Best Seller", "siteName": "Example Store" },
  "content": { "text": "The Widget Pro is our most popular...", "wordCount": 342 },
  "product": {
    "name": "Widget Pro",
    "price": "$49.99",
    "currency": "USD",
    "availability": "InStock",
    "rating": "4.8",
    "reviewCount": "1,247",
    "brand": "WidgetCo",
    "sku": "WP-2026",
    "images": ["https://store.example.com/widget-pro-1.jpg"]
  }
}

Use Case Workflows

RAG Pipeline Integration

flowchart LR
    URLs["URL List<br/>100+ sources"] --> EXTRACT["Web Content<br/>Extractor API"]
    EXTRACT --> TEXT["Clean Text<br/>+ Metadata"]
    TEXT --> CHUNK["Text Chunking<br/>(LangChain)"]
    CHUNK --> EMBED["Embeddings<br/>(OpenAI)"]
    EMBED --> VECTOR["Vector DB<br/>(Pinecone)"]
    VECTOR --> RAG["RAG Query<br/>Engine"]
    RAG --> ANSWER["AI-Powered<br/>Answers"]

    style EXTRACT fill:#10b981,color:#fff
    style RAG fill:#3b82f6,color:#fff

Competitive Intelligence Pipeline

flowchart LR
    COMP["Competitor<br/>URLs"] --> EXTRACT["Web Content<br/>Extractor API"]
    EXTRACT --> PROD["Product Data:<br/>prices, features"]
    EXTRACT --> NEWS["News & Blog:<br/>announcements"]
    PROD --> DASH["Analytics<br/>Dashboard"]
    NEWS --> ALERT["Email<br/>Alerts"]

    style EXTRACT fill:#10b981,color:#fff

Pricing

Event	Price per call	Cost per 1,000
Content extraction	$0.003	$3.00

Cost Comparison

Solution	Cost per 1,000 URLs	Setup Time
This Actor	$3.00	0 minutes
Diffbot	$299/month flat	Hours
Custom scraper	$50+ developer hours	Days
Manual copy-paste	40+ hours labor	Forever

Integrations

Platform	How to Connect
LangChain	Use as Document Loader via HTTP
LlamaIndex	Custom reader pointing to /extract
Zapier	Webhook trigger -> GET /extract
Make (Integromat)	HTTP module -> POST /extract
n8n	HTTP Request node
Apify Orchestrator	Direct actor call or Standby URL

FAQ

Q: How fast is extraction? A: 1-3 seconds for a single URL. Batch processes 25 URLs in parallel.

Q: Does it handle paywalled content? A: It extracts whatever is publicly visible in the HTML. Paywalled content behind JavaScript auth won't be extracted.

Q: What about JavaScript-rendered pages (SPAs)? A: Current version uses server-side HTML. For JS-heavy pages, pair with our Screenshot & PDF API.

Q: Is there a rate limit? A: No hard rate limit. Apify Standby handles concurrent requests automatically.

Q: What languages are supported? A: Any language. The extractor works with HTML structure, not language-specific parsing.

WebSight API — Technical website analysis (SEO, tech stack, AI score)
Screenshot & PDF API — Pixel-perfect webpage captures
Website Contact Scraper — Extract emails, phones, social links

Built by George Kioko | 6,196+ data extraction jobs completed | 35+ production APIs

AI Web Extractor

uxinfra/uxinfra-web-extractor

Intelligent web content extraction with AI-powered structuring. Extracts articles, products, reviews, and structured data from any website.

UXINFRA

Smart Url Extractor

diao-bah-timbi/smart-url-extractor

Intelligent web scraping Actor that automatically detects content types (products, jobs, articles, profiles) and extracts structured data with 15+ fields. Perfect for e-commerce monitoring, job aggregation, and content curation.

Mamadou Diao Bah

Web Images Scraper

jupri/web-images-scraper

Scrape Images from a Webpage

cat

583

AI-Powered Web Content & Link Extractor

scrapercoder/ai-powered-web-content-link-extractor

Crawls websites to extract clean, structured content for AI/LLM use, ideal for training datasets, knowledge bases, and RAG systems. Json output includes: * text: Normalized page content * links: Extracted sub-URLs

wallnut.ai

178

AI Web Reader (RAG Ready)

viinaysonii/ai-web-reader-rag-ready

Convert any webpage into clean, structured, AI-ready Markdown. Removes ads, images, and UI noise, normalizes content, and outputs data optimized for LLMs, RAG pipelines, and AI agents. Fast, scalable, and built for real-world AI workflows.

Article Content Extractor 📄

easyapi/article-content-extractor

Extract clean article content, metadata and structured information from any web page. Supports multiple URLs and returns well-formatted JSON with title, description, content, author, publish date and more. 🔍📄

EasyApi

125

Extract Website With URL

mrahil/extract-website-with-url

The Extract Website with URL API allows users to extract structured data from any webpage by providing a URL. It retrieves HTML, metadata, tables, and images, returning data in JSON format. Ideal for web scraping, SEO analysis, and content extraction. Use it for e-commerce data, news scraping

Mohammed Rahil

223

AI Smart Scraper — Extract Data from Any Website

flreey/ai-smart-scraper

AI web scraper: describe the data you want in plain English, get clean JSON from any webpage. No CSS selectors needed. For lead gen, price monitoring, RAG, and AI agents. Powered by Gemini AI.

亲晖林

5.0

Website Content Crawler

parseforge/website-content-crawler

Crawl any website and pull clean Markdown content ready for AI! Follow links across a whole domain and extract page text, titles, headings, images, and metadata. Perfect for building RAG pipelines, training datasets, knowledge bases, and vector databases. Start crawling content in minutes!