# XavvyNess AI Web Extractor (`xavvyness/xavvyness-smart-extractor`) Actor

Extract data from any website using plain English — no CSS selectors, no code. Describe what you want, get JSON, CSV, or Markdown back. Works even when site layouts change. Example: 'Extract job titles, company names, and salaries'. Support email: hello@xavvyness.ai

- **URL**: https://apify.com/xavvyness/xavvyness-smart-extractor.md
- **Developed by:** [XavvyNess](https://apify.com/xavvyness) (community)
- **Categories:** AI, Developer tools, Automation
- **Stats:** 2 total users, 1 monthly users, 100.0% runs succeeded, NaN bookmarks
- **User rating**: No ratings yet

## Pricing

from $25.00 / 1,000 ai-extracted pages

This Actor is paid per event. You are not charged for the Apify platform usage, but only a fixed price for specific events.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## 🤖 XavvyNess Smart Extractor — Natural Language Web Scraping

Extract structured data from any website using plain English. No code, no XPath, no CSS selectors. Just describe what you want and get clean JSON, CSV, or Markdown back. Works even when websites change their HTML.

> **Same price as Apify's official AI Web Scraper ($25/1,000 pages) — but with JSON, CSV, and Markdown output, plus specific error messages instead of generic failures.**

### Demo

<!-- VIDEO_PLACEHOLDER: smart-extractor -->
> 🎬 **Video demo coming soon.** Upload `smart-extractor.mp4` to YouTube, then run `python3 scripts/actor-video-gen.py --embed-readmes` to embed it here automatically.

---

### 🚀 What It Does

1. Crawls any webpage and extracts clean text content
2. Sends the content + your extraction prompt to AI
3. Returns structured data in the format you choose (JSON, CSV, Markdown)

**Perfect for:** lead generation, price monitoring, content aggregation, data pipelines, research automation

---

### 📥 Input

| Field | Required | Default | Description |
|---|---|---|---|
| `urls` | ✅ | — | URLs to extract data from |
| `extractionPrompt` | ✅ | — | Plain English description of what to extract |
| `outputFormat` | — | `json` | `json` / `csv` / `markdown` |
| `maxItems` | — | `50` | Maximum items to extract per page (1-500) |

**Example inputs:**

```json
{
  "urls": ["https://news.ycombinator.com/"],
  "extractionPrompt": "Extract all post titles, point scores, and comment counts. Return as a list.",
  "outputFormat": "json",
  "maxItems": 30
}
````

```json
{
  "urls": ["https://www.g2.com/products/hubspot/reviews"],
  "extractionPrompt": "Extract reviewer name, star rating, review title, and the pros and cons mentioned in each review.",
  "outputFormat": "json"
}
```

***

### 📤 Output (JSON format)

Real output from a live run on Hacker News:

```json
{
  "sourceUrl": "https://news.ycombinator.com",
  "extractionPrompt": "Extract top 10 story titles with their point scores and comment counts",
  "items": [
    { "title": "I ported Mac OS X to the Nintendo Wii", "points": 1032, "comments": 194 },
    { "title": "Git commands I run before reading any code", "points": 1653, "comments": 355 },
    { "title": "Veracrypt project update", "points": 1077, "comments": 404 },
    { "title": "They're made out of meat (1991)", "points": 348, "comments": 99 },
    { "title": "ML promises to be profoundly weird", "points": 314, "comments": 359 },
    { "title": "Muse Spark: Scaling towards personal superintelligence", "points": 214, "comments": 257 },
    { "title": "Understanding the Kalman filter with a simple radar example", "points": 156, "comments": 25 },
    { "title": "USB for Software Developers", "points": 104, "comments": 15 },
    { "title": "Expanding Swift's IDE Support", "points": 55, "comments": 30 },
    { "title": "Pgit: I Imported the Linux Kernel into PostgreSQL", "points": 47, "comments": 4 }
  ],
  "itemCount": 10,
  "totalFound": 10,
  "outputFormat": "json",
  "extractedAt": "2026-04-08T22:22:20.139Z",
  "agent": "XavvyNess Smart Extractor"
}
```

***

### 💡 Writing Good Extraction Prompts

Be specific about what fields you want and their types:

| ❌ Vague | ✅ Specific |
|---|---|
| "Get the jobs" | "Extract job title, company name, location, and salary range for each listing" |
| "Scrape reviews" | "Extract reviewer name, star rating (1-5), and the main complaint from each review" |
| "Get prices" | "Extract product name, original price, discounted price, and stock status" |

***

### ⚙️ Setup — API Keys

| Variable | Required | Where to Get |
|---|---|---|
| `GROQ_API_KEY` | Recommended (free) | [console.groq.com](https://console.groq.com) |
| `GOOGLE_API_KEY` | Optional fallback | [aistudio.google.com](https://aistudio.google.com) |

***

### ❓ FAQ

**Q: What if the site uses JavaScript rendering (React/Vue/Angular)?**\
A: The actor uses CheerioCrawler which handles static HTML. For JS-heavy SPAs, the extracted text may be limited. For React apps, try URLs that serve server-side rendered content.

**Q: What if the site blocks the crawler (403)?**\
A: You'll get a clear error message: "Access denied (403) — site blocks automated requests". Try again with a different URL from the same site, or contact us about proxy options.

**Q: Can I extract from multiple pages at once?**\
A: Yes — add multiple URLs to the `urls` array. Each page is processed independently with the same extraction prompt.

**Q: How is this different from a normal scraper?**\
A: A normal scraper needs hard-coded CSS selectors that break when the site updates. This actor uses AI to understand the content structure — it adapts automatically.

***

### 🔗 Use Cases

1. **Lead generation** — Extract company names, emails, and phone numbers from directories
2. **Price monitoring** — Track competitor pricing across e-commerce sites
3. **Review aggregation** — Collect G2, Trustpilot, or Amazon reviews for sentiment analysis
4. **Job board scraping** — Extract job listings with titles, requirements, and salaries
5. **News monitoring** — Pull headlines and summaries from any news site
6. **Research automation** — Extract structured data from academic or government pages

***

### 📊 Performance

- ✅ Most pages: under 15 seconds
- ✅ Handles dynamic prompt structures — no hardcoding required
- ✅ Clear error messages for every failure mode
- ✅ Groq → Gemini fallback — resilient to API outages
- ✅ **Failed runs are not charged** — you only pay for successful extractions

***

### 📊 vs. Competitors

| | XavvyNess Smart Extractor | Apify AI Web Scraper |
|---|---|---|
| Price | $25/1,000 pages | $25/1,000 pages |
| AI provider | Groq/Gemini (free tier) | OpenAI (paid) |
| Natural language prompts | ✅ | ✅ |
| Output formats | JSON, CSV, Markdown | JSON |
| Error messages | Specific, actionable | Generic |

***

### Integration

#### Via Apify JavaScript client

```javascript
import { ApifyClient } from 'apify-client';

const client = new ApifyClient({ token: 'YOUR_APIFY_TOKEN' });

const run = await client.actor('IN4O5pGUjye34xW0O').call({
  urls: ['https://news.ycombinator.com/', 'https://producthunt.com/'],
  extractionPrompt: 'Extract all post titles, upvote counts, and URLs.',
  outputFormat: 'json',
  maxItems: 50,
});

const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach(result => {
  console.log(result.sourceUrl);  // URL scraped
  console.log(result.items);      // extracted data array
  console.log(result.itemCount);  // how many items found
});
```

#### Via HTTP API

```bash
curl -X POST \
  "https://api.apify.com/v2/acts/IN4O5pGUjye34xW0O/runs?token=YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "urls": ["https://news.ycombinator.com/"],
    "extractionPrompt": "Extract all post titles and scores."
  }'
```

#### Via Make.com / Zapier

Use the **Apify** module → **Run Actor** action. Actor ID: `IN4O5pGUjye34xW0O`. Describe what to extract in plain English in the `extractionPrompt` field — no code required.

***

*Built by XavvyNess — AI agent services that do real work.*

# Actor input Schema

## `urls` (type: `array`):

URLs to extract data from.

## `extractionPrompt` (type: `string`):

Describe in plain language what data you want extracted. E.g. 'Extract all job titles, companies, and locations from this page.'

## `outputFormat` (type: `string`):

Format for the extracted data output.

## `maxItems` (type: `integer`):

Maximum number of items to extract per page.

## Actor input object example

```json
{
  "urls": [
    "https://news.ycombinator.com/"
  ],
  "extractionPrompt": "Extract all post titles, scores, and comment counts from this Hacker News page.",
  "outputFormat": "json",
  "maxItems": 50
}
```

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "urls": [
        "https://news.ycombinator.com/"
    ],
    "extractionPrompt": "Extract all post titles, scores, and comment counts from this Hacker News page."
};

// Run the Actor and wait for it to finish
const run = await client.actor("xavvyness/xavvyness-smart-extractor").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {
    "urls": ["https://news.ycombinator.com/"],
    "extractionPrompt": "Extract all post titles, scores, and comment counts from this Hacker News page.",
}

# Run the Actor and wait for it to finish
run = client.actor("xavvyness/xavvyness-smart-extractor").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "urls": [
    "https://news.ycombinator.com/"
  ],
  "extractionPrompt": "Extract all post titles, scores, and comment counts from this Hacker News page."
}' |
apify call xavvyness/xavvyness-smart-extractor --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=xavvyness/xavvyness-smart-extractor",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "XavvyNess AI Web Extractor",
        "description": "Extract data from any website using plain English — no CSS selectors, no code. Describe what you want, get JSON, CSV, or Markdown back. Works even when site layouts change. Example: 'Extract job titles, company names, and salaries'. Support email: hello@xavvyness.ai",
        "version": "1.0",
        "x-build-id": "Lj3U41I1CzQDg5UNG"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/xavvyness~xavvyness-smart-extractor/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-xavvyness-xavvyness-smart-extractor",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/xavvyness~xavvyness-smart-extractor/runs": {
            "post": {
                "operationId": "runs-sync-xavvyness-xavvyness-smart-extractor",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/xavvyness~xavvyness-smart-extractor/run-sync": {
            "post": {
                "operationId": "run-sync-xavvyness-xavvyness-smart-extractor",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "urls",
                    "extractionPrompt"
                ],
                "properties": {
                    "urls": {
                        "title": "Target URLs",
                        "type": "array",
                        "description": "URLs to extract data from.",
                        "items": {
                            "type": "string"
                        }
                    },
                    "extractionPrompt": {
                        "title": "What to Extract",
                        "type": "string",
                        "description": "Describe in plain language what data you want extracted. E.g. 'Extract all job titles, companies, and locations from this page.'"
                    },
                    "outputFormat": {
                        "title": "Output Format",
                        "enum": [
                            "json",
                            "csv",
                            "markdown"
                        ],
                        "type": "string",
                        "description": "Format for the extracted data output.",
                        "default": "json"
                    },
                    "maxItems": {
                        "title": "Max Items to Extract",
                        "minimum": 1,
                        "maximum": 500,
                        "type": "integer",
                        "description": "Maximum number of items to extract per page.",
                        "default": 50
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
