Website Categorization API — 6-LLM Consensus URL Classifier
Pricing
from $7.00 / 1,000 results
Website Categorization API — 6-LLM Consensus URL Classifier
Stop hallucinated category labels. Run URLs through 6 LLMs voting in parallel (DeepSeek-v4, Llama-4, Qwen-3.5, Nemotron-3, GLM-5.1, MiniMax) for higher-confidence taxonomy classification. Lead-gen filtering, content moderation, dataset labeling. $0.007 per URL.
Pricing
from $7.00 / 1,000 results
Rating
0.0
(0)
Developer
yanmiayn
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
15 hours ago
Last modified
Categories
Share
Multi-Model Consensus Web Page Classifier
Classify any list of URLs into your custom taxonomy using a 6-model consensus engine (open-weights frontier LLMs voting in parallel). Reduces single-model hallucination on edge cases — useful for lead-gen filtering, content moderation queues, knowledge-graph ingestion, and dataset labeling.
Why consensus?
A single LLM occasionally hallucinates labels on ambiguous pages. This actor fans out the same classification prompt to 6 independent open-weights models and returns the consensus label plus a confidence signal. When the models agree, you can trust the label; when they disagree, the row is flagged for review.
Models in the pool: DeepSeek-v4, Llama-4-maverick, Qwen-3.5, NVIDIA Nemotron-3, GLM-5.1, MiniMax-m2.7.
Pricing
Pay-per-event (no subscription):
- $0.007 per URL classified (charged on each result row written)
- $0.01 per run (one-time orchestration fee)
A 1,000-URL run costs ~$7.01.
Input
{"urls": ["https://stripe.com", "https://nytimes.com"],"taxonomy": ["fintech", "news_media", "developer_tools", "ecommerce", "other"],"consensusMode": "majority","maxConcurrency": 5}
| Field | Type | Default | Description |
|---|---|---|---|
urls | string[] | — | Public URLs to classify. |
taxonomy | string[] | — | 2–30 candidate categories. Should be mutually exclusive and include an "other" bucket. |
consensusMode | "majority" | "deep" | majority | majority uses fewer models (faster). deep uses the full pool. |
maxConcurrency | int | 5 | Parallel URL fetches (1–20). |
Output (per URL)
{"url": "https://stripe.com","title": "Stripe | Financial Infrastructure for the Internet","status": "ok","category": "fintech","confidence": null,"consensusMode": "majority","durationMs": 1840}
The category field returns the consensus answer when models agree, or "other" as a safe fallback when JSON parsing fails. confidence may be null while the post-processing extractor is being improved.
For URLs that take too long to fetch or where the consensus engine times out, status: "error" is returned with a reason — those rows are not charged.
Use cases
- Lead-gen filtering — bucket scraped homepages by industry before SDR outreach.
- Content moderation triage — pre-tag URLs in user-submitted feeds.
- Dataset labeling — bootstrap a training set with consensus labels.
- Affiliate / partner discovery — group competitor sites by vertical.
- Compliance pre-screening — surface pages that may belong to regulated categories.
Tips
- Treat the actor as a first-pass classifier: high-confidence rows go straight through, ambiguous or error rows go to a human queue.
- Categories work better when they are concrete and non-overlapping. Add
"other"as the safety bucket. - Heavy single-page-application URLs may exceed the 120s consensus timeout; expect a small percentage of
errorrows on JS-heavy targets.
How it works
- Fetches each URL (10s budget, follows redirects).
- Extracts title + meta description + ~250 characters of body text.
- Sends a compact classification prompt to the public consensus endpoint (
/v1/public), which fans out to the 6-model pool and returns the agreed JSON label. - Parses the result and pushes one row per URL to the Apify dataset.
No personal data is stored — only the public page text and your taxonomy are sent for classification. The consensus engine is rate-limited at 10 requests per IP per day on the free public endpoint.
Limitations (honest)
- The actor is a fresh listing (May 2026). Accuracy claims have not been independently benchmarked yet — early users help us calibrate.
- A small fraction of buyer test runs hit the public endpoint's per-IP rate limit on bursts. Use small batches (≤30 URLs/run) for now, or contact the publisher for a private endpoint key.
confidenceextraction is being tightened; for nownullis common.
Source
Built and maintained by yanmiayn. Bug reports and feature requests via the actor's Issues tab on Apify.