Website Categorization API — 6-LLM Consensus URL Classifier avatar

Website Categorization API — 6-LLM Consensus URL Classifier

Pricing

from $7.00 / 1,000 results

Go to Apify Store
Website Categorization API — 6-LLM Consensus URL Classifier

Website Categorization API — 6-LLM Consensus URL Classifier

Stop hallucinated category labels. Run URLs through 6 LLMs voting in parallel (DeepSeek-v4, Llama-4, Qwen-3.5, Nemotron-3, GLM-5.1, MiniMax) for higher-confidence taxonomy classification. Lead-gen filtering, content moderation, dataset labeling. $0.007 per URL.

Pricing

from $7.00 / 1,000 results

Rating

0.0

(0)

Developer

yanmiayn

yanmiayn

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

15 hours ago

Last modified

Share

Multi-Model Consensus Web Page Classifier

Classify any list of URLs into your custom taxonomy using a 6-model consensus engine (open-weights frontier LLMs voting in parallel). Reduces single-model hallucination on edge cases — useful for lead-gen filtering, content moderation queues, knowledge-graph ingestion, and dataset labeling.

Why consensus?

A single LLM occasionally hallucinates labels on ambiguous pages. This actor fans out the same classification prompt to 6 independent open-weights models and returns the consensus label plus a confidence signal. When the models agree, you can trust the label; when they disagree, the row is flagged for review.

Models in the pool: DeepSeek-v4, Llama-4-maverick, Qwen-3.5, NVIDIA Nemotron-3, GLM-5.1, MiniMax-m2.7.

Pricing

Pay-per-event (no subscription):

  • $0.007 per URL classified (charged on each result row written)
  • $0.01 per run (one-time orchestration fee)

A 1,000-URL run costs ~$7.01.

Input

{
"urls": ["https://stripe.com", "https://nytimes.com"],
"taxonomy": ["fintech", "news_media", "developer_tools", "ecommerce", "other"],
"consensusMode": "majority",
"maxConcurrency": 5
}
FieldTypeDefaultDescription
urlsstring[]Public URLs to classify.
taxonomystring[]2–30 candidate categories. Should be mutually exclusive and include an "other" bucket.
consensusMode"majority" | "deep"majoritymajority uses fewer models (faster). deep uses the full pool.
maxConcurrencyint5Parallel URL fetches (1–20).

Output (per URL)

{
"url": "https://stripe.com",
"title": "Stripe | Financial Infrastructure for the Internet",
"status": "ok",
"category": "fintech",
"confidence": null,
"consensusMode": "majority",
"durationMs": 1840
}

The category field returns the consensus answer when models agree, or "other" as a safe fallback when JSON parsing fails. confidence may be null while the post-processing extractor is being improved.

For URLs that take too long to fetch or where the consensus engine times out, status: "error" is returned with a reason — those rows are not charged.

Use cases

  • Lead-gen filtering — bucket scraped homepages by industry before SDR outreach.
  • Content moderation triage — pre-tag URLs in user-submitted feeds.
  • Dataset labeling — bootstrap a training set with consensus labels.
  • Affiliate / partner discovery — group competitor sites by vertical.
  • Compliance pre-screening — surface pages that may belong to regulated categories.

Tips

  • Treat the actor as a first-pass classifier: high-confidence rows go straight through, ambiguous or error rows go to a human queue.
  • Categories work better when they are concrete and non-overlapping. Add "other" as the safety bucket.
  • Heavy single-page-application URLs may exceed the 120s consensus timeout; expect a small percentage of error rows on JS-heavy targets.

How it works

  1. Fetches each URL (10s budget, follows redirects).
  2. Extracts title + meta description + ~250 characters of body text.
  3. Sends a compact classification prompt to the public consensus endpoint (/v1/public), which fans out to the 6-model pool and returns the agreed JSON label.
  4. Parses the result and pushes one row per URL to the Apify dataset.

No personal data is stored — only the public page text and your taxonomy are sent for classification. The consensus engine is rate-limited at 10 requests per IP per day on the free public endpoint.

Limitations (honest)

  • The actor is a fresh listing (May 2026). Accuracy claims have not been independently benchmarked yet — early users help us calibrate.
  • A small fraction of buyer test runs hit the public endpoint's per-IP rate limit on bursts. Use small batches (≤30 URLs/run) for now, or contact the publisher for a private endpoint key.
  • confidence extraction is being tightened; for now null is common.

Source

Built and maintained by yanmiayn. Bug reports and feature requests via the actor's Issues tab on Apify.