Deprecated

Pricing

Pay per usage

See alternative Actors

Go to Apify Store

Dataset Classifier

Deprecated

See alternative Actors

Automatically classify rows in any Apify dataset into categories you define. Point it at a dataset, pick a text column, provide your categories, and get back the original data with a new classification column added.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Lukas Priban

Actor stats

Bookmarked

Total users

Monthly active users

2 months ago

Last modified

What it does

Reads rows from an Apify dataset or a CSV file you upload
Classifies each row's text field into one (or more) of your categories
Outputs the original rows with an added _classification field

No setup, no API keys, no configuration beyond the basics. Just provide your data and categories.

Use cases

Categorize scraped articles — Sort news articles into Technology, Business, Sports, etc.
Tag product listings — Classify products by type, audience, or price tier
Filter leads — Separate B2B from B2C leads based on company descriptions
Sort reviews by topic — Classify customer feedback into Product, Shipping, Support, etc.
Organize job postings — Tag jobs by seniority, department, or work model

Input

Field	Type	Required	Description
Source Dataset	string	No*	Apify dataset to classify (use picker or paste ID)
CSV File	string	No*	Upload a CSV file to classify
Field to Classify	string	Yes	Name of the text column (e.g. `description`, `title`)
Categories	string[]	Yes	List of target categories
Category Descriptions	object	No	Descriptions to help disambiguate similar categories
Context Fields	string[]	No	Extra columns to provide as context alongside the main field
Output Field Name	string	No	Name of the new column (default: `_classification`)
Allow Multiple Categories	boolean	No	Assign multiple categories per row (default: `false`)
Max Items	integer	No	Limit how many rows to process
LLM Model	string	No	Override the default model (`openai/gpt-4o-mini`). See openrouter.ai/models.
Suggest Categories From Sample	boolean	No	If enabled, skip classification and have the LLM propose categories from a sample of your data. See "Suggest mode" below.
Sample Size For Category Suggestion	integer	No	Rows to sample when suggesting categories (default 30, range 5–100).
Allow UNCATEGORIZED	boolean	No	Let the model use `"UNCATEGORIZED"` when no listed category fits. Default off (model forced to pick).
Suggest New Categories From UNCATEGORIZED	boolean	No	After classification, sample UNCATEGORIZED rows and propose new categories that would cover them. Requires Allow UNCATEGORIZED.

*Provide either Source Dataset or CSV File, not both.

Example input (dataset)

{
    "datasetId": "abc123DEF456",
    "classifyField": "title",
    "categories": ["Technology", "Business", "Entertainment", "Sports", "Science"]
}

Example input (CSV)

{
    "csvFile": "title,description\nApple Vision Pro,Apple launches mixed reality headset\nSuper Bowl,Chiefs win in overtime",
    "classifyField": "description",
    "categories": ["Technology", "Sports", "Business"]
}

Using category descriptions

When categories are ambiguous or overlap, add descriptions to improve accuracy:

{
    "datasetId": "abc123DEF456",
    "classifyField": "description",
    "categories": ["B2B", "B2C", "Internal"],
    "categoryDescriptions": {
        "B2B": "Business-to-business products and services sold to companies",
        "B2C": "Consumer products and services sold to individuals",
        "Internal": "Internal tools, documentation, and employee-facing resources"
    }
}

Using context fields

Provide additional columns for more accurate classification:

{
    "datasetId": "abc123DEF456",
    "classifyField": "description",
    "categories": ["Positive", "Negative", "Neutral"],
    "contextFields": ["rating", "author"]
}

Suggest mode

Not sure what categories make sense for your data? Enable Suggest Categories From Sample and leave the Categories field empty. The Actor will:

Read a sample (default 30 rows) from your dataset or CSV.
Ask the LLM to propose 5–10 mutually-exclusive, content-specific categories.
Write the proposals to the SUGGESTED_CATEGORIES record in the key-value store and log them to the run output.
Exit. No dataset rows are pushed in this mode, so it's much cheaper than a full classification.

Review the suggestions, pick the ones you want, then re-run the Actor with Suggest Categories From Sample disabled and your chosen names in Categories to classify the full dataset.

After your first real classification pass, some rows may have genuinely not fit any of your categories. Enable both Allow UNCATEGORIZED and Suggest New Categories From UNCATEGORIZED on the next run and the Actor will:

Classify normally, marking rows that don't fit as "UNCATEGORIZED".
After classification finishes, sample up to Sample Size UNCATEGORIZED rows.
Ask the LLM for 3–7 new categories that would cover those rows (without overlapping your existing ones).
Write the proposals to SUGGESTED_NEW_CATEGORIES in the key-value store and log them.

Add the names you like to Categories, re-run, and your UNCATEGORIZED count should drop.

Output

The Actor outputs the original dataset rows with one new field added. All original fields are preserved.

Field	Type	Description
`_classification`	string or string[]	Assigned category (or array if multiple categories enabled)

Example output row

{
    "title": "Apple announces new M5 chip at WWDC",
    "url": "https://example.com/article/123",
    "description": "Apple unveiled its next-generation M5 processor...",
    "_classification": "Technology"
}

Every row is classified into one of the categories you provide — even genuinely borderline content is forced into its closest-fitting category. If you need an explicit "doesn't fit" bucket, add it to your category list (e.g. ["Positive", "Negative", "Neutral"]).

Items that could not be classified are marked "CLASSIFICATION_ERROR". This happens when:

The LLM kept returning malformed or unparseable responses for that row after retries.
The LLM returned a category that wasn't on your list (a hallucinated label).

The Actor continues past errored rows rather than aborting, so a few bad rows don't kill a large job. Inspect CLASSIFICATION_ERROR rows after the run if you need to retry them separately.

How to get a dataset ID

Every Apify Actor run produces a dataset. You can find the dataset ID in several ways:

From the Apify Console — Open any Actor run, go to the Storage tab, and copy the dataset ID
From the API — The dataset ID is returned in the defaultDatasetId field of every run response
From integrations — When chaining Actors, pass the defaultDatasetId from one run as input to this Actor

Pricing

This Actor uses pay-per-event pricing with a small minimum charge per classified row, kept low to cover the underlying LLM cost and a thin margin. There is no monthly rental or platform-usage markup beyond the standard Apify costs (compute time, dataset operations) that any Actor incurs. See the Actor's pricing tab on Apify Store for the current per-item rate.

Limitations

The text field must contain meaningful text for classification — empty or very short values may be classified as UNCATEGORIZED
Very long text fields (>2000 characters) are handled automatically but may slightly increase processing time
Maximum accuracy depends on how distinct your categories are — use category descriptions to improve results when categories overlap

Acceptable use

This Actor is a classification tool, not a general-purpose AI endpoint. Do not submit content that is unlawful, infringes third-party rights, or violates the terms of service of any underlying AI providers used by the Actor. You are responsible for the content you submit and for ensuring it is appropriate for automated processing.

Germany Visa-Sponsored Jobs

cg_nguyen/germany-visa-jobs

Find Germany jobs that sponsor work visas (Blue Card, Skilled Worker, ICT). Pulls the federal Bundesagentur Jobboerse, enriches with full descriptions, classifies visa-sponsorship likelihood (regex + opt-in LLM tie-breaker), and dedupes across runs. For relocation services and intl. job seekers.

CG Nguyễn

Airbnb Reviews Scraper

reviewly/airbnb-reviews-scraper

Extract comprehensive guest reviews from Airbnb.com including detailed ratings, comments, traveler profiles, and host responses. Perfect for reputation analysis, hospitality market research, and competitive intelligence in the short-term rental industry.

Reviewly

Flipkart Reviews Scraper ⭐

shahidirfan/flipkart-reviews-scraper

Instantly extract detailed product reviews, ratings, and customer feedback from Flipkart. Perfect for sentiment analysis, market research, and monitoring competitor products. Unlock valuable e-commerce insights and data to power your business growth today!

Shahid Irfan

123

5.0

(5)

X (Twitter) Tweet Scraper | Search, Hashtags & Profiles

khadinakbar/x-tweet-scraper

Scrape tweets from X.com (Twitter) by search query, hashtag, username, or URL. Returns tweet text, engagement metrics, author info, media, and more $3.00/1K. MCP/API-ready.

Khadin Akbar

Douban Pro Scraper — Reviews, Discussions & Subject Data

zhorex/douban-scraper

Scrape long-form reviews, comments, and group discussions from Douban (豆瓣) — China's leading reviews + interest community. Movies, books, music, plus subject search. Built for Chinese-LLM training corpus, sentiment analysis, and academic NLP research. Pure HTTP, no auth.

Sami

Drone UAS Regulatory Intelligence MCP Server

ryanclinton/drone-uas-regulatory-intelligence-mcp

UAS airspace and compliance intelligence for AI agents via the Model Context Protocol.

Ryan Clinton

ai-data-cleaner-classifier

keratogenous_surgeon/dataset-ai-cleaner

Clean, normalize, deduplicate, and classify JSON, CSV, or Apify datasets using rules or OpenAI models. Built for automation pipelines, data preparation, and AI workflows. Supports dataset chaining, cost controls, and safe fallbacks.

King Shepherd

Tripadvisor Reviews Scraper

marklp/tripadvisor-reviews-scraper

Reliably scrape TripAdvisor reviews using stealth Firefox (Camoufox) and residential proxies to bypass DataDome. Filter by rating, date, and language. Get structured reviews + place metadata. Export as JSON, CSV, or Excel

ML Data Solutions

YouTube Comments Scraper (with Replies)

khadinakbar/youtube-comments-scraper

Extract comments and replies from YouTube videos by URL or ID. Returns author, text, likes, reply count, timestamps, pinned/hearted status. MCP/API-ready.

Khadin Akbar

YouTube Comments Scraper

scraperhive/youtube-comments-scraper

Scrape YouTube comments without API limits or quotas. Extract comment text, author, likes, replies, publish time & video metadata. Sort by Top or Newest. Pay per result — no charge if no comments found. Export JSON, CSV, Excel.

Mubeen Ali

5.0

(1)

iNaturalist Observations Scraper

parseforge/inaturalist-observations-scraper

Pull citizen science wildlife sightings from iNaturalist. Filter by taxon name, place, and quality grade (research, needs ID, casual). Returns taxon, geojson location, observer, photos, identification count, captive flag, and license. Useful for biodiversity mapping and field studies.