Dataset Classifier
Under maintenancePricing
Pay per usage
Dataset Classifier
Under maintenanceAutomatically classify rows in any Apify dataset into categories you define. Point it at a dataset, pick a text column, provide your categories, and get back the original data with a new classification column added.
Pricing
Pay per usage
Rating
0.0
(0)
Developer
Lukas Priban
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
8 days ago
Last modified
Categories
Share
Automatically classify rows in any Apify dataset into categories you define. Point it at a dataset, pick a text column, provide your categories, and get back the original data with a new classification column added.
What it does
- Reads rows from an Apify dataset or a CSV file you upload
- Classifies each row's text field into one (or more) of your categories
- Outputs the original rows with an added
_classificationfield
No setup, no API keys, no configuration beyond the basics. Just provide your data and categories.
Use cases
- Categorize scraped articles — Sort news articles into Technology, Business, Sports, etc.
- Tag product listings — Classify products by type, audience, or price tier
- Filter leads — Separate B2B from B2C leads based on company descriptions
- Sort reviews by topic — Classify customer feedback into Product, Shipping, Support, etc.
- Organize job postings — Tag jobs by seniority, department, or work model
Input
| Field | Type | Required | Description |
|---|---|---|---|
| Source Dataset | string | No* | Apify dataset to classify (use picker or paste ID) |
| CSV File | string | No* | Upload a CSV file to classify |
| Field to Classify | string | Yes | Name of the text column (e.g. description, title) |
| Categories | string[] | Yes | List of target categories |
| Category Descriptions | object | No | Descriptions to help disambiguate similar categories |
| Context Fields | string[] | No | Extra columns to provide as context alongside the main field |
| Output Field Name | string | No | Name of the new column (default: _classification) |
| Allow Multiple Categories | boolean | No | Assign multiple categories per row (default: false) |
| Max Items | integer | No | Limit how many rows to process |
| LLM Model | string | No | Override the default model (openai/gpt-4o-mini). See openrouter.ai/models. |
| Suggest Categories From Sample | boolean | No | If enabled, skip classification and have the LLM propose categories from a sample of your data. See "Suggest mode" below. |
| Sample Size For Category Suggestion | integer | No | Rows to sample when suggesting categories (default 30, range 5–100). |
| Allow UNCATEGORIZED | boolean | No | Let the model use "UNCATEGORIZED" when no listed category fits. Default off (model forced to pick). |
| Suggest New Categories From UNCATEGORIZED | boolean | No | After classification, sample UNCATEGORIZED rows and propose new categories that would cover them. Requires Allow UNCATEGORIZED. |
*Provide either Source Dataset or CSV File, not both.
Example input (dataset)
{"datasetId": "abc123DEF456","classifyField": "title","categories": ["Technology", "Business", "Entertainment", "Sports", "Science"]}
Example input (CSV)
{"csvFile": "title,description\nApple Vision Pro,Apple launches mixed reality headset\nSuper Bowl,Chiefs win in overtime","classifyField": "description","categories": ["Technology", "Sports", "Business"]}
Using category descriptions
When categories are ambiguous or overlap, add descriptions to improve accuracy:
{"datasetId": "abc123DEF456","classifyField": "description","categories": ["B2B", "B2C", "Internal"],"categoryDescriptions": {"B2B": "Business-to-business products and services sold to companies","B2C": "Consumer products and services sold to individuals","Internal": "Internal tools, documentation, and employee-facing resources"}}
Using context fields
Provide additional columns for more accurate classification:
{"datasetId": "abc123DEF456","classifyField": "description","categories": ["Positive", "Negative", "Neutral"],"contextFields": ["rating", "author"]}
Suggest mode
Not sure what categories make sense for your data? Enable Suggest Categories From Sample and leave the Categories field empty. The Actor will:
- Read a sample (default 30 rows) from your dataset or CSV.
- Ask the LLM to propose 5–10 mutually-exclusive, content-specific categories.
- Write the proposals to the
SUGGESTED_CATEGORIESrecord in the key-value store and log them to the run output. - Exit. No dataset rows are pushed in this mode, so it's much cheaper than a full classification.
Review the suggestions, pick the ones you want, then re-run the Actor with Suggest Categories From Sample disabled and your chosen names in Categories to classify the full dataset.
Iterative taxonomy refinement
After your first real classification pass, some rows may have genuinely not fit any of your categories. Enable both Allow UNCATEGORIZED and Suggest New Categories From UNCATEGORIZED on the next run and the Actor will:
- Classify normally, marking rows that don't fit as
"UNCATEGORIZED". - After classification finishes, sample up to Sample Size UNCATEGORIZED rows.
- Ask the LLM for 3–7 new categories that would cover those rows (without overlapping your existing ones).
- Write the proposals to
SUGGESTED_NEW_CATEGORIESin the key-value store and log them.
Add the names you like to Categories, re-run, and your UNCATEGORIZED count should drop.
Output
The Actor outputs the original dataset rows with one new field added. All original fields are preserved.
| Field | Type | Description |
|---|---|---|
_classification | string or string[] | Assigned category (or array if multiple categories enabled) |
Example output row
{"title": "Apple announces new M5 chip at WWDC","url": "https://example.com/article/123","description": "Apple unveiled its next-generation M5 processor...","_classification": "Technology"}
Every row is classified into one of the categories you provide — even genuinely borderline content is forced into its closest-fitting category. If you need an explicit "doesn't fit" bucket, add it to your category list (e.g. ["Positive", "Negative", "Neutral"]).
Items that could not be classified are marked "CLASSIFICATION_ERROR". This happens when:
- The LLM kept returning malformed or unparseable responses for that row after retries.
- The LLM returned a category that wasn't on your list (a hallucinated label).
The Actor continues past errored rows rather than aborting, so a few bad rows don't kill a large job. Inspect CLASSIFICATION_ERROR rows after the run if you need to retry them separately.
How to get a dataset ID
Every Apify Actor run produces a dataset. You can find the dataset ID in several ways:
- From the Apify Console — Open any Actor run, go to the Storage tab, and copy the dataset ID
- From the API — The dataset ID is returned in the
defaultDatasetIdfield of every run response - From integrations — When chaining Actors, pass the
defaultDatasetIdfrom one run as input to this Actor
Pricing
This Actor uses pay-per-event pricing with a small minimum charge per classified row, kept low to cover the underlying LLM cost and a thin margin. There is no monthly rental or platform-usage markup beyond the standard Apify costs (compute time, dataset operations) that any Actor incurs. See the Actor's pricing tab on Apify Store for the current per-item rate.
Limitations
- The text field must contain meaningful text for classification — empty or very short values may be classified as
UNCATEGORIZED - Very long text fields (>2000 characters) are handled automatically but may slightly increase processing time
- Maximum accuracy depends on how distinct your categories are — use category descriptions to improve results when categories overlap
Acceptable use
This Actor is a classification tool, not a general-purpose AI endpoint. Do not submit content that is unlawful, infringes third-party rights, or violates the terms of service of any underlying AI providers used by the Actor. You are responsible for the content you submit and for ensuring it is appropriate for automated processing.


