Pricing

Pay per usage

Go to Store

Append to dataset

Try for free

Developed by

Josef Válek

Utility actor that allows you to build a single large dataset from individual default datasets of other actor runs.

0.0 (0)

Pricing

Pay per usage

Total users

Monthly users

Runs succeeded

62%

Last modified

a year ago

Developer tools

Automation

Open source

.editorconfig

root = true

[*]
indent_style = space
indent_size = 4
charset = utf-8
trim_trailing_whitespace = true
insert_final_newline = true
end_of_line = lf

.eslintrc

{
    "extends": "@apify"
}

.gitignore

# This file tells Git which files shouldn't be added to source control

.idea
node_modules

Dockerfile

# First, specify the base Docker image. You can read more about
# the available images at https://sdk.apify.com/docs/guides/docker-images
# You can also use any other image from Docker Hub.
FROM apify/actor-node:16

# Second, copy just package.json and package-lock.json since it should be
# the only file that affects "npm install" in the next step, to speed up the build
COPY package*.json ./

# Install NPM packages, skip optional and development dependencies to
# keep the image small. Avoid logging too much and print the dependency
# tree for debugging
RUN npm --quiet set progress=false \
 && npm install --only=prod --no-optional \
 && echo "Installed NPM packages:" \
 && (npm list --only=prod --no-optional --all || true) \
 && echo "Node.js version:" \
 && node --version \
 && echo "NPM version:" \
 && npm --version

# Next, copy the remaining files and directories with the source code.
# Since we do this after NPM install, quick build will be really fast
# for most source file changes.
COPY . ./

# Optionally, specify how to launch the source code of your actor.
# By default, Apify's base Docker images define the CMD instruction
# that runs the Node.js source code using the command specified
# in the "scripts.start" section of the package.json file.
# In short, the instruction looks something like this:
#
# CMD npm start

INPUT_SCHEMA.json

{
    "title": "Input schema for the append-to-dataset actor.",
    "type": "object",
    "schemaVersion": 1,
    "properties": {
        "datasetIdOrName": {
            "title": "Target Dataset (id or name)",
            "type": "string",
            "description": "Dataset that should be appended to",
            "editor": "textfield"
        },
        "sourceDatasetId": {
            "sectionCaption": "Advanced settings",
            "title": "Source Dataset (id or name)",
            "description": "In one-time usecase, fill in the dataset to be appended",
            "type": "string",
            "editor": "textfield"
        },
        "eventData":{
            "title": "Event Data",
            "description": "If the actor is run via webhook, eventData.actorRunId will be deterimined from webhook payload and it's default dataset will be appended",
            "type": "object",
            "editor": "json"
        }
    },
    "required": ["datasetIdOrName"]
}

apify.json

{
    "env": { "npm_config_loglevel": "silent" }
}

main.js

1const Apify = require('apify');
2
3Apify.main(async () => {
4    const {eventData, datasetIdOrName, sourceDatasetId, pageSize = 100 } = await Apify.getInput();
5    // Check that input really contains sufficient identification of source dataset
6    if (!eventData?.actorRunId && !sourceDatasetId) {
7        throw new Error('Missing source dataset id or actor run id in event data');
8    }
9
10    const client = Apify.newClient();
11
12    // Prepare target dataset id
13    const targetDataset = await Apify.openDataset(datasetIdOrName);
14
15    // Prepare source dataset client
16    const sourceDatasetClient = sourceDatasetId
17        ? client.dataset(sourceDatasetId)
18        : client.run(eventData.actorRunId).dataset();
19
20    let currentOffset = 0;
21    // eslint-disable-next-line no-constant-condition
22    while(true) {
23        // Get items from source dataset
24        const {items, total, offset} = await sourceDatasetClient.listItems({
25            clean: true,
26            limit: pageSize,
27            offset: currentOffset,
28        });
29
30        // Push the items to target dataset
31        await targetDataset.pushData(items);
32
33        Apify.utils.log.info('Transfered items', {
34            count: items.length,
35            total,
36            offset
37        });
38
39        // Increase offset to go to the next page
40        currentOffset += pageSize;
41
42        // If we got all the items, we can stop
43        if (offset + items.length >= total) {
44            Apify.utils.log.info('All items were transfered');
45            break;
46        }
47    }
48
49});

package.json

{
    "name": "project-empty",
    "version": "0.0.1",
    "description": "This is a boilerplate of an Apify actor.",
    "dependencies": {
        "apify": "^2.3.2"
    },
    "devDependencies": {
        "@apify/eslint-config": "^0.1.3",
        "eslint": "^7.0.0"
    },
    "scripts": {
        "start": "node main.js",
        "lint": "./node_modules/.bin/eslint ./src --ext .js,.jsx",
        "lint:fix": "./node_modules/.bin/eslint ./src --ext .js,.jsx --fix",
        "test": "echo \"Error: oops, the actor has no tests yet, sad!\" && exit 1"
    },
    "author": "It's not you it's me",
    "license": "ISC"
}

XMLs To Dataset

mtrunkat/xmls-to-dataset

Go to actor anytime you need to download XML files and store them in the dataset.

Marek Trunkát

103

LLM Dataset Processor

dusan.vystrcil/llm-dataset-processor

Allows you to process output of other actors or stored dataset with single LLM prompt. It's useful if you need to enrich data, summarize content, extract specific information, or manipulate data in a structured way using AI.

Dušan Vystrčil

Dataset Query Engine

jiri.spilka/dataset-query-engine

Use natural language queries to retrieve results from an Apify dataset. This Actor provides a query engine that loads a dataset, executes SQL queries, and synthesizes results.

Jiří Spilka

4.6

Sort Dataset Items

lukaskrivka/sort-dataset-items

Add this actor as a webhook to your scraper to sort the dataset by index field

Lukáš Křivka

Forward Dataset to Actor or Task

valek.josef/forward-dataset-to-actor-or-task

Forwards contents of specified dataset to a specified field on the input of another Actor or task.

Josef Válek

Python Example

apify/python-example

Example Actor written in Python, showing how to read the Actor input and push to the Actor's default dataset.

Apify

108

Dataset Processor in Python

drobnikj/dataset-processor-python

This actor utilizes Python to process the dataset.

Jakub Drobník

Failed Runs Monitor

jannovotny/failed-runs-monitor

This actor will let you know about failed or time outed runs of your actors and tasks via Slack or email. It can also notice you about successful runs with empty dataset, check JSON schema of dataset items, or about runs that are running for too long.

Jan Novotný

Validate Dataset(s) with JSON Schema

jaroslavhejlek/validate-dataset-with-json-schema

This Actor validates items in one or more datasets against a provided JSON Schema. Use it if you planning to add a dataset validation schema to your actor and you want test it.

Jaroslav Hejlek

Dataset Validity Checker

equidem/Dataset-Validity-Checker

Automatically checks, whether default datasets created by runs of an actor differ too much from the previously encountered ones, allowing it to warn you about web scraping problems caused by, e.g., a website layout changing, or other significant changes in the resulting data.