By leveraging Jasmine, the extensible expect and Apify SDK, you can test tasks and actors, and check for their output consistency and/or duplicates.

It goes well with monitoring suit for running your production runs, but this actor should be run in a scheduled manner for best results.

You can run many tests in parallel or test them in series (as your account memory allows)
You can run tests locally but accessing platform storage and actors
Abstracts access to two other public actors:
- Results checker
- Duplications checker

Testing

The testing interface is familiar with Jasmine BDD tests, but with Apify specific async matchers:

({
    it,
    run,
    expectAsync,
    input, // Object containing the current input, you can access customData here
    describe, // describe subsections
    expect, // default Jasmine expect
    _, // lodash as a helper to traverse array items and objects
    moment // Moment.JS to help with dates and time math
    Apify // Apify SDK v2
    apifyClient // Apify client v2
}) => {

  // describe is not needed, but it's good to keep everything tidy
  describe('sub', () => {

    it('should have preconfigured task working', async () => {
        const myTaskResult = await run({
            // actorId: 'actor/from-store', // can use an actorId directly
            taskId: 'myuser/my-task-name',
            input: {
                some: 'extra input' // optional overrides
            },
            options: {
                timeout: 15000 // optional call options
            },
            name: 'should have preconfigured task working'
        });

        // sync assertions, not very useful, expections should have inside async assertions
        expect(myTaskResult.runId).not.toBeEmptyString();

        /**
         * Async assertions calls resources on the platform
         */

        // reads the OUTPUT key
        await expectAsync(myTaskResult).withOutput(async ({ contentType, value }) => {
            expect(contentType)
                // withContext give more information about of what you're testing
                .withContext(myTaskResult.format('Body should be utf-8 JSON'))
                .toEqual('application/json; charset=utf-8');

            expect(value).toEqual({ hello: 'world' }, myTaskResult.format('Output body'));
        });

        // reads any key, fails the test if not found
        await expectAsync(myTaskResult).withKeyValueStore(async ({ key, contentType, value }) => {
            expect(value).toEqual({ status: true });
        }, { keyName: 'INPUT' });

        // gets requestQueue information
        await expectAsync(myTaskResult).withRequestQueue(async ({
            // contains everything from RequestQueueInfo
            id, userId, createdAt,
            modifiedAt, accessedAt, expireAt,
            totalRequestCount, handledRequestCount, pendingRequestCount,
            actId, actRunId, hadMultipleClients
        }) => {
            expect(totalRequestCount).toBeGreaterThan(1);
        });

        // check log for errors
        await expectAsync(myTaskResult).withLog((log) => {
            expect(log).not.toContain('ReferenceError');
            expect(log).not.toContain('TypeError');
            expect(log).not.toContain('The function passed to Apify.main() threw an exception');
        });

        // Check for dataset consistency
        await expectAsync(myTaskResult).withChecker(({ runResult, output }) => {
            expect(output.badItemCount).toBe(0);
        }, {
            functionalChecker: () => ({
                myField: (field) => typeof field === 'string'
            })
        });

        // Check for duplicate items
        await expectAsync(myTaskResult).withDuplicates(({ runResult, output }) => {
            expect(output).toEqual({});
        }, {
            taskId: 'myTaskId'
        })
    });

  });
}

Supports all extra Jasmine matchers, including asymmetrical matchers from https://github.com/JamieMason/Jasmine-Matchers To access any without the JS editor complaining on the platform, you need to use global.any[asymmetricMatcher]

The special run parameter gives you the hability to run your tasks or actors, and return an accessor for their resources:

const result = await run({
  taskId: 'xxx',  // task either by id or using user/task-name
  actorId: 'xxx', // actor either by id or using user/actor-name
  input: {}       // custom input override
  options: {}     // specific memory, timeout options
  nonce: '1'      // additional nonce for tasks running with the same input and options
  name: 'run name'// give the run a name to be able to distinguish between them
});

The run is idempotent and will run the same tasks once per test, but you can specify the nonce to force running it everytime

The run function returns an object with standard API client run info with extra data:

runResult = {
    runInput, // Actual input of the run with default fields filled
    maxResults, // Attempts at parsing maxResults or similar field from input (use runInput to do this yourself)
    data: {
        ...runInfo,
        taskId,
        actorName,
        taskName,
        name: run.name,
    }

Matchers

Those async matchers are lazy and only evaluated when you use them. You should use the result from run function to run expectAsync() on. They abstract many common platform API calls. All callbacks can be plain closures or async ones, they are awaited anyway.

You also have full access to the Apify variable inside your tests.

toHaveStatus(status: 'SUCCEEDED' | 'FAILED' | 'ABORTED' | 'TIMED-OUT')

Checks for the proper run status

withLog((logContent: string) => void)

Run expectations on the logContent

withDuplicates((result: { runResult: Object, output: Object }) => void, input?: Object)

Ensures that no duplicates are found. You can provide a taskId with a pre-configured task or you can provide all the input manually according to the docs here By default, anything above 2 counted items are considered duplicates

Returns the OUTPUT of the run, containing an object like this:

{
  // the keys here mean all the values that were found on the target dataset
  "$$": {
    "count": 4,
    "originalIndexes": [
      0,
      12,
      13,
      15
    ],
    "outputIndexes": [
      9,
      10,
      11,
      13
    ]
  },
  "MISSING!": { // this means it's missing or null value
    "count": 8,
    "originalIndexes": [
      1,
      3,
      4,
      6,
      10,
      14,
      16,
      17
    ],
    "outputIndexes": [
      0,
      1,
      2,
      5,
      8,
      12,
      14,
      15
    ]
  },
  "$$$": {
    "count": 4,
    "originalIndexes": [
      2,
      5,
      7,
      8
    ],
    "outputIndexes": [
      3,
      4,
      6,
      7
    ]
  }
}

withChecker((result: { runResult: Object, output: Object }) => void, input: Object, options?: Object)

Input is required and you need at least a taskId parameter pointing to a pre-configured results-checker task or you can pass everything to the input. Check the docs here

Options is the Apify.call/callTask options Returns the OUTPUT of the run, containing an object like this:

"totalItemCount": 17,
  "badItemCount": 0,
  "identificationFields": [],
  "badFields": {},
  "extraFields": {},
  "totalFieldCounts": {
    "categories": 17,
    "info": 17,
    "likes": 17,
    "messenger": 17,
    "posts": 17,
    "priceRange": 10,
    "title": 17,
    "pageUrl": 17,
    "address": 17,
    "awards": 17,
    "email": 15,
    "impressum": 17,
    "instagram": 2,
    "phone": 15,
    "products": 17,
    "transit": 4,
    "twitter": 1,
    "website": 16,
    "youtube": 0,
    "mission": 17,
    "overview": 17,
    "payment": 2,
    "checkins": 12,
    "#startedAt": 17,
    "verified": 0,
    "#url": 17,
    "#ref": 17,
    "reviews": 14,
    "#version": 17,
    "#finishedAt": 17
  },
  "badItems": "https://api.apify.com/v2/key-value-stores/_/records/BAD-ITEMS?disableRedirect=true"

withDataset((result: { dataset: Object, info: Object }) => void, options?: Object)

Returns dataset information and the items. Options can be optionally passed to limit the number of items returned, using unwind parameter, or any other option that is available here: Dataset getItems

The dataset object contains:

{
    items: [ [Object] ],
    total: 1,
    offset: 0,
    count: 1,
    limit: 999999999999
}

The info object contains:

{
    id: '',
    userId: '',
    createdAt: 2020-12-05T18:44:45.041Z,
    modifiedAt: 2020-12-05T18:44:50.515Z,
    accessedAt: 2020-12-05T18:44:50.515Z,
    itemCount: 1,
    cleanItemCount: 1,
    actId: '',
    actRunId: '',
    stats: {
      uploadedBytes: 0,
      downloadedBytes: 0,
      deflatedBytes: 0,
      inflatedBytes: 21,
      s3PutCount: 0,
      s3GetCount: 0,
      s3DeleteCount: 0,
      readCount: 0,
      writeCount: 1
    }
}

N.B.: this method waits at least 12 seconds to be able to read from the remote storage and make sure it's ready to be accessed after the task/actor has finished running using run

withOutput((output: { value: any, contentType: string }) => void)

Returns the OUTPUT key of the run. Can have any content type, check the contentType

withStatistics((stats: Object) => void, options?: { index: number = 0 })

Returns the SDK_CRAWLER_STATISTICS_0 key of the run by default, unless provided with another index in the options.

Returns an object like this:

{
  "requestsFinished": 217,
  "requestsFailed": 99,
  "requestsRetries": 0,
  "requestsFailedPerMinute": 3,
  "requestsFinishedPerMinute": 8,
  "requestMinDurationMillis": 3071,
  "requestMaxDurationMillis": 41800,
  "requestTotalFailedDurationMillis": 686856,
  "requestTotalFinishedDurationMillis": 3161769,
  "crawlerStartedAt": "2020-12-07T05:06:44.107Z",
  "crawlerFinishedAt": null,
  "statsPersistedAt": "2020-12-07T05:34:04.209Z",
  "crawlerRuntimeMillis": 1640402,
  "crawlerLastStartTimestamp": 1607317603807,
  "requestRetryHistogram": [
    316
  ],
  "statsId": 0,
  "requestAvgFailedDurationMillis": 6938,
  "requestAvgFinishedDurationMillis": 14570,
  "requestTotalDurationMillis": 3848625,
  "requestsTotal": 316
}

withKeyValueStore((output: { value: any, contentType: string }) => void, options: { keyName: string })

Returns the content of the selected keyName. The test fails if the key doesn't exist. You can access the INPUT that was used for the run using { keyName: 'INPUT' }

withRequestQueue((requestQueue: Object) => void)

Access the requestQueue object, that contains:

{
    id: '',
    userId: '',
    createdAt: 2020-12-05T18:44:45.048Z,
    modifiedAt: 2020-12-05T18:44:45.048Z,
    accessedAt: 2020-12-05T18:44:45.048Z,
    expireAt: 2021-02-03T18:44:45.048Z,
    totalRequestCount: 0,
    handledRequestCount: 0,
    pendingRequestCount: 0,
    actId: '',
    actRunId: '',
    hadMultipleClients: false
}

N.B.: all those exists only on expectAsync and need to be awaited, as demonstrated above:

await expectAsync(runResult).withDataset((something) => {
    expect(something).toEqual('here');
});

jasmine.any() and jasmine.anything() can be accessed using global.jasmine

Output

The tests output are available in the key value store under OUTPUT key, with the following structure:

{
  "suite2": {
    "id": "suite2",
    "description": "one",
    "fullName": "Actor tests one",
    "failedExpectations": [],
    "deprecationWarnings": [],
    "duration": 26484,
    "properties": null,
    "status": "passed",
    "specs": [
      {
        "id": "spec0",
        "description": "should work",
        "fullName": "Actor tests one should work",
        "failedExpectations": [],
        "passedExpectations": [
          {
            "matcherName": "toHaveStatus",
            "message": "Passed.",
            "stack": "",
            "passed": true
          },
          {
            "matcherName": "toEqual",
            "message": "Passed.",
            "stack": "",
            "passed": true
          },
          {
            "matcherName": "withDataset",
            "message": "Passed.",
            "stack": "",
            "passed": true
          },
          {
            "matcherName": "withRequestQueue",
            "message": "Passed.",
            "stack": "",
            "passed": true
          },
          {
            "matcherName": "withOutput",
            "message": "Passed.",
            "stack": "",
            "passed": true
          },
          {
            "matcherName": "withKeyValueStore",
            "message": "Passed.",
            "stack": "",
            "passed": true
          },
          {
            "matcherName": "withChecker",
            "message": "Passed.",
            "stack": "",
            "passed": true
          }
        ],
        "deprecationWarnings": [],
        "pendingReason": "",
        "duration": 26480,
        "properties": null,
        "status": "passed"
      }
    ]
  },
  "suite3": {
    "id": "suite3",
    "description": "two",
    "fullName": "Actor tests two",
    "failedExpectations": [],
    "deprecationWarnings": [],
    "duration": 21,
    "properties": null,
    "status": "passed",
    "specs": [
      {
        "id": "spec1",
        "description": "works",
        "fullName": "Actor tests two works",
        "failedExpectations": [
          {
            "matcherName": "toBe",
            "message": "Expected true to be false.",
            "stack": "Error: Expected true to be false.\n    at <Jasmine>\n    at listOnTimeout (internal/timers.js:549:17)\n    at processTimers (internal/timers.js:492:7)",
            "passed": false,
            "expected": false,
            "actual": true
          }
        ],
        "passedExpectations": [],
        "deprecationWarnings": [],
        "pendingReason": "",
        "duration": 15,
        "properties": null,
        "status": "failed"
      }
    ]
  }
}

Expected consumption

This is a very lightweight actor that only intermediates actor runs, it can be run with the lowest amount of memory, which is 128MB. Running for an hour should consume around 0.125 CUs.

Reasoning

Automated and integration tests are a must have for any complex piece of software. For Apify actors, it's no different. Apify actors can be one (or many inputs) to one output, or it can have many items (through the dataset).

License

Apache 2.0

On this page

Apify actor testing
- Features
- Testing
- Matchers
- Output
- Expected consumption
- Reasoning
- License

Share Actor:

Dice Search Scraper

axlymxp/dice-search-scraper

A web scraper that extracts job listings from Dice.com based on search criteria like keywords, location, and radius. It retrieves detailed job information including title, company, location, description and more. Built as an Apify actor for easy integration.

axly

Dice.com Jobs Scraper

piotrv1001/dice-com-jobs-scraper

The Dice.com Jobs Scraper extracts US tech job listings from Dice.com based on search keywords and location (state), capturing salary details, remote work status, company logo, job URL, and job descriptions. Ideal for job market analysis and recruitment insights.

Piotr Vassev

Dice Scraper

deltaspider/dice-scraper

Automatically and efficiently scrape Dice.com job postings

delta spider

🔥Dice.com FULL Job Scraper🔥

mohamedgb00714/dicecom-job-scraper

Scrapes job listings from Dice.com, including detailed information, and handles pagination. Supports keyword search, location search, and various filters (employment type, employer type, workplace type, posted date, easy apply, willing to sponsor).

mohamed el hadi msaid

5.0

Dice.com Job Scraper

easyapi/dice-com-job-scraper

Unlock the tech job market with our Dice.com Job Scraper! Extract detailed listings effortlessly, including salaries, remote options, and more. Perfect for recruiters, job seekers, and researchers. Get valuable insights into the latest tech career opportunities!

EasyApi

5.0

Fastest Dice.com Job(s) Scraper (Richest Output)

memo23/apify-dice-scraper

Extract data from Dice.com including detailed job descriptions, company profiles, locations, salaries, and application details. Get structured data on employment types, remote options, posting dates, and company information. Monitor new postings and track changes over time with our scraping solution

Muhamed Didovic

Dice.com US Tech Jobs Scraper

lexis-solutions/dice-com-jobs-scraper

Scrape US tech jobs data from dice.com. Export to Excel, CSV, JSON, or API with Apify. Extract job listings, salaries, locations, and more.

Lexis Solutions

196

5.0

Dice Jobs Scraper

worldunboxer/dice-jobs-scraper

Boost your job search with our Dice Job Scraper! Easily extract job listings, company details, salaries, and full job descriptions from Dice.com. Automate job scraping with high accuracy and efficiency. Perfect for recruiters, analysts, and job seekers. Get real-time job data instantly!

Umesh Patidar

5.0

Duplications Checker

lukaskrivka/duplications-checker

Check your dataset for duplications. Accept only the highest quality data!

Lukáš Křivka

229

YouTube Community Posts Scraper 🚀📄💬

scrapestorm/youtube-community-posts-scraper

Looking for a YouTube community posts scraper to extract posts from an entire channel? 🔍 Your search ends here! With the YouTube Community Posts Extractor 💬, gather all the community posts from videos across a channel, including likes, comments, poll choices, images, & more. No proxy needed ! ⚡💬