apify_animation_01_02
Actor picture

Results Checker

lukaskrivka/results-checker

Check the results of your scrapers with this flexible checker. Just supply a dataset or key-value store ID and a few simple rules to get a detailed report.

Results Checker

Overview

Results Checker is an Apify actor that helps you find inconsistencies in your output and essentially fix bugs.

  • Loads data from Apify Dataset, Key Value store or just as an arbitrary JSON and runs a check on each item.
  • The check takes seconds to a maximum of a few minutes for larger datasets.
  • Produces a report so you know exactly how many problems are there and which items contained them.
  • It is very useful to append this actor as a webhook and you can easily chain another actor after that to send an email or add a report to your Google Sheets to name just a few examples. Check Apify Store for more.

How it works

  • Loads data in batches into memory (Key Value store or raw data are loaded all at once).
  • Each item in the batch is scanned.
  • Each field is checked with a predicate. Extra fields are considered bad (the whole item is marked bad).
  • A report is created from the whole batch.
  • Between each batch, the state of the actor is saved so it doesn't have to repeat itself on restart(migration).
  • In the end, the report from all batches is merged together and saved as OUTPUT to the default Key Value store.

Usage

  • For smaller datasets you can use 128 MB memory but if it fails with an 137 error code (out of memory), you will need to increase it. Add more memory for increased speed. Maximum effective memory is usually about 4 GB since the checker can use just one CPU core.
  • If the report would be too big to be saved or opened, just run a few smaller runs of this actor using limit and offset parameters.

Compute units (CU) consumption examples

  • 10000 items (complex check) - 0.005 CU (few seconds)
  • 100000 items (complext check) - 0.05 (one minute, computation is instant but loading items take time)

Input

This actor expects a JSON object as an input. You can also set it up in a visual UI on Apify. You can find examples in the Input and Example Run tabs of the actor page in Apify Store.

  • apifyStorageId <string> Apify ID of the storage where the data are located. Can be ID of a dataset or key-value store or crawler execution. Key-value-store requires to set also a recordKey You have specify this or rawData but not both
  • recordKey <string> Record key from where it loads data in key value store. Only allowed when apifyStorageId points to a key value store
  • rawData <array> Array of objects to be checked. You have specify this or apifyStorageId but not both.
  • functionalChecker <stringified function> Stringified javascipt function returns an object with item fields as keys and values as predicates (functions that return true/false). Check Function Checker section. Required
  • identificationFields <array> Array of fields(strings) that will be shown for the bad items in the OUTPUT report. Useful for identification (usually URL, itemId, color etc.). limit: <number> How many items will be checked. Default: all offset: <number> From which item the checking will start. Use with limit to check specific items. Default: 0 batchSize: <number> You can change number of loaded and processed items in each batch. This is only needed to be changed if you have really huge items. Default: 50000

Functional checker

A checker that uses functions allows us to write custom and flexible checks in plain javascript. Let's look first on some examples of the checker.

Very simple: This checker ensures the url is in the correct format most of the time. It also allows an optional color field. All other extra fields will be marked bad.

() => ({
    url: (url) => typeof url === 'string' && url.startsWith('http') && url.length > 10,
    color: (field) => true // optional field
})

You can see the name of the parameter doesn't matter as it is just a regular javascript function. The object key as url and color in this example needs to match exactly.

Medium complexity Checks more fields.

() => ({
    url: (url) => typeof url === 'string' && url.startsWith('http') && url.length > 10,
    title: (title) => typeof title === 'string' && title.length >= 3,
    itemId: (itemId) => typeof itemId === 'string' && itemId.length >= 4,
    source: (source) => typeof source === 'string',
    status: (status) => status === 'NEW',
})

Complex

() => ({
    url: (url) => typeof url === 'string' && url.startsWith('http') && url.length > 10 && !url.includes('?'),
    original_url: (original_url, item) => typeof original_url === 'string' && original_url.startsWith('http') && original_url.length >= item.url.length,
    categories_json: (categories_json) => Array.isArray(categories_json),
    title: (title) => typeof title === 'string' && title.length >= 3,
    designer_name: (designer_name) => typeof designer_name === 'string' || designer_name === null,
    manufacturer: (manufacturer) => typeof manufacturer === 'string' && manufacturer.length > 0,
    itemId: (itemId) => typeof itemId === 'string' && itemId.length >= 4,
    sku: (sku) => typeof sku === 'string' && sku.length >= 4,
    price: (price) => typeof price === 'number',
    sale_price: (sale_price, item) => (typeof sale_price === 'number' || sale_price === null) && sale_price !== item.price,
    source: (source) => typeof source === 'string',
    currency: (currency) => typeof currency === 'string' && currency.length === 3,
    description: (description) => typeof description === 'string' && description.length >= 5,
    mapped_category: (mapped_category) => typeof mapped_category === 'string' && mapped_category !== 'other',
    composition: (composition) => Array.isArray(composition),
    long_description: (long_description) => typeof long_description === 'string' || long_description === null,
    images: (images) => Array.isArray(images) && images.length > 0 && typeof images[0] === 'string' && images[0].includes('http'),
    stock_total: (stock_total) => typeof stock_total === 'number',
    variants: (variants) => Array.isArray(variants), // This is not that important now to do deeper check
    color: () => true,
    otherColors: () => true,
    shipFrom: () => true,
})

Let's look at some advanced checks we did here:

  • You can pass a second parameter item to the predicate (checking function) so that you can always have a reference to all other fields. In this case, we first checked that price is a number. Then salePrice can be either number or null but cannot equal to price so it only shows up if there is a real discount, otherwise, it should stay null.
    price: (price) => typeof price === 'number',
    sale_price: (sale_price, item) => (typeof sale_price === 'number' || sale_price === null) && sale_price !== item.price,
  • If the predicate always returns true, it means this field can have any value, even undefined so it can be absent and still pass too.

JSON Schema Checker

To be added in the next version

Reports

At the end of the actor run, the report is saved to the default Key Value store as an OUTPUT.

It contains:

  • totalItemCount and badItemCount
  • badFields array that shows how many times each field was bad. This way you instantly see your problematic spots.
  • badItems array of all bad items. They are displayed whole or just with identificationFields to shorten the length. Also for each bad item, you will see exactly the bad fields (that didn't match the predicate or were extra) and DEBUG_itemIndex to locate your item in the dataset.
{
  "totalItemCount": 41117,
  "badItemCount": 63,
  "badFields": {
    "sku": 63,
    "price": 63,
    "status": 63,
    "images": 63,
    "title": 2,
    "itemId": 2
  },
  "badItems": [
    {
      "url": "https://en-ae.namshi.com/buy-trendyol-puff-sleeve-sheer-detail-dress-cd1088znaa8k.html"
      "badFields": [
        "sku",
        "price",
        "status",
        "images"
      ],
      "DEBUG_itemIndex": 4
    },
    ... // other items here
  ]
}

Epilogue

If you find any problem or would like to add a new feature, please create an issue on the Github page.

Thanks everybody for using it and feedback!

  • Modified
  • Last run
  • Used26 times