No credit card required

Results Checker

lukaskrivka/results-checker

No credit card required

Check the results of your scrapers with this flexible checker. Just supply a dataset or key-value store ID and a few simple rules to get a detailed report.

Results Checker

Overview
How it works
Usage
Input
Functional checker
JSON Schema checker
Reports
Epilogue

Overview

Results Checker is an Apify actor that helps you find inconsistencies in your output and essentially fix bugs.

Loads data from Apify Dataset, Key Value store or just as an arbitrary JSON and runs a check on each item.
The check takes seconds to a maximum of a few minutes for larger datasets.
Produces a report so you know exactly how many problems are there and which items contained them.
It is very useful to append this actor as a webhook and you can easily chain another actor after that to send an email or add a report to your Google Sheets to name just a few examples. Check Apify Store for more.

How it works

Loads data in batches into memory (Key Value store or raw data are loaded all at once).
Each item in the batch is scanned.
Each field is checked with a predicate. Extra fields are considered bad (the whole item is marked bad).
Each field is also checked for truhly value for a separate totalFieldCounts report.
A report is created from the whole batch.
Between each batch, the state of the actor is saved so it doesn't have to repeat itself on restart(migration).
In the end, the report from all batches is merged together and saved as OUTPUT to the default Key Value store.

Usage

For smaller datasets you can use 128 MB memory but if it fails with an 137 error code (out of memory), you will need to increase it. Add more memory for increased speed. Maximum effective memory is usually about 4 GB since the checker can use just one CPU core.
If the report would be too big to be saved or opened, just run a few smaller runs of this actor using limit and offset parameters.

Compute units (CU) consumption examples (complex check & large items)

10,000 items - 0.005 CU (few seconds)
100,000 items - 0.05 (one minute, computation is instant but loading items take time)
1,000,000 items - 2 CU (requires up to 16 GB memory to hold data, better to split into smaller runs - this may get fixed in future version)

Input

This actor expects a JSON object as an input. You can also set it up in a visual UI on Apify. You can find examples in the Input and Example Run tabs of the actor page in Apify Store.

apifyStorageId <string> Apify ID of the storage where the data are located. Can be ID of a dataset or key-value store or crawler execution. Key-value-store requires to set also a recordKey You have specify this or rawData but not both
recordKey <string> Record key from where it loads data in key value store. Only allowed when apifyStorageId points to a key value store
rawData <array> Array of objects to be checked. You have specify this or apifyStorageId but not both.
functionalChecker <stringified function> Stringified javascipt function returns an object with item fields as keys and values as predicates (functions that return true/false). Check Function Checker section. Required
context <object> Custom object where you can put any value that will be accessible in functional checker functions as third parameter. Useful for dynamic values coming from other actors.
identificationFields <array> Array of fields(strings) that will be shown for the bad items in the OUTPUT report. Useful for identification (usually URL, itemId, color etc.).
minimalSuccessRate <object> You can specify minimal success rate (0 to 1) of any field. If the success rate will be higher than this, the field will not be count as bad field. This needs t obe an object with fields as keys and success rate as values. Default Empty object (all values should have success rate 1 (100%))
limit: <number> How many items will be checked. Default: all
offset: <number> From which item the checking will start. Use with limit to check specific items. Default: 0
batchSize: <number> You can change the number of loaded and processed items in each batch. This is only needed to be changed if you have really huge items. Default: 50000
maxBadItemsSaved: <number> Sets how big report you get for each unique combination of bad fields. Keeping this small and running it again after fixing some is the best approach. It speeds up the actor and reduces memory need.

Reading from webhook

You can call this actor from Apify webhook without a need to pass to change the webhook payload. The actor will automatically extract the default dataset.

Functional checker

A checker that uses functions allows us to write custom and flexible checks in plain javascript. Let's look first on some examples of the checker.

Very simple: This checker ensures the url is in the correct format most of the time. It also allows an optional color field. All other extra fields will be marked bad.

1() => ({
2    url: (url) => typeof url === 'string' && url.startsWith('http') && url.length > 10,
3    color: (field) => true // optional field
4})

You can see the name of the parameter doesn't matter as it is just a regular javascript function. The object key as url and color in this example needs to match exactly.

Medium complexity Checks more fields.

1() => ({
2    url: (url) => typeof url === 'string' && url.startsWith('http') && url.length > 10,
3    title: (title) => typeof title === 'string' && title.length >= 3,
4    itemId: (itemId) => typeof itemId === 'string' && itemId.length >= 4,
5    source: (source) => typeof source === 'string',
6    status: (status) => status === 'NEW',
7})

Complex

1() => ({
2    url: (url) => typeof url === 'string' && url.startsWith('http') && url.length > 10 && !url.includes('?'),
3    original_url: (original_url, item) => typeof original_url === 'string' && original_url.startsWith('http') && original_url.length >= item.url.length,
4    categories_json: (categories_json, item, context) => Array.isArray(categories_json) && (context && context.onlyShirts ? categories_json.length === 1 && categories_json[0] === 'shirts' : true),
5    title: (title) => typeof title === 'string' && title.length >= 3,
6    designer_name: (designer_name) => typeof designer_name === 'string' || designer_name === null,
7    manufacturer: (manufacturer) => typeof manufacturer === 'string' && manufacturer.length > 0,
8    itemId: (itemId) => typeof itemId === 'string' && itemId.length >= 4,
9    sku: (sku) => typeof sku === 'string' && sku.length >= 4,
10    price: (price) => typeof price === 'number',
11    sale_price: (sale_price, item) => (typeof sale_price === 'number' || sale_price === null) && sale_price !== item.price,
12    source: (source) => typeof source === 'string',
13    currency: (currency) => typeof currency === 'string' && currency.length === 3,
14    description: (description) => typeof description === 'string' && description.length >= 5,
15    mapped_category: (mapped_category) => typeof mapped_category === 'string' && mapped_category !== 'other',
16    composition: (composition) => Array.isArray(composition),
17    long_description: (long_description) => typeof long_description === 'string' || long_description === null,
18    images: (images) => Array.isArray(images) && typeof images[0] === 'string' && images[0].includes('http'),
19    stock_total: (stock_total) => typeof stock_total === 'number',
20    variants: (variants) => Array.isArray(variants), // This is not that important now to do deeper check
21    color: () => true,
22    otherColors: () => true,
23    shipFrom: () => true,
24})

Let's look at some advanced checks we did here:

You can pass a second parameter item to the predicate (checking function) so that you can always have a reference to all other fields. In this case, we first checked that price is a number. Then salePrice can be either number or null but cannot equal to price so it only shows up if there is a real discount, otherwise, it should stay null.
You can pass a third parameter context which is any object you passed via actor input. In this case we may pass context.onlyShirts which means the checker will check that we got only the shirts category and nothing else. If context.onlyShirts is not passed, then we just check that categories_json is a valid array.

1price: (price) => typeof price === 'number',
2sale_price: (sale_price, item) => (typeof sale_price === 'number' || sale_price === null) && sale_price !== item.price,
3categories_json: (categories_json, item, context) => Array.isArray(categories_json) && (context && context.onlyShirts ? categories_json.length === 1 && categories_json[0] === 'shirts' : true),

If the predicate always returns true, it means this field can have any value, even undefined so it can be absent and still pass too.

Important: You should always define your predicates in a way that cannot crash. For example (images) => images[0].includes('http') has ways to crash. The correct definition is (images) => Array.isArray(images) && typeof images[0] === 'string' && images[0].includes('http'). An error occuring in the predicate will crash the whole actor because the check cannot be valid any more. If it happens, the problematic item will be logged so you can correct the check.

JSON Schema Checker

To be added in the next version

Reports

At the end of the actor run, the report is saved to the default Key Value store as an OUTPUT.

It contains:

totalItemCount, badItemCount, identificationFields
badFields Object that shows how many times each field was bad. This way you instantly see your problematic spots.
extraFields Object that shows how many times an extra field was encountered.
totalFieldCounts Object that shows how many times a field was seen in the dataset. Field is considered seen if its value is not null or '' (empty string). It is like JS truthy value but considers 0 valid.
badItems Link to another record with an array of all bad items. The data diplay their content whole or just with identificationFields plus bad fields to shorten the length. Also for each bad item, you will see exactly the badFields (that didn't match the predicate or were extra) and itemIndex to locate your item in the dataset.

1{
2  "totalItemCount": 41117,
3  "badItemCount": 63,
4  "identificationFields": ["url"],
5  "badFields": {
6    "sku": 63,
7    "price": 63,
8    "status": 63,
9    "images": 63,
10    "title": 2,
11    "itemId": 2
12  },
13  "extraFields": {
14    "garbage": 1
15  },
16  "totalFieldCounts": {
17      "url": 41117,
18      "title": 41115,
19      "garbage": 1
20  },
21  "badItems": "https://api.apify.com/v2/key-value-stores/defaultKeyValueStoreId/records/BAD-ITEMS?disableRedirect=true"
22}

Detailed bad items report from the previous link

1[
2    {
3        "data": {
4            "url": "https://en-ae.namshi.com/buy-trendyol-puff-sleeve-sheer-detail-dress-cd1088znaa8k.html",
5            "garbage": "sfsdfd"
6        },
7        "badFields": [
8            "sku",
9            "price",
10            "status",
11            "images"
12        ],
13        "extraFields": [
14            "garbage"
15        ],
16        "itemIndex": 4
17    },
18... // other items here
19]

Minimal success rate

Sometimes you know that some items will always fail the check due to external factors (like website being broken). This actor let's you define minimalSuccessRate for a field. If that field passes more checks than minimalSuccessRate, it will not be present in badFields or badItems reports.

There are 2 options how to set minimalSuccessRate:

As input parameter

Provide an object to the input with config for the fields that are allowed to have some % of fails. All values are between 0 and 1.

1"minimalSuccessRate": {
2    "url": 0.99,
3    "price": 0.9,
4    "composition": 0.5
5}

Inside functional checker

You can also define it directly in your checkers which gives you even more flexibility. In that case, you have change the checkers from function to objects that hold these check functions.

1id: {
2    minimalSuccessRate: 0.5,
3    check: (field) => /^\d+$/.test(field)
4}

You can also have more checks for each field. This is example of one field that has 2 checks, one stricter and one general that should be always correct (if you don't provide minimalSuccessRate, it has to be always correct).

1id: {
2    any: {
3        check: (field) => !!field
4    },
5    exact: {
6        minimalSuccessRate: 0.5,
7        check: (field) => /^\d+$/.test(field)
8    }
9}

The keys any and exact are completely up to you and are used for identification. The badFields then have the same structure as you provide in your checker (instead of just plain number).

1"badFields": {
2    "id": {
3        "any": 43,
4        "exact": 100
5    },
6    "price": 63,
7    "status": 63,
8    "images": 63,
9    "title": 2,
10    "itemId": 2
11  }

Epilogue

If you find any problem or would like to add a new feature, please create an issue on the Github page.

Thanks everybody for using it and feedback!

Developer

Lukáš Křivka

Actor metrics

1 monthly user
8 stars
Created in May 2019
Modified over 3 years ago

Categories

Automation

Scraper Results Checker

drobnikj/check-crawler-results

This actor checks results from Apify's scrapers or any other actor that stores its result to a dataset, and sends a notification if there are errors. It's designed to run from webhook.

Jakub Drobník

Aggregate Fields

pocesar/aggregate-fields

Create an overview of a dataset by aggregating the possible variations from the selected fields. Useful for checking the consistency of data used together with the Results Checker actor.

Paulo Cesar

Actor Testing

pocesar/actor-testing

Test your actors with varying inputs and expected outputs, duplicates, bad output fields, or unexpected log messages using Jasmine

Paulo Cesar

Youtube Rank Checker

karamelo/youtube-rank-checker

See how your YouTube videos rank against competition for specific keywords! or want to see where your competitors rank? YouTube Rank Checker gives you instant results, ease of use and accurate checks in seconds.

karamelo

Facebook page posts checker

apify/facebook-page-posts-checker

Facebook page checker extracts posts until several years from past, reviews and page details. Groups added as beta, less posts expected but with better details.

Apify

582

Quick Instagram Posts Checker

apify/quick-instagram-posts-checker

Fast Instagram stats (for profiles and post only) for bulk tracking and analytics

Apify

846

Broken Link Checker

jancurn/find-broken-links

Crawls a website and finds broken links. Unlike other similar SEO analysis tools, the actor also reports broken URL #fragments. The results are stored in a JSON and HTML report.

Jan Čurn

573

Content Checker

jakubbalada/content-checker

Monitor a website or web page for content changes. Automatically saves before and after screenshots and sends an email notification when content changes are detected.

Jakub Balada

2.1k

Google Bulk Index Checker

caprolok/google-bulk-index-checker

Google Bulk Index Checker is a swift, user-friendly tool designed to verify if a website is indexed by Google. It provides instant indexing status updates, helping SEO professionals and webmasters ensure their sites are visible on Google search. Essential for efficient SEO management.

Caprolok

155

How to set up an alert when a webpage changes (easy guide)

How to check broken links on any website

6 things you should know before buying or building a web scraper

Build new tools

Are you a developer? Build your own Actors and run them on Apify.

Learn more

Get a custom solution

Get a custom web scraping or RPA solution.

Book a demo

Results Checker

Results Checker

Overview

How it works

Usage

Compute units (CU) consumption examples (complex check & large items)

Input

Reading from webhook

Functional checker

JSON Schema Checker

Reports

Minimal success rate

As input parameter

Inside functional checker

Epilogue

Scraper Results Checker

Aggregate Fields

Actor Testing

Youtube Rank Checker

Facebook page posts checker

Quick Instagram Posts Checker

Broken Link Checker

Content Checker

Google Bulk Index Checker

Related articles

Where next?

Build new tools

Get a custom solution

Results Checker

Overview

How it works

Usage

Compute units (CU) consumption examples (complex check & large items)

Input

Reading from webhook

Functional checker

JSON Schema Checker

Reports

Minimal success rate

As input parameter

Inside functional checker

Epilogue

You might also like these Actors

Scraper Results Checker

Aggregate Fields

Actor Testing

Youtube Rank Checker

Facebook page posts checker

Quick Instagram Posts Checker

Broken Link Checker

Content Checker

Google Bulk Index Checker

Related articles

Where next?

Build new tools

Get a custom solution