Pricing

Pay per usage

Go to Apify Store

Dataset Validity Checker

Try for free

Developed by

Matěj Sochor

Automatically checks, whether default datasets created by runs of an actor differ too much from the previously encountered ones, allowing it to warn you about web scraping problems caused by, e.g., a website layout changing, or other significant changes in the resulting data.

0.0 (0)

Pricing

Pay per usage

Last modified

3 years ago

Developer tools

Dataset Validity Checker (DVC)

How it works

DVC goes through default datasets of the actor specified in its input and calculates indicators, that allow it to recognize, when something is amiss with datasets from later runs. When it notices such a case, it will send you an email to address specified in the input, as well as write a note to its console output. All datasets that pass the check are then also added to the history DVC uses to determine validity of the later datasets.

You can use DVC for an entire actor, or only for a specific task, depending on whether you provide it with a task id or actor id in its input. Moreover, DVC can be used simultaneously (and independently) for any number of actors/tasks. You only need to run it with different task/actor ids.

Please note, that DVC doesn't check individual items in a dataset, but the dataset as a whole, so it will not catch a mistake influencing only a small portion of the items. Also, it only checks datasets from successful runs.

How to use it

Before DVC can work properly, it needs to have historical information about the runs of your actor/task. Therefore, the first run of DVC (and perhaps several others, if you don't have enough runs of the checked actor/task yet) will go through the existing runs and obtain the information it needs. Because of this, the first run might take a relatively long time. For ways to decrease it, see the next section. Please, make sure, that all the runs the first DVC run processes are valid, otherwise, accuracy of the check will be lower. For ways to exclude invalid runs you know about, see the next section.

After the first run, I recommend running DVC after each run of the checked actor/task completes (best achieved using a webhook). If you specify a warning email in DVCs input, it will send you an email for each dataset it considers invalid. If you don't specify it, you can check the console logs from the runs - they contain the same information.

Useful tips

1. Restricting the scope

This tip will be useful (especially for the first run of DVC), if at least one of the following applies to you:

a) You already have a large number of runs (e.g. hundreds) in your actor/task and the first run would therefore take too long.

b) You know there is a mistake somewhere in the previous runs of your actor/task and don't want DVC to consider it normal.

c) You know the website you are scraping changed over time and you don't want DVC to consider the older state normal.

If any of those apply to you, you can use the parameter 'Starting At' (or 'startingAt' in the JSON input) to control what will be the earliest run DVC will process. If you need even more control, you can use the parameter 'Until' (or 'until' in the JSON input) to define, what will be the latest run DVC will process.

2. Clearing the history

Previous tip was mainly concerned with the first run, but what if the website changes significantly without causing an error in your scraping? This could for example happen, when an e-shop decides to widen the selection of items it offers. To prevent false positives (valid datasets flagged as invalid by DVC) in this case, you can use the parameter 'Clear History' (or 'clearHistory' in the JSON input) for a single run to delete all information about what a dataset should look like gathered before that run, allowing DVC to start anew for the particular actor/task.

3. Adjusting strictness

As with any similar algorithm, there has to be a tradeoff between false positives and false negatives (invalid datasets flagged as valid by DVC). If the default setting doesn't work well enough for you, you can use parameters 'Average Multiplying Coefficient' and 'Maximal Multiplying Coefficient' (or 'averageMultiplyingCoefficient' and 'maximalMultiplyingCoefficient' in the JSON input) to adjust the tradeoff, or even change them both using the 'Leniency Coefficient' parameter (or 'leniencyCoefficient' in the JSON input).

On this page

Dataset Validity Checker (DVC)

Share Actor:

Append to dataset

valek.josef/append-to-dataset

Utility actor that allows you to build a single large dataset from individual default datasets of other actor runs.

Josef Válek

Failed Runs Monitor

jannovotny/failed-runs-monitor

This actor will let you know about failed or time outed runs of your actors and tasks via Slack or email. It can also notice you about successful runs with empty dataset, check JSON schema of dataset items, or about runs that are running for too long.

Jan Novotný

Website-checker-starter

vaclavrut/website-checker-starter

Works with lukaskrivka/website-checker. The idea is that this actor manages more URLs on the input, will start website-checker with 10 runs at a time and store all data to one datasets.

Vaclav Rut

Website Checker Workload

lukaskrivka/website-checker-workload

Creates reasonable workloads for analyzing any website with the Website Checker actor and combines the resulting data. This is the easiest way to analyze any website for compute unit usage and anti-scraping blocking.

Lukáš Křivka

Content Checker

jakubbalada/content-checker

Monitor a website or web page for content changes. Automatically saves before and after screenshots and sends an email notification when content changes are detected.

Jakub Balada

2.4K

4.4

Scraper Results Checker

drobnikj/check-crawler-results

This actor checks results from Apify's scrapers or any other actor that stores its result to a dataset, and sends a notification if there are errors. It's designed to run from webhook.

Jakub Drobník

Get Countries Info By Code Scraper

dev_bodex/get-countries-info-by-code-scraper

This scraper is designed to retrieve detailed information about countries by their respective ISO country codes (e.g., "US" for the United States) or by their currency codes (e.g., "USD" for US Dollar).

Festus Befgrp

Web Crawler

rigelbytes/webcrawler

This web crawler is designed to provide users with complete flexibility by allowing them to use their **own proxies**. The scraper collects all pages from the website and returns extracts the **MetaData**, **Title**, and **Content** of the page in MarkDown.

Rigel Bytes

Extract Website With URL

mrahil/extract-website-with-url

The Extract Website with URL API allows users to extract structured data from any webpage by providing a URL. It retrieves HTML, metadata, tables, and images, returning data in JSON format. Ideal for web scraping, SEO analysis, and content extraction. Use it for e-commerce data, news scraping

Mohammed Rahil

110

Website Backup

mhamas/website-backup

Enables to create a backup of any website by crawling it, so that you don’t lose any content by accident. Ideal e.g. for your personal or company blog.