Actor picture

Dataset Validity Checker

equidem/dataset-validity-checker

Automatically checks, whether default datasets created by runs of an actor differ too much from the previously encountered ones, allowing it to warn you about web scraping problems caused by, e.g., a website layout changing, or other significant changes in the resulting data.

Author's avatarMatěj Sochor
  • Modified
  • Users4
  • Runs3,120
Actor picture

Dataset Validity Checker

Actor Id

actId

Optional

string

Id of the actor whose datasets the validity checker is supposed to process.

Task Id

taskId

Optional

string

Id of the task whose datasets the validity checker is supposed to process. Supersedes the actId.

User Token

token

Optional

string

Token of the user owning the examined actor/task. If not filled, token of the user starting the Dataset Validity Checker is used.

Warning Email

warningEmail

Optional

string

An email, where warnings about invalid datasets should be sent.

Clear History

clearHistory

Optional

boolean

Set to true if you want the validity checker to discard all previously gathered information about datasets and start anew. You should use this option if you change the actor in a way that significantly changes its results, or if the website changes significantly in a way, that doesn't actually break your actor (e.g. the amount of different items available for purchase at an e-shop changes drastically).

Previous Datasets Considered

previousDatasetsTakenIntoAccount

Optional

integer

A number of previous datasets that will be considered when determining whether the dataset is valid. If not filled, the value will be 100.

Minimal Datasets

minimalDatasetCount

Optional

integer

Minimal number of datasets processed needed to validate further datasets. Needs to be at most the same value as 'Previous Datasets Considered Count'. If not filled, the value will be 10.

Number Handling Policy

numberHandlingPolicy

Optional

string

Governs what attributes the Dataset Validity Checker considers to be numbers. If it is 'Strict', only values saved as number type will be considered as such. If 'Loose', strings that are numbers in a non-scientific notation are also handled like numbers. 'Strict' policy is generally better, but if you don't convert numbers to the proper type, using 'Loose' should give you better results.

Options:

"loose", "strict"

Starting At

startingAt

Optional

string

Allows you to control, what will be the earliest run whose dataset will be processed by this run of Dataset Validity Checker. Will be superseded, if runs from later time have already been processed. Has to be ISO 8601 compliant date/time in UTC.

Until

until

Optional

string

Allows you to control, what will be the latest run whose dataset will be processed by this run of Dataset Validity Checker. Has to be ISO 8601 compliant date/time in UTC.

Average Multiplying Coefficient

averageMultiplyingCoefficient

Optional

string

Controls how different the dataset can be compared to the previously seen datasets to still be considered valid in terms of multiples of average difference. Default value is 5.

Maximal Multiplying Coefficient

maximalMultiplyingCoefficient

Optional

string

Controls how different the dataset can be compared to the previously seen datasets to still be considered valid in terms of multiples of maximal difference. Default value is 2.

Leniency Coefficient

leniencyCoefficient

Optional

string

Allows you to control both 'Maximal Multiplying Coefficient' and 'Average Multiplying Coefficient' at the same time. Is multiplicative, so a value of 2 increases both of them by a factor of 2. Default value is 1.