Data Consistency Test

2 posts / 0 new

Tue, 2025-08-12 16:38

As in the previous calculations, instead of counting the exact numbers with a complicated combinatorial argument I make an observation and we may take a very simple shortcut, working instead directly with the probabilities.

I write the argument in more detail,

We assume each number/entry to start with a digit, One ; 2 ; 3 ... or 9 and each as equally likely. If you don't count numbers starting with a zero, then there are only nine digits to choose from and the individual probabilities are not 1/10 but 1/9.

In effect I am not counting a number starting with the digit Zero (as an empty space) so there are nine possible initial digits and the likelihood of each would be the probability 1/9 since they all must be the same and the sum must be equal to 1.

The individual probabilities each should be then approach 1/9 = 0.111 ... as the data set becomes the larger.

If I then have a very large set of equally distributed numbers (of certain size and specified largest), counting the individual entries we may compare them and have a clear indication whether the numbers in the table are random and could as such be reliable.

Writing an algorithm to count for frequencies in a given table is a very routine matter. It is thus possible to avoid the difficulties of a rigid combinatorical argument completely.

See my previous Forum entry,

https://www.abctales.com/forum/mon-2024-11-04-1153/suggested-fraud-test

Fri, 2025-08-15 19:27

Tom Brown

Some clarity and generalisation

One thing might have been not clear enough. Say for the digit One = 1 we would count how many entries in the table start with a One. The frequency should approach the probability 1/9 as explained. This must hold for any given digit, from symmetry.

Count, how many 2's ; how many 3's etc up to 9s, regarded each as a distinct test and each should near 1/9.

From the mathematics the individual entries don't have to be equally large, but not larger than some fixed bound and are assumed to be spread "evenly".

The test can be extended in several ways, for instance instead of the first digit of a given entry/ number you may count the frequencies for the second, the following, the third digit from the beginning and so on, but now of course the probabilities for a a digit in the second place, or third, etc must now be 1/10 = 0.1 because now the digit Zero = 0 must also be counted.

By this time I would imagine you might have a reasonably good idea if the distributions are reliable, i.e the numbers have not been tampered with. If your table was large enough!

I would love some discussion on this I am sure these ideas may be implemented in practice, the blogs would do.

Tom Brown