Analysis of large
datasets and high-intensity event streams is gaining significant practical
importance at the present stage, due to the computerization of virtually all
areas of life. In practice, such analysis is usually carried out using methods
developed within the theory of stationary random processes. At the same time,
actual nonstationarity is taken into account within the framework of heuristic
adaptive methods or within so-called cointegrated models of random processes.
It should be noted that most machine learning methods rely on fixed, selected
datasets, which therefore amount to learning on stationary distributions. In
the context of analyzing high-intensity data streams, these methods tend to
overestimate the accuracy of results and are therefore not particularly
effective. The methodological problem lies in the fact that if data series have
a stationary distribution, then it is sufficient to analyze a single sample of
the required size in order to formulate a statistical criterion directly in
terms of measurable quantities. If, however, the distribution is nonstationary,
then, according to the definition of a nonstationary random process, one should
study not the data themselves, but distribution functions constructed from
non-overlapping samples, which requires substantially more computational
resources. For such sets of distributions, one computes distances between them —
typically in the norm of summable functions or in the norm of continuous
functions — after which the distribution of these distances is analyzed. In
other words, homogeneity criteria for the data must be formulated not in terms
of the elements of the series, but in terms of distances between sample
distributions. Such analysis requires the use of methods of kinetic theory,
i.e. the theory of time-varying distribution functions generated by measurable
parameters of a complex system. However, the computational complexity of the
corresponding procedures makes it infeasible to process large arrays of
multidimensional data. Consequently, it is desirable to have a certain
benchmark of statistics analogous to those for stationary series, which,
however, is difficult to obtain for arbitrary nonstationary time series.
Nearest neighbor
graphs (NNGs) make it possible to partially overcome this difficulty, since it
has been established that for stationary distributions of distances between
elements of a point set in a metric space, such statistical characteristics of
the graph as its distribution by number of connected components and the degree
distribution of vertices do not depend on the specific form of the distance
distribution. Thus, computational experiments for determining the statistics of
random NNG structures provide an opportunity to construct a new nonparametric
criterion of independence for sample data. A corresponding benchmark was
constructed by the author in [1, 2]. It appears important to demonstrate the
methodology of data independence analysis using NNGs on a number of practical
tasks, where the developed method proves to be highly effective compared to
traditional approaches that employ criteria in terms of sample moments computed
from observed data and similar statistics.
It should be noted
that computational experiments aimed at finding percentiles of statistical
criteria had been conducted before. For example, the Dickey–Fuller unit root
test in autoregressive models [3] is carried out by numerical simulation of the
Wiener process using the Monte Carlo method. The results of such numerical
experiments are valid because they rely on proven statements about the
mathematical properties of the simulated process.
The theory of
random graphs, developed in the context of machine learning methods, originated
in the 1950s thanks to the pioneering work of Gilbert [4], and later Erdős
and Rényi [5], who laid its foundations. Since then, the theory has
undergone significant development. The study of various aspects of graph theory
— particularly geometric and topological — and the emergence of numerous
corollaries have led to its application in a wide range of scientific
disciplines, including telecommunications [6], astronomy [7], statistical
physics, the social and biological sciences, and machine learning [8].
At present, the
main theoretical results in random graph theory are collected in V.F. Kolchin’s
monograph [9], as well as in the course [10] of the School of Mathematics,
University of Birmingham. Their practical applications are associated with the
involvement of external scientific disciplines. One current direction that
requires further development is the use of random graphs and their connection
to pseudorandom number generators. Sample statistical analysis of random graphs
is challenging because a “sample of graphs” does not exist as such, since a
graph merely visualizes some property of the studied group of objects. Thus,
the concept of sampling applies to the group of objects themselves, which must
be homogeneous with respect to the studied property in a statistical sense. But
since the property under investigation is revealed precisely as a result of
analyzing the graph structure, it is impossible to pre-form groups of objects
with the desired property. In this case, one can proceed as follows. By
studying the distribution of parameters that are considered key to describing a
given system model, one can replicate samples with the empirically obtained
distribution law of these parameters, and then study the statistics of graphs
corresponding (under the model’s assumption) to such a family of systems. Since
the parameter sample is finite, the corresponding empirical distribution
function fluctuates from sample to sample. Therefore, one should examine the
variability of the structure of the family of graphs as the parameter distribution
of the system changes. It is worth noting that similar issues, albeit in a
somewhat different formulation, were considered in the work of V.A. Kalyagin
and co-authors [11], where graphs were used as a tool for detecting
relationships between elements of network structures with randomly scattered
parameters.
The
nearest-neighbor graph is a convenient tool for qualitative analysis of sample
data dependence. Along with numerical estimates of vertex statistics and
connectivity components, visualization immediately reveals the presence of
dependence, since for independent data the sum of frequencies of vertices with
degrees from zero to six accounts for more than 99% of the normalization.
Therefore, the appearance of a vertex of degree, say, 13 in a graph with 10,000
vertices has a probability of less than
.
The paper will present
examples demonstrating the effectiveness of visualization for obtaining quick
qualitative assessments.
We assume that pairwise distances between points are
distinct, meaning every point has a unique nearest neighbor. Under this
assumption, Nearest Neighbor Graphs possess the following main properties:
○ Each vertex has exactly one outgoing edge, hence no isolated vertices appear in NNGs.
○ The number of vertices equals the
number of edges, stemming from the uniqueness of the nearest neighbor for
any order.
○ The sum of vertex degrees over
incoming edges equals the total number of vertices.
○ Every connected component of an NNG
contains precisely one cycle of length two.
In reference [1], a series of new
results concerning NNG structures have been proved. The primary statements
utilized in designing algorithms for generating NNGs and collecting statistics
include:
Theorem 1. Probabilities of realizing NNG structures are independent of the distribution
function governing intervertex distances.
Proof relies on monotonic mapping linking uniform and arbitrary distributions,
preserving the ordering of sample elements interpreted as distances rather than
coordinates.
Theorem 2. Realization probabilities of NNG
structures remain unaffected by whether triangle inequality holds for matrix
elements treated as distance matrices.
Proof constructs an algorithm generating random matrices satisfying triangle
inequality for any three elements. Furthermore, it demonstrates that no
coordinate distribution exists in
,
where the distribution of distances between points is uniform. Therefore, transforming
the element distribution of the constructed matrix to a uniform distribution
will inevitably violate the triangle inequality while maintaining the original
ordering among elements.
Consequently,
these assertions imply the possibility of constructing a benchmark for
statistics of NNG structures based on uniformly distributed random matrices.
In reference [12],
distributions of NNGs by numbers of connected components and vertex degrees
were built for random uniform distributions of point coordinates in
low-dimensional spaces
.
It turned out that as dimensionality increases, the probability distributions
of NNG structures converge towards those obtained from generating elements of
random matrices irrespective of whether these elements represent actual
distances or simply positive random values. Thus, we can numerically derive the
asymptotic behavior of vertex degree distributions in NNGs for high-dimensional
spaces, e.g., when
,
without actually conducting computationally intractable numerical experiments
involving random coordinate generation and subsequent construction of
corresponding distance matrices.
To construct a
benchmark for NNG statistics, simulations of random graphs with varying numbers
of vertices N from 100 to 1500 in steps of 50 were performed. Statistics
were collected from 1 million realizations of symmetric random matrices for
each size. The graph was constructed according to the following procedure: in
each row of the matrix, the minimal nonzero entry was identified. The column
index containing this minimum corresponded to the vertex closest to the vertex
indexed by the current row. It was discovered that the distribution of
first-neighbor graphs by the number of connected components closely approximates
a Gaussian distribution. The mode of this distribution occurs at the point
.
The range of the distribution is
,
and the variance turned out to be approximately
,
so the distribution density is approximated by the function
|
|
(1)
|
In particular, for N = 250, the probabilities of NNG realizations with a specific ratio
Q/N are presented in Table 1.
Table 1. Distribution of NNG by the number of connected components,
|
Q/N
|
Frequency
|
Q /N
|
Frequency
|
Q/N
|
Frequency
|
|
0,175
|
< 0,00001
|
0,225
|
0,03614
|
0,275
|
0,03614
|
|
0,180
|
0,00001
|
0,230
|
0,05669
|
0,280
|
0,02085
|
|
0,185
|
0,00003
|
0,235
|
0,08044
|
0,285
|
0,01089
|
|
0,190
|
0,00009
|
0,240
|
0,10329
|
0,290
|
0,00514
|
|
0,195
|
0,00030
|
0,245
|
0,12000
|
0,295
|
0,00220
|
|
0,200
|
0,00085
|
0,250
|
0,12616
|
0,300
|
0,00085
|
|
0,205
|
0,00220
|
0,255
|
0,12000
|
0,305
|
0,00030
|
|
0,210
|
0,00514
|
0,260
|
0,10329
|
0,310
|
0,00009
|
|
0,215
|
0,01089
|
0,265
|
0,08044
|
0,315
|
0,00003
|
|
0,220
|
0,02085
|
0,270
|
0,05669
|
0,320
|
0,00001
|
If, instead of a random matrix, we
directly generate N
points with uniformly distributed coordinates in an
n-dimensional cube, the resulting distribution of the NNGs by the number of
fragments will be the same as distribution (1), but shifted to the right. The
mode of distribution (1), which we will call "dimensionless," is
located at the point Q/N = 0.25. This is an exact result derived from
the properties of Young diagrams for representing the number of ways a natural
number can be partitioned into a sum of natural numbers. For distributions
based on the number of connected components of the NNG in spaces of small
dimensions, the numerically determined positions of the probability maxima are
given in Table 2.
Table 2. Most likely values of the number of connected
components of NNG in n-dimensional space
|
n
|
1
|
2
|
3
|
4
|
5
|
6
|
7
|
|
argmax(G)
|
0,340
|
0,310
|
0,290
|
0,275
|
0,265
|
0,260
|
0,255
|
Thus, the
distribution by the number of fragments for a 7-th dimensional space turns out
to be very close to the "dimensionless" situation. As the
dimensionality increases further, these distributions coincide with high
accuracy. Therefore, in practice, for high-dimensional random vectors, the
"dimensionless" benchmark, which can be obtained with relatively low
computational costs, can be used.
The empirical
joint distribution of the number of vertices of the NNG by the degrees
of the incoming edges is
formally described by a polynomial distribution, which, for large dimensions
and a large number of vertices, asymptotically becomes a multivariate normal
distribution:
|
,
|
(2)
|
where
is the determinant of the covariance matrix, the elements of which
are normalized by the value
,
– are the elements of
the inverse covariance matrix,
–
is the number of vertices with degree
,
– is the average
fraction of vertices with degree
.
Based on the results from 1 million graph generations, the following values of
,
independent of
were obtained:
|
|
(3)
|
The first four
frequencies have a relative accuracy of 0.002, for the
accuracy is 0.01, for the
it is 0.03, and for the
,
we have a relative
accuracy of 0.1. These values (3) are the same for both symmetric matrices
(distances) and arbitrary random matrices with a zero main diagonal
(quasi-distances). The values presented in (3) are obtained for the first time.
They are universal for any stationary distributions of distances between random
points. It is noteworthy that the sum of the frequencies of the first seven
degrees in (3) equals 0.9975, i.e., the probability of finding a vertex with a
degree greater than 6 in a random graph is less than 0.003. Therefore, the
presence of such vertices would indicate a non-random nature of the distances
between the elements of the set.
Due to the
different levels of significance in defining empirical frequencies (3), using a
multivariate distribution to test data for independence is not convenient. To
test the corresponding statistical hypothesis, it is sufficient to consider the
distribution of vertices with zero degree. This distribution, at a significance
level of 0.01, is approximated by a normal distribution of the form:
|
|
(4)
|
An
example of the benchmark probability of dependence of quantities based on the
statistics of the number of vertices with zero degree for
is presented in
Table 3. In this table, the probability of deviation of the quantity
from zero equal
.
Value
is an estimate of the
probability that the data under consideration are statistically independent.
Table 3. Probability of dependency between variables
based on the deviation
,
|
|
P
|
|
P
|
|
P
|
|
0,001
|
0,040
|
0,020
|
0,684
|
0,039
|
0,949
|
|
0,002
|
0,080
|
0,021
|
0,707
|
0,040
|
0,954
|
|
0,003
|
0,120
|
0,022
|
0,729
|
0,041
|
0,959
|
|
0,004
|
0,160
|
0,023
|
0,750
|
0,042
|
0,964
|
|
0,005
|
0,199
|
0,024
|
0,770
|
0,043
|
0,968
|
|
0,006
|
0,237
|
0,025
|
0,789
|
0,044
|
0,972
|
|
0,007
|
0,275
|
0,026
|
0,807
|
0,045
|
0,975
|
|
0,008
|
0,313
|
0,027
|
0,823
|
0,046
|
0,978
|
|
0,009
|
0,349
|
0,028
|
0,839
|
0,047
|
0,981
|
|
0,010
|
0,385
|
0,029
|
0,853
|
0,048
|
0,983
|
|
0,011
|
0,419
|
0,030
|
0,866
|
0,049
|
0,986
|
|
0,012
|
0,453
|
0,031
|
0,879
|
0,050
|
0,987
|
|
0,013
|
0,486
|
0,032
|
0,890
|
0,051
|
0,989
|
|
0,014
|
0,518
|
0,033
|
0,901
|
0,052
|
0,991
|
|
0,015
|
0,548
|
0,034
|
0,911
|
0,053
|
0,992
|
|
0,016
|
0,578
|
0,035
|
0,920
|
0,054
|
0,993
|
|
0,017
|
0,606
|
0,036
|
0,928
|
0,055
|
0,994
|
|
0,018
|
0,633
|
0,037
|
0,935
|
0,056
|
0,995
|
|
0,019
|
0,659
|
0,038
|
0,942
|
0,057
|
0,996
|
It
is evident that even a very small deviation from the mean leads to the
conclusion that the quantities are dependent.
In
spaces of small dimensions, the distribution of vertices by degree remains the
same, only the maxima of the distributions (i.e., the mean frequencies) shift
slightly. The corresponding values are presented in Table 4.
Table 4. Mean values of the frequencies of the main
vertex degrees of the NNGs in n-dimensional space.
|
n
|
1
|
2
|
3
|
4
|
5
|
6
|
7
|
|
|
0,421
|
0,392
|
0,382
|
0,375
|
0,369
|
0,368
|
0,367
|
|
|
0,449
|
0,427
|
0,418
|
0,400
|
0,375
|
0,369
|
0,368
|
|
|
0,130
|
0,143
|
0,148
|
0,166
|
0,178
|
0,181
|
0,183
|
Next, practical
examples of applying the constructed benchmark to various situations related to
data independence analysis will be considered.
Consider an
example that presents significant methodological interest. It is required to
determine at what confidence level it can be assumed that the distances between
the capitals of the world’s countries form an independent sequence of random
variables. The coordinates (latitude and longitude) of the world’s capitals are
known: they can be found, for example, on the website [www.flagpictures.com],
which lists the latitude
and
longitude
for the capitals
of 251 countries. Assuming the Earth’s surface is approximately a sphere with a
radius of R = 6371 km (average radius), the distances between points on
the sphere can be calculated using the formula:
|
,
|
After that, the
corresponding distance matrix is constructed. As a result, we obtain about
30,000 data points.
Various
homogeneity tests can be applied to the obtained values. For example, by
arbitrarily splitting the data into two equal parts, one can use the
Kolmogorov-Smirnov test and conclude that both parts are drawn from the same
distribution with a probability of 0.95. Therefore, if the data are
independent, one would naturally expect typical values for the parameters of
the structure of a random nearest-neighbor graph, which, in this case, is constructed
for points in a two-dimensional space.
On one hand, the
capitals of countries seem to be randomly distributed, meaning the geographical
or other metric principles by which the positions of capitals are chosen are
not clear. On the other hand, it is evident that a capital city, like any other
city, is unlikely to be located at an arbitrary point on the surface, as most
of this surface is covered by oceans.
The nearest
neighbor graph for this set is shown in Figure 1. It was constructed using the
Networkx library for Python 3.10.
Fig. 1. Nearest neighbor graph for world capitals
Visually, major
cluster centers are absent in this example. However, there are quite a few
linear sequential structures, indicating a large number of vertices with degree
1.
Indeed,
calculations showed that the graph consists of 80 disconnected fragments,
yielding a ratio
equal to 0.3187. From Table 2, for the case
,
we find that the most probable fragment fraction value for random data is
0.310. The deviation of the obtained value from the optimal benchmark value is
less than 0.01. Referring to Table 1, we observe that at a deviation of 0.05
from the maximum, the probability of graph realization decreases by 0.06.
Therefore, in linear approximation interpolation, with confidence level
approximately 0.99, the data can be considered independent according to the
criterion of NNGs connected components.
On the other hand,
the proportion of vertices with zero degree is 0.255. Vertices with degree 1
make up 0.506, those with degree 2 amount to 0.223, and those with degree 3
contribute 0.016. No other degrees occur in this graph. According to Table 4,
the most probable fraction of vertices with zero degree in the two-dimensional
case is 0.392, which exceeds the observed value by 0.137. Based on Table 3,
even at half this deviation, the likelihood of dependency in the data already
surpasses 0.96. Therefore, with a probability greater than 0.999, the
arrangement of data points should be regarded as dependent.
Thus, applying
various criteria for estimating the probability of NNGs structure allowed
detecting dependency among the data. Note that this result does not contradict
the fact that other criteria indicated independence of point locations. The
reason lies in the observation that if a graph is not random, it may still
share similar characteristics with a random graph, whereas the converse
statement does not hold true.
This methodology can also be used to
analyze independence between data in an arbitrary sample. Such a situation
typically arises when analyzing a sequence of some physical measurements. In
this case, the data are arranged in the same sequence as in the experiment, in
the form of the upper triangular part of a square matrix, and are interpreted
as conditional distances between points, regardless of the dimension.
As an example,
consider the task of analyzing the stability of locomotion during human
movement on a treadmill. The experiment was described in detail in the work
[13]. A sequence of pressure sensor readings is analyzed as a person moves on a
special platform at a constant speed. The pressure curve is quasi-periodic,
which allows for the extraction of the pattern shape of one step, i.e., the
determination of some average pressure profile for a single step. After this,
the obtained average profile is subtracted from the profiles of the actual
steps. This results in a series of residuals. If this series is stationary, it
indicates stable movement, meaning the body does not generate mechanical disorders.
If the residual series is non-stationary, it indicates a disturbance in
locomotor functions. In this example, the movement of an astronaut returning
from a long spaceflight was analyzed. Therefore, it is reasonable to assume
that the movement would not be entirely stable. A statistical analysis of this
residual series can provide a valid quantitative assessment of this
instability. However, the difficulty lies in the fact that the amount of data
is not particularly large — approximately 20,000 data points. This prevents the
use of time-varying distribution function analysis, as there should be many
such segments, each of sufficient length for a reliable estimation of the
corresponding probabilities. At the same time, this data can be represented as
elements of an upper triangular matrix of size
,
where each row is considered as a sequence of distances between the point
numbered by that row and the other points. A nearest-neighbor graph for this
structure is then constructed, as shown in Figure 2.
Fig. 2. Nearest neighbor graph for locomotion analysis
It is evident that
the graph contains several attractor vertices with large degrees, so a high
probability of dependence (non-stationarity) in the data is expected. The
statistical properties of this graph are as follows. The number of structures
is 50, i.e.,
,
which
corresponds to the number of connected components for the
"dimensionless" case. However, the proportion of vertices with zero
degree is 0.415 instead of the reference value of 0.367. The deviation in
frequencies amounted to
.
The probability of independence for such data can be determined using formula
(4), which gives a value of approximately 0.02. Thus, the sequence of the
residual series is non-stationary, and the astronaut's locomotor functions are
disrupted.
The benchmark statistics
of NNG structures generated by a non-symmetric random matrix can also be used
in problems related to assessing the dependence of a system of original
objects. In practice, such a problem arises when studying the proximity of
probability density functions in a non-symmetric quasi-norm, known as relative
entropy or the Kullback-Leibler distance.
Let’s formulate
the definition of this quasi-norm in the discrete case, which is the one
considered in this example. Let
be
some frequency distribution (empirical or theoretical), and let
be another distribution.
We assume that both
and
are strictly positive.
The Kullback-Leibler distance between the distributions
and
is defined as:
|
.
|
(5)
|
Consider as such
distributions the empirical frequencies of letter usage in literary texts by
different authors in the Russian language. In the work [14], benchmarks for
over 8000 authors were constructed. The matrix of quasi-distances, calculated
using formula (5), contains about 64,000 elements. By determining the minimum
element in each row, we construct an adjacency matrix. The corresponding NNG is
shown in Figure 3. It is evident that the graph contains quite a few clustering
points — these are authors who, either consciously or unconsciously, try to
imitate each other, creating texts that are statistically similar. Even without
referring to the benchmark, it can be confidently said that the representative
vectors of the authors are dependent with a probability close to 1.
Fig. 3. Nearest-neighbor
graph for author benchmark letter frequencies in the Kullback-Leibler
quasi-norm.
The following
results were obtained for the vectors under consideration. The proportion of
vertices with zero degree was 0.503, with degree one was 0.253, with degree two
was 0.122, with degree three was 0.056, and with degree four was 0.032.
Comparing these results with the data from (3), we see that they differ
significantly. Using the approximation (4), we find that the hypothesis of
independence for the system of author benchmarks is rejected with a probability
of one, as was expected from the qualitative assessment of the pattern in
Figure 3.
Let us also
present another important example of using the benchmark, when analyzing a
sufficiently long sequence of random numbers. Such a sequence can be considered
as the result of a pseudorandom number generator (PRNG), the quality of which
is compared to the benchmark built using reliable generators. These include,
for example, Xoshiro 512 [15] and Mersenne Twister [16]. If we use the existing
sequence of numbers to generate an alternative benchmark for NNGs structures,
the deviation of the resulting distributions from the base benchmark would
indicate that the data in such a long sample are dependent.
In particular, due
to the need for generating sufficiently long sequences of random independent
variables, the statistical analysis of the sequence of digits in the decimal
expansion of the number
is
of great practical interest. On the one hand, since the number
can be calculated using
mathematical formulas, it is not random. However, the question is not about
that, but rather whether, if we do not know that we are dealing with the number
,
and analyze the
sequence of its decimal digits, it is possible to determine that this sequence
is not random? The author is not aware of any statistical proof of such
dependence. On the contrary, classical statistical tests show that the digits
of
form a random, uniformly
distributed sequence, and the correlation between samples is close to zero. Of
course, we do not know the number itself, but for the analysis, a computed
sequence of 10 trillion decimal digits of
was used [17]. Thus, the results of further analysis pertain not to the theory of
the number
itself, but only
to the given finite sequence of its digits.
A computational
experiment was conducted to test how independent the digits of the number
are, using the
constructed benchmark for NNGs structure statistics. The experiment was as
follows: a window of 20 decimal digits was considered, and it was associated
with a number from the interval
,
in which these 20 digits appeared after the decimal point. The window was moved
along the sequence of digits of
in
steps of 20. The generated series of numbers was treated as the result of
generating a random sequence in
.
Using this sequence, NNGs structure statistic was built for symmetric matrices
of size
based on 1 million generated graphs.
Figures 4-6 show
the NNGs distributions by fragments for two variants of random uniform
symmetric generation and for the "pi-generator" for comparison.
Fig. 4. NNGs distributions by the number of fragments for
N = 100 for the first
neighbor in regular symmetric and “pi-generation”
From these graphs,
an important conclusion can be drawn: the NNG distributions for both standard,
but structurally different, PRNGs coincide. This test showed that these
generators are equivalent in terms of creating structures where there are no
hidden connections between the elements that are not detected by standard
criteria, i.e., they are sufficiently reliable. Moreover, the distributions for
both generators match both in the number of connected components and in the
degrees of the vertices.
Fig. 5. NNGs distributions by the number of fragments for
N = 1000 for the first
neighbor in regular symmetric and “pi-generation”
Fig. 6. NNGs distributions by the number of fragments for N = 100 for the 10-th neighbor in
regular symmetric and “pi-generation”
Note that the maximum of the first neighbors' distribution for the "pi-generator"
is shifted to the left of the point
.
This is quite an interesting point, as for distance distributions in a
finite-dimensional space, a similar graph is shifted to the right, as indicated
by the data in Table 2. Therefore, the interpretation of the pi-generator as a
random distribution in a small finite-dimensional space does not hold, and we
are forced to admit that the pi-generator generates dependent systems of
variables.
How significant
are the differences in the distributions? The shift of the maximum to the left
is fundamental. The difference between the two distribution functions in the
C-norm was 0.174, which is a large value for samples of length around 25
million (this is the full number of data points for the number of NNGs
fragments). According to the estimates for the agreed level of stationarity
[18], such a distance is characteristic of samples of length 100 from the same
distribution. For samples of length over 20 million, the probability of this
distance, assuming the independence of the sample elements, is less than 0.01.
Therefore, the distributions by fragments are different with a probability
practically equal to one, and the digits of the number
cannot be considered statistically independent.
Also, for the
"pi-generator," the following distribution of vertices by degrees was
obtained:
|
.
|
(6)
|
These values
differ significantly from the values presented in (3). With a probability
practically equal to one, the digits of the number
form a statistically
dependent sequence.
The distribution
of the number of vertices with degree zero in the nearest neighbor graphs with
100 vertices, compared to standard symmetric generation, is shown in Figure 7.
Fig. 7. Comparison of
distributions of the number of vertices with zero degree for regular and
“pi-generation”
The distance
between the pi-distribution and both PRNGs in the C-norm was 0.209, which
indicates that these are two distinct distributions. For the graphs generated
by the number
,
the number of vertices that are not nearest neighbors to any other vertices is slightly
higher than for a regular random sample of numbers.
At the same time,
the claim of dependence between the digits of the number
in this case has a high
degree of reliability (greater than 0.99). Meanwhile, the Kolmogorov-Smirnov
test applied to the sample of the considered set of random numbers (matrix
elements) confirms the hypothesis of their uniform distribution with a
probability greater than 0.99. Therefore, statistical criteria based on
probability distributions of NNGs structures are, in certain cases, a more
subtle tool for detecting hidden dependencies.
The construction
of the benchmark for the probabilities of NNG structure realizations for random
matrices from independent variables allowed for the development of two new
approaches to analyzing random data in terms of testing their stationarity.
First, when the
data is relatively small — on the order of a few thousand — conducting a full
analysis of the stationarity of their sample distributions depending on the
sample size is not feasible, as the number of samples for analyzing evolution
is small. In this case, the data can be arranged in the upper triangular part
of a matrix of appropriate size, symmetrically extended to the lower part, and
the NNG generated by this specific matrix can be studied. If the resulting
graph has a number of disconnected fragments and a distribution of vertex
degrees close to the maxima of the corresponding benchmark distributions, the
sample can be considered to consist of independent data, and the significance
level at which this decision is made can be specified.
Second, when there
is a large amount of data, the issue of their homogeneity arises, i.e., whether
they belong to the same distribution. Before proceeding with the technically
complex task of identifying possible clusters and relationships between them,
it is first necessary to determine whether such an analysis is even meaningful.
To do this, a method of rapid graph structure generation is used, assuming that
the analyzed data represent some distances generated by a PRNG. A second, test
benchmark is then created and compared with the original reference benchmark.
If the differences in the distributions are significant, the data are
considered dependent, after which filtering procedures can be applied using
various methods, depending on the specific task.
The novelty of the
proposed approach to analyzing samples from a non-stationary series, compared
to classical stationary analogs of this task, lies in the fact that the
indicator is not the deviation of current sample distributions from one another
or from some pattern, but the deviation in the probability distributions of the
realizations of structures, which are indirectly related to these
distributions. Examples of applying this approach to a number of practical problems
described in this work demonstrated its effectiveness compared to classical
stationary methods.
This work is supported by RSF, project ¹ 23-71-10055
1. Kislitsyn A.A. Investigation of nearest neighbor graph statistics. (in Russian) Mathematical Modeling, 2022, Vol. 34, ¹ 8, pp. 110-126.
2. Kislitsyn A.A. Modeling of nearest neighbor graphs to assess the probability of independence of sample data. (in Russian) Mathematical Modeling, 2023, Vol. 35, ¹ 7, pp. 63-82.
3. Verbik M.A., Guide to Modern Econometrics. (in Russian) Moscow: Nauchnaya kniga, 2008, 615 p.
4. Gilbert E.N., Random plane networks. J. Soc. Ind. Appl. Math. 9, 1961, pp. 533–553.
5. Erdos P., Renyi A. «On random graphs», Publ. Math. Debr., 1959, Vol. 6, pp. 290–297.
6. Haenggi M., Andrews J.G., Baccelli F., Dousse O., Franceschetti M., Stochastic geometry and random graphs for the analysis and design of wireless networks. IEEE Journal on Selected Areas in Communications, 2009, Vol. 27, ¹ 7, pp. 1029-1046.
7. Pavlou O., Michos I., Papadopoulou Lesta V., Papadopoulos M., Papaefthymiou E.S., Efstathiou A., Graph Theoretical Analysis of local ultraluminous infrared galaxies and quasars, Astronomy and Computing, 2023, 45p.
8. Waikhom L., Patgiri R. A survey of graph neural networks in various learning paradigms: methods, applications, and challenges. Artif. Intell. Rev., 2023, Vol. 56, pp. 6295–6364.
9. Kolchin V.F. Random graphs. (in Russian) Moscow: Fizmatlit, 2004. 256 p.
10. Krivelevich M., Panagiotou K., Penrose M., Mcdiarmid C. Random Graphs, Geometry and Asymptotic Structure. Fountoulakis N, Hefetz D, eds. Cambridge University Press; 2016.
11. Kalyagin V.A., Koldanov A.P., Koldanov P., Pardalos P.M. Statistical Analysis of Graph Structures in Random Variable Networks, Springer, 2020, 101 p.
12. Kislitsyn A.A., Goguev M.V. Computational expansion into nearest neighbor graphs: statistics and dimensions of space. (in Russian) Keldysh Institute Preprints, 2022, ¹ 88, pp. 1-32.
13. Fomina E.V., Ivchenko A.Yu., Lysova N.Yu., Zhedyaev R. Yu., Orlov Yu.N. Indicators of cosmonaut locomotor functions stability: a new method for ground-reaction forces analysis. Acta Astronautica, 2021, Vol. 189, pp. 679-686.
14. Kislitsyna M.Yu., Orlov Yu.N. Statistical analysis of the complete corpus of fiction in Russian and recognition of the author. (in Russian) Keldysh Institute Preprints, 2024, ¹ 17, pp. 1-24.
15. Blackman D., Vigna S. Scrambled Linear Pseudorandom Number Generators. arXiv: 1805.01407v3 28 Mar 2022.
16. Makoto M., Takuji N. Mersenne Twister: A 623-Dimensionally Equidistributed Uniform Pseudo-Random Number Generator. ACM Trans. Model. Comput. Simul, 1998, Vol. 8, ¹ 1, pp. 3–30.
17. Trub P. Digit Statistics of the First 22.4 Trillion Decimal Digits of Pi. arXiv, 2016, https://doi.org/10.48550/arXiv.1612.00489.
18. Orlov Yu.N. Kinetic methods of nonstationary time series. (in Russian) Moscow: MIPT, 2014, 276 p.